Welcome to PyHealth’s documentation!¶
[Oct 2022] We will release a brand-new version of PyHealth in the next few weeks. It will include more EHR datasets, health-related tasks, and state-of-the-art models. Please stay tuned!
Deployment & Documentation & Stats
Build Status & Coverage & Maintainability & License

PyHealth is a comprehensive Python package for healthcare AI, designed for both ML researchers and healthcare and medical practitioners. PyHealth accepts diverse healthcare data such as longitudinal electronic health records (EHRs), continuous signials (ECG, EEG), and clinical notes (to be added), and supports various predictive modeling methods using deep learning and other advanced machine learning algorithms published in the literature.
The library is proudly developed and maintained by researchers from Carnegie Mellon University, IQVIA, and University of Illinois at Urbana-Champaign. PyHealth makes many important healthcare tasks become accessible, such as phenotyping prediction, mortality prediction, and ICU length stay forecasting, etc. Running these prediction tasks with deep learning models can be as short as 10 lines of code in PyHealth.
PyHealth comes with three major modules: (i) data preprocessing module; (ii) learning module and (iii) evaluation module. Typically, one can run the data prep module to prepare the data, then feed to the learning module for prediction, and finally assess the result with the evaluation module. Users can use the full system as mentioned or just selected modules based on the own need:
Deep learning researchers may directly use the processed data along with the proposed new models.
Medical personnel, may leverage our data preprocessing module to convert the medical data to the format that learning models could digest, and then perform the inference tasks to get insights from the data.
PyHealth is featured for:
Unified APIs, detailed documentation, and interactive examples across various types of datasets and algorithms.
Advanced models, including latest deep learning models and classical machine learning models.
Wide coverage, supporting sequence data, image data, series data and text data like clinical notes.
Optimized performance with JIT and parallelization when possible, using numba and joblib.
Customizable modules and flexible design: each module may be turned on/off or totally replaced by custom functions. The trained models can be easily exported and reloaded for fast execution and deployment.
API Demo for LSTM on Phenotyping Prediction with GPU:
# load pre-processed CMS dataset from pyhealth.data.expdata_generator import sequencedata as expdata_generator expdata_id = '2020.0810.data.mortality.mimic' cur_dataset = expdata_generator(exp_id=exp_id) cur_dataset.get_exp_data(sel_task='mortality', ) cur_dataset.load_exp_data() # initialize the model for training from pyhealth.models.sequence.lstm import LSTM # enable GPU expmodel_id = 'test.model.lstm.0001' clf = LSTM(expmodel_id=expmodel_id, n_batchsize=20, use_gpu=True, n_epoch=100) clf.fit(cur_dataset.train, cur_dataset.valid) # load the best model for inference clf.load_model() clf.inference(cur_dataset.test) pred_results = clf.get_results() # evaluate the model from pyhealth.evaluation.evaluator import func r = func(pred_results['hat_y'], pred_results['y']) print(r)
Citing PyHealth:
PyHealth paper is under review at JMLR (machine learning open-source software track). If you use PyHealth in a scientific publication, we would appreciate citations to the following paper:
@article{zhao2021pyhealth,
title={PyHealth: A Python Library for Health Predictive Models},
author={Zhao, Yue and Qiao, Zhi and Xiao, Cao and Glass, Lucas and Sun, Jimeng},
journal={arXiv preprint arXiv:2101.04209},
year={2021}
}
or:
Zhao, Y., Qiao, Z., Xiao, C., Glass, L. and Sun, J., 2021. PyHealth: A Python Library for Health Predictive Models. arXiv preprint arXiv:2101.04209.
Key Links and Resources:
Preprocessed Datasets & Implemented Algorithms¶
(i) Preprocessed Datasets (customized data preprocessing function is provided in the example folders):
Type |
Abbr |
Description |
Processed Function |
Link |
---|---|---|---|---|
Sequence: EHR-ICU |
MIMIC III |
A relational database containing tables of data relating to patients who stayed within ICU. |
\examples\data_generation\dataloader_mimic |
|
Sequence: EHR-ICU |
MIMIC_demo |
The MIMIC-III demo database is limited to 100 patients and excludes the noteevents table. |
\examples\data_generation\dataloader_mimic_demo |
|
Sequence: EHU-Claim |
CMS |
DE-SynPUF: CMS 2008-2010 Data Entrepreneurs Synthetic Public Use File |
\examples\data_generation\dataloader_cms |
https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs |
Image: Chest X-ray |
Pediatric |
Pediatric Chest X-ray Pneumonia (Bacterial vs Viral vs Normal) Dataset |
N/A |
https://academictorrents.com/details/951f829a8eeb4d2839c4a535db95078a9175010b |
Series: ECG |
PhysioNet |
AF Classification from a short single lead ECG recording Dataset. |
N/A |
https://archive.physionet.org/challenge/2017/#challenge-data |
You may download the above datasets at the links. The structure of the generated datasets can be found in datasets folder:
\datasets\cms\x_data\…csv
\datasets\cms\y_data\phenotyping.csv
\datasets\cms\y_data\mortality.csv
The processed datasets (X,y) should be put in x_data, y_data correspondingly, to be appropriately digested by deep learning models. We include some sample datasets under \datasets folder.
(ii) Machine Learning and Deep Learning Models :
For sequence data:
Type |
Abbr |
Class |
Algorithm |
Year |
Ref |
---|---|---|---|---|---|
Classical Models |
RandomForest |
Random forests |
2000 |
[ABre01] |
|
Classical Models |
XGBoost |
|
XGBoost: A scalable tree boosting system |
2016 |
|
Neural Networks |
LSTM |
Long short-term memory |
1997 |
||
Neural Networks |
GRU |
Gated recurrent unit |
2014 |
||
Neural Networks |
RETAIN |
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism |
2016 |
||
Neural Networks |
Dipole |
Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks |
2017 |
||
Neural Networks |
tLSTM |
Patient Subtyping via Time-Aware LSTM Networks |
2017 |
||
Neural Networks |
RAIM |
RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data |
2018 |
||
Neural Networks |
StageNet |
StageNet: Stage-Aware Neural Networks for Health Risk Prediction |
2020 |
For image data:
For ecg/egg data:
Type |
Abbr |
Class |
Algorithm |
Year |
Ref |
---|---|---|---|---|---|
Classical Models |
RandomForest |
pyhealth.models.ecg.rf |
Random Forests |
2000 |
|
Classical Models |
XGBoost |
pyhealth.models.ecg.xgboost |
XGBoost: A scalable tree boosting system |
2016 |
|
Neural Networks |
BasicCNN1D |
pyhealth.models.ecg.conv1d |
Face recognition: A convolutional neural-network approach |
1997 |
|
Neural Networks |
DBLSTM-WS |
pyhealth.models.ecg.dblstm_ws |
A novel wavelet sequence based on deep bidirectional LSTM network model for ECG signal classification |
2018 |
|
Neural Networks |
DeepRes1D |
pyhealth.models.ecg.deepres1d |
Heartbeat classification using deep residual convolutional neural network from 2-lead electrocardiogram |
2019 |
|
Neural Networks |
AE+BiLSTM |
pyhealth.models.ecg.sdaelstm |
Automatic Classification of CAD ECG Signals With SDAE and Bidirectional Long Short-Term Network |
2019 |
|
Neural Networks |
KRCRnet |
pyhealth.models.ecg.rcrnet |
K-margin-based Residual-Convolution-Recurrent Neural Network for Atrial Fibrillation Detection |
2019 |
|
Neural Networks |
MINA |
pyhealth.models.ecg.mina |
MINA: Multilevel Knowledge-Guided Attention for Modeling Electrocardiography Signals |
2019 |
Examples of running ML and DL models can be found below, or directly at \examples\learning_examples\
(iii) Evaluation Metrics :
Type |
Abbr |
Metric |
Method |
---|---|---|---|
Binary Classification |
average_precision_score |
Compute micro/macro average precision (AP) from prediction scores |
pyhealth.evaluation.xxx.get_avg_results |
Binary Classification |
roc_auc_score |
Compute micro/macro ROC AUC score from prediction scores |
pyhealth.evaluation.xxx.get_avg_results |
Binary Classification |
recall, precision, f1 |
Get recall, precision, and f1 values |
pyhealth.evaluation.xxx.get_predict_results |
Multi Classification |
To be done here |
(iv) Supported Tasks:
Type |
Abbr |
Description |
Method |
---|---|---|---|
Multi-classification |
phenotyping |
Predict the diagnosis code of a patient based on other information, e.g., procedures |
\examples\data_generation\generate_phenotyping_xxx.py |
Binary Classification |
mortality prediction |
Predict whether a patient may pass away during the hospital |
\examples\data_generation\generate_mortality_xxx.py |
Regression |
ICU stay length pred |
Forecast the length of an ICU stay |
\examples\data_generation\generate_icu_length_xxx.py |
Algorithm Benchmark¶
The comparison among of implemented models will be made available later with a benchmark paper. TBA soon :)
References
- ABre01
Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.