Welcome to PyHealth!#

PyPI version Documentation status GitHub stars GitHub forks Downloads Tutorials YouTube

Introduction [Video]#

PyHealth can support diverse electronic health records (EHRs) such as MIMIC and eICU and all OMOP-CDM based databases and provide various advanced deep learning algorithms for handling important healthcare tasks such as diagnosis-based drug recommendation, patient hospitalization and mortality prediction, and ICU length stay forecasting, etc.

Build a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.

Modules#

All healthcare tasks in our package follow a five-stage pipeline:

load dataset -> define task function -> build ML/DL model -> model training -> inference

! We try hard to make sure each stage is as separate as possibe, so that people can customize their own pipeline by only using our data processing steps or the ML models. Each step will call one module and we introduce them using an example.

An ML Pipeline Example#

  • STEP 1: <pyhealth.datasets> provides a clean structure for the dataset, independent from the tasks. We support MIMIC-III, MIMIC-IV and eICU, as well as the standard OMOP-formatted data. The dataset is stored in a unified Patient-Visit-Event structure.

from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
    root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
    # map all NDC codes to ATC 3-rd level codes in these tables
    code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)

User could also store their own dataset into our <pyhealth.datasets.SampleBaseDataset> structure and then follow the same pipeline below, see Tutorial

  • STEP 2: <pyhealth.tasks> inputs the <pyhealth.datasets> object and defines how to process each patient’s data into a set of samples for the tasks. In the package, we provide several task examples, such as drug recommendation and length of stay prediction.

from pyhealth.tasks import drug_recommendation_mimic3_fn
from pyhealth.datasets import split_by_patient, get_dataloader

mimic3sample = mimic3base.set_task(task_fn=drug_recommendation_mimic3_fn) # use default task
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])

# create dataloaders (torch.data.DataLoader)
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
  • STEP 3: <pyhealth.models> provides the healthcare ML models using <pyhealth.models>. This module also provides model layers, such as pyhealth.models.RETAINLayer for building customized ML architectures. Our model layers can used as easily as torch.nn.Linear.

from pyhealth.models import Transformer

model = Transformer(
    dataset=mimic3sample,
    feature_keys=["conditions", "procedures"],
    label_key="drugs",
    mode="multilabel",
)
  • STEP 4: <pyhealth.trainer> is the training manager with train_loader, the val_loader, val_metric, and specify other arguemnts, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.

from pyhealth.trainer import Trainer

trainer = Trainer(model=model)
trainer.train(
    train_dataloader=train_loader,
    val_dataloader=val_loader,
    epochs=50,
    monitor="pr_auc_samples",
)
  • STEP 5: <pyhealth.metrics> provides several common evaluation metrics (refer to Doc and see what are available) and special metrics in healthcare, such as drug-drug interaction (DDI) rate.

trainer.evaluate(test_loader)

Medical Code Map#

  • <pyhealth.codemap> provides two core functionalities: (i) looking up information for a given medical code (e.g., name, category, sub-concept); (ii) mapping codes across coding systems (e.g., ICD9CM to CCSCM). This module can be independently applied to your research.

  • For code mapping between two coding systems

from pyhealth.medcode import CrossMap

codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("82101") # use it like a dict

codemap = CrossMap.load("NDC", "ATC")
codemap.map("00527051210")
  • For code ontology lookup within one system

from pyhealth.medcode import InnerMap

icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0") # get detailed info
icd9cm.get_ancestors("428.0") # get parents

Medical Code Tokenizer#

  • <pyhealth.tokenizer> is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be independently applied to your research.

from pyhealth.tokenizer import Tokenizer

# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
        'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
        'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])

# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens) # [[8, 9, 10, 11], [12, 1, 1, 0]]

# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices) # [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]

Users can customize their healthcare AI pipeline as simply as calling one module

  • process your OMOP data via pyhealth.datasets

  • process the open eICU (e.g., MIMIC) data via pyhealth.datasets

  • define your own task on existing databases via pyhealth.tasks

  • use existing healthcare models or build upon it (e.g., RETAIN) via pyhealth.models.

  • code map between for conditions and medicaitons via pyhealth.codemap.


Datasets#

We provide the following datasets for general purpose healthcare AI research:

Dataset

Module

Year

Information

MIMIC-III

pyhealth.datasets.MIMIC3Dataset

2016

MIMIC-III Clinical Database

MIMIC-IV

pyhealth.datasets.MIMIC4Dataset

2020

MIMIC-IV Clinical Database

eICU

pyhealth.datasets.eICUDataset

2018

eICU Collaborative Research Database

OMOP

pyhealth.datasets.OMOPDataset

OMOP-CDM schema based dataset

SleepEDF

pyhealth.datasets.SleepEDFDataset

2018

Sleep-EDF dataset

SHHS

pyhealth.datasets.SHHSDataset

2016

Sleep Heart Health Study dataset

ISRUC

pyhealth.datasets.ISRUCDataset

2016

ISRUC-SLEEP dataset

Machine/Deep Learning Models#

Model Name

Type

Module

Year

Summary

Reference

Multi-layer Perceptron

deep learning

pyhealth.models.MLP

1986

MLP treats each feature as static

Backpropagation: theory, architectures, and applications

Convolutional Neural Network (CNN)

deep learning

pyhealth.models.CNN

1989

CNN runs on the conceptual patient-by-visit grids

Handwritten Digit Recognition with a Back-Propagation Network

Recurrent Neural Nets (RNN)

deep Learning

pyhealth.models.RNN

2011

RNN (includes LSTM and GRU) can run on any sequential level (e.g., visit by visit sequences)

Recurrent neural network based language model

Transformer

deep Learning

pyhealth.models.Transformer

2017

Transformer can run on any sequential level (e.g., visit by visit sequences)

Atention is All you Need

RETAIN

deep Learning

pyhealth.models.RETAIN

2016

RETAIN uses two RNN to learn patient embeddings while providing feature-level and visit-level importance.

RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism

GAMENet

deep Learning

pyhealth.models.GAMENet

2019

GAMENet uses memory networks, used only for drug recommendation task

GAMENet: Graph Attention Mechanism for Explainable Electronic Health Record Prediction

MICRON

deep Learning

pyhealth.models.MICRON

2021

MICRON predicts the future drug combination by instead predicting the changes w.r.t. the current combination, used only for drug recommendation task

Change Matters: Medication Change Prediction with Recurrent Residual Networks

SafeDrug

deep Learning

pyhealth.models.SafeDrug

2021

SafeDrug encodes drug molecule structures by graph neural networks, used only for drug recommendation task

SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations

MoleRec

deep Learning

pyhealth.models.MoleRec

2023

MoleRec encodes drug molecule in a substructure level as well as the patient’s information into a drug combination representation, used only for drug recommendation task

MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning

Deepr

deep Learning

pyhealth.models.Deepr

2017

Deepr is based on 1D CNN. General purpose.

Deepr : A Convolutional Net for Medical Records

ContraWR Encoder (STFT+CNN)

deep Learning

pyhealth.models.ContraWR

2021

ContraWR encoder uses short time Fourier transform (STFT) + 2D CNN, used for biosignal learning

Self-supervised EEG Representation Learning for Automatic Sleep Staging

SparcNet (1D CNN)

deep Learning

pyhealth.models.SparcNet

2023

SparcNet is based on 1D CNN, used for biosignal learning

Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation

TCN

deep learning

pyhealth.models.TCN

2018

TCN is based on dilated 1D CNN. General purpose

Temporal Convolutional Networks

AdaCare

deep learning

pyhealth.models.AdaCare

2020

AdaCare uses CNNs with dilated filters to learn enriched patient embedding. It uses feature calibration module to provide the feature-level and visit-level interpretability

AdaCare: Explainable Clinical Health Status Representation Learning via Scale-Adaptive Feature Extraction and Recalibration

ConCare

deep learning

pyhealth.models.ConCare

2020

ConCare uses transformers to learn patient embedding and calculate inter-feature correlations.

ConCare: Personalized Clinical Feature Embedding via Capturing the Healthcare Context

StageNet

deep learning

pyhealth.models.StageNet

2020

StageNet uses stage-aware LSTM to conduct clinical predictive tasks while learning patient disease progression stage change unsupervisedly

StageNet: Stage-Aware Neural Networks for Health Risk Prediction

Dr. Agent

deep learning

pyhealth.models.Agent

2020

Dr. Agent uses two reinforcement learning agents to learn patient embeddings by mimicking clinical second opinions

Dr. Agent: Clinical predictive model via mimicked second opinions

GRASP

deep learning

pyhealth.models.GRASP

2021

GRASP uses graph neural network to identify latent patient clusters and uses the clustering information to learn patient

GRASP: Generic Framework for Health Status Representation Learning Based on Incorporating Knowledge from Similar Patients

Benchmark on Healthcare Tasks#

  • Here is our benchmark doc on healthcare tasks. You can also check this below.

We also provide function for leaderboard generation, check it out in our github repo.

Here are the dynamic visualizations of the leaderboard. You can click the checkbox and easily compare the performance for different models doing different tasks on different datasets!

import sys
sys.path.append('../..')

from leaderboard import leaderboard_gen, utils
args = leaderboard_gen.construct_args()
leaderboard_gen.plots_generation(args)