Welcome to PyHealth!#

PyPI version Documentation status GitHub stars GitHub forks Downloads Tutorials YouTube

PyHealth is designed for both ML researchers and medical practitioners. We can make your healthcare AI applications easier to develop, test and validate. Your development process becomes more flexible and more customizable. [GitHub]


[News!] Our PyHealth is accepted by KDD 2023 Tutorial Track! We will present a 3-hour tutorial on PyHealth at [KDD 2023], August 6-10, Long Beach, CA.

Introduction [Video]#

PyHealth can support diverse electronic health records (EHRs) such as MIMIC and eICU and all OMOP-CDM based databases and provide various advanced deep learning algorithms for handling important healthcare tasks such as diagnosis-based drug recommendation, patient hospitalization and mortality prediction, and ICU length stay forecasting, etc.

Build a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.

Modules#

All healthcare tasks in our package follow a five-stage pipeline:

load dataset -> define task function -> build ML/DL model -> model training -> inference

! We try hard to make sure each stage is as separate as possibe, so that people can customize their own pipeline by only using our data processing steps or the ML models. Each step will call one module and we introduce them using an example.

An ML Pipeline Example#

  • STEP 1: <pyhealth.datasets> provides a clean structure for the dataset, independent from the tasks. We support MIMIC-III, MIMIC-IV and eICU, as well as the standard OMOP-formatted data. The dataset is stored in a unified Patient-Visit-Event structure.

from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
    root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
    # map all NDC codes to ATC 3-rd level codes in these tables
    code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)

User could also store their own dataset into our <pyhealth.datasets.SampleBaseDataset> structure and then follow the same pipeline below, see Tutorial

  • STEP 2: <pyhealth.tasks> inputs the <pyhealth.datasets> object and defines how to process each patient’s data into a set of samples for the tasks. In the package, we provide several task examples, such as drug recommendation and length of stay prediction.

from pyhealth.tasks import drug_recommendation_mimic3_fn
from pyhealth.datasets import split_by_patient, get_dataloader

mimic3sample = mimic3base.set_task(task_fn=drug_recommendation_mimic3_fn) # use default task
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])

# create dataloaders (torch.data.DataLoader)
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
  • STEP 3: <pyhealth.models> provides the healthcare ML models using <pyhealth.models>. This module also provides model layers, such as pyhealth.models.RETAINLayer for building customized ML architectures. Our model layers can used as easily as torch.nn.Linear.

from pyhealth.models import Transformer

model = Transformer(
    dataset=mimic3sample,
    feature_keys=["conditions", "procedures"],
    label_key="drugs",
    mode="multilabel",
)
  • STEP 4: <pyhealth.trainer> is the training manager with train_loader, the val_loader, val_metric, and specify other arguemnts, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.

from pyhealth.trainer import Trainer

trainer = Trainer(model=model)
trainer.train(
    train_dataloader=train_loader,
    val_dataloader=val_loader,
    epochs=50,
    monitor="pr_auc_samples",
)
  • STEP 5: <pyhealth.metrics> provides several common evaluation metrics (refer to Doc and see what are available) and special metrics in healthcare, such as drug-drug interaction (DDI) rate.

trainer.evaluate(test_loader)

Medical Code Map#

  • <pyhealth.codemap> provides two core functionalities: (i) looking up information for a given medical code (e.g., name, category, sub-concept); (ii) mapping codes across coding systems (e.g., ICD9CM to CCSCM). This module can be independently applied to your research.

  • For code mapping between two coding systems

from pyhealth.medcode import CrossMap

codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("82101") # use it like a dict

codemap = CrossMap.load("NDC", "ATC")
codemap.map("00527051210")
  • For code ontology lookup within one system

from pyhealth.medcode import InnerMap

icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0") # get detailed info
icd9cm.get_ancestors("428.0") # get parents

Medical Code Tokenizer#

  • <pyhealth.tokenizer> is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be independently applied to your research.

from pyhealth.tokenizer import Tokenizer

# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
        'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
        'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])

# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens) # [[8, 9, 10, 11], [12, 1, 1, 0]]

# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices) # [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]

Users can customize their healthcare AI pipeline as simply as calling one module

  • process your OMOP data via pyhealth.datasets

  • process the open eICU (e.g., MIMIC) data via pyhealth.datasets

  • define your own task on existing databases via pyhealth.tasks

  • use existing healthcare models or build upon it (e.g., RETAIN) via pyhealth.models.

  • code map between for conditions and medicaitons via pyhealth.codemap.


Datasets#

We provide the following datasets for general purpose healthcare AI research:

Dataset

Module

Year

Information

MIMIC-III

pyhealth.datasets.MIMIC3Dataset

2016

MIMIC-III Clinical Database

MIMIC-IV

pyhealth.datasets.MIMIC4Dataset

2020

MIMIC-IV Clinical Database

eICU

pyhealth.datasets.eICUDataset

2018

eICU Collaborative Research Database

OMOP

pyhealth.datasets.OMOPDataset

OMOP-CDM schema based dataset

SleepEDF

pyhealth.datasets.SleepEDFDataset

2018

Sleep-EDF dataset

SHHS

pyhealth.datasets.SHHSDataset

2016

Sleep Heart Health Study dataset

ISRUC

pyhealth.datasets.ISRUCDataset

2016

ISRUC-SLEEP dataset

Machine/Deep Learning Models#

Model Name

Type

Module

Year

Summary

Reference

Multi-layer Perceptron

deep learning

pyhealth.models.MLP

1986

MLP treats each feature as static

Backpropagation: theory, architectures, and applications

Convolutional Neural Network (CNN)

deep learning

pyhealth.models.CNN

1989

CNN runs on the conceptual patient-by-visit grids

Handwritten Digit Recognition with a Back-Propagation Network

Recurrent Neural Nets (RNN)

deep Learning

pyhealth.models.RNN

2011

RNN (includes LSTM and GRU) can run on any sequential level (e.g., visit by visit sequences)

Recurrent neural network based language model

Transformer

deep Learning

pyhealth.models.Transformer

2017

Transformer can run on any sequential level (e.g., visit by visit sequences)

Atention is All you Need

RETAIN

deep Learning

pyhealth.models.RETAIN

2016

RETAIN uses two RNN to learn patient embeddings while providing feature-level and visit-level importance.

RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism

GAMENet

deep Learning

pyhealth.models.GAMENet

2019

GAMENet uses memory networks, used only for drug recommendation task

GAMENet: Graph Attention Mechanism for Explainable Electronic Health Record Prediction

MICRON

deep Learning

pyhealth.models.MICRON

2021

MICRON predicts the future drug combination by instead predicting the changes w.r.t. the current combination, used only for drug recommendation task

Change Matters: Medication Change Prediction with Recurrent Residual Networks

SafeDrug

deep Learning

pyhealth.models.SafeDrug

2021

SafeDrug encodes drug molecule structures by graph neural networks, used only for drug recommendation task

SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations

MoleRec

deep Learning

pyhealth.models.MoleRec

2023

MoleRec encodes drug molecule in a substructure level as well as the patient’s information into a drug combination representation, used only for drug recommendation task

MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning

Deepr

deep Learning

pyhealth.models.Deepr

2017

Deepr is based on 1D CNN. General purpose.

Deepr : A Convolutional Net for Medical Records

ContraWR Encoder (STFT+CNN)

deep Learning

pyhealth.models.ContraWR

2021

ContraWR encoder uses short time Fourier transform (STFT) + 2D CNN, used for biosignal learning

Self-supervised EEG Representation Learning for Automatic Sleep Staging

SparcNet (1D CNN)

deep Learning

pyhealth.models.SparcNet

2023

SparcNet is based on 1D CNN, used for biosignal learning

Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation

TCN

deep learning

pyhealth.models.TCN

2018

TCN is based on dilated 1D CNN. General purpose

Temporal Convolutional Networks

AdaCare

deep learning

pyhealth.models.AdaCare

2020

AdaCare uses CNNs with dilated filters to learn enriched patient embedding. It uses feature calibration module to provide the feature-level and visit-level interpretability

AdaCare: Explainable Clinical Health Status Representation Learning via Scale-Adaptive Feature Extraction and Recalibration

ConCare

deep learning

pyhealth.models.ConCare

2020

ConCare uses transformers to learn patient embedding and calculate inter-feature correlations.

ConCare: Personalized Clinical Feature Embedding via Capturing the Healthcare Context

StageNet

deep learning

pyhealth.models.StageNet

2020

StageNet uses stage-aware LSTM to conduct clinical predictive tasks while learning patient disease progression stage change unsupervisedly

StageNet: Stage-Aware Neural Networks for Health Risk Prediction

Dr. Agent

deep learning

pyhealth.models.Agent

2020

Dr. Agent uses two reinforcement learning agents to learn patient embeddings by mimicking clinical second opinions

Dr. Agent: Clinical predictive model via mimicked second opinions

GRASP

deep learning

pyhealth.models.GRASP

2021

GRASP uses graph neural network to identify latent patient clusters and uses the clustering information to learn patient

GRASP: Generic Framework for Health Status Representation Learning Based on Incorporating Knowledge from Similar Patients

Benchmark on Healthcare Tasks#

  • Here is our benchmark doc on healthcare tasks. You can also check this below.

We also provide function for leaderboard generation, check it out in our github repo.

Here are the dynamic visualizations of the leaderboard. You can click the checkbox and easily compare the performance for different models doing different tasks on different datasets!

import sys
sys.path.append('../..')

from leaderboard import leaderboard_gen, utils
args = leaderboard_gen.construct_args()
leaderboard_gen.plots_generation(args)

Installation#

You could install from PyPi:

pip install pyhealth

or from github source:

git clone https://github.com/sunlabuiuc/PyHealth.git
cd pyhealth
pip install .

Required Dependencies:

python>=3.8
torch>=1.8.0
rdkit>=2022.03.4
scikit-learn>=0.24.2
networkx>=2.6.3
pandas>=1.3.2
tqdm

Warning 1:

PyHealth has multiple neural network based models, e.g., LSTM, which are implemented in PyTorch. However, PyHealth does NOT install these DL libraries for you. This reduces the risk of interfering with your local copies. If you want to use neural-net based models, please make sure PyTorch is installed. Similarly, models depending on xgboost would NOT enforce xgboost installation by default.

CUDA Setting:

To run PyHealth, you also need CUDA and cudatoolkit that support your GPU well. More info

For example, if you use NVIDIA RTX A6000 as your GPU for training, you should install a compatible cudatoolkit using:

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch.

Tutorials#

We provide the following tutorials to help users get started with our pyhealth.

Tutorial 0: Introduction to pyhealth.data [Video]

Tutorial 1: Introduction to pyhealth.datasets [Video]

Tutorial 2: Introduction to pyhealth.tasks [Video]

Tutorial 3: Introduction to pyhealth.models [Video]

Tutorial 4: Introduction to pyhealth.trainer [Video]

Tutorial 5: Introduction to pyhealth.metrics [Video]

Tutorial 6: Introduction to pyhealth.tokenizer [Video]

Tutorial 7: Introduction to pyhealth.medcode [Video]

The following tutorials will help users build their own task pipelines. [Video]

Pipeline 1: Drug Recommendation

Pipeline 2: Length of Stay Prediction

Pipeline 3: Readmission Prediction

Pipeline 4: Mortality Prediction

Pipeline 5: Sleep Staging


Advanced Tutorials#

We provided the advanced tutorials for supporting various needs.

Advanced Tutorial 1: Fit your dataset into our pipeline [Video]

Advanced Tutorial 2: Define your own healthcare task

Advanced Tutorial 3: Adopt customized model into pyhealth [Video]

Advanced Tutorial 4: Load your own processed data into pyhealth and try out our ML models [Video]


Data#

pyhealth.data defines the atomic data structures of this package.

pyhealth.data.Event#

One basic data structure in the package. It is a simple container for a single event. It contains all necessary attributes for supporting various healthcare tasks.

class pyhealth.data.Event(code=None, table=None, vocabulary=None, visit_id=None, patient_id=None, timestamp=None, **attr)[source]#

Bases: object

Contains information about a single event.

An event can be anything from a diagnosis to a prescription or a lab test that happened in a visit of a patient at a specific time.

Parameters
  • code (Optional[str]) – code of the event. E.g., “428.0” for congestive heart failure.

  • table (Optional[str]) – name of the table where the event is recorded. This corresponds to the raw csv file name in the dataset. E.g., “DIAGNOSES_ICD”.

  • vocabulary (Optional[str]) – vocabulary of the code. E.g., “ICD9CM” for ICD-9 diagnosis codes.

  • visit_id (Optional[str]) – unique identifier of the visit.

  • patient_id (Optional[str]) – unique identifier of the patient.

  • timestamp (Optional[datetime]) – timestamp of the event. Default is None.

  • **attr – optional attributes to add to the event as key=value pairs.

attr_dict#

Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.

Examples

>>> from pyhealth.data import Event
>>> event = Event(
...     code="00069153041",
...     table="PRESCRIPTIONS",
...     vocabulary="NDC",
...     visit_id="v001",
...     patient_id="p001",
...     dosage="250mg",
... )
>>> event
Event with NDC code 00069153041 from table PRESCRIPTIONS
>>> event.attr_dict
{'dosage': '250mg'}

pyhealth.data.Visit#

Another basic data structure in the package. A Visit is a single encounter in hospital. It is a container a sequence of Event for each information aspect, such as diagnosis or medications. It also contains other necessary attributes for supporting healthcare tasks, such as the date of the visit.

class pyhealth.data.Visit(visit_id, patient_id, encounter_time=None, discharge_time=None, discharge_status=None, **attr)[source]#

Bases: object

Contains information about a single visit.

A visit is a period of time in which a patient is admitted to a hospital or a specific department. Each visit is associated with a patient and contains a list of different events.

Parameters
  • visit_id (str) – unique identifier of the visit.

  • patient_id (str) – unique identifier of the patient.

  • encounter_time (Optional[datetime]) – timestamp of visit’s encounter. Default is None.

  • discharge_time (Optional[datetime]) – timestamp of visit’s discharge. Default is None.

  • discharge_status – patient’s status upon discharge. Default is None.

  • **attr – optional attributes to add to the visit as key=value pairs.

attr_dict#

Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.

event_list_dict#

Dict[str, List[Event]], dictionary of event lists. Each key is a table name and each value is a list of events from that table ordered by timestamp.

Examples

>>> from pyhealth.data import Event, Visit
>>> event = Event(
...     code="00069153041",
...     table="PRESCRIPTIONS",
...     vocabulary="NDC",
...     visit_id="v001",
...     patient_id="p001",
...     dosage="250mg",
... )
>>> visit = Visit(
...     visit_id="v001",
...     patient_id="p001",
... )
>>> visit.add_event(event)
>>> visit
Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS']
>>> visit.available_tables
['PRESCRIPTIONS']
>>> visit.num_events
1
>>> visit.get_event_list('PRESCRIPTIONS')
[Event with NDC code 00069153041 from table PRESCRIPTIONS]
>>> visit.get_code_list('PRESCRIPTIONS')
['00069153041']
>>> patient.available_tables
['PRESCRIPTIONS']
>>> patient.get_visit_by_index(0)
Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS']
>>> patient.get_visit_by_index(0).get_code_list(table="PRESCRIPTIONS")
['00069153041']
add_event(event)[source]#

Adds an event to the visit.

If the event’s table is not in the visit’s event list dictionary, it is added as a new key. The event is then added to the list of events of that table.

Parameters

event (Event) – event to add.

Note

As for now, there is no check on the order of the events. The new event

is simply appended to end of the list.

Return type

None

get_event_list(table)[source]#

Returns a list of events from a specific table.

If the table is not in the visit’s event list dictionary, an empty list is returned.

Parameters

table (str) – name of the table.

Return type

List[Event]

Returns

List of events from the specified table.

Note

As for now, there is no check on the order of the events. The list of

events is simply returned as is.

get_code_list(table, remove_duplicate=True)[source]#

Returns a list of codes from a specific table.

If the table is not in the visit’s event list dictionary, an empty list is returned.

Parameters
  • table (str) – name of the table.

  • remove_duplicate (Optional[bool]) – whether to remove duplicate codes (but keep the relative order). Default is True.

Return type

List[str]

Returns

List of codes from the specified table.

Note

As for now, there is no check on the order of the codes. The list of

codes is simply returned as is.

set_event_list(table, event_list)[source]#

Sets the list of events from a specific table.

This function will overwrite any existing list of events from the specified table.

Parameters
  • table (str) – name of the table.

  • event_list (List[Event]) – list of events to set.

Note

As for now, there is no check on the order of the events. The list of

events is simply set as is.

Return type

None

property available_tables: List[str]#

Returns a list of available tables for the visit.

Return type

List[str]

Returns

List of available tables.

property num_events: int#

Returns the total number of events in the visit.

Return type

int

Returns

Total number of events.

pyhealth.data.Patient#

Another basic data structure in the package. A Patient is a collection of Visit for the current patients. It contains all necessary attributes of a patient, such as ethnicity, mortality status, gender, etc. It can support various healthcare tasks.

class pyhealth.data.Patient(patient_id, birth_datetime=None, death_datetime=None, gender=None, ethnicity=None, **attr)[source]#

Bases: object

Contains information about a single patient.

A patient is a person who is admitted at least once to a hospital or a specific department. Each patient is associated with a list of visits.

Parameters
  • patient_id (str) – unique identifier of the patient.

  • birth_datetime (Optional[datetime]) – timestamp of patient’s birth. Default is None.

  • death_datetime (Optional[datetime]) – timestamp of patient’s death. Default is None.

  • gender – gender of the patient. Default is None.

  • ethnicity – ethnicity of the patient. Default is None.

  • **attr – optional attributes to add to the patient as key=value pairs.

attr_dict#

Dict, dictionary of patient attributes. Each key is an attribute name and each value is the attribute’s value.

visits#

OrderedDict[str, Visit], an ordered dictionary of visits. Each key is a visit_id and each value is a visit.

index_to_visit_id#

Dict[int, str], dictionary that maps the index of a visit in the visits list to the corresponding visit_id.

Examples

>>> from pyhealth.data import Event, Visit, Patient
>>> event = Event(
...     code="00069153041",
...     table="PRESCRIPTIONS",
...     vocabulary="NDC",
...     visit_id="v001",
...     patient_id="p001",
...     dosage="250mg",
... )
>>> visit = Visit(
...     visit_id="v001",
...     patient_id="p001",
... )
>>> visit.add_event(event)
>>> patient = Patient(
...     patient_id="p001",
... )
>>> patient.add_visit(visit)
>>> patient
Patient p001 with 1 visits
add_visit(visit)[source]#

Adds a visit to the patient.

If the visit’s visit_id is already in the patient’s visits dictionary, it will be overwritten by the new visit.

Parameters

visit (Visit) – visit to add.

Note

As for now, there is no check on the order of the visits. The new visit

is simply added to the end of the ordered dictionary of visits.

Return type

None

add_event(event)[source]#

Adds an event to the patient.

If the event’s visit_id is not in the patient’s visits dictionary, this function will raise KeyError.

Parameters

event (Event) – event to add.

Note

As for now, there is no check on the order of the events. The new event

is simply appended to the end of the list of events of the corresponding visit.

Return type

None

get_visit_by_id(visit_id)[source]#

Returns a visit by visit_id.

Parameters

visit_id (str) – unique identifier of the visit.

Return type

Visit

Returns

Visit with the given visit_id.

get_visit_by_index(index)[source]#

Returns a visit by its index.

Parameters

index (int) – int, index of the visit to return.

Return type

Visit

Returns

Visit with the given index.

property available_tables: List[str]#

Returns a list of available tables for the patient.

Return type

List[str]

Returns

List of available tables.

Datasets#

pyhealth.datasets.BaseEHRDataset#

This is the basic EHR dataset class. Any specific EHR dataset will inherit from this class.

class pyhealth.datasets.BaseEHRDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: ABC

Abstract base dataset class.

This abstract class defines a uniform interface for all EHR datasets (e.g., MIMIC-III, MIMIC-IV, eICU, OMOP).

Each specific dataset will be a subclass of this abstract class, which can then be converted to samples dataset for different tasks by calling self.set_task().

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]). Basic tables will be loaded by default.

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    • a str of the target code vocabulary. E.g., {“NDC”, “ATC”}.

    • a tuple with two elements. The first element is a str of the

      target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method. E.g., {“NDC”, (“ATC”, {“target_kwargs”: {“level”: 3}})}.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

parse_tables()[source]#

Parses the tables in self.tables and return a dict of patients.

Will be called in self.__init__() if cache file does not exist or

refresh_cache is True.

This function will first call self.parse_basic_info() to parse the basic patient information, and then call self.parse_[table_name]() to parse the table with name table_name. Both self.parse_basic_info() and self.parse_[table_name]() should be implemented in the subclass.

Return type

Dict[str, Patient]

Returns

A dict mapping patient_id to Patient object.

property available_tables: List[str]#

Returns a list of available tables for the dataset.

Return type

List[str]

Returns

List of available tables.

stat()[source]#

Returns some statistics of the base dataset.

Return type

str

static info()[source]#

Prints the output format.

set_task(task_fn, task_name=None)[source]#

Processes the base dataset to generate the task-specific sample dataset.

This function should be called by the user after the base dataset is initialized. It will iterate through all patients in the base dataset and call task_fn which should be implemented by the specific task.

Parameters
  • task_fn (Callable) – a function that takes a single patient and returns a list of samples (each sample is a dict with patient_id, visit_id, and other task-specific attributes as key). The samples will be concatenated to form the sample dataset.

  • task_name (Optional[str]) – the name of the task. If None, the name of the task function will be used.

Returns

the task-specific sample dataset.

Return type

sample_dataset

Note

In task_fn, a patient may be converted to multiple samples, e.g.,

a patient with three visits may be converted to three samples ([visit 1], [visit 1, visit 2], [visit 1, visit 2, visit 3]). Patients can also be excluded from the task dataset by returning an empty list.

pyhealth.datasets.BaseSignalDataset#

This is the basic Signal dataset class. Any specific Signal dataset will inherit from this class.

class pyhealth.datasets.BaseSignalDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#

Bases: ABC

Abstract base Signal dataset class.

This abstract class defines a uniform interface for all EEG datasets (e.g., SleepEDF, SHHS).

Each specific dataset will be a subclass of this abstract class, which can then be converted to samples dataset for different tasks by calling self.set_task().

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

stat()[source]#

Returns some statistics of the base dataset.

Return type

str

static info()[source]#

Prints the output format.

set_task(task_fn, task_name=None)[source]#

Processes the base dataset to generate the task-specific sample dataset.

This function should be called by the user after the base dataset is initialized. It will iterate through all patients in the base dataset and call task_fn which should be implemented by the specific task.

Parameters
  • task_fn (Callable) – a function that takes a single patient and returns a list of samples (each sample is a dict with patient_id, visit_id, and other task-specific attributes as key). The samples will be concatenated to form the sample dataset.

  • task_name (Optional[str]) – the name of the task. If None, the name of the task function will be used.

Returns

the task-specific sample (Base) dataset.

Return type

sample_dataset

Note

In task_fn, a patient may be converted to multiple samples, e.g.,

a patient with three visits may be converted to three samples ([visit 1], [visit 1, visit 2], [visit 1, visit 2, visit 3]). Patients can also be excluded from the task dataset by returning an empty list.

pyhealth.datasets.SampleEHRDataset#

This class the takes a list of samples as input (either from BaseEHRDataset.set_task() or user-provided json input), and provides a uniform interface for accessing the samples.

class pyhealth.datasets.SampleEHRDataset(samples, dataset_name='', task_name='')[source]#

Bases: SampleBaseDataset

Sample EHR dataset class.

This class inherits from SampleBaseDataset and is specifically designed

for EHR datasets.

Parameters
  • samples (List[Dict]) – a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key.

  • dataset_name – the name of the dataset. Default is None.

  • task_name – the name of the task. Default is None.

Currently, the following types of attributes are supported:
  • a single value. Type: int/float/str. Dim: 0.

  • a single vector. Type: int/float. Dim: 1.

  • a list of codes. Type: str. Dim: 2.

  • a list of vectors. Type: int/float. Dim: 2.

  • a list of list of codes. Type: str. Dim: 3.

  • a list of list of vectors. Type: int/float. Dim: 3.

input_info#

Dict, a dict whose keys are the same as the keys in the samples, and values are the corresponding input information: - “type”: the element type of each key attribute, one of float, int, str. - “dim”: the list dimension of each key attribute, one of 0, 1, 2, 3. - “len”: the length of the vector, only valid for vector-based attributes.

patient_to_index#

Dict[str, List[int]], a dict mapping patient_id to a list of sample indices.

visit_to_index#

Dict[str, List[int]], a dict mapping visit_id to a list of sample indices.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "single_vector": [1, 2, 3],
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "single_vector": [1, 5, 8],
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]],
...                 [[7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples)
>>> dataset.input_info
{'patient_id': {'type': <class 'str'>, 'dim': 0}, 'visit_id': {'type': <class 'str'>, 'dim': 0}, 'single_vector': {'type': <class 'int'>, 'dim': 1, 'len': 3}, 'list_codes': {'type': <class 'str'>, 'dim': 2}, 'list_vectors': {'type': <class 'float'>, 'dim': 2, 'len': 3}, 'list_list_codes': {'type': <class 'str'>, 'dim': 3}, 'list_list_vectors': {'type': <class 'float'>, 'dim': 3, 'len': 3}, 'label': {'type': <class 'int'>, 'dim': 0}}
>>> dataset.patient_to_index
{'patient-0': [0, 1]}
>>> dataset.visit_to_index
{'visit-0': [0], 'visit-1': [1]}
property available_keys: List[str]#

Returns a list of available keys for the dataset.

Return type

List[str]

Returns

List of available keys.

get_distribution_tokens(key)[source]#

Gets the distribution of tokens with a specific key in the samples.

Parameters

key (str) – the key of the tokens in the samples.

Returns

a dict mapping token to count.

Return type

distribution

stat()[source]#

Returns some statistics of the task-specific dataset.

Return type

str

pyhealth.datasets.SampleSignalDataset#

This class the takes a list of samples as input (either from BaseSignalDataset.set_task() or user-provided json input), and provides a uniform interface for accessing the samples.

class pyhealth.datasets.SampleSignalDataset(samples, dataset_name='', task_name='')[source]#

Bases: SampleBaseDataset

Sample signal dataset class.

This class the takes a list of samples as input (either from BaseDataset.set_task() or user-provided input), and provides a uniform interface for accessing the samples.

Parameters
  • samples (List[Dict]) – a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key.

  • classes – a list of classes, e.g., [“W”, “1”, “2”, “3”, “R”].

  • dataset_name – the name of the dataset. Default is None.

  • task_name – the name of the task. Default is None.

stat()[source]#

Returns some statistics of the task-specific dataset.

Return type

str

pyhealth.datasets.MIMIC3Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC3Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-III dataset.

The MIMIC-III dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.

The basic information is stored in the following tables:
  • PATIENTS: defines a patient in the database, SUBJECT_ID.

  • ADMISSIONS: defines a patient’s hospital admission, HADM_ID.

We further support the following tables:
  • DIAGNOSES_ICD: contains ICD-9 diagnoses (ICD9CM code) for patients.

  • PROCEDURES_ICD: contains ICD-9 procedures (ICD9PROC code) for patients.

  • PRESCRIPTIONS: contains medication related order entries (NDC code)

    for patients.

  • LABEVENTS: contains laboratory measurements (MIMIC3_ITEMID code)

    for patients

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> dataset = MIMIC3Dataset(
...         root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...         tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"],
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper function which parses PATIENTS and ADMISSIONS tables.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id which is updated with the mimic-3 table result.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses DIAGNOSES_ICD table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_procedures_icd(patients)[source]#

Helper function which parses PROCEDURES_ICD table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in PROCEDURES_ICD

table, so we set it to None.

parse_prescriptions(patients)[source]#

Helper function which parses PRESCRIPTIONS table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_labevents(patients)[source]#

Helper function which parses LABEVENTS table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.MIMICExtractDataset#

The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMICExtractDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False, pop_size=None, itemid_to_variable_map=None)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-Extract dataset.

Reads the HDF5 data produced by [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract#step-4-set-cohort-selection-and-extraction-criteria). Works with files created with or without LEVEL2 grouping and with restricted cohort population sizes, other optional parameter values, and should work with many customized versions of the pipeline.

You can create or obtain a MIMIC-Extract dataset in several ways:

Any of these methods will provide you with a set of HDF5 files containing a cleaned subset of the MIMIC-III dataset. This class can be used to read that dataset (mainly the all_hourly_data.h5 file). Consult the MIMIC-Extract documentation for all the options available for dataset generation (cohort selection, aggregation level, etc.).

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain one or more HDF5 files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“vitals_labs”, “interventions”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

  • pop_size (Optional[int]) – If your MIMIC-Extract dataset was created with a pop_size parameter, include it here. This is used to find the correct filenames.

  • itemid_to_variable_map (Optional[str]) – Path to the CSV file used for aggregation mapping during your dataset’s creation. Probably the one located in the MIMIC-Extract repo at resources/itemid_to_variable_map.csv, or your own version if you have customized it.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMICExtractDataset
>>> dataset = MIMICExtractDataset(
...         root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...         tables=["DIAGNOSES_ICD", "NOTES"], TODO: What here?
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper function which parses patients dataset (within all_hourly_data.h5)

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id which is updated with the mimic-3 table result.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_diagnoses_icd(patients)[source]#
Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5) in

a way compatible with MIMIC3Dataset.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_c(patients)[source]#

Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5).

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_labevents(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.

Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to LABEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done by MIMIC-Extract.

See also self.parse_vitals_labs()

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_chartevents(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.

Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to CHARTEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done in MIMIC-Extract.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_vitals_labs(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5).

Events are added using the MIMIC3_ITEMID vocabulary, and the mapping is determined by the CSV file passed to the constructor in itemid_to_variable_map. Since MIMIC-Extract aggregates like events, only a single MIMIC-III ITEMID will be used to represent all like items in the MIMIC-Extract dataset–so the data here will likely not match raw MIMIC-III data. Which ITEMIDs are used depends on the aggregation level in your dataset (i.e. whether you used –no_group_by_level2).

Will be called in self.parse_tables()

See also self.parse_chartevents() and self.parse_labevents()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_interventions(patients)[source]#

Helper function which parses the interventions dataset (within all_hourly_data.h5). Events are added using the MIMIC3_ITEMID vocabulary, using a manually derived mapping corresponding to general items descriptive of the intervention. Since the raw MIMIC-III data had multiple codes, and MIMIC-Extract aggregates like items, these will not match raw MIMIC-III data.

In particular, note that ITEMID 41491 (“fluid bolus”) is used for crystalloid_bolus and ITEMID 46729 (“Dextran”) is used for colloid_bolus because there is no existing general ITEMID for colloid boluses.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.MIMIC4Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC4Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-IV dataset.

The MIMIC-IV dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.

The basic information is stored in the following tables:
  • patients: defines a patient in the database, subject_id.

  • admission: define a patient’s hospital admission, hadm_id.

We further support the following tables:
  • diagnoses_icd: contains ICD diagnoses (ICD9CM and ICD10CM code)

    for patients.

  • procedures_icd: contains ICD procedures (ICD9PROC and ICD10PROC

    code) for patients.

  • prescriptions: contains medication related order entries (NDC code)

    for patients.

  • labevents: contains laboratory measurements (MIMIC4_ITEMID code)

    for patients

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> dataset = MIMIC4Dataset(
...         root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...         tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"],
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses patients and admissions tables.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses diagnosis_icd table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in diagnoses_icd

table, so we set it to None.

parse_procedures_icd(patients)[source]#

Helper function which parses procedures_icd table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in procedures_icd

table, so we set it to None.

parse_prescriptions(patients)[source]#

Helper function which parses prescriptions table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_labevents(patients)[source]#

Helper function which parses labevents table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_hcpcsevents(patients)[source]#

Helper function which parses hcpcsevents table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in hcpcsevents

table, so we set it to None.

pyhealth.datasets.eICUDataset#

The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.eICUDataset(**kwargs)[source]#

Bases: BaseEHRDataset

Base dataset for eICU dataset.

The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.

The basic information is stored in the following tables:
  • patient: defines a patient (uniquepid), a hospital admission

    (patienthealthsystemstayid), and a ICU stay (patientunitstayid) in the database.

  • hospital: contains information about a hospital (e.g., region).

Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time. Thus, we only know the order of ICU stays within a hospital admission, but not the order of hospital admissions within a patient. As a result, we use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.

We further support the following tables:
  • diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)

    and diagnosis information (under attr_dict) for patients

  • treatment: contains treatment information (eICU_TREATMENTSTRING code)

    for patients.

  • medication: contains medication related order entries (eICU_DRUGNAME

    code) for patients.

  • lab: contains laboratory measurements (eICU_LABNAME code)

    for patients

  • physicalExam: contains all physical exam (eICU_PHYSICALEXAMPATH)

    conducted for patients.

  • admissionDx: table contains the primary diagnosis for admission to

    the ICU per the APACHE scoring criteria. (eICU_ADMITDXPATH)

Parameters
  • dataset_name – name of the dataset.

  • root – root directory of the raw data (should contain many csv files).

  • tables – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> dataset = eICUDataset(
...         root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...         tables=["diagnosis", "medication", "lab", "treatment", "physicalExam", "admissionDx"],
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses patient and hospital tables.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

We use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.

parse_diagnosis(patients)[source]#

Helper function which parses diagnosis table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

This table contains both ICD9CM and ICD10CM codes in one single

cell. We need to use medcode to distinguish them.

parse_treatment(patients)[source]#

Helper function which parses treatment table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_medication(patients)[source]#

Helper function which parses medication table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_lab(patients)[source]#

Helper function which parses lab table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_physicalexam(patients)[source]#

Helper function which parses physicalExam table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_admissiondx(patients)[source]#

Helper function which parses admissionDx (admission diagnosis) table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.OMOPDataset#

We can process any OMOP-CDM formatted database, refer to doc for more information. We it into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.OMOPDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseEHRDataset

Base dataset for OMOP dataset.

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence.

See: https://www.ohdsi.org/data-standardization/the-common-data-model/.

The basic information is stored in the following tables:
  • person: contains records that uniquely identify each person or patient,

    and some demographic information.

  • visit_occurrence: contains info for how a patient engages with the

    healthcare system for a duration of time.

  • death: contains info for how and when a patient dies.

We further support the following tables:
  • condition_occurrence.csv: contains the condition information

    (CONDITION_CONCEPT_ID code) of patients’ visits.

  • procedure_occurrence.csv: contains the procedure information

    (PROCEDURE_CONCEPT_ID code) of patients’ visits.

  • drug_exposure.csv: contains the drug information (DRUG_CONCEPT_ID code)

    of patients’ visits.

  • measurement.csv: contains all laboratory measurements

    (MEASUREMENT_CONCEPT_ID code) of patients’ visits.

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> dataset = OMOPDataset(
...         root="/srv/local/data/zw12/pyhealth/raw_data/synpuf1k_omop_cdm_5.2.2",
...         tables=["condition_occurrence", "procedure_occurrence", "drug_exposure", "measurement",],
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses person, visit_occurrence, and death tables.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_condition_occurrence(patients)[source]#

Helper function which parses condition_occurrence table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_procedure_occurrence(patients)[source]#

Helper function which parses procedure_occurrence table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_drug_exposure(patients)[source]#

Helper function which parses drug_exposure table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_measurement(patients)[source]#

Helper function which parses measurement table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.SleepEDFDataset#

The open Sleep-EDF Database Expanded database, refer to doc for more information.

class pyhealth.datasets.SleepEDFDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#

Bases: BaseSignalDataset

Base EEG dataset for SleepEDF

Dataset is available at https://www.physionet.org/content/sleep-edfx/1.0.0/

For the Sleep Cassette Study portion:
  • The 153 SC* files (SC = Sleep Cassette) were obtained in a 1987-1991 study of age effects on sleep in healthy Caucasians aged 25-101, without any sleep-related medication [2]. Two PSGs of about 20 hours each were recorded during two subsequent day-night periods at the subjects homes. Subjects continued their normal activities but wore a modified Walkman-like cassette-tape recorder described in chapter VI.4 (page 92) of Bob’s 1987 thesis [7].

  • Files are named in the form SC4ssNEO-PSG.edf where ss is the subject number, and N is the night. The first nights of subjects 36 and 52, and the second night of subject 13, were lost due to a failing cassette or laserdisk.

  • The EOG and EEG signals were each sampled at 100 Hz. The submental-EMG signal was electronically highpass filtered, rectified and low-pass filtered after which the resulting EMG envelope expressed in uV rms (root-mean-square) was sampled at 1Hz. Oro-nasal airflow, rectal body temperature and the event marker were also sampled at 1Hz.

  • Subjects and recordings are further described in the file headers, the descriptive spreadsheet SC-subjects.xls, and in [2].

For the Sleep Telemetry portoin:
  • The 44 ST* files (ST = Sleep Telemetry) were obtained in a 1994 study of temazepam effects on sleep in 22 Caucasian males and females without other medication. Subjects had mild difficulty falling asleep but were otherwise healthy. The PSGs of about 9 hours were recorded in the hospital during two nights, one of which was after temazepam intake, and the other of which was after placebo intake. Subjects wore a miniature telemetry system with very good signal quality described in [8].

  • Files are named in the form ST7ssNJ0-PSG.edf where ss is the subject number, and N is the night.

  • EOG, EMG and EEG signals were sampled at 100 Hz, and the event marker at 1 Hz. The physical marker dimension ID+M-E relates to the fact that pressing the marker (M) button generated two-second deflections from a baseline value that either identifies the telemetry unit (ID = 1 or 2 if positive) or marks an error (E) in the telemetry link if negative. Subjects and recordings are further described in the file headers, the descriptive spreadsheet ST-subjects.xls, and in [1].

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “sleep staging”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import SleepEDFDataset
>>> dataset = SleepEDFDataset(
...         root="/srv/local/data/SLEEPEDF/sleep-edf-database-expanded-1.0.0/sleep-cassette",
...     )
>>> dataset.stat()
>>> dataset.info()
process_EEG_data()[source]#

pyhealth.datasets.SHHSDataset#

The open Sleep-EDF Database Expanded database, refer to doc for more information.

class pyhealth.datasets.SHHSDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#

Bases: BaseSignalDataset

Base EEG dataset for Sleep Heart Health Study (SHHS)

Dataset is available at https://sleepdata.org/datasets/shhs

The Sleep Heart Health Study (SHHS) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing. It tests whether sleep-related breathing is associated with an increased risk of coronary heart disease, stroke, all cause mortality, and hypertension. In all, 6,441 men and women aged 40 years and older were enrolled between November 1, 1995 and January 31, 1998 to take part in SHHS Visit 1. During exam cycle 3 (January 2001- June 2003), a second polysomnogram (SHHS Visit 2) was obtained in 3,295 of the participants. CVD Outcomes data were monitored and adjudicated by parent cohorts between baseline and 2011. More than 130 manuscripts have been published investigating predictors and outcomes of sleep disorders.

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “sleep staging”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import SHHSDataset
>>> dataset = SHHSDataset(
...         root="/srv/local/data/shhs/files/polysomnography/edfs",
...     )
>>> dataset.stat()
>>> dataset.info()
parse_patient_id(file_name)[source]#
Parameters

file_name – the file name of the shhs datasets. e.g., shhs1-200001.edf

Returns

the patient id of the shhs datasets. e.g., 200001

Return type

patient_id

process_EEG_data()[source]#

pyhealth.datasets.ISRUCDataset#

The open ISRUC EEF database, refer to doc for more information.

class pyhealth.datasets.ISRUCDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#

Bases: BaseSignalDataset

Base EEG dataset for ISRUC Group I.

Dataset is available at https://sleeptight.isr.uc.pt/

  • The EEG signals are sampled at 200 Hz.

  • There are 100 subjects in the orignal dataset.

  • Each subject’s data is about a night’s sleep.

Parameters
  • dataset_name (Optional[str]) – name of the dataset. Default is ‘ISRUCDataset’.

  • root (str) – root directory of the raw data. We expect root/raw to contain all extracted files (.txt, .rec, …) You can also download the data to a new directory by using download=True.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – Whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

  • download – Whether to download the data automatically. Default is False.

Examples

>>> from pyhealth.datasets import ISRUCDataset
>>> dataset = ISRUCDataset(
...         root="/srv/local/data/data/ISRUC-I",
...         download=True,
...     )
>>> dataset.stat()
>>> dataset.info()
process_EEG_data()[source]#

pyhealth.datasets.splitter#

Several data splitting function for pyhealth.datasets module to obtain training / validation / test sets.

pyhealth.datasets.splitter.split_by_visit(dataset, ratios, seed=None)[source]#

Splits the dataset by visit (i.e., samples).

Parameters
Returns

three subsets of the dataset of

type torch.utils.data.Subset.

Return type

train_dataset, val_dataset, test_dataset

Note

The original dataset can be accessed by train_dataset.dataset,

val_dataset.dataset, and test_dataset.dataset.

pyhealth.datasets.splitter.split_by_patient(dataset, ratios, seed=None)[source]#

Splits the dataset by patient.

Parameters
Returns

three subsets of the dataset of

type torch.utils.data.Subset.

Return type

train_dataset, val_dataset, test_dataset

Note

The original dataset can be accessed by train_dataset.dataset,

val_dataset.dataset, and test_dataset.dataset.

pyhealth.datasets.utils#

Several utility functions.

pyhealth.datasets.utils.hash_str(s)[source]#
pyhealth.datasets.utils.strptime(s)[source]#

Helper function which parses a string to datetime object.

Parameters

s (str) – str, string to be parsed.

Return type

Optional[datetime]

Returns

Optional[datetime], parsed datetime object. If s is nan, return None.

pyhealth.datasets.utils.padyear(year, month='1', day='1')[source]#

Pad a date time year of format ‘YYYY’ to format ‘YYYY-MM-DD’

Parameters
  • year (str) – str, year to be padded. Must be non-zero value.

  • month – str, month string to be used as padding. Must be in [1, 12]

  • day – str, day string to be used as padding. Must be in [1, 31]

Returns

str, padded year.

Return type

padded_date

pyhealth.datasets.utils.flatten_list(l)[source]#

Flattens a list of list.

Parameters

l (List) – List, the list of list to be flattened.

Return type

List

Returns

List, the flattened list.

Examples

>>> flatten_list([[1], [2, 3], [4]])
[1, 2, 3, 4]R
>>> flatten_list([[1], [[2], 3], [4]])
[1, [2], 3, 4]
pyhealth.datasets.utils.list_nested_levels(l)[source]#

Gets all the different nested levels of a list.

Parameters

l (List) – the list to be checked.

Return type

Tuple[int]

Returns

All the different nested levels of the list.

Examples

>>> list_nested_levels([])
(1,)
>>> list_nested_levels([1, 2, 3])
(1,)
>>> list_nested_levels([[]])
(2,)
>>> list_nested_levels([[1, 2, 3], [4, 5, 6]])
(2,)
>>> list_nested_levels([1, [2, 3], 4])
(1, 2)
>>> list_nested_levels([[1, [2, 3], 4]])
(2, 3)
pyhealth.datasets.utils.is_homo_list(l)[source]#

Checks if a list is homogeneous.

Parameters

l (List) – the list to be checked.

Return type

bool

Returns

bool, True if the list is homogeneous, False otherwise.

Examples

>>> is_homo_list([1, 2, 3])
True
>>> is_homo_list([])
True
>>> is_homo_list([1, 2, "3"])
False
>>> is_homo_list([1, 2, 3, [4, 5, 6]])
False
pyhealth.datasets.utils.collate_fn_dict(batch)[source]#
pyhealth.datasets.utils.get_dataloader(dataset, batch_size, shuffle=False)[source]#

Tasks#

We support various real-world healthcare predictive tasks defined by function calls. The following example tasks are collected from top AI/Medical venues:

  1. Drug Recommendation [Yang et al. IJCAI 2021a, Yang et al. IJCAI 2021b, Shang et al. AAAI 2020]

  2. Readmission Prediction [Choi et al. AAAI 2021]

  3. Mortality Prediction [Choi et al. AAAI 2021]

  4. Length of Stay Prediction

  5. Sleep Staging [Yang et al. ArXiv 2021]

pyhealth.tasks.drug_recommendation#

pyhealth.tasks.drug_recommendation.drug_recommendation_mimic3_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key, like this {

”patient_id”: xxx, “visit_id”: xxx, “conditions”: [list of diag in visit 1, list of diag in visit 2, …, list of diag in visit N], “procedures”: [list of prod in visit 1, list of prod in visit 2, …, list of prod in visit N], “drugs_hist”: [list of drug in visit 1, list of drug in visit 2, …, list of drug in visit (N-1)], “drugs”: list of drug in visit N, # this is the predicted target

}

Return type

samples

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import drug_recommendation_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(drug_recommendation_mimic3_fn)
>>> mimic3_sample.samples[0]
{
    'visit_id': '174162',
    'patient_id': '107',
    'conditions': [['139', '158', '237', '99', '60', '101', '51', '54', '53', '133', '143', '140', '117', '138', '55']],
    'procedures': [['4443', '4513', '3995']],
    'drugs_hist': [[]],
    'drugs': ['0000', '0033', '5817', '0057', '0090', '0053', '0', '0012', '6332', '1001', '6155', '1001', '6332', '0033', '5539', '6332', '5967', '0033', '0040', '5967', '5846', '0016', '5846', '5107', '5551', '6808', '5107', '0090', '5107', '5416', '0033', '1150', '0005', '6365', '0090', '6155', '0005', '0090', '0000', '6373'],
}
pyhealth.tasks.drug_recommendation.drug_recommendation_mimic4_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key {

”patient_id”: xxx, “visit_id”: xxx, “conditions”: [list of diag in visit 1, list of diag in visit 2, …, list of diag in visit N], “procedures”: [list of prod in visit 1, list of prod in visit 2, …, list of prod in visit N], “drugs_hist”: [list of drug in visit 1, list of drug in visit 2, …, list of drug in visit (N-1)], “drugs”: list of drug in visit N, # this is the predicted target

}

Return type

samples

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import drug_recommendation_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(drug_recommendation_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
pyhealth.tasks.drug_recommendation.drug_recommendation_eicu_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import drug_recommendation_eicu_fn
>>> eicu_sample = eicu_base.set_task(drug_recommendation_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
pyhealth.tasks.drug_recommendation.drug_recommendation_omop_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import drug_recommendation_omop_fn
>>> omop_sample = omop_base.set_task(drug_recommendation_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51'], ['98', '663', '58', '51']], 'procedures': [['1'], ['2', '3']], 'label': [['2', '3', '4'], ['0', '1', '4', '5']]}]

pyhealth.tasks.readmission_prediction#

pyhealth.tasks.readmission_prediction.readmission_prediction_mimic3_fn(patient, time_window=15)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import readmission_prediction_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(readmission_prediction_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.readmission_prediction.readmission_prediction_mimic4_fn(patient, time_window=15)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import readmission_prediction_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(readmission_prediction_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
pyhealth.tasks.readmission_prediction.readmission_prediction_eicu_fn(patient, time_window=5)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Features key-value pairs: - using diagnosis table (ICD9CM and ICD10CM) as condition codes - using physicalExam table as procedure codes - using medication table as drugs codes

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication", "physicalExam"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import readmission_prediction_eicu_fn
>>> eicu_sample = eicu_base.set_task(readmission_prediction_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.readmission_prediction.readmission_prediction_eicu_fn2(patient, time_window=5)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Similar to readmission_prediction_eicu_fn, but with different code mapping: - using admissionDx table and diagnosisString under diagnosis table as condition codes - using treatment table as procedure codes

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "treatment", "admissionDx"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import readmission_prediction_eicu_fn2
>>> eicu_sample = eicu_base.set_task(readmission_prediction_eicu_fn2)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.readmission_prediction.readmission_prediction_omop_fn(patient, time_window=15)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import readmission_prediction_omop_fn
>>> omop_sample = omop_base.set_task(readmission_prediction_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]

pyhealth.tasks.mortality_prediction#

pyhealth.tasks.mortality_prediction.mortality_prediction_mimic3_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import mortality_prediction_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(mortality_prediction_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
pyhealth.tasks.mortality_prediction.mortality_prediction_mimic4_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import mortality_prediction_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(mortality_prediction_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.mortality_prediction.mortality_prediction_eicu_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Features key-value pairs: - using diagnosis table (ICD9CM and ICD10CM) as condition codes - using physicalExam table as procedure codes - using medication table as drugs codes

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication", "physicalExam"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import mortality_prediction_eicu_fn
>>> eicu_sample = eicu_base.set_task(mortality_prediction_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
pyhealth.tasks.mortality_prediction.mortality_prediction_eicu_fn2(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Similar to mortality_prediction_eicu_fn, but with different code mapping: - using admissionDx table and diagnosisString under diagnosis table as condition codes - using treatment table as procedure codes

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "admissionDx", "treatment"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import mortality_prediction_eicu_fn2
>>> eicu_sample = eicu_base.set_task(mortality_prediction_eicu_fn2)
>>> eicu_sample.samples[0]
{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}
pyhealth.tasks.mortality_prediction.mortality_prediction_omop_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import mortality_prediction_omop_fn
>>> omop_sample = omop_base.set_task(mortality_prediction_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]

pyhealth.tasks.length_of_stay_prediction#

pyhealth.tasks.length_of_stay_prediction.categorize_los(days)[source]#

Categorizes length of stay into 10 categories.

One for ICU stays shorter than a day, seven day-long categories for each day of the first week, one for stays of over one week but less than two, and one for stays of over two weeks.

Parameters

days (int) – int, length of stay in days

Returns

int, category of length of stay

Return type

category

pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic3_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import length_of_stay_prediction_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(length_of_stay_prediction_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 4}]
pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic4_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import length_of_stay_prediction_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(length_of_stay_prediction_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 2}]
pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_eicu_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import length_of_stay_prediction_eicu_fn
>>> eicu_sample = eicu_base.set_task(length_of_stay_prediction_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 5}]
pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_omop_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import length_of_stay_prediction_omop_fn
>>> omop_sample = omop_base.set_task(length_of_stay_prediction_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 7}]

pyhealth.tasks.sleep_staging#

pyhealth.tasks.sleep_staging.sleep_staging_isruc_fn(record, epoch_seconds=10, label_id=1)[source]#

Processes a single patient for the sleep staging task on ISRUC.

Sleep staging aims at predicting the sleep stages (Awake, N1, N2, N3, REM) based on the multichannel EEG signals. The task is defined as a multi-class classification.

Parameters
  • record

    a singleton list of one subject from the ISRUCDataset. The (single) record is a dictionary with the following keys:

    load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id

  • epoch_seconds – how long will each epoch be (in seconds). It has to be a factor of 30 because the original data was labeled every 30 seconds.

  • label_id – which set of labels to use. ISURC is labeled by two experts. By default we use the first set of labels (label_id=1).

Returns

a list of samples, each sample is a dict with patient_id, record_id,

and epoch_path (the path to the saved epoch {“X”: signal, “Y”: label} as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import ISRUCDataset
>>> isruc = ISRUCDataset(
...         root="/srv/local/data/data/ISRUC-I", download=True,
...     )
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> sleepstage_ds = isruc.set_task(sleep_staging_isruc_fn)
>>> sleepstage_ds.samples[0]
{
    'record_id': '1-0',
    'patient_id': '1',
    'epoch_path': '/home/zhenlin4/.cache/pyhealth/datasets/832afe6e6e8a5c9ea5505b47e7af8125/10-1/1/0.pkl',
    'label': 'W'
}
pyhealth.tasks.sleep_staging.sleep_staging_sleepedf_fn(record, epoch_seconds=30)[source]#

Processes a single patient for the sleep staging task on Sleep EDF.

Sleep staging aims at predicting the sleep stages (Awake, REM, N1, N2, N3, N4) based on the multichannel EEG signals. The task is defined as a multi-class classification.

Parameters
  • patient – a list of (root, PSG, Hypnogram, save_to_path) tuples, where PSG is the signal files and Hypnogram

  • labels (contains the) –

  • epoch_seconds – how long will each epoch be (in seconds)

Returns

a list of samples, each sample is a dict with patient_id, record_id,

and epoch_path (the path to the saved epoch {“X”: signal, “Y”: label} as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import SleepEDFDataset
>>> sleepedf = SleepEDFDataset(
...         root="/srv/local/data/SLEEPEDF/sleep-edf-database-expanded-1.0.0/sleep-cassette",
...     )
>>> from pyhealth.tasks import sleep_staging_sleepedf_fn
>>> sleepstage_ds = sleepedf.set_task(sleep_staging_sleepedf_fn)
>>> sleepstage_ds.samples[0]
{
    'record_id': 'SC4001-0',
    'patient_id': 'SC4001',
    'epoch_path': '/home/chaoqiy2/.cache/pyhealth/datasets/70d6dbb28bd81bab27ae2f271b2cbb0f/SC4001-0.pkl',
    'label': 'W'
}

Models#

We implement the following models for supporting multiple healthcare predictive tasks.

pyhealth.models.MLP#

The separate callable MLP model.

class pyhealth.models.MLP(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, n_layers=2, activation='relu', **kwargs)[source]#

Bases: BaseModel

Multi-layer perceptron model.

This model applies a separate MLP layer for each feature, and then concatenates the final hidden states of each MLP layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate MLP layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the rnn model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector; we use mean/sum pooling and then MLP

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we first use the embedding table to encode each code into a vector

    and then use mean/sum pooling to get one vector for each sample; we then use MLP layers

  • case 3. [1.5, 2.0, 0.0]
    • we run MLP directly

  • case 4. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • This case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 1 after embedding table

  • case 5. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • This case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • n_layers (int) – the number of layers. Default is 2.

  • activation (str) – the activation function. Default is “relu”.

  • **kwargs – other parameters for the RNN layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "conditions": ["cond-33", "cond-86", "cond-80"],
...             "procedures": [1.0, 2.0, 3.5, 4],
...             "label": 0,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "conditions": ["cond-33", "cond-86", "cond-80"],
...             "procedures": [5.0, 2.0, 3.5, 4],
...             "label": 1,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import MLP
>>> model = MLP(
...         dataset=dataset,
...         feature_keys=["conditions", "procedures"],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.6659, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.5680],
                    [0.5352]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.],
                    [0.]]),
    'logit': tensor([[0.2736],
                    [0.1411]], grad_fn=<AddmmBackward0>)
}
>>>
static mean_pooling(x, mask)[source]#

Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)

Returns

tensor of shape (batch_size, embedding_dim)

Return type

x

Examples

>>> x.shape
[128, 5, 32]
>>> mean_pooling(x).shape
[128, 32]
static sum_pooling(x)[source]#

Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)

Returns

tensor of shape (batch_size, embedding_dim)

Return type

x

Examples

>>> x.shape
[128, 5, 32]
>>> sum_pooling(x).shape
[128, 32]
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.CNN#

The separate callable CNNLayer and the complete CNN model.

class pyhealth.models.CNNLayer(input_size, hidden_size, num_layers=1)[source]#

Bases: Module

Convolutional neural network layer.

This layer stacks multiple CNN blocks and applies adaptive average pooling at the end. It is used in the CNN model. But it can also be used as a standalone layer.

Parameters
  • input_size (int) – input feature size.

  • hidden_size (int) – hidden feature size.

  • num_layers (int) – number of convolutional layers. Default is 1.

Examples

>>> from pyhealth.models import CNNLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = CNNLayer(5, 64)
>>> outputs, last_outputs = layer(input)
>>> outputs.shape
torch.Size([3, 128, 64])
>>> last_outputs.shape
torch.Size([3, 64])
forward(x)[source]#

Forward propagation.

Parameters

x (tensor) – a tensor of shape [batch size, sequence len, input size].

Returns

a tensor of shape [batch size, sequence len, hidden size],

containing the output features for each time step.

pooled_outputs: a tensor of shape [batch size, hidden size], containing

the pooled output features.

Return type

outputs

training: bool#
class pyhealth.models.CNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

Convolutional neural network model.

This model applies a separate CNN layer for each feature, and then concatenates the final hidden states of each CNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate CNN layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the CNN model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply CNN on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use CNN one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the CNN layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]],
...                 [[7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import CNN
>>> model = CNN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.8872, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.5008], [0.6614]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.], [0.]]),
    'logit': tensor([[0.0033], [0.6695]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.RNN#

The separate callable RNNLayer and the complete RNN model.

class pyhealth.models.RNNLayer(input_size, hidden_size, rnn_type='GRU', num_layers=1, dropout=0.5, bidirectional=False)[source]#

Bases: Module

Recurrent neural network layer.

This layer wraps the PyTorch RNN layer with masking and dropout support. It is used in the RNN model. But it can also be used as a standalone layer.

Parameters
  • input_size (int) – input feature size.

  • hidden_size (int) – hidden feature size.

  • rnn_type (str) – type of rnn, one of “RNN”, “LSTM”, “GRU”. Default is “GRU”.

  • num_layers (int) – number of recurrent layers. Default is 1.

  • dropout (float) – dropout rate. If non-zero, introduces a Dropout layer before each RNN layer. Default is 0.5.

  • bidirectional (bool) – whether to use bidirectional recurrent layers. If True, a fully-connected layer is applied to the concatenation of the forward and backward hidden states to reduce the dimension to hidden_size. Default is False.

Examples

>>> from pyhealth.models import RNNLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = RNNLayer(5, 64)
>>> outputs, last_outputs = layer(input)
>>> outputs.shape
torch.Size([3, 128, 64])
>>> last_outputs.shape
torch.Size([3, 64])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, sequence len, hidden size],

containing the output features for each time step.

last_outputs: a tensor of shape [batch size, hidden size], containing

the output features for the last time step.

Return type

outputs

training: bool#
class pyhealth.models.RNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

Recurrent neural network model.

This model applies a separate RNN layer for each feature, and then concatenates the final hidden states of each RNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate rnn layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the rnn model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply rnn on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use rnn one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the RNN layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import RNN
>>> model = RNN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.8056, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.5906],
                    [0.6620]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.],
                    [0.]]),
    'logit': tensor([[0.3666],
                    [0.6721]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.Transformer#

The separate callable TransformerLayer and the complete Transformer model.

class pyhealth.models.TransformerLayer(feature_size, heads=1, dropout=0.5, num_layers=1)[source]#

Bases: Module

Transformer layer.

Paper: Ashish Vaswani et al. Attention is all you need. NIPS 2017.

This layer is used in the Transformer model. But it can also be used as a standalone layer.

Parameters
  • feature_size – the hidden feature size.

  • heads – the number of attention heads. Default is 1.

  • dropout – dropout rate. Default is 0.5.

  • num_layers – number of transformer layers. Default is 1.

Examples

>>> from pyhealth.models import TransformerLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = TransformerLayer(64)
>>> emb, cls_emb = layer(input)
>>> emb.shape
torch.Size([3, 128, 64])
>>> cls_emb.shape
torch.Size([3, 64])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, feature_size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, sequence len, feature_size],

containing the output features for each time step.

cls_emb: a tensor of shape [batch size, feature_size], containing

the output features for the first time step.

Return type

emb

training: bool#
class pyhealth.models.Transformer(dataset, feature_keys, label_key, mode, embedding_dim=128, **kwargs)[source]#

Bases: BaseModel

Transformer model.

This model applies a separate Transformer layer for each feature, and then concatenates the final hidden states of each Transformer layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate Transformer layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the transformer model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply transformer on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use transformer one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 2 after embedding table

dataset: the dataset to train the model. It is used to query certain

information such as the set of all tokens.

feature_keys: list of keys in samples to use as features,

e.g. [“conditions”, “procedures”].

label_key: key in samples to use as label (e.g., “drugs”). mode: one of “binary”, “multiclass”, or “multilabel”. embedding_dim: the embedding dimension. Default is 128. **kwargs: other parameters for the Transformer layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import Transformer
>>> model = Transformer(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="multiclass",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(4.0555, grad_fn=<NllLossBackward0>),
    'y_prob': tensor([[1.0000e+00, 1.8206e-06],
                [9.9970e-01, 3.0020e-04]], grad_fn=<SoftmaxBackward0>),
    'y_true': tensor([0, 1]),
    'logit': tensor([[ 7.6283, -5.5881],
                [ 1.0898, -7.0210]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.RETAIN#

The separate callable RETAINLayer and the complete RETAIN model.

class pyhealth.models.RETAINLayer(feature_size, dropout=0.5)[source]#

Bases: Module

RETAIN layer.

Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.

This layer is used in the RETAIN model. But it can also be used as a standalone layer.

Parameters
  • feature_size (int) – the hidden feature size.

  • dropout (float) – dropout rate. Default is 0.5.

Examples

>>> from pyhealth.models import RETAINLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = RETAINLayer(64)
>>> c = layer(input)
>>> c.shape
torch.Size([3, 64])
static reverse_x(input, lengths)[source]#

Reverses the input.

compute_alpha(rx, lengths)[source]#

Computes alpha attention.

compute_beta(rx, lengths)[source]#

Computes beta attention.

forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, feature_size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, feature_size] representing the

context vector.

Return type

c

training: bool#
class pyhealth.models.RETAIN(dataset, feature_keys, label_key, mode, embedding_dim=128, **kwargs)[source]#

Bases: BaseModel

RETAIN model.

Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.

Note

We use separate Retain layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the Retain model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply Retain on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use Retain one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • **kwargs – other parameters for the RETAIN layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import RETAIN
>>> model = RETAIN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.5640, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.5325],
                    [0.3922]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.],
                    [0.]]),
    'logit': tensor([[ 0.1303],
                    [-0.4382]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.GAMENet#

The separate callable GAMENetLayer and the complete GAMENet model.

class pyhealth.models.GAMENetLayer(hidden_size, ehr_adj, ddi_adj, dropout=0.5)[source]#

Bases: Module

GAMENet layer.

Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.

This layer is used in the GAMENet model. But it can also be used as a standalone layer.

Parameters
  • hidden_size (int) – hidden feature size.

  • ehr_adj (tensor) – an adjacency tensor of shape [num_drugs, num_drugs].

  • ddi_adj (tensor) – an adjacency tensor of shape [num_drugs, num_drugs].

  • dropout (float) – the dropout rate. Default is 0.5.

Examples

>>> from pyhealth.models import GAMENetLayer
>>> queries = torch.randn(3, 5, 32) # [patient, visit, hidden_size]
>>> prev_drugs = torch.randint(0, 2, (3, 4, 50)).float()
>>> curr_drugs = torch.randint(0, 2, (3, 50)).float()
>>> ehr_adj = torch.randint(0, 2, (50, 50)).float()
>>> ddi_adj = torch.randint(0, 2, (50, 50)).float()
>>> layer = GAMENetLayer(32, ehr_adj, ddi_adj)
>>> loss, y_prob = layer(queries, prev_drugs, curr_drugs)
>>> loss.shape
torch.Size([])
>>> y_prob.shape
torch.Size([3, 50])
forward(queries, prev_drugs, curr_drugs, mask=None)[source]#

Forward propagation.

Parameters
  • queries (tensor) – query tensor of shape [patient, visit, hidden_size].

  • prev_drugs (tensor) – multihot tensor indicating drug usage in all previous visits of shape [patient, visit - 1, num_drugs].

  • curr_drugs (tensor) – multihot tensor indicating drug usage in the current visit of shape [patient, num_drugs].

  • mask (Optional[tensor]) – an optional mask tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.GAMENet(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#

Bases: BaseModel

GAMENet model.

Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.

Note

This model only accepts ATC level 3 as medication codes.

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • num_layers (int) – the number of layers used in RNN. Default is 1.

  • dropout (float) – the dropout rate. Default is 0.5.

  • **kwargs – other parameters for the GAMENet layer.

generate_ehr_adj()[source]#

Generates the EHR graph adjacency matrix.

Return type

tensor

generate_ddi_adj()[source]#

Generates the DDI graph adjacency matrix.

Return type

tensor

forward(conditions, procedures, drugs_hist, drugs, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels [patient, visit, procedure].

  • drugs_hist (List[List[List[str]]]) – a nested list in three levels [patient, visit, drug], up to visit (N-1)

  • drugs (List[List[str]]) – a nested list in two levels [patient, drug], at visit N

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing

the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels] representing

the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.MICRON#

The separate callable MICRONLayer and the complete MICRON model.

class pyhealth.models.MICRONLayer(input_size, hidden_size, num_drugs, lam=0.1)[source]#

Bases: Module

MICRON layer.

Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.

This layer is used in the MICRON model. But it can also be used as a standalone layer.

Parameters
  • input_size (int) – input feature size.

  • hidden_size (int) – hidden feature size.

  • num_drugs (int) – total number of drugs to recommend.

  • lam (float) – regularization parameter for the reconstruction loss. Default is 0.1.

Examples

>>> from pyhealth.models import MICRONLayer
>>> patient_emb = torch.randn(3, 5, 32) # [patient, visit, input_size]
>>> drugs = torch.randint(0, 2, (3, 50)).float()
>>> layer = MICRONLayer(32, 64, 50)
>>> loss, y_prob = layer(patient_emb, drugs)
>>> loss.shape
torch.Size([])
>>> y_prob.shape
torch.Size([3, 50])
static compute_reconstruction_loss(logits, logits_residual, mask)[source]#
Return type

tensor

forward(patient_emb, drugs, mask=None)[source]#

Forward propagation.

Parameters
  • patient_emb (tensor) – a tensor of shape [patient, visit, input_size].

  • drugs (tensor) – a multihot tensor of shape [patient, num_labels].

  • mask (Optional[tensor]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.MICRON(dataset, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

MICRON model.

Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the MICRON layer.

forward(conditions, procedures, drugs, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels [patient, visit, procedure].

  • drugs (List[List[str]]) – a nested list in two levels [patient, drug].

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing

the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels] representing

the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.SafeDrug#

The separate callable SafeDrugLayer and the complete SafeDrug model.

class pyhealth.models.SafeDrugLayer(hidden_size, mask_H, ddi_adj, num_fingerprints, molecule_set, average_projection, kp=0.05, target_ddi=0.08)[source]#

Bases: Module

SafeDrug model.

Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.

This layer is used in the SafeDrug model. But it can also be used as a standalone layer.

Parameters
  • hidden_size (int) – hidden feature size.

  • mask_H (Tensor) – the mask matrix H of shape [num_drugs, num_substructures].

  • ddi_adj (Tensor) – an adjacency tensor of shape [num_drugs, num_drugs].

  • num_fingerprints (int) – total number of different fingerprints.

  • molecule_set (List[Tuple]) – a list of molecule tuples (A, B, C) of length num_molecules. - A <torch.tensor>: fingerprints of atoms in the molecule - B <torch.tensor>: adjacency matrix of the molecule - C <int>: molecular_size

  • average_projection (Tensor) – a tensor of shape [num_drugs, num_molecules] representing the average projection for aggregating multiple molecules of the same drug into one vector.

  • kp (float) – correcting factor for the proportional signal. Default is 0.5.

  • target_ddi (float) – DDI acceptance rate. Default is 0.08.

pad(matrices, pad_value)[source]#

Pads the list of matrices.

Padding with a pad_value (e.g., 0) for batch processing. For example, given a list of matrices [A, B, C], we obtain a new matrix [A00, 0B0, 00C], where 0 is the zero (i.e., pad value) matrix.

calculate_loss(logits, y_prob, labels)[source]#
Return type

Tensor

forward(patient_emb, drugs, mask=None)[source]#

Forward propagation.

Parameters
  • patient_emb (tensor) – a tensor of shape [patient, visit, input_size].

  • drugs (tensor) – a multihot tensor of shape [patient, num_labels].

  • mask (Optional[tensor]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.SafeDrug(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#

Bases: BaseModel

SafeDrug model.

Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.

Note

This model only accepts ATC level 3 as medication codes.

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • num_layers (int) – the number of layers used in RNN. Default is 1.

  • dropout (float) – the dropout rate. Default is 0.5.

  • **kwargs – other parameters for the SafeDrug layer.

generate_ddi_adj()[source]#

Generates the DDI graph adjacency matrix.

Return type

tensor

generate_smiles_list()[source]#

Generates the list of SMILES strings.

Return type

List[List[str]]

generate_mask_H()[source]#

Generates the molecular segmentation mask H.

Return type

tensor

generate_molecule_info(radius=1)[source]#

Generates the molecule information.

forward(conditions, procedures, drugs, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels [patient, visit, procedure].

  • drugs (List[List[str]]) – a nested list in two levels [patient, drug].

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing

the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels] representing

the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.MoleRec#

The separate callable MoleRecLayer and the complete MoleRec model.

class pyhealth.models.MoleRecLayer(hidden_size, coef=2.5, target_ddi=0.08, GNN_layers=4, dropout=0.5, multiloss_weight=0.05, **kwargs)[source]#

Bases: Module

MoleRec model.

Paper: Nianzu Yang et al. MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning. WWW 2023.

This layer is used in the MoleRec model. But it can also be used as a standalone layer.

Parameters
  • hidden_size (int) – hidden feature size.

  • coef (float) – coefficient of ddi loss weight annealing. larger coefficient means higher penalty to the drug-drug-interaction. Default is 2.5.

  • target_ddi (float) – DDI acceptance rate. Default is 0.06.

  • GNN_layers (int) – the number of layers of GNNs encoding molecule and substructures. Default is 4.

  • dropout (float) – the dropout ratio of model. Default is 0.7.

  • multiloss_weight (float) – the weight of multilabel_margin_loss for multilabel classification. Value should be set between [0, 1]. Default is 0.05

calc_loss(logits, y_prob, ddi_adj, labels, label_index=None)[source]#
Return type

Tensor

forward(patient_emb, drugs, average_projection, ddi_adj, substructure_mask, substructure_graph, molecule_graph, mask=None, drug_indexes=None)[source]#

Forward propagation.

Parameters
  • patient_emb (Tensor) – a tensor of shape [patient, visit, num_substructures], representating the relation between each patient visit and each substructures.

  • drugs (Tensor) – a multihot tensor of shape [patient, num_labels].

  • mask (Optional[tensor]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

  • substructure_mask (Tensor) – tensor of shape [num_drugs, num_substructures], representing whether a substructure shows up in one of the molecule of each drug.

  • average_projection (Tensor) – a tensor of shape [num_drugs, num_molecules] representing the average projection for aggregating multiple molecules of the same drug into one vector.

  • substructure_graph (Union[StaticParaDict, Dict[str, Union[int, Tensor]]]) – a dictionary representating a graph batch of all substructures, where each graph is extracted via ‘smiles2graph’ api of ogb library.

  • molecule_graph (Union[StaticParaDict, Dict[str, Union[int, Tensor]]]) – dictionary with same form of substructure_graph, representing the graph batch of all molecules.

  • ddi_adj (Tensor) – an adjacency tensor for drug drug interaction of shape [num_drugs, num_drugs].

  • drug_indexes (Optional[Tensor]) – the index version of drugs (ground truth) of shape [patient, num_labels], padded with -1

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.MoleRec(dataset, embedding_dim=64, hidden_dim=64, num_rnn_layers=1, num_gnn_layers=4, dropout=0.5, **kwargs)[source]#

Bases: BaseModel

MoleRec model.

Paper: Nianzu Yang et al. MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning. WWW 2023.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.

Note

This model only accepts ATC level 3 as medication codes.

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • num_rnn_layers (int) – the number of layers used in RNN. Default is 1.

  • num_gnn_layers (int) – the number of layers used in GNN. Default is 4.

  • dropout (float) – the dropout rate. Default is 0.7.

  • **kwargs – other parameters for the MoleRec layer.

generate_ddi_adj()[source]#

Generates the DDI graph adjacency matrix.

Return type

FloatTensor

generate_substructure_mask()[source]#
Return type

Tuple[Tensor, List[str]]

generate_smiles_list()[source]#

Generates the list of SMILES strings.

Return type

List[List[str]]

generate_average_projection()[source]#
Return type

Tuple[Tensor, List[str]]

encode_patient(feature_key, raw_values)[source]#
Return type

Tensor

forward(conditions, procedures, drugs, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels with shape [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels with shape [patient, visit, procedure].

  • drugs (List[List[str]]) – a nested list in two levels [patient, drug].

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels]

representing the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels]

representing the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.Deepr#

The separate callable DeeprLayer and the complete Deepr model.

class pyhealth.models.DeeprLayer(feature_size=100, window=1, hidden_size=3)[source]#

Bases: Module

Deepr layer.

Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, ” Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.

This layer is used in the Deepr model.

Parameters
  • feature_size (int) – embedding dim of codes (m in the original paper).

  • window (int) – sliding window (d in the original paper)

  • hidden_size (int) – number of conv filters (motif size, p, in the original paper)

Examples

>>> from pyhealth.models import DeeprLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = DeeprLayer(5, window=4, hidden_size=7) # window does not impact the output shape
>>> outputs = layer(input)
>>> outputs.shape
torch.Size([3, 7])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (Tensor) – a Tensor of shape [batch size, sequence len, input size].

  • mask (Optional[Tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a Tensor of shape [batch size, hidden_size] representing the

summarized vector.

Return type

c

training: bool#
class pyhealth.models.Deepr(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

Deepr model.

Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, ” Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.

Note

We use separate Deepr layers for different feature_keys.

Parameters
  • dataset (BaseEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the Deepr layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import Deepr
>>> model = Deepr(
...         dataset=dataset,
...         feature_keys=[
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.8908, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.2295],
                [0.2665]], device='cuda:0', grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.],
                [0.]], device='cuda:0'),
    'logit': tensor([[-1.2110],
                [-1.0126]], device='cuda:0', grad_fn=<AddmmBackward0>)
}
forward(**kwargs)[source]#

Forward propagation.

Return type

Dict[str, Tensor]

training: bool#

pyhealth.models.ContraWR#

The separate callable ResBlock2D and the complete ContraWR model.

class pyhealth.models.ResBlock2D(in_channels, out_channels, stride=2, downsample=True, pooling=True)[source]#

Bases: Module

Convolutional Residual Block 2D

This block stacks two convolutional layers with batch normalization, max pooling, dropout, and residual connection.

Parameters
  • in_channels (int) – number of input channels.

  • out_channels (int) – number of output channels.

  • stride (int) – stride of the convolutional layers.

  • downsample (bool) – whether to use a downsampling residual connection.

  • pooling (bool) – whether to use max pooling.

Example

>>> import torch
>>> from pyhealth.models import ResBlock2D
>>>
>>> model = ResBlock2D(6, 16, 1, True, True)
>>> input_ = torch.randn((16, 6, 28, 150))  # (batch, channel, height, width)
>>> output = model(input_)
>>> output.shape
torch.Size([16, 16, 14, 75])
forward(x)[source]#

Forward propagation.

Parameters

x – input tensor of shape (batch_size, in_channels, height, width).

Returns

output tensor of shape (batch_size, out_channels, *, *).

Return type

out

training: bool#
class pyhealth.models.ContraWR(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, n_fft=128, **kwargs)[source]#

Bases: BaseModel

The encoder model of ContraWR (a supervised model, STFT + 2D CNN layers)

Paper: Yang, Chaoqi, Danica Xiao, M. Brandon Westover, and Jimeng Sun. “Self-supervised eeg representation learning for automatic sleep staging.” arXiv preprint arXiv:2110.15278 (2021).

Note

We use one encoder to handle multiple channel together.

Parameters
  • dataset (BaseSignalDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the Deepr layer.

Examples

>>> from pyhealth.datasets import SampleSignalDataset
>>> samples = [
...         {
...             "record_id": "SC4001-0",
...             "patient_id": "SC4001",
...             "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-0.pkl",
...             "label": "W",
...         },
...         {
...             "record_id": "SC4001-1",
...             "patient_id": "SC4001",
...             "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-1.pkl",
...             "label": "R",
...         }
...     ]
>>> dataset = SampleSignalDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import ContraWR
>>> model = ContraWR(
...         dataset=dataset,
...         feature_keys=["signal"], # dataloader will load the signal from "epoch_path" and put it in "signal"
...         label_key="label",
...         mode="multiclass",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(2.8425, device='cuda:0', grad_fn=<NllLossBackward0>),
    'y_prob': tensor([[0.9345, 0.0655],
                    [0.9482, 0.0518]], device='cuda:0', grad_fn=<SoftmaxBackward0>),
    'y_true': tensor([1, 1], device='cuda:0'),
    'logit': tensor([[ 0.1472, -2.5104],
                    [2.1584, -0.7481]], device='cuda:0', grad_fn=<AddmmBackward0>)
}
>>>
cal_encoder_stat()[source]#

obtain the convolution encoder initialization statistics

Note

We show an example to illustrate the encoder statistics. input x:

  • torch.Size([5, 7, 3000])

after stft transform
  • torch.Size([5, 7, 65, 90])

we design the first CNN (out_channels = 8)
  • torch.Size([5, 8, 16, 22])

  • here: 8 * 16 * 22 > 256, we continute the convolution

we design the second CNN (out_channels = 16)
  • torch.Size([5, 16, 4, 5])

  • here: 16 * 4 * 5 > 256, we continute the convolution

we design the second CNN (out_channels = 32)
  • torch.Size([5, 32, 1, 1])

  • here: 32 * 1 * 1, we stop the convolution

output:
  • channels = [7, 8, 16, 32]

  • emb_size = 32 * 1 * 1 = 32

torch_stft(X)[source]#

torch short time fourier transform (STFT)

Parameters

X – (batch, n_channels, length)

Returns

(batch, n_channels, freq, time_steps)

Return type

signal

forward(**kwargs)[source]#

Forward propagation.

Return type

Dict[str, Tensor]

training: bool#

pyhealth.models.SparcNet#

The SparcNet Model: Jin Jing, et al. Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation. Neurology 2023.

class pyhealth.models.DenseLayer(input_channels, growth_rate, bn_size, drop_rate=0.5, conv_bias=True, batch_norm=True)[source]#

Bases: Sequential

Densely connected layer :param input_channels: number of input channels :param growth_rate: rate of growth of channels in this layer :param bn_size: multiplicative factor for the bottleneck layer (does not affect the output size) :param drop_rate: dropout rate :param conv_bias: whether to use bias in convolutional layers :param batch_norm: whether to use batch normalization

Example

>>> x = torch.randn(128, 5, 1000)
>>> batch, channels, length = x.shape
>>> model = DenseLayer(channels, 5, 2)
>>> y = model(x)
>>> y.shape
torch.Size([128, 10, 1000])
forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pyhealth.models.DenseBlock(num_layers, input_channels, growth_rate, bn_size, drop_rate=0.5, conv_bias=True, batch_norm=True)[source]#

Bases: Sequential

Densely connected block :param num_layers: number of layers in this block :param input_channls: number of input channels :param growth_rate: rate of growth of channels in this layer :param bn_size: multiplicative factor for the bottleneck layer (does not affect the output size) :param drop_rate: dropout rate :param conv_bias: whether to use bias in convolutional layers :param batch_norm: whether to use batch normalization

Example

>>> x = torch.randn(128, 5, 1000)
>>> batch, channels, length = x.shape
>>> model = DenseBlock(3, channels, 5, 2)
>>> y = model(x)
>>> y.shape
torch.Size([128, 20, 1000])
class pyhealth.models.TransitionLayer(input_channels, output_channels, conv_bias=True, batch_norm=True)[source]#

Bases: Sequential

pooling transition layer

Parameters
  • input_channls – number of input channels

  • output_channels – number of output channels

  • conv_bias – whether to use bias in convolutional layers

  • batch_norm – whether to use batch normalization

Example

>>> x = torch.randn(128, 5, 1000)
>>> model = TransitionLayer(5, 18)
>>> y = model(x)
>>> y.shape
torch.Size([128, 18, 500])
class pyhealth.models.SparcNet(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, block_layers=4, growth_rate=16, bn_size=16, drop_rate=0.5, conv_bias=True, batch_norm=True, **kwargs)[source]#

Bases: BaseModel

The SparcNet model for sleep staging.

Paper: Jin Jing, et al. Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation. Neurology 2023.

Note

We use one encoder to handle multiple channel together.

Parameters
  • dataset (BaseSignalDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – (not used now) the embedding dimension. Default is 128.

  • hidden_dim (int) – (not used now) the hidden dimension. Default is 128.

  • block_layer – the number of layers in each dense block. Default is 4.

  • growth_rate – the growth rate of each dense layer. Default is 16.

  • bn_size – the bottleneck size of each dense layer. Default is 16.

  • conv_bias – whether to use bias in convolutional layers. Default is True.

  • batch_norm – whether to use batch normalization. Default is True.

  • **kwargs – other parameters for the Deepr layer.

Examples

>>> from pyhealth.datasets import SampleSignalDataset
>>> samples = [
...         {
...             "record_id": "SC4001-0",
...             "patient_id": "SC4001",
...             "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-0.pkl",
...             "label": "W",
...         },
...         {
...             "record_id": "SC4001-1",
...             "patient_id": "SC4001",
...             "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-1.pkl",
...             "label": "R",
...         }
...     ]
>>> dataset = SampleSignalDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import SparcNet
>>> model = SparcNet(
...         dataset=dataset,
...         feature_keys=["signal"], # dataloader will load the signal from "epoch_path" and put it in "signal"
...         label_key="label",
...         mode="multiclass",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.6530, device='cuda:0', grad_fn=<NllLossBackward0>),
    'y_prob': tensor([[0.4459, 0.5541],
                    [0.5111, 0.4889]], device='cuda:0', grad_fn=<SoftmaxBackward0>),
    'y_true': tensor([1, 1], device='cuda:0'),
    'logit': tensor([[-0.2750, -0.0577],
                    [-0.1319, -0.1763]], device='cuda:0', grad_fn=<AddmmBackward0>)
}
label_tokenizer#

input statistics

forward(**kwargs)[source]#

Forward propagation.

Return type

Dict[str, Tensor]

training: bool#

pyhealth.models.StageNet#

The separate callable StageNetLayer and the complete StageNet model.

class pyhealth.models.StageNetLayer(input_dim, chunk_size=128, conv_size=10, levels=3, dropconnect=0.3, dropout=0.3, dropres=0.3)[source]#

Bases: Module

StageNet layer.

Paper: Stagenet: Stage-aware neural networks for health risk prediction. WWW 2020.

This layer is used in the StageNet model. But it can also be used as a standalone layer.

Parameters
  • input_dim (int) – dynamic feature size.

  • chunk_size (int) – the chunk size for the StageNet layer. Default is 128.

  • levels (int) – the number of levels for the StageNet layer. levels * chunk_size = hidden_dim in the RNN. Smaller chunk size and more levels can capture more detailed patient status variations. Default is 3.

  • conv_size (int) – the size of the convolutional kernel. Default is 10.

  • dropconnect (int) – the dropout rate for the dropconnect. Default is 0.3.

  • dropout (int) – the dropout rate for the dropout. Default is 0.3.

  • dropres (int) – the dropout rate for the residual connection. Default is 0.3.

Examples

>>> from pyhealth.models import StageNetLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = StageNetLayer(64)
>>> c, _, _ = layer(input)
>>> c.shape
torch.Size([3, 384])
cumax(x, mode='l2r')[source]#
step(inputs, c_last, h_last, interval, device)[source]#
forward(x, time=None, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input_dim].

  • static – a tensor of shape [batch size, static_dim].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, chunk_size*levels] representing the

patient embedding.

outputs: a tensor of shape [batch size, sequence len, chunk_size*levels] representing the patient at each time step.

Return type

last_output

training: bool#
class pyhealth.models.StageNet(dataset, feature_keys, label_key, mode, time_keys=None, embedding_dim=128, chunk_size=128, levels=3, **kwargs)[source]#

Bases: BaseModel

StageNet model.

Paper: Junyi Gao et al. Stagenet: Stage-aware neural networks for health risk prediction. WWW 2020.

Note

We use separate StageNet layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the StageNet model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply StageNet on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use StageNet one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run StageNet directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run StageNet directly on the inner bracket level, similar to case 2 after embedding table

The time interval information specified by time_keys will be used to calculate the memory decay between each visit. If time_keys is None, all visits are treated as the same time interval. For each feature, the time interval should be a two-dimensional float array with shape (time_step, 1).

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • time_keys (Optional[List[str]]) – list of keys in samples to use as time interval information for each feature, Default is None. If none, all visits are treated as the same time interval.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • chunk_size (int) – the chunk size for the StageNet layer. Default is 128.

  • levels (int) – the number of levels for the StageNet layer. levels * chunk_size = hidden_dim in the RNN. Smaller chunk size and more levels can capture more detailed patient status variations. Default is 3.

  • **kwargs – other parameters for the StageNet layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...     {
...         "patient_id": "patient-0",
...         "visit_id": "visit-0",
...         # "single_vector": [1, 2, 3],
...         "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...         "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...         "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...         "list_list_vectors": [
...             [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...             [[7.7, 8.5, 9.4]],
...         ],
...         "label": 1,
...         "list_vectors_time": [[0.0], [1.3]],
...         "list_codes_time": [[0.0], [2.0], [1.3]],
...         "list_list_codes_time": [[0.0], [1.5]],
...     },
...     {
...         "patient_id": "patient-0",
...         "visit_id": "visit-1",
...         # "single_vector": [1, 5, 8],
...         "list_codes": [
...             "55154191800",
...             "551541928",
...             "55154192800",
...             "705182798",
...             "70518279800",
...         ],
...         "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...         "list_list_codes": [["A04A", "B035", "C129"]],
...         "list_list_vectors": [
...             [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...         ],
...         "label": 0,
...         "list_vectors_time": [[0.0], [2.0], [1.0]],
...         "list_codes_time": [[0.0], [2.0], [1.3], [1.0], [2.0]],
...         "list_list_codes_time": [[0.0]],
...     },
... ]
>>>
>>> # dataset
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> # data loader
>>> from pyhealth.datasets import get_dataloader
>>>
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>>
>>> # model
>>> model = StageNet(
...     dataset=dataset,
...     feature_keys=[
...         "list_codes",
...         "list_vectors",
...         "list_list_codes",
...         # "list_list_vectors",
...     ],
...     time_keys=["list_codes_time", "list_vectors_time", "list_list_codes_time"],
...     label_key="label",
...     mode="binary",
... )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.7111, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.4815],
                [0.4991]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.],
                [0.]]),
    'logit': tensor([[-0.0742],
                [-0.0038]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the final loss. distance: list of tensors representing the stage variation of the patient. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.AdaCare#

The separate callable AdaCareLayer and the complete AdaCare model.

class pyhealth.models.AdaCareLayer(input_dim, hidden_dim=128, kernel_size=2, kernel_num=64, r_v=4, r_c=4, activation='sigmoid', rnn_type='gru', dropout=0.5)[source]#

Bases: Module

AdaCare layer.

Paper: Liantao Ma et al. Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration. AAAI 2020.

This layer is used in the AdaCare model. But it can also be used as a standalone layer.

Parameters
  • input_dim (int) – the input feature size.

  • hidden_dim (int) – the hidden dimension of the GRU layer. Default is 128.

  • kernel_size (int) – the kernel size of the causal convolution layer. Default is 2.

  • kernel_num (int) – the kernel number of the causal convolution layer. Default is 64.

  • r_v (int) – the number of the reduction rate for the original feature calibration. Default is 4.

  • r_c (int) – the number of the reduction rate for the convolutional feature recalibration. Default is 4.

  • activation (str) – the activation function for the recalibration layer (sigmoid, sparsemax, softmax). Default is “sigmoid”.

  • dropout (float) – dropout rate. Default is 0.5.

Examples

>>> from pyhealth.models import AdaCareLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = AdaCareLayer(64)
>>> c, _, inputatt, convatt = layer(input)
>>> c.shape
torch.Size([3, 64])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input_dim].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, input_dim] representing the

patient embedding.

output: a tensor of shape [batch size, sequence_len, input_dim] representing the patient embedding at each time step. inputatt: a tensor of shape [batch size, sequence_len, input_dim] representing the feature importance of the input. convatt: a tensor of shape [batch size, sequence_len, 3 * kernel_num] representing the feature importance of the convolutional features.

Return type

last_output

training: bool#
class pyhealth.models.AdaCare(dataset, feature_keys, label_key, mode, use_embedding, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

AdaCare model.

Paper: Liantao Ma et al. Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration. AAAI 2020.

Note

We use separate AdaCare layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

Since the AdaCare model calibrate the original features to provide interpretability, we do not recommend use embeddings for the input features. We follow the current convention for the AdaCare model:

  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply AdaCare on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use AdaCare one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run AdaCare directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run AdaCare directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • use_embedding (List[bool]) – list of bools indicating whether to use embedding for each feature type, e.g. [True, False].

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the AdaCare layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import AdaCare
>>> model = AdaCare(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         use_embedding=[True, False, True, False],
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.7167, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.5009], [0.4779]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[0.], [1.]]),
    'logit': tensor([[ 0.0036], [-0.0886]], grad_fn=<AddmmBackward0>)
}
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. feature_importance: a list of tensors with shape (feature_type, batch_size, time_step, features)

representing the feature importance.

conv_feature_importance: a list of tensors with shape (feature_type, batch_size, time_step, 3*kernal_size)

representing the convolutional feature importance.

y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.ConCare#

The separate callable ConCareLayer and the complete ConCare model.

class pyhealth.models.ConCareLayer(input_dim, static_dim=0, hidden_dim=128, num_head=4, pe_hidden=64, dropout=0.5)[source]#

Bases: Module

ConCare layer.

Paper: Liantao Ma et al. Concare: Personalized clinical feature embedding via capturing the healthcare context. AAAI 2020.

This layer is used in the ConCare model. But it can also be used as a standalone layer.

Parameters
  • input_dim (int) – dynamic feature size.

  • static_dim (int) – static feature size, if 0, then no static feature is used.

  • hidden_dim (int) – hidden dimension of the channel-wise GRU, default 128.

  • transformer_hidden – hidden dimension of the transformer, default 128.

  • num_head (int) – number of heads in the transformer, default 4.

  • pe_hidden (int) – hidden dimension of the positional encoding, default 64.

  • dropout (int) – dropout rate, default 0.5.

Examples

>>> from pyhealth.models import ConCareLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = ConCareLayer(64)
>>> c, _ = layer(input)
>>> c.shape
torch.Size([3, 128])
concare_encoder(input, static=None, mask=None)[source]#
forward(x, static=None, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input_dim].

  • static (Optional[tensor]) – a tensor of shape [batch size, static_dim].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, fusion_dim] representing the

patient embedding.

decov: the decov loss value

Return type

output

training: bool#
class pyhealth.models.ConCare(dataset, feature_keys, label_key, mode, use_embedding, static_key=None, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

ConCare model.

Paper: Liantao Ma et al. Concare: Personalized clinical feature embedding via capturing the healthcare context. AAAI 2020.

Note

We use separate ConCare layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

If you need the interpretable feature correlations provided by the ConCare model calculates the , we do not recommend use embeddings for the input features. We follow the current convention for the ConCare model:

  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply ConCare on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use ConCare one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run ConCare directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run ConCare directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • static_keys – the key in samples to use as static features, e.g. “demographics”. Default is None. we only support numerical static features.

  • use_embedding (List[bool]) – list of bools indicating whether to use embedding for each feature type, e.g. [True, False].

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the ConCare layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "demographic": [0.0, 2.0, 1.5],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "demographic": [0.0, 2.0, 1.5],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import ConCare
>>> model = ConCare(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         static_key="demographic",
...         use_embedding=[True, False, True, False],
...         mode="binary"
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(9.5541, grad_fn=<AddBackward0>),
    'y_prob': tensor([[0.5323], [0.5363]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.], [0.]]),
    'logit': tensor([[0.1293], [0.1454]], grad_fn=<AddmmBackward0>)
}
>>>
training: bool#
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the final loss. loss_task: a scalar tensor representing the task loss. loss_decov: a scalar tensor representing the decov loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

pyhealth.models.Agent#

The separate callable AgentLayer and the complete Agent model.

class pyhealth.models.AgentLayer(input_dim, static_dim=0, cell='gru', use_baseline=True, n_actions=10, n_units=64, n_hidden=128, dropout=0.5, lamda=0.5)[source]#

Bases: Module

Dr. Agent layer.

Paper: Junyi Gao et al. Dr. Agent: Clinical predictive model via mimicked second opinions. JAMIA.

This layer is used in the Dr. Agent model. But it can also be used as a standalone layer.

Parameters
  • input_dim (int) – dynamic feature size.

  • static_dim (int) – static feature size, if 0, then no static feature is used.

  • cell (str) – rnn cell type. Default is “gru”.

  • use_baseline (bool) – whether to use baseline for the RL agent. Default is True.

  • n_actions (int) – number of historical visits to choose. Default is 10.

  • n_units (int) – number of hidden units in each agent. Default is 64.

  • fusion_dim – number of hidden units in the final representation. Default is 128.

  • n_hidden (int) – number of hidden units in the rnn. Default is 128.

  • dropout (int) – dropout rate. Default is 0.5.

  • lamda (int) – weight for the agent selected hidden state and the current hidden state. Default is 0.5.

Examples

>>> from pyhealth.models import AgentLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = AgentLayer(64)
>>> c, _ = layer(input)
>>> c.shape
torch.Size([3, 128])
choose_action(observation, agent=1)[source]#
forward(x, static=None, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input_dim].

  • static (Optional[tensor]) – a tensor of shape [batch size, static_dim].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, n_hidden] representing the

patient embedding.

output: a tensor of shape [batch size, sequence len, n_hidden] representing the patient embedding at each time step.

Return type

last_output

training: bool#
class pyhealth.models.Agent(dataset, feature_keys, label_key, mode, static_key=None, embedding_dim=128, hidden_dim=128, use_baseline=True, **kwargs)[source]#

Bases: BaseModel

Dr. Agent model.

Paper: Junyi Gao et al. Dr. Agent: Clinical predictive model via mimicked second opinions. JAMIA.

Note

We use separate Dr. Agent layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the Dr. Agent model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply Dr. Agent on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use Dr. Agent one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run Dr. Agent directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run Dr. Agent directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • static_keys – the key in samples to use as static features, e.g. “demographics”. Default is None. we only support numerical static features.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension of the RNN in the Dr. Agent layer. Default is 128.

  • use_baseline (bool) – whether to use the baseline value to calculate the RL loss. Default is True.

  • **kwargs – other parameters for the Dr. Agent layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "demographic": [0.0, 2.0, 1.5],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "demographic": [0.0, 2.0, 1.5],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import Agent
>>> model = Agent(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         static_key="demographic",
...         mode="binary"
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(1.4059, grad_fn=<AddBackward0>),
    'y_prob': tensor([[0.4861], [0.5348]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[0.], [1.]]),
    'logit': tensor([[-0.0556], [0.1392]], grad_fn=<AddmmBackward0>)
}
>>>
get_loss(model, pred, true, mask, gamma=0.9, entropy_term=0.01)[source]#
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the final loss. loss_task: a scalar tensor representing the task loss. loss_RL: a scalar tensor representing the RL loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.GRASP#

The separate callable GRASPLayer and the complete GRASP model.

class pyhealth.models.GRASPLayer(input_dim, static_dim=0, hidden_dim=128, cluster_num=2, dropout=0.5, block='ConCare')[source]#

Bases: Module

GRASPLayer layer.

Paper: Liantao Ma et al. GRASP: generic framework for health status representation learning based on incorporating knowledge from similar patients. AAAI 2021.

This layer is used in the GRASP model. But it can also be used as a standalone layer.

Parameters
  • input_dim (int) – dynamic feature size.

  • static_dim (int) – static feature size, if 0, then no static feature is used.

  • hidden_dim (int) – hidden dimension of the GRASP layer, default 128.

  • cluster_num (int) – number of clusters, default 12. The cluster_num should be no more than the number of samples.

  • dropout (int) – dropout rate, default 0.5.

  • block (str) – the backbone model used in the GRASP layer (‘ConCare’, ‘LSTM’ or ‘GRU’), default ‘ConCare’.

Examples

>>> from pyhealth.models import GRASPLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = GRASPLayer(64, cluster_num=2)
>>> c = layer(input)
>>> c.shape
torch.Size([3, 128])
sample_gumbel(shape, eps=1e-20)[source]#
gumbel_softmax_sample(logits, temperature, device)[source]#
gumbel_softmax(logits, temperature, device, hard=False)[source]#

ST-gumple-softmax input: [, n_class] return: flatten –> [, n_class] an one-hot vector

grasp_encoder(input, static=None, mask=None)[source]#
forward(x, static=None, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input_dim].

  • static (Optional[tensor]) – a tensor of shape [batch size, static_dim].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, fusion_dim] representing the

patient embedding.

Return type

output

training: bool#
class pyhealth.models.GRASP(dataset, feature_keys, label_key, mode, use_embedding, static_key=None, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

GRASP model.

Paper: Liantao Ma et al. GRASP: generic framework for health status representation learning based on incorporating knowledge from similar patients. AAAI 2021.

Note

We use separate GRASP layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the GRASP model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply GRASP on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use GRASP one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run GRASP directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run GRASP directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • static_keys – the key in samples to use as static features, e.g. “demographics”. Default is None. we only support numerical static features.

  • use_embedding (List[bool]) – list of bools indicating whether to use embedding for each feature type, e.g. [True, False].

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension of the GRASP layer. Default is 128.

  • cluster_num – the number of clusters. Default is 10. Note that batch size should be greater than cluster_num.

  • **kwargs – other parameters for the GRASP layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "demographic": [0.0, 2.0, 1.5],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "demographic": [0.0, 2.0, 1.5],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import GRASP
>>> model = GRASP(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         static_key="demographic",
...         use_embedding=[True, False, True, False],
...         mode="binary"
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(0.6896, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.4983],
                [0.4947]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[1.],
                [0.]]),
    'logit': tensor([[-0.0070],
                [-0.0213]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the final loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.TCN#

The separate callable TCNLayer and the complete TCN model.

class pyhealth.models.TCNLayer(input_dim, num_channels=128, max_seq_length=20, kernel_size=2, dropout=0.5)[source]#

Bases: Module

Temporal Convolutional Networks layer.

Shaojie Bai et al. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling.

This layer wraps the PyTorch TCN layer with masking and dropout support. It is used in the TCN model. But it can also be used as a standalone layer.

Parameters
  • input_dim (int) – input feature size.

  • num_channels (int) – int or list of ints. If int, the depth will be automatically decided by the max_seq_length. If list, number of channels in each layer.

  • max_seq_length (int) – max sequence length. Used to compute the depth of the TCN.

  • kernel_size (int) – kernel size of the TCN.

  • dropout (float) – dropout rate. If non-zero, introduces a Dropout layer before each TCN blocks. Default is 0.5.

Examples

>>> from pyhealth.models import TCNLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = TCNLayer(5, 64)
>>> outputs, last_outputs = layer(input)
>>> outputs.shape
torch.Size([3, 128, 64])
>>> last_outputs.shape
torch.Size([3, 64])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, hidden size], containing

the output features for the last time step.

out: a tensor of shape [batch size, sequence len, hidden size],

containing the output features for each time step.

Return type

last_out

training: bool#
class pyhealth.models.TCN(dataset, feature_keys, label_key, mode, embedding_dim=128, num_channels=128, **kwargs)[source]#

Bases: BaseModel

Temporal Convolutional Networks model.

This model applies a separate TCN layer for each feature, and then concatenates the final hidden states of each TCN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate TCN layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the TCN model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply TCN on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use TCN one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run TCN directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run TCN directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleEHRDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • num_channels (int) – the number of channels in the TCN layer. Default is 128.

  • **kwargs – other parameters for the TCN layer.

Examples

>>> from pyhealth.datasets import SampleEHRDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleEHRDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import TCN
>>> model = TCN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{
    'loss': tensor(1.1641, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>),
    'y_prob': tensor([[0.6837],
                    [0.3081]], grad_fn=<SigmoidBackward0>),
    'y_true': tensor([[0.],
                    [1.]]),
    'logit': tensor([[ 0.7706],
                    [-0.8091]], grad_fn=<AddmmBackward0>)
}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

Trainer#

class pyhealth.trainer.Trainer(model, checkpoint_path=None, metrics=None, device=None, enable_logging=True, output_path=None, exp_name=None)[source]#

Bases: object

Trainer for PyTorch models.

Parameters
  • model (Module) – PyTorch model.

  • checkpoint_path (Optional[str]) – Path to the checkpoint. Default is None, which means the model will be randomly initialized.

  • metrics (Optional[List[str]]) – List of metric names to be calculated. Default is None, which means the default metrics in each metrics_fn will be used.

  • device (Optional[str]) – Device to be used for training. Default is None, which means the device will be GPU if available, otherwise CPU.

  • enable_logging (bool) – Whether to enable logging. Default is True.

  • output_path (Optional[str]) – Path to save the output. Default is “./output”.

  • exp_name (Optional[str]) – Name of the experiment. Default is current datetime.

train(train_dataloader, val_dataloader=None, test_dataloader=None, epochs=5, optimizer_class=<class 'torch.optim.adam.Adam'>, optimizer_params=None, weight_decay=0.0, max_grad_norm=None, monitor=None, monitor_criterion='max', load_best_model_at_last=True)[source]#

Trains the model.

Parameters
  • train_dataloader (DataLoader) – Dataloader for training.

  • val_dataloader (Optional[DataLoader]) – Dataloader for validation. Default is None.

  • test_dataloader (Optional[DataLoader]) – Dataloader for testing. Default is None.

  • epochs (int) – Number of epochs. Default is 5.

  • optimizer_class (Type[Optimizer]) – Optimizer class. Default is torch.optim.Adam.

  • optimizer_params (Optional[Dict[str, object]]) – Parameters for the optimizer. Default is {“lr”: 1e-3}.

  • weight_decay (float) – Weight decay. Default is 0.0.

  • max_grad_norm (Optional[float]) – Maximum gradient norm. Default is None.

  • monitor (Optional[str]) – Metric name to monitor. Default is None.

  • monitor_criterion (str) – Criterion to monitor. Default is “max”.

  • load_best_model_at_last (bool) – Whether to load the best model at the last. Default is True.

inference(dataloader, additional_outputs=None)[source]#

Model inference.

Parameters
  • dataloader – Dataloader for evaluation.

  • additional_outputs – List of additional output to collect. Defaults to None ([]).

Returns

List of true labels. y_prob_all: List of predicted probabilities. loss_mean: Mean loss over batches. additional_outputs (only if requested): Dict of additional results.

Return type

y_true_all

evaluate(dataloader)[source]#

Evaluates the model.

Parameters

dataloader – Dataloader for evaluation.

Returns

a dictionary of scores.

Return type

scores

save_ckpt(ckpt_path)[source]#

Saves the model checkpoint.

Return type

None

load_ckpt(ckpt_path)[source]#

Saves the model checkpoint.

Return type

None

Tokenizer#

The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.

class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#

Bases: object

Vocabulary class for mapping between tokens and indices.

add_token(token)[source]#

Adds a token to the vocabulary.

class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#

Bases: object

Tokenizer class for converting tokens to indices and vice versa.

This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.

Examples

>>> from pyhealth.tokenizer import Tokenizer
>>> token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E',             ...                'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C',             ...                'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X',             ...                'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A',             ...                'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
>>> tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
get_padding_index()[source]#

Returns the index of the padding token.

get_vocabulary_size()[source]#

Returns the size of the vocabulary.

Examples

>>> tokenizer.get_vocabulary_size()
44
convert_tokens_to_indices(tokens)[source]#

Converts a list of tokens to indices.

Examples

>>> tokens = ['A03C', 'A03D', 'A03E', 'A03F', 'A04A', 'A05A', 'A05B', 'B035', 'C129']
>>> indices = tokenizer.convert_tokens_to_indices(tokens)
>>> print(indices)
[8, 9, 10, 11, 12, 13, 14, 1, 1]
Return type

List[int]

convert_indices_to_tokens(indices)[source]#

Converts a list of indices to tokens.

Examples

>>> indices = [0, 1, 2, 3, 4, 5]
>>> tokens = tokenizer.convert_indices_to_tokens(indices)
>>> print(tokens)
['<pad>', '<unk>', 'A01A', 'A02A', 'A02B', 'A02X']
Return type

List[str]

batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#

Converts a list of lists of tokens (2D) to indices.

Parameters
  • batch (List[List[str]]) – List of lists of tokens to convert to indices.

  • padding (bool) – whether to pad the tokens to the max number of tokens in the batch (smart padding).

  • truncation (bool) – whether to truncate the tokens to max_length.

  • max_length (int) – maximum length of the tokens. This argument is ignored if truncation is False.

Examples

>>> tokens = [
...     ['A03C', 'A03D', 'A03E', 'A03F'],
...     ['A04A', 'B035', 'C129']
... ]
>>> indices = tokenizer.batch_encode_2d(tokens)
>>> print ('case 1:', indices)
case 1: [[8, 9, 10, 11], [12, 1, 1, 0]]
>>> indices = tokenizer.batch_encode_2d(tokens, padding=False)
>>> print ('case 2:', indices)
case 2: [[8, 9, 10, 11], [12, 1, 1]]
>>> indices = tokenizer.batch_encode_2d(tokens, max_length=3)
>>> print ('case 3:', indices)
case 3: [[9, 10, 11], [12, 1, 1]]
batch_decode_2d(batch, padding=False)[source]#

Converts a list of lists of indices (2D) to tokens.

Parameters
  • batch (List[List[int]]) – List of lists of indices to convert to tokens.

  • padding (bool) – whether to keep the padding tokens from the tokens.

Examples

>>> indices = [
...     [8, 9, 10, 11],
...     [12, 1, 1, 0]
... ]
>>> tokens = tokenizer.batch_decode_2d(indices)
>>> print ('case 1:', tokens)
case 1: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
>>> tokens = tokenizer.batch_decode_2d(indices, padding=True)
>>> print ('case 2:', tokens)
case 2: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>', '<pad>']]
batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#

Converts a list of lists of lists of tokens (3D) to indices.

Parameters
  • batch (List[List[List[str]]]) – List of lists of lists of tokens to convert to indices.

  • padding (Tuple[bool, bool]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).

  • truncation (Tuple[bool, bool]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_length

  • max_length (Tuple[int, int]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.

Examples

>>> tokens = [
...     [
...         ['A03C', 'A03D', 'A03E', 'A03F'],
...         ['A08A', 'A09A'],
...     ],
...     [
...         ['A04A', 'B035', 'C129'],
...     ]
... ]
>>> indices = tokenizer.batch_encode_3d(tokens)
>>> print ('case 1:', indices)
case 1: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0], [0, 0, 0, 0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, True))
>>> print ('case 2:', indices)
case 2: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(True, False))
>>> print ('case 3:', indices)
case 3: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1], [0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, False))
>>> print ('case 4:', indices)
case 4: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1]]]
>>> indices = tokenizer.batch_encode_3d(tokens, max_length=(2,2))
>>> print ('case 5:', indices)
case 5: [[[10, 11], [24, 25]], [[1, 1], [0, 0]]]
batch_decode_3d(batch, padding=False)[source]#

Converts a list of lists of lists of indices (3D) to tokens.

Parameters
  • batch (List[List[List[int]]]) – List of lists of lists of indices to convert to tokens.

  • padding (bool) – whether to keep the padding tokens from the tokens.

Examples

>>> indices = [
...     [
...         [8, 9, 10, 11],
...         [24, 25, 0, 0]
...     ],
...     [
...         [12, 1, 1, 0],
...         [0, 0, 0, 0]
...     ]
... ]
>>> tokens = tokenizer.batch_decode_3d(indices)
>>> print ('case 1:', tokens)
case 1: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], [['A04A', '<unk>', '<unk>']]]
>>> tokens = tokenizer.batch_decode_3d(indices, padding=True)
>>> print ('case 2:', tokens)
case 2: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A', '<pad>', '<pad>']], [['A04A', '<unk>', '<unk>', '<pad>'], ['<pad>', '<pad>', '<pad>', '<pad>']]]

Metrics#

We provide easy to use metrics (the same style and args as sklearn.metrics) for binary classification, multiclass classification, multilabel classification. For applicable tasks, we provide the relevant metrics for model calibration, as well as those for prediction set evaluation. Among these we also provide metrics related to uncertainty quantification, for model calibration, as well as metrics that measure the quality of prediction sets We also provide other metrics specically for healthcare tasks, such as drug drug interaction (DDI) rate.

pyhealth.metrics.multiclass#

pyhealth.metrics.multiclass.multiclass_metrics_fn(y_true, y_prob, metrics=None, y_predset=None)[source]#

Computes metrics for multiclass classification.

User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:

  • roc_auc_macro_ovo: area under the receiver operating characteristic curve,

    macro averaged over one-vs-one multiclass classification

  • roc_auc_macro_ovr: area under the receiver operating characteristic curve,

    macro averaged over one-vs-rest multiclass classification

  • roc_auc_weighted_ovo: area under the receiver operating characteristic curve,

    weighted averaged over one-vs-one multiclass classification

  • roc_auc_weighted_ovr: area under the receiver operating characteristic curve,

    weighted averaged over one-vs-rest multiclass classification

  • accuracy: accuracy score

  • balanced_accuracy: balanced accuracy score (usually used for imbalanced

    datasets)

  • f1_micro: f1 score, micro averaged

  • f1_macro: f1 score, macro averaged

  • f1_weighted: f1 score, weighted averaged

  • jaccard_micro: Jaccard similarity coefficient score, micro averaged

  • jaccard_macro: Jaccard similarity coefficient score, macro averaged

  • jaccard_weighted: Jaccard similarity coefficient score, weighted averaged

  • cohen_kappa: Cohen’s kappa score

  • brier_top1: brier score between the top prediction and the true label

  • ECE: Expected Calibration Error (with 20 equal-width bins). Check pyhealth.metrics.calibration.ece_confidence_multiclass().

  • ECE_adapt: adaptive ECE (with 20 equal-size bins). Check pyhealth.metrics.calibration.ece_confidence_multiclass().

  • cwECEt: classwise ECE with threshold=min(0.01,1/K). Check pyhealth.metrics.calibration.ece_classwise().

  • cwECEt_adapt: classwise adaptive ECE with threshold=min(0.01,1/K). Check pyhealth.metrics.calibration.ece_classwise().

The following metrics related to the prediction sets are accepted as well, but will be ignored if y_predset is None:

If no metrics are specified, accuracy, f1_macro, and f1_micro are computed by default.

This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.

Parameters
  • y_true (ndarray) – True target values of shape (n_samples,).

  • y_prob (ndarray) – Predicted probabilities of shape (n_samples, n_classes).

  • metrics (Optional[List[str]]) – List of metrics to compute. Default is [“accuracy”, “f1_macro”, “f1_micro”].

Return type

Dict[str, float]

Returns

Dictionary of metrics whose keys are the metric names and values are

the metric values.

Examples

>>> from pyhealth.metrics import multiclass_metrics_fn
>>> y_true = np.array([0, 1, 2, 2])
>>> y_prob = np.array([[0.9,  0.05, 0.05],
...                    [0.05, 0.9,  0.05],
...                    [0.05, 0.05, 0.9],
...                    [0.6,  0.2,  0.2]])
>>> multiclass_metrics_fn(y_true, y_prob, metrics=["accuracy"])
{'accuracy': 0.75}

pyhealth.metrics.multilabel#

pyhealth.metrics.multilabel.multilabel_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5, y_predset=None)[source]#

Computes metrics for multilabel classification.

User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:

  • roc_auc_micro: area under the receiver operating characteristic curve, micro averaged

  • roc_auc_macro: area under the receiver operating characteristic curve, macro averaged

  • roc_auc_weighted: area under the receiver operating characteristic curve, weighted averaged

  • roc_auc_samples: area under the receiver operating characteristic curve, samples averaged

  • pr_auc_micro: area under the precision recall curve, micro averaged

  • pr_auc_macro: area under the precision recall curve, macro averaged

  • pr_auc_weighted: area under the precision recall curve, weighted averaged

  • pr_auc_samples: area under the precision recall curve, samples averaged

  • accuracy: accuracy score

  • f1_micro: f1 score, micro averaged

  • f1_macro: f1 score, macro averaged

  • f1_weighted: f1 score, weighted averaged

  • f1_samples: f1 score, samples averaged

  • precision_micro: precision score, micro averaged

  • precision_macro: precision score, macro averaged

  • precision_weighted: precision score, weighted averaged

  • precision_samples: precision score, samples averaged

  • recall_micro: recall score, micro averaged

  • recall_macro: recall score, macro averaged

  • recall_weighted: recall score, weighted averaged

  • recall_samples: recall score, samples averaged

  • jaccard_micro: Jaccard similarity coefficient score, micro averaged

  • jaccard_macro: Jaccard similarity coefficient score, macro averaged

  • jaccard_weighted: Jaccard similarity coefficient score, weighted averaged

  • jaccard_samples: Jaccard similarity coefficient score, samples averaged

  • hamming_loss: Hamming loss

  • cwECE: classwise ECE (with 20 equal-width bins). Check pyhealth.metrics.calibration.ece_classwise().

  • cwECE_adapt: classwise adaptive ECE (with 20 equal-size bins). Check pyhealth.metrics.calibration.ece_classwise().

The following metrics related to the prediction sets are accepted as well, but will be ignored if y_predset is None:
  • fp: Number of false positives.

  • tp: Number of true positives.

If no metrics are specified, pr_auc_samples is computed by default.

This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.

Parameters
  • y_true (ndarray) – True target values of shape (n_samples, n_labels).

  • y_prob (ndarray) – Predicted probabilities of shape (n_samples, n_labels).

  • metrics (Optional[List[str]]) – List of metrics to compute. Default is [“pr_auc_samples”].

  • threshold (float) – Threshold to binarize the predicted probabilities. Default is 0.5.

Return type

Dict[str, float]

Returns

Dictionary of metrics whose keys are the metric names and values are

the metric values.

Examples

>>> from pyhealth.metrics import multilabel_metrics_fn
>>> y_true = np.array([[0, 1, 1], [1, 0, 1]])
>>> y_prob = np.array([[0.1, 0.9, 0.8], [0.05, 0.95, 0.6]])
>>> multilabel_metrics_fn(y_true, y_prob, metrics=["accuracy"])
{'accuracy': 0.5}

pyhealth.metrics.binary#

pyhealth.metrics.binary.binary_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5)[source]#

Computes metrics for binary classification.

User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:

  • pr_auc: area under the precision-recall curve

  • roc_auc: area under the receiver operating characteristic curve

  • accuracy: accuracy score

  • balanced_accuracy: balanced accuracy score (usually used for imbalanced datasets)

  • f1: f1 score

  • precision: precision score

  • recall: recall score

  • cohen_kappa: Cohen’s kappa score

  • jaccard: Jaccard similarity coefficient score

  • ECE: Expected Calibration Error (with 20 equal-width bins). Check pyhealth.metrics.calibration.ece_confidence_binary().

  • ECE_adapt: adaptive ECE (with 20 equal-size bins). Check pyhealth.metrics.calibration.ece_confidence_binary().

If no metrics are specified, pr_auc, roc_auc and f1 are computed by default.

This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.

Parameters
  • y_true (ndarray) – True target values of shape (n_samples,).

  • y_prob (ndarray) – Predicted probabilities of shape (n_samples,).

  • metrics (Optional[List[str]]) – List of metrics to compute. Default is [“pr_auc”, “roc_auc”, “f1”].

  • threshold (float) – Threshold for binary classification. Default is 0.5.

Return type

Dict[str, float]

Returns

Dictionary of metrics whose keys are the metric names and values are

the metric values.

Examples

>>> from pyhealth.metrics import binary_metrics_fn
>>> y_true = np.array([0, 0, 1, 1])
>>> y_prob = np.array([0.1, 0.4, 0.35, 0.8])
>>> binary_metrics_fn(y_true, y_prob, metrics=["accuracy"])
{'accuracy': 0.75}

[core] calibration#

Metrics that meature model calibration.

Reference Papers:

[1] Lin, Zhen, Shubhendu Trivedi, and Jimeng Sun. “Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks.” ICLR 2023.

[2] Nixon, Jeremy, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. “Measuring Calibration in Deep Learning.” In CVPR workshops, vol. 2, no. 7. 2019.

[3] Patel, Kanil, William Beluch, Bin Yang, Michael Pfeiffer, and Dan Zhang. “Multi-class uncertainty calibration via mutual information maximization-based binning.” ICLR 2021.

[4] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On calibration of modern neural networks.” ICML 2017.

[5] Kull, Meelis, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. “Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.” Advances in neural information processing systems 32 (2019).

[6] Brier, Glenn W. “Verification of forecasts expressed in terms of probability.” Monthly weather review 78, no. 1 (1950): 1-3.

pyhealth.metrics.calibration.ece_confidence_multiclass(prob, label, bins=20, adaptive=False)[source]#

Expected Calibration Error (ECE).

We group samples into ‘bins’ basing on the top-class prediction. Then, we compute the absolute difference between the average top-class prediction and the frequency of top-class being correct (i.e. accuracy) for each bin. ECE is the average (weighed by number of points in each bin) of these absolute differences. It could be expressed by the following formula, with \(B_m\) denoting the m-th bin:

\[ECE = \sum_{m=1}^M \frac{|B_m|}{N} |acc(B_m) - conf(B_m)|\]

Example

>>> pred = np.asarray([[0.2, 0.2, 0.6], [0.2, 0.31, 0.49], [0.1, 0.1, 0.8]])
>>> label = np.asarray([2,1,2])
>>> ECE_confidence_multiclass(pred, label, bins=2)
0.36333333333333334

Explanation of the example: The bins are [0, 0.5] and (0.5, 1]. In the first bin, we have one sample with top-class prediction of 0.49, and its accuracy is 0. In the second bin, we have average confidence of 0.7 and average accuracy of 1. Thus, the ECE is \(\frac{1}{3} \cdot 0.49 + \frac{2}{3}\cdot 0.3=0.3633\).

Parameters
  • prob (np.ndarray) – (N, C)

  • label (np.ndarray) – (N,)

  • bins (int, optional) – Number of bins. Defaults to 20.

  • adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.

pyhealth.metrics.calibration.ece_confidence_binary(prob, label, bins=20, adaptive=False)[source]#

Expected Calibration Error (ECE) for binary classification.

Similar to ece_confidence_multiclass(), but on class 1 instead of the top-prediction.

Parameters
  • prob (np.ndarray) – (N, C)

  • label (np.ndarray) – (N,)

  • bins (int, optional) – Number of bins. Defaults to 20.

  • adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.

pyhealth.metrics.calibration.ece_classwise(prob, label, bins=20, threshold=0.0, adaptive=False)[source]#

Classwise Expected Calibration Error (ECE).

This is equivalent to applying ece_confidence_binary() to each class and take the average.

Parameters
  • prob (np.ndarray) – (N, C)

  • label (np.ndarray) – (N,)

  • bins (int, optional) – Number of bins. Defaults to 20.

  • threshold (float) – threshold to filter out samples. If the number of classes C is very large, many classes receive close to 0 prediction. Any prediction below threshold is considered noise and ignored. In recent papers, this is typically set to a small number (such as 1/C).

  • adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.

pyhealth.metrics.calibration.brier_top1(prob, label)[source]#

Brier score (i.e. mean squared error between prediction and 0-1 label) of the top prediction.

[core] prediction_set#

pyhealth.metrics.prediction_set.size(y_pred)[source]#

Average size of the prediction set.

pyhealth.metrics.prediction_set.rejection_rate(y_pred)[source]#

Rejection rate, defined as the proportion of samples with prediction set size != 1

pyhealth.metrics.prediction_set.miscoverage_ps(y_pred, y_true)[source]#

Miscoverage rates for all samples (similar to recall).

Example

>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0],[0, 1, 0]])
>>> y_true = np.asarray([1,0,1,2])
>>> error_ps(y_pred, y_true)
array([0. , 0.5, 1. ])

Explanation: For class 0, the 1-th prediction set ({0}) contains the label, so the miss-coverage is 0/1=0. For class 1, the 0-th prediction set ({0}) does not contain the label, the 2-th prediction set ({0,1}) contains the label. Thus, the miss-coverage is 1/2=0.5. For class 2, the last prediction set is {1} and the label is 2, so the miss-coverage is 1/1=1.

pyhealth.metrics.prediction_set.error_ps(y_pred, y_true)[source]#

Miscoverage rates for unrejected samples, where rejection is defined to be sets with size !=1).

Example

>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0],[0, 1, 0]])
>>> y_true = np.asarray([1,0,1,2])
>>> error_ps(y_pred, y_true)
array([0., 1., 1.])

Explanation: For class 0, the 1-th sample is correct and not rejected, so the error is 0/1=0. For class 1, the 0-th sample is incorrerct and not rejected, the 2-th is rejected. Thus, the error is 1/1=1. For class 2, the last sample is not-rejected but the prediction set is {1}, so the error is 1/1=1.

pyhealth.metrics.prediction_set.miscoverage_overall_ps(y_pred, y_true)[source]#

Miscoverage rate for the true label. Only for multiclass.

Example

>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0]])
>>> y_true = np.asarray([1,0,1])
>>> miscoverage_overall_ps(y_pred, y_true)
0.333333

Explanation: The 0-th prediction set is {0} and the label is 1 (not covered). The 1-th prediction set is {0} and the label is 0 (covered). The 2-th prediction set is {0,1} and the label is 1 (covered). Thus the miscoverage rate is 1/3.

pyhealth.metrics.prediction_set.error_overall_ps(y_pred, y_true)[source]#

Overall error rate for the un-rejected samples.

Example

>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0]])
>>> y_true = np.asarray([1,0,1])
>>> error_overall_ps(y_pred, y_true)
0.5

Explanation: The 0-th prediction set is {0} and the label is 1, so it is an error (no rejection as its prediction set has only one class). The 1-th sample is not rejected and incurs on error. The 2-th sample is rejected, thus excluded from the computation.

MedCode#

We provide medical code mapping tools for (i) ontology mapping within one coding system and (ii) mapping the same concept cross different coding systems.

class pyhealth.medcode.InnerMap(vocabulary, refresh_cache=False)[source]#

Bases: ABC

Contains information for a specific medical code system.

InnerMap is a base abstract class for all medical code systems. It will be instantiated as a specific medical code system with InnerMap.load(vocabulary).

Note

This class cannot be instantiated using __init__() (throws an error).

classmethod load(vocabulary, refresh_cache=False)[source]#

Initializes a specific medical code system inheriting from InnerMap.

Parameters
  • vocabulary (str) – vocabulary name. E.g., “ICD9CM”, “ICD9PROC”.

  • refresh_cache (bool) – whether to refresh the cache. Default is False.

Examples

>>> from pyhealth.medcode import InnerMap
>>> icd9cm = InnerMap.load("ICD9CM")
>>> icd9cm.lookup("428.0")
'Congestive heart failure, unspecified'
>>> icd9cm.get_ancestors("428.0")
['428', '420-429.99', '390-459.99', '001-999.99']
property available_attributes: List[str]#

Returns a list of available attributes.

Return type

List[str]

Returns

List of available attributes.

stat()[source]#

Prints statistics of the code system.

static standardize(code)[source]#

Standardizes a given code.

Subclass will override this method based on different medical code systems.

Return type

str

static convert(code, **kwargs)[source]#

Converts a given code.

Subclass will override this method based on different medical code systems.

Return type

str

lookup(code, attribute='name')[source]#

Looks up the code.

Parameters
  • code (str) – code to look up.

  • attribute (str) – attribute to look up. One of self.available_attributes. Default is “name”.

Returns

The attribute value of the code.

get_ancestors(code)[source]#

Gets the ancestors of the code.

Parameters

code (str) – code to look up.

Return type

List[str]

Returns

List of ancestors ordered from the closest to the farthest.

get_descendants(code)[source]#

Gets the descendants of the code.

Parameters

code (str) – code to look up.

Return type

List[str]

Returns

List of ancestors ordered from the closest to the farthest.

class pyhealth.medcode.CrossMap(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#

Bases: object

Contains mapping between two medical code systems.

CrossMap is a base class for all possible mappings. It will be initialized with two specific medical code systems with CrossMap.load(source_vocabulary, target_vocabulary).

classmethod load(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#

Initializes the mapping between two medical code systems.

Parameters
  • source_vocabulary (str) – source medical code system.

  • target_vocabulary (str) – target medical code system.

  • refresh_cache (bool) – whether to refresh the cache. Default is False.

Examples

>>> from pyhealth.medcode import CrossMap
>>> mapping = CrossMap("ICD9CM", "CCSCM")
>>> mapping.map("428.0")
['108']
>>> mapping = CrossMap.load("NDC", "ATC")
>>> mapping.map("00527051210", target_kwargs={"level": 3})
['A11C']
map(source_code, source_kwargs=None, target_kwargs=None)[source]#

Maps a source code to a list of target codes.

Parameters
  • source_code (str) – source code.

  • **source_kwargs (Optional[Dict]) – additional arguments for the source code. Will be passed to self.s_class.convert(). Default is empty dict.

  • **target_kwargs (Optional[Dict]) – additional arguments for the target code. Will be passed to self.t_class.convert(). Default is empty dict.

Return type

List[str]

Returns

A list of target codes.

Diagnosis codes:#

class pyhealth.medcode.ICD9CM(**kwargs)[source]#

Bases: InnerMap

9-th International Classification of Diseases, Clinical Modification.

static standardize(code)[source]#

Standardizes ICD9CM code.

class pyhealth.medcode.ICD10CM(**kwargs)[source]#

Bases: InnerMap

10-th International Classification of Diseases, Clinical Modification.

static standardize(code)[source]#

Standardizes ICD10CM code.

class pyhealth.medcode.CCSCM(**kwargs)[source]#

Bases: InnerMap

Classification of Diseases, Clinical Modification.

Procedure codes:#

class pyhealth.medcode.ICD9PROC(**kwargs)[source]#

Bases: InnerMap

9-th International Classification of Diseases, Procedure.

static standardize(code)[source]#

Standardizes ICD9PROC code.

class pyhealth.medcode.ICD10PROC(**kwargs)[source]#

Bases: InnerMap

10-th International Classification of Diseases, Procedure.

class pyhealth.medcode.CCSPROC(**kwargs)[source]#

Bases: InnerMap

Classification of Diseases, Procedure.

Medication codes:#

class pyhealth.medcode.NDC(**kwargs)[source]#

Bases: InnerMap

National Drug Code.

class pyhealth.medcode.RxNorm(**kwargs)[source]#

Bases: InnerMap

RxNorm.

class pyhealth.medcode.ATC(**kwargs)[source]#

Bases: InnerMap

Anatomical Therapeutic Chemical.

static convert(code, level=5)[source]#

Convert ATC code to a specific level.

get_ddi(gamenet_ddi=False, refresh_cache=False)[source]#

Gets the drug-drug interactions (DDI).

Parameters
  • gamenet_ddi (bool) – Whether to use the DDI from the GAMENet paper, which is a subset of the DDI from the ATC.

  • refresh_cache (bool) – Whether to refresh the cache. Default is False.

Return type

List[str]

Calibration and Uncertainty Quantification#

In this module, we implemented the following prediction set constructors or model calibration methods, which can be combined with any PyHealth models.

pyhealth.calib.calibration#

Model calibration methods

class pyhealth.calib.calibration.DirichletCalibration(model, debug=False, **kwargs)[source]#

Bases: PostHocCalibrator

Dirichlet Calibration

Dirichlet calibration is similar to retraining a linear layer mapping from the old logits to the new logits with regularizations. This is a calibration method for multiclass classification only.

Paper:

[1] Kull, Meelis, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. “Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.” Advances in neural information processing systems 32 (2019).

Parameters

model (BaseModel) – A trained base model.

Examples

>>> from pyhealth.datasets import ISRUCDataset, split_by_patient, get_dataloader
>>> from pyhealth.models import SparcNet
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> from pyhealth.calib.calibration import DirichletCalibration
>>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn)
>>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2])
>>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"],
...     label_key="label", mode="multiclass")
>>> # ... Train the model here ...
>>> # Calibrate
>>> cal_model = DirichletCalibration(model)
>>> cal_model.calibrate(cal_dataset=val_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl))
{'accuracy': 0.7096615988229524, 'cwECEt_adapt': 0.05336195546573208}
calibrate(cal_dataset, lr=0.01, max_iter=128, reg_lambda=0.001)[source]#

Calibrate the base model using a calibration dataset.

Parameters
  • cal_dataset (Subset) – Calibration set.

  • lr (float, optional) – learning rate, defaults to 0.01

  • max_iter (int, optional) – maximum iterations, defaults to 128

  • reg_lambda (float, optional) – regularization coefficient on the deviation from identity matrix. defaults to 1e-3

Returns

None

Return type

None

forward(**kwargs)[source]#

Forward propagation (just like the original model).

Parameters

**kwargs

Additional arguments to the base model.

Returns

A dictionary with all results from the base model, with the following modified:

y_prob: calibrated predicted probabilities. loss: Cross entropy loss with the new y_prob. logit: temperature-scaled logits.

Return type

Dict[str, torch.Tensor]

class pyhealth.calib.calibration.HistogramBinning(model, debug=False, **kwargs)[source]#

Bases: PostHocCalibrator

Histogram Binning

Histogram binning amounts to creating bins and computing the accuracy for each bin using the calibration dataset, and then predicting such at test time. For multilabel/binary/multiclass classification tasks, we calibrate each class independently following [1]. Users could choose to renormalize the probability scores for multiclass tasks so they sum to 1.

Paper:

[1] Gupta, Chirag, and Aaditya Ramdas. “Top-label calibration and multiclass-to-binary reductions.” ICLR 2022.

[2] Zadrozny, Bianca, and Charles Elkan. “Learning and making decisions when costs and probabilities are both unknown.” In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 204-213. 2001.

Parameters

model (BaseModel) – A trained base model.

Examples

>>> from pyhealth.datasets import ISRUCDataset, get_dataloader, split_by_patient
>>> from pyhealth.models import SparcNet
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> from pyhealth.calib.calibration import HistogramBinning
>>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn)
>>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2])
>>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"],
...     label_key="label", mode="multiclass")
>>> # ... Train the model here ...
>>> # Calibrate
>>> cal_model = HistogramBinning(model)
>>> cal_model.calibrate(cal_dataset=val_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl))
{'accuracy': 0.7189072348464207, 'cwECEt_adapt': 0.04455814993598299}
calibrate(cal_dataset, nbins=15)[source]#

Calibrate the base model using a calibration dataset.

Parameters
  • cal_dataset (Subset) – Calibration set.

  • nbins (int, optional) – number of bins to use, defaults to 15

forward(normalization='sum', **kwargs)[source]#

Forward propagation (just like the original model).

Parameters
  • normalization (str, optional) – how to normalize the calibrated probability. Defaults to ‘sum’ (and only ‘sum’ is supported for now).

  • **kwargs

    Additional arguments to the base model.

Returns

A dictionary with all results from the base model, with the following modified:

y_prob: calibrated predicted probabilities. loss: Cross entropy loss with the new y_prob.

Return type

Dict[str, torch.Tensor]

class pyhealth.calib.calibration.KCal(model, debug=False, **kwargs)[source]#

Bases: PostHocCalibrator

Kernel-based Calibration. This is a full calibration method for multiclass classification. It tries to calibrate the predicted probabilities for all classes, by using KDE classifiers estimated from the calibration set.

Paper:

Lin, Zhen, Shubhendu Trivedi, and Jimeng Sun. “Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks.” ICLR 2023.

Parameters

model (BaseModel) – A trained model.

Examples

>>> from pyhealth.datasets import ISRUCDataset, split_by_patient, get_dataloader
>>> from pyhealth.models import SparcNet
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> from pyhealth.calib.calibration import KCal
>>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn)
>>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2])
>>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"],
...     label_key="label", mode="multiclass")
>>> # ... Train the model here ...
>>> # Calibrate
>>> cal_model = KCal(model)
>>> cal_model.calibrate(cal_dataset=val_data)
>>> # Alternatively, you could re-fit the reprojection:
>>> # cal_model.calibrate(cal_dataset=val_data, train_dataset=train_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl))
{'accuracy': 0.7303689172252193, 'cwECEt_adapt': 0.03324275630220515}
fit(train_dataset, val_dataset=None, split_by_patient=False, dim=32, bs_pred=64, bs_supp=20, epoch_len=5000, epochs=10, load_best_model_at_last=False, **train_kwargs)[source]#

Fit the reprojection module. You don’t need to call this function - it is called in KCal.calibrate(). For training details, please refer to the paper.

Parameters
  • train_dataset (Dataset) – The training dataset.

  • val_dataset (Dataset, optional) – The validation dataset. Defaults to None.

  • split_by_patient (bool, optional) – Whether to split the dataset by patient during training. Defaults to False.

  • dim (int, optional) – The dimension of the embedding. Defaults to 32.

  • bs_pred (int, optional) – The batch size for the prediction set. Defaults to 64.

  • bs_supp (int, optional) – The batch size for the support set. Defaults to 20.

  • epoch_len (int, optional) – The number of batches in an epoch. Defaults to 5000.

  • epochs (int, optional) – The number of epochs. Defaults to 10.

  • load_best_model_at_last (bool, optional) – Whether to load the best model (or the last model). Defaults to False.

  • **train_kwargs – Other keyword arguments for pyhealth.trainer.Trainer.train().

calibrate(cal_dataset, num_fold=20, record_id_name=None, train_dataset=None, train_split_by_patient=False, load_best_model_at_last=True, **train_kwargs)[source]#

Calibrate using a calibration dataset. If train_dataset is not None, it will be used to fit a re-projection from the base model embeddings. In either case, the calibration set will be used to construct the KDE classifier.

Parameters
  • cal_dataset (Subset) – Calibration set.

  • record_id_name (str, optional) – the key/name of the unique index for records. Defaults to None.

  • train_dataset (Subset, optional) – Dataset to train the reprojection. Defaults to None (no training).

  • train_split_by_patient (bool, optional) – Whether to split by patient when training the embeddings. That is, do we use samples from the same patient in KDE during training. Defaults to False.

  • load_best_model_at_last (bool, optional) – Whether to load the best reprojection basing on the calibration set. Defaults to True.

  • train_kwargs (dict, optional) – Additional arguments for training the reprojection. Passed to KCal.fit()

forward(**kwargs)[source]#

Forward propagation (just like the original model).

Parameters

**kwargs

Additional arguments to the base model.

Returns

A dictionary with all results from the base model, with the following modified:

y_prob: calibrated predicted probabilities. loss: Cross entropy loss with the new y_prob.

Return type

Dict[str, torch.Tensor]

class pyhealth.calib.calibration.TemperatureScaling(model, debug=False, **kwargs)[source]#

Bases: PostHocCalibrator

Temperature Scaling

Temprature scaling refers to scaling the logits by a “temprature” tuned on the calibration set. For binary classification tasks, this amounts to Platt scaling. For multilabel classification, users can use one temperature for all classes, or one for each. For multiclass classification, this is a confidence calibration method: It tries to calibrate the predicted class’ predicted probability.

Paper:

[1] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On calibration of modern neural networks.” ICML 2017.

[2] Platt, John. “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.” Advances in large margin classifiers 10, no. 3 (1999): 61-74.

Parameters

model (BaseModel) – A trained base model.

Examples

>>> from pyhealth.datasets import ISRUCDataset, get_dataloader, split_by_patient
>>> from pyhealth.models import SparcNet
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> from pyhealth.calib.calibration import TemperatureScaling
>>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn)
>>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2])
>>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"],
...     label_key="label", mode="multiclass")
>>> # ... Train the model here ...
>>> # Calibrate
>>> cal_model = TemperatureScaling(model)
>>> cal_model.calibrate(cal_dataset=val_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl))
{'accuracy': 0.709843241966832, 'cwECEt_adapt': 0.051673596521491505}
calibrate(cal_dataset, lr=0.01, max_iter=50, mult_temp=False)[source]#

Calibrate the base model using a calibration dataset.

Parameters
  • cal_dataset (Subset) – Calibration set.

  • lr (float, optional) – learning rate, defaults to 0.01

  • max_iter (int, optional) – maximum iterations, defaults to 50

  • mult_temp (bool, optional) – if mult_temp and mode=’multilabel’, defaults to False

Returns

None

Return type

None

forward(**kwargs)[source]#

Forward propagation (just like the original model).

Parameters

**kwargs

Additional arguments to the base model.

Returns

A dictionary with all results from the base model, with the following modified:

y_prob: calibrated predicted probabilities. loss: Cross entropy loss with the new y_prob. logit: temperature-scaled logits.

Return type

Dict[str, torch.Tensor]

pyhealth.calib.predictionset#

Prediction set construction methods

class pyhealth.calib.predictionset.LABEL(model, alpha, debug=False, **kwargs)[source]#

Bases: SetPredictor

LABEL: Least ambiguous set-valued classifiers with bounded error levels.

This is a prediction-set constructor for multi-class classification problems. It controls either \(\mathbb{P}\{Y \not \in C(X) | Y=k\}\leq \alpha_k\) (when alpha is an array), or \(\mathbb{P}\{Y \not \in C(X)\}\leq \alpha\) (when alpha is a float). Here, \(C(X)\) denotes the final prediction set. This is essentially a split conformal prediction method using the predicted scores.

Paper:

Sadinle, Mauricio, Jing Lei, and Larry Wasserman. “Least ambiguous set-valued classifiers with bounded error levels.” Journal of the American Statistical Association 114, no. 525 (2019): 223-234.

Parameters
  • model (BaseModel) – A trained base model.

  • alpha (Union[float, np.ndarray]) – Target mis-coverage rate(s).

Examples

>>> from pyhealth.datasets import ISRUCDataset, split_by_patient, get_dataloader
>>> from pyhealth.models import SparcNet
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> from pyhealth.calib.predictionset import LABEL
>>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn)
>>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2])
>>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"],
...     label_key="label", mode="multiclass")
>>> # ... Train the model here ...
>>> # Calibrate the set classifier, with different class-specific mis-coverage rates
>>> cal_model = LABEL(model, [0.15, 0.3, 0.15, 0.15, 0.15])
>>> # Note that we used the test set here because ISRUCDataset has relatively few
>>> # patients, and calibration set should be different from the validation set
>>> # if the latter is used to pick checkpoint. In general, the calibration set
>>> # should be something exchangeable with the test set. Please refer to the paper.
>>> cal_model.calibrate(cal_dataset=test_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer, get_metrics_fn
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> y_true_all, y_prob_all, _, extra_output = Trainer(model=cal_model).inference(test_dl, additional_outputs=['y_predset'])
>>> print(get_metrics_fn(cal_model.mode)(
... y_true_all, y_prob_all, metrics=['accuracy', 'miscoverage_ps'],
... y_predset=extra_output['y_predset'])
... )
{'accuracy': 0.709843241966832, 'miscoverage_ps': array([0.1499847 , 0.29997638, 0.14993964, 0.14994704, 0.14999252])}
calibrate(cal_dataset)[source]#

Calibrate the thresholds used to construct the prediction set.

Parameters

cal_dataset (Subset) – Calibration set.

forward(**kwargs)[source]#

Forward propagation (just like the original model).

Returns

A dictionary with all results from the base model, with the following updates:

y_predset: a bool tensor representing the prediction for each class.

Return type

Dict[str, torch.Tensor]

class pyhealth.calib.predictionset.SCRIB(model, risk, loss_kwargs=None, debug=False, fill_max=True, **kwargs)[source]#

Bases: SetPredictor

SCRIB: Set-classifier with Class-specific Risk Bounds

This is a prediction-set constructor for multi-class classification problems. SCRIB tries to control class-specific risk while minimizing the ambiguity. To to this, it selects class-specific thresholds for the predictions, on a calibration set.

If risk is a float (say 0.1), SCRIB controls the overall risk: \(\mathbb{P}\{Y \not \in C(X) | |C(X)| = 1\}\leq risk\). If risk is an array (say np.asarray([0.1] * 5)), SCRIB controls the class specific risks: \(\mathbb{P}\{Y \not \in C(X) | Y=k \land |C(X)| = 1\}\leq risk_k\) Here, \(C(X)\) denotes the final prediction set.

Paper:

Lin, Zhen, Lucas Glass, M. Brandon Westover, Cao Xiao, and Jimeng Sun. “SCRIB: Set-classifier with Class-specific Risk Bounds for Blackbox Models.” AAAI 2022.

Parameters
  • model (BaseModel) – A trained model.

  • risk (Union[float, np.ndarray]) – risk targets.

  • loss_kwargs (dict, optional) –

    Additional loss parameters (including hyperparameters). It could contain the following float/int hyperparameters:

    lk: The coefficient for the loss term associated with risk violation penalty.

    The higher the lk, the more penalty on risk violation (likely higher ambiguity).

    fill_max: Whether to fill the class with max predicted score

    when no class exceeds the threshold. In other words, if fill_max, the null region will be filled with max-prediction class.

    Defaults to {‘lk’: 1e4, ‘fill_max’: False}

  • fill_max (bool, optional) – Whether to fill the empty prediction set with the max-predicted class. Defaults to True.

Examples

>>> from pyhealth.data import ISRUCDataset, split_by_patient, get_dataloader
>>> from pyhealth.models import SparcNet
>>> from pyhealth.tasks import sleep_staging_isruc_fn
>>> from pyhealth.calib.predictionset import SCRIB
>>> from pyhealth.trainer import get_metrics_fn
>>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn)
>>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2])
>>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"],
...     label_key="label", mode="multiclass")
>>> # ... Train the model here ...
>>> # Calibrate the set classifier, with different class-specific risk targets
>>> cal_model = SCRIB(model, [0.2, 0.3, 0.1, 0.2, 0.1])
>>> # Note that we used the test set here because ISRUCDataset has relatively few
>>> # patients, and calibration set should be different from the validation set
>>> # if the latter is used to pick checkpoint. In general, the calibration set
>>> # should be something exchangeable with the test set. Please refer to the paper.
>>> cal_model.calibrate(cal_dataset=test_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> y_true_all, y_prob_all, _, extra_output = Trainer(model=cal_model).inference(test_dl, additional_outputs=['y_predset'])
>>> print(get_metrics_fn(cal_model.mode)(
... y_true_all, y_prob_all, metrics=['accuracy', 'error_ps', 'rejection_rate'],
... y_predset=extra_output['y_predset'])
... )
{'accuracy': 0.709843241966832, 'rejection_rate': 0.6381305287631919,
'error_ps': array([0.32161874, 0.36654135, 0.11461734, 0.23728814, 0.14993925])}
calibrate(cal_dataset)[source]#

Calibrate/Search for the thresholds used to construct the prediction set.

Parameters

cal_dataset (Subset) – Calibration set.

forward(**kwargs)[source]#

Forward propagation (just like the original model).

Returns

A dictionary with all results from the base model, with the following updates:

y_predset: a bool tensor representing the prediction for each class.

Return type

Dict[str, torch.Tensor]

class pyhealth.calib.predictionset.FavMac(model, value_weights=1.0, cost_weights=1.0, target_cost=1.0, delta=None, debug=False, **kwargs)[source]#

Bases: SetPredictor

Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control (FavMac)

This is a prediction-set constructor for multi-label classification problems. FavMac could control the cost/risk while realizing high value on the prediction set.

Value and cost functions are functions in the form of \(V(S;Y)\) or \(C(S;Y)\), with S being the prediction set and Y being the label. For example, a classical cost function would be “numebr of false positives”. Denote the target_cost as \(c\), if delta=None, FavMac controls the expected cost in the following sense:

\(\mathbb{E}[C(S_{N+1};Y_{N+1}] \leq c\).

Otherwise, FavMac controls the violation probability in the following sense:

\(\mathbb{P}\{C(S_{N+1};Y_{N+1})>c\}\leq delta\).

Right now, this FavMac implementation only supports additive value and cost functions (unlike the implementation associated with [1]). That is, the value function is specified by the weights value_weights and the cost function is specified by cost_weights. With \(k\) denoting classes, the cost function is then computed as

\(C(S;Y,w) = \sum_{k} (1-Y_k)S_k w_k\)

Similarly, the value function is computed as

\(V(S;Y,w) = \sum_{k} Y_k S_k w_k\).

Papers:

[1] Lin, Zhen, Shubhendu Trivedi, Cao Xiao, and Jimeng Sun. “Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control (FavMac).” ICML 2023.

[2] Fisch, Adam, Tal Schuster, Tommi Jaakkola, and Regina Barzilay. “Conformal prediction sets with limited false positives.” ICML 2022.

Parameters
  • model (BaseModel) – A trained model.

  • value_weights (Union[float, np.ndarray]) – weights for the value function. See description above. Defaults to 1.

  • cost_weights (Union[float, np.ndarray]) – weights for the cost function. See description above. Defaults to 1.

  • target_cost (float) – Target cost. When cost_weights is set to 1, this is essentially the number of false positive. Defaults to 1.

  • delta (float) – Violation target (in violation control). Defaults to None (which means expectation control instead of violation control).

Examples

>>> from pyhealth.calib.predictionset import FavMac
>>> from pyhealth.datasets import (MIMIC3Dataset, get_dataloader,split_by_patient)
>>> from pyhealth.models import Transformer
>>> from pyhealth.tasks import drug_recommendation_mimic3_fn
>>> from pyhealth.trainer import get_metrics_fn
>>> base_dataset = MIMIC3Dataset(
...     root="/srv/scratch1/data/physionet.org/files/mimiciii/1.4",
...     tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...     code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     refresh_cache=False)
>>> sample_dataset = base_dataset.set_task(drug_recommendation_mimic3_fn)
>>> train_data, val_data, test_data = split_by_patient(sample_dataset, [0.6, 0.2, 0.2])
>>> model = Transformer(dataset=sample_dataset, feature_keys=["conditions", "procedures"],
...             label_key="drugs", mode="multilabel")
>>> # ... Train the model here ...
>>> # Try to control false positive to <=3
>>> cal_model = FavMac(model, target_cost=3, delta=None)
>>> cal_model.calibrate(cal_dataset=val_data)
>>> # Evaluate
>>> from pyhealth.trainer import Trainer
>>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False)
>>> y_true_all, y_prob_all, _, extra_output = Trainer(model=cal_model).inference(
... test_dl, additional_outputs=["y_predset"])
>>> print(get_metrics_fn(cal_model.mode)(
...     y_true_all, y_prob_all, metrics=['tp', 'fp'],
...     y_predset=extra_output["y_predset"])) # We get FP~=3
{'tp': 0.5049893086243763, 'fp': 2.8442622950819674}
calibrate(cal_dataset)[source]#

Calibrate the cost-control procedure.

Parameters

cal_dataset (Subset) – Calibration set.

forward(**kwargs)[source]#

Forward propagation (just like the original model).

Returns

A dictionary with all results from the base model, with the following updates:

y_predset: a bool tensor representing the prediction for each class.

Return type

Dict[str, torch.Tensor]

PyHealth live#

Start Time: 8 PM Central Time, Wednesday

Recurrence: There is no live session for now.

Zoom: Join Link

Add to Google Calender: Invitation

Add to Microsoft Outlook (.ics): Invitation

YouTube: Recorded Live Sessions

User/Developer Slack: Click to join

Schedules#

(Dec 21, 2022) Live 01 - What is PyHealth and How to Get Started? [Recap]

(Dec 28, 2022) Live 02 - Data & Datasets & Tasks: store unstructured data in an structured way. [Recap I] [II] [III] [IV]

(Jan 4, 2023) Live 03 - Models & Trainer & Metrics: initialize and train a deep learning model. [Recap I] [II] [III]

(Jan 11, 2023) Live 04 - Tokenizer & Medcode: master the medical code lookup and mapping [Recap I] [II]

(Jan 18, 2023) Live 05 - PyHealth can support a complete healthcare ML pipeline [Recap I] [II]

(Jan 25, 2023) Live 06 - Fit your own dataset into pipeline and use our model [Recap]

(Feb 1, 2023) Live 07 - Adopt your customized model and quickly try it on our data [Recap]

(Feb 8, 2023) Live 08 - New feature: support for biosignal data (EEG, ECG, etc.) classification [Recap I] [II]

(Feb 15, 2023) Live 09 - New feature: parallel and faster data loading architecture

(Feb 22, 2023) Live 10 - Add a covid prediction benchmark (new datasets, new models)

Development logs#

We track the new development here:

May 25, 2023

1. add dirichlet calibration `PR #159`

May 9, 2023

1. add MIMIC-Extract dataset  `#136`
2. add new maintainer members for pyhealth: Junyi Gao and Benjamin Danek

May 6, 2023

1. add new parser functions (admissionDx, diagnosisStrings) and prediction tasks for eICU dataset `#148`

Apr 27, 2023

1. add MoleRec model (WWW'23) for drug recommendation `#122`

Apr 26, 2023

1. fix bugs in GRASP model `#141`
2. add pandas install <2 constraints `#135`
3. add hcpcsevents table process in MIMIC4 dataset `#134`

Apr 10, 2023

1. fix Ambiguous datetime usage in eICU (https://github.com/sunlabuiuc/PyHealth/pull/132)

Mar 26, 2023

1. add the entire uncertainty quantification module (https://github.com/sunlabuiuc/PyHealth/pull/111)

Feb 26, 2023

1. add 6 EHR predictiom model: Adacare, Concare, Stagenet, TCN, Grasp, Agent

Feb 24, 2023

1. add unittest for omop dataset
2. add github action triggered manually, check `#104`

Feb 19, 2023

1. add unittest for eicu dataset
2. add ISRUC dataset (and task function) for signal learning

Feb 12, 2023

1. add unittest for mimiciii, mimiciv
2. add SHHS datasets for sleep staging task
3. add SparcNet model for signal classification task

Feb 08, 2023

1. complete the biosignal data support, add ContraWR [1] model for general purpose biosignal classification task ([1] Yang, Chaoqi, Danica Xiao, M. Brandon Westover, and Jimeng Sun.
    "Self-supervised eeg representation learning for automatic sleep staging."
    arXiv preprint arXiv:2110.15278 (2021).)

Feb 07, 2023

1. Support signal dataset processing and split: add SampleSignalDataset, BaseSignalDataset. Use SleepEDFcassette dataset as the first signal dataset. Use example/sleep_staging_sleepEDF_contrawr.py
2. rename the dataset/ parts: previous BaseDataset becomes BaseEHRDataset and SampleDatast becomes SampleEHRDataset. Right now, BaseDataset will be inherited by BaseEHRDataset and BaseSignalDataset. SampleBaseDataset will be inherited by SampleEHRDataset and SampleSignalDataset.

Feb 06, 2023

1. improve readme style
2. add the pyhealth live 06 and 07 link to pyhealth live

Feb 01, 2023

1. add unittest of PyHealth MedCode and Tokenizer

Jan 26, 2023

1. accelerate MIMIC-IV, eICU and OMOP data loading by using multiprocessing (pandarallel)

Jan 25, 2023

1. accelerate the MIMIC-III data loading process by using multiprocessing (pandarallel)

Jan 24, 2023

1. Fix the code typo in pyhealth/tasks/drug_recommendation.py for issue `#71`.
2. update the pyhealth live schedule

Jan 22, 2023

1. Fix the list of list of vector problem in RNN, Transformer, RETAIN, and CNN
2. Add initialization examples for RNN, Transformer, RETAIN, CNN, and Deepr
3. (minor) change the parameters from "Type" and "level" to "type_" and "dim_"
4. BPDanek adds the "__repr__" function to medcode for better print understanding
5. add unittest for pyhealth.data

Jan 21, 2023

1. Added a new model, Deepr (models.Deepr)

Jan 20, 2023

1. add the pyhealth live 05
2. add slack channel invitation in pyhealth live page

Jan 13, 2023

1. add the pyhealth live 03 and 04 video link to the nagivation
2. add future pyhealth live schedule

Jan 8, 2023

1. Changed BaseModel.add_feature_transform_layer in models/base_model.py so that it accepts special_tokens if necessary
2. fix an int/float bug in dataset checking (transform int to float and then process them uniformly)

Dec 26, 2022

1. add examples to pyhealth.data, pyhealth.datasets
2. improve jupyter notebook tutorials 0, 1, 2

Dec 21, 2022

1. add the development logs to the navigation
2. add the pyhealth live schedule to the nagivation

About us#

We are the SunLab healthcare research team at UIUC.

*Zhenbang Wu (Ph.D. Student @ University of Illinois Urbana-Champaign)

*Chaoqi Yang (Ph.D. Student @ University of Illinois Urbana-Champaign)

Patrick Jiang (M.S. Student @ University of Illinois Urbana-Champaign)

Zhen Lin (Ph.D. Student @ University of Illinois Urbana-Champaign)

Junyi Gao (M.S. Student @ UIUC, Ph.D. Student @ University of Edinburgh)

Benjamin Danek (M.S. Student @ University of Illinois Urbana-Champaign)

Jimeng Sun (Professor @ University of Illinois Urbana-Champaign)

(* indicates equal contribution)


Acknowledgement#

Yue Zhao (Ph.D. Student @ Carnegie Mellon University)

Dr. Zhi Qiao (Associate ML Director, ACOE @ IQVIA)

Dr. Xiao Cao (VP of Machine Learning and NLP, Relativity)

Xiyang Hu (Ph.D. Student @ Carnegie Mellon University)