Welcome to PyHealth!#

PyPI version Documentation status GitHub stars GitHub forks Downloads Downloads

PyHealth is designed for both ML researchers and medical practitioners. We can make your healthcare AI applications easier to develop, test and validate. Your development process becomes more flexible and more customizable. [GitHub]


[News!] We are running the “PyHealth Live” gathering at 8 PM central time every Wednesday night! Welcome to join the live discussion over zoom. You may also add the schedules to your Google Calender or Microsoft Outlook (.ics).

FYI, the PyHealth Weekly Live will introduce basic pyhealth modules sequentially and showcase the newly developed functions as well as different use cases. For efficiency, all live lasts for around half an hour and the video collections are can be found in YouTube. The future scheduled topics are announced here. Hope to see you all on every wednesday night!

Introduction [Video]#

PyHealth can support diverse electronic health records (EHRs) such as MIMIC and eICU and all OMOP-CDM based databases and provide various advanced deep learning algorithms for handling important healthcare tasks such as diagnosis-based drug recommendation, patient hospitalization and mortality prediction, and ICU length stay forecasting, etc.

Build a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.

Modules#

All healthcare tasks in our package follow a five-stage pipeline:

load dataset -> define task function -> build ML/DL model -> model training -> inference

! We try hard to make sure each stage is as separate as possibe, so that people can customize their own pipeline by only using our data processing steps or the ML models. Each step will call one module and we introduce them using an example.

An ML Pipeline Example#

  • STEP 1: <pyhealth.datasets> provides a clean structure for the dataset, independent from the tasks. We support MIMIC-III, MIMIC-IV and eICU, as well as the standard OMOP-formatted data. The dataset is stored in a unified Patient-Visit-Event structure.

from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
    root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
    # map all NDC codes to ATC 3-rd level codes in these tables
    code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)

User could also store their own dataset into our <pyhealth.datasets.SampleDataset> structure and then follow the same pipeline below, see Tutorial

  • STEP 2: <pyhealth.tasks> inputs the <pyhealth.datasets> object and defines how to process each patient’s data into a set of samples for the tasks. In the package, we provide several task examples, such as drug recommendation and length of stay prediction.

from pyhealth.tasks import drug_recommendation_mimic3_fn
from pyhealth.datasets import split_by_patient, get_dataloader

mimic3sample = mimic3base.set_task(task_fn=drug_recommendation_mimic3_fn) # use default task
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])

# create dataloaders (torch.data.DataLoader)
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
  • STEP 3: <pyhealth.models> provides the healthcare ML models using <pyhealth.models>. This module also provides model layers, such as pyhealth.models.RETAINLayer for building customized ML architectures. Our model layers can used as easily as torch.nn.Linear.

from pyhealth.models import Transformer

model = Transformer(
    dataset=mimic3sample,
    feature_keys=["conditions", "procedures"],
    label_key="drugs",
    mode="multilabel",
)
  • STEP 4: <pyhealth.trainer> is the training manager with train_loader, the val_loader, val_metric, and specify other arguemnts, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.

from pyhealth.trainer import Trainer

trainer = Trainer(model=model)
trainer.train(
    train_dataloader=train_loader,
    val_dataloader=val_loader,
    epochs=50,
    monitor="pr_auc_samples",
)
  • STEP 5: <pyhealth.metrics> provides several common evaluation metrics (refer to Doc and see what are available) and special metrics in healthcare, such as drug-drug interaction (DDI) rate.

trainer.evaluate(test_loader)

Medical Code Map#

  • <pyhealth.codemap> provides two core functionalities: (i) looking up information for a given medical code (e.g., name, category, sub-concept); (ii) mapping codes across coding systems (e.g., ICD9CM to CCSCM). This module can be independently applied to your research.

  • For code mapping between two coding systems

from pyhealth.medcode import CrossMap

codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("82101") # use it like a dict

codemap = CrossMap.load("NDC", "ATC")
codemap.map("00527051210")
  • For code ontology lookup within one system

from pyhealth.medcode import InnerMap

icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0") # get detailed info
icd9cm.get_ancestors("428.0") # get parents

Medical Code Tokenizer#

  • <pyhealth.tokenizer> is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be independently applied to your research.

from pyhealth.tokenizer import Tokenizer

# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
        'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
        'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])

# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens) # [[8, 9, 10, 11], [12, 1, 1, 0]]

# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices) # [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]

Users can customize their healthcare AI pipeline as simply as calling one module

  • process your OMOP data via pyhealth.datasets

  • process the open eICU (e.g., MIMIC) data via pyhealth.datasets

  • define your own task on existing databases via pyhealth.tasks

  • use existing healthcare models or build upon it (e.g., RETAIN) via pyhealth.models.

  • code map between for conditions and medicaitons via pyhealth.codemap.


Datasets#

We provide the following datasets for general purpose healthcare AI research:

Dataset

Module

Year

Information

MIMIC-III

pyhealth.datasets.MIMIC3BaseDataset

2016

MIMIC-III Clinical Database

MIMIC-IV

pyhealth.datasets.MIMIC4BaseDataset

2020

MIMIC-IV Clinical Database

eICU

pyhealth.datasets.eICUBaseDataset

2018

eICU Collaborative Research Database

OMOP

pyhealth.datasets.OMOPBaseDataset

OMOP-CDM schema based dataset

Machine/Deep Learning Models#

Model Name

Type

Module

Year

Reference

Convolutional Neural Network (CNN)

deep learning

pyhealth.models.CNN

1989

Handwritten Digit Recognition with a Back-Propagation Network

Recurrent Neural Nets (RNN)

deep Learning

pyhealth.models.RNN

2011

Recurrent neural network based language model

Transformer

deep Learning

pyhealth.models.Transformer

2017

Atention is All you Need

RETAIN

deep Learning

pyhealth.models.RETAIN

2016

RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism

GAMENet

deep Learning

pyhealth.models.GAMENet

2019

GAMENet: Graph Attention Mechanism for Explainable Electronic Health Record Prediction

MICRON

deep Learning

pyhealth.models.MICRON

2021

Change Matters: Medication Change Prediction with Recurrent Residual Networks

SafeDrug

deep Learning

pyhealth.models.SafeDrug

2021

SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations

Benchmark on Healthcare Tasks#

  • Here is our benchmark doc on healthcare tasks. You can also check this below.

We also provide function for leaderboard generation, check it out in our github repo.

Here are the dynamic visualizations of the leaderboard. You can click the checkbox and easily compare the performance for different models doing different tasks on different datasets!

import sys
sys.path.append('../..')

from leaderboard import leaderboard_gen, utils
args = leaderboard_gen.construct_args()
leaderboard_gen.plots_generation(args)

Installation#

You could install from PyPi:

pip install pyhealth

or from github source:

git clone https://github.com/sunlabuiuc/PyHealth.git
cd pyhealth
pip install .

Required Dependencies:

python>=3.8
torch>=1.8.0
rdkit>=2022.03.4
scikit-learn>=0.24.2
networkx>=2.6.3
pandas>=1.3.2
tqdm

Warning 1:

PyHealth has multiple neural network based models, e.g., LSTM, which are implemented in PyTorch. However, PyHealth does NOT install these DL libraries for you. This reduces the risk of interfering with your local copies. If you want to use neural-net based models, please make sure PyTorch is installed. Similarly, models depending on xgboost would NOT enforce xgboost installation by default.

CUDA Setting:

To run PyHealth, you also need CUDA and cudatoolkit that support your GPU well. More info

For example, if you use NVIDIA RTX A6000 as your GPU for training, you should install a compatible cudatoolkit using:

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch.

Tutorials#

We provide the following tutorials to help users get started with our pyhealth.

Tutorial 0: Introduction to pyhealth.data [Video]

Tutorial 1: Introduction to pyhealth.datasets [Video]

Tutorial 2: Introduction to pyhealth.tasks [Video]

Tutorial 3: Introduction to pyhealth.models [Video]

Tutorial 4: Introduction to pyhealth.trainer [Video]

Tutorial 5: Introduction to pyhealth.metrics [Video]

Tutorial 6: Introduction to pyhealth.tokenizer [Video]

Tutorial 7: Introduction to pyhealth.medcode [Video]

The following tutorials will help users build their own task pipelines. [Video]

Pipeline 1: Drug Recommendation

Pipeline 2: Length of Stay Prediction

Pipeline 3: Readmission Prediction

Pipeline 4: Mortality Prediction


Advanced Tutorials#

We provided the advanced tutorials for supporting various needs.

Advanced Tutorial 1: Fit your dataset into our pipeline

Advanced Tutorial 2: Define your own healthcare task

Advanced Tutorial 3: Adopt customized model into pyhealth

Advanced Tutorial 4: Load your own processed data into pyhealth and try out our ML models


Data#

pyhealth.data defines the atomic data structures of this package.

pyhealth.data.Event#

One basic data structure in the package. It is a simple container for a single event. It contains all necessary attributes for supporting various healthcare tasks.

class pyhealth.data.Event(code, table, vocabulary, visit_id, patient_id, timestamp=None, **attr)[source]#

Bases: object

Contains information about a single event.

An event can be anything from a diagnosis to a prescription or a lab test that happened in a visit of a patient at a specific time.

Parameters
  • code (str) – code of the event. E.g., “428.0” for congestive heart failure.

  • table (str) – name of the table where the event is recorded. This corresponds to the raw csv file name in the dataset. E.g., “DIAGNOSES_ICD”.

  • vocabulary (str) – vocabulary of the code. E.g., “ICD9CM” for ICD-9 diagnosis codes.

  • visit_id (str) – unique identifier of the visit.

  • patient_id (str) – unique identifier of the patient.

  • timestamp (Optional[datetime]) – timestamp of the event. Default is None.

  • **attr – optional attributes to add to the event as key=value pairs.

attr_dict#

Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.

Examples

>>> from pyhealth.data import Event
>>> event = Event(
...     code="00069153041",
...     table="PRESCRIPTIONS",
...     vocabulary="NDC",
...     visit_id="v001",
...     patient_id="p001",
...     dosage="250mg",
... )
>>> event
Event with NDC code 00069153041 from table PRESCRIPTIONS
>>> event.attr_dict
{'dosage': '250mg'}

pyhealth.data.Visit#

Another basic data structure in the package. A Visit is a single encounter in hospital. It is a container a sequence of Event for each information aspect, such as diagnosis or medications. It also contains other necessary attributes for supporting healthcare tasks, such as the date of the visit.

class pyhealth.data.Visit(visit_id, patient_id, encounter_time=None, discharge_time=None, discharge_status=None, **attr)[source]#

Bases: object

Contains information about a single visit.

A visit is a period of time in which a patient is admitted to a hospital or a specific department. Each visit is associated with a patient and contains a list of different events.

Parameters
  • visit_id (str) – unique identifier of the visit.

  • patient_id (str) – unique identifier of the patient.

  • encounter_time (Optional[datetime]) – timestamp of visit’s encounter. Default is None.

  • discharge_time (Optional[datetime]) – timestamp of visit’s discharge. Default is None.

  • discharge_status – patient’s status upon discharge. Default is None.

  • **attr – optional attributes to add to the visit as key=value pairs.

attr_dict#

Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.

event_list_dict#

Dict[str, List[Event]], dictionary of event lists. Each key is a table name and each value is a list of events from that table ordered by timestamp.

Examples

>>> from pyhealth.data import Event, Visit
>>> event = Event(
...     code="00069153041",
...     table="PRESCRIPTIONS",
...     vocabulary="NDC",
...     visit_id="v001",
...     patient_id="p001",
...     dosage="250mg",
... )
>>> visit = Visit(
...     visit_id="v001",
...     patient_id="p001",
... )
>>> visit.add_event(event)
>>> visit
Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS']
>>> visit.available_tables
['PRESCRIPTIONS']
>>> visit.num_events
1
>>> visit.get_event_list('PRESCRIPTIONS')
[Event with NDC code 00069153041 from table PRESCRIPTIONS]
>>> visit.get_code_list('PRESCRIPTIONS')
['00069153041']
>>> patient.available_tables
['PRESCRIPTIONS']
>>> patient.get_visit_by_index(0)
Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS']
>>> patient.get_visit_by_index(0).get_code_list(table="PRESCRIPTIONS")
['00069153041']
add_event(event)[source]#

Adds an event to the visit.

If the event’s table is not in the visit’s event list dictionary, it is added as a new key. The event is then added to the list of events of that table.

Parameters

event (Event) – event to add.

Note

As for now, there is no check on the order of the events. The new event

is simply appended to end of the list.

Return type

None

get_event_list(table)[source]#

Returns a list of events from a specific table.

If the table is not in the visit’s event list dictionary, an empty list is returned.

Parameters

table (str) – name of the table.

Return type

List[Event]

Returns

List of events from the specified table.

Note

As for now, there is no check on the order of the events. The list of

events is simply returned as is.

get_code_list(table, remove_duplicate=True)[source]#

Returns a list of codes from a specific table.

If the table is not in the visit’s event list dictionary, an empty list is returned.

Parameters
  • table (str) – name of the table.

  • remove_duplicate (Optional[bool]) – whether to remove duplicate codes (but keep the relative order). Default is True.

Return type

List[str]

Returns

List of codes from the specified table.

Note

As for now, there is no check on the order of the codes. The list of

codes is simply returned as is.

set_event_list(table, event_list)[source]#

Sets the list of events from a specific table.

This function will overwrite any existing list of events from the specified table.

Parameters
  • table (str) – name of the table.

  • event_list (List[Event]) – list of events to set.

Note

As for now, there is no check on the order of the events. The list of

events is simply set as is.

Return type

None

property available_tables: List[str]#

Returns a list of available tables for the visit.

Return type

List[str]

Returns

List of available tables.

property num_events: int#

Returns the total number of events in the visit.

Return type

int

Returns

Total number of events.

pyhealth.data.Patient#

Another basic data structure in the package. A Patient is a collection of Visit for the current patients. It contains all necessary attributes of a patient, such as ethnicity, mortality status, gender, etc. It can support various healthcare tasks.

class pyhealth.data.Patient(patient_id, birth_datetime=None, death_datetime=None, gender=None, ethnicity=None, **attr)[source]#

Bases: object

Contains information about a single patient.

A patient is a person who is admitted at least once to a hospital or a specific department. Each patient is associated with a list of visits.

Parameters
  • patient_id (str) – unique identifier of the patient.

  • birth_datetime (Optional[datetime]) – timestamp of patient’s birth. Default is None.

  • death_datetime (Optional[datetime]) – timestamp of patient’s death. Default is None.

  • gender – gender of the patient. Default is None.

  • ethnicity – ethnicity of the patient. Default is None.

  • **attr – optional attributes to add to the patient as key=value pairs.

attr_dict#

Dict, dictionary of patient attributes. Each key is an attribute name and each value is the attribute’s value.

visits#

OrderedDict[str, Visit], an ordered dictionary of visits. Each key is a visit_id and each value is a visit.

index_to_visit_id#

Dict[int, str], dictionary that maps the index of a visit in the visits list to the corresponding visit_id.

Examples

>>> from pyhealth.data import Event, Visit, Patient
>>> event = Event(
...     code="00069153041",
...     table="PRESCRIPTIONS",
...     vocabulary="NDC",
...     visit_id="v001",
...     patient_id="p001",
...     dosage="250mg",
... )
>>> visit = Visit(
...     visit_id="v001",
...     patient_id="p001",
... )
>>> visit.add_event(event)
>>> patient = Patient(
...     patient_id="p001",
... )
>>> patient.add_visit(visit)
>>> patient
Patient p001 with 1 visits
add_visit(visit)[source]#

Adds a visit to the patient.

If the visit’s visit_id is already in the patient’s visits dictionary, it will be overwritten by the new visit.

Parameters

visit (Visit) – visit to add.

Note

As for now, there is no check on the order of the visits. The new visit

is simply added to the end of the ordered dictionary of visits.

Return type

None

add_event(event)[source]#

Adds an event to the patient.

If the event’s visit_id is not in the patient’s visits dictionary, this function will raise KeyError.

Parameters

event (Event) – event to add.

Note

As for now, there is no check on the order of the events. The new event

is simply appended to the end of the list of events of the corresponding visit.

Return type

None

get_visit_by_id(visit_id)[source]#

Returns a visit by visit_id.

Parameters

visit_id (str) – unique identifier of the visit.

Return type

Visit

Returns

Visit with the given visit_id.

get_visit_by_index(index)[source]#

Returns a visit by its index.

Parameters

index (int) – int, index of the visit to return.

Return type

Visit

Returns

Visit with the given index.

property available_tables: List[str]#

Returns a list of available tables for the patient.

Return type

List[str]

Returns

List of available tables.

Datasets#

pyhealth.datasets.BaseDataset#

This is the basic dataset class. Any specific dataset will inherit from this class.

class pyhealth.datasets.BaseDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: ABC

Abstract base dataset class.

This abstract class defines a uniform interface for all datasets (e.g., MIMIC-III, MIMIC-IV, eICU, OMOP).

Each specific dataset will be a subclass of this abstract class, which can then be converted to samples dataset for different tasks by calling self.set_task().

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    • a str of the target code vocabulary. E.g., {“NDC”, “ATC”}.

    • a tuple with two elements. The first element is a str of the

      target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method. E.g., {“NDC”, (“ATC”, {“target_kwargs”: {“level”: 3}})}.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

parse_tables()[source]#

Parses the tables in self.tables and return a dict of patients.

Will be called in self.__init__() if cache file does not exist or

refresh_cache is True.

This function will first call self.parse_basic_info() to parse the basic patient information, and then call self.parse_[table_name]() to parse the table with name table_name. Both self.parse_basic_info() and self.parse_[table_name]() should be implemented in the subclass.

Return type

Dict[str, Patient]

Returns

A dict mapping patient_id to Patient object.

property available_tables: List[str]#

Returns a list of available tables for the dataset.

Return type

List[str]

Returns

List of available tables.

stat()[source]#

Returns some statistics of the base dataset.

Return type

str

static info()[source]#

Prints the output format.

set_task(task_fn, task_name=None)[source]#

Processes the base dataset to generate the task-specific sample dataset.

This function should be called by the user after the base dataset is initialized. It will iterate through all patients in the base dataset and call task_fn which should be implemented by the specific task.

Parameters
  • task_fn (Callable) – a function that takes a single patient and returns a list of samples (each sample is a dict with patient_id, visit_id, and other task-specific attributes as key). The samples will be concatenated to form the sample dataset.

  • task_name (Optional[str]) – the name of the task. If None, the name of the task function will be used.

Returns

the task-specific sample dataset.

Return type

sample_dataset

Note

In task_fn, a patient may be converted to multiple samples, e.g.,

a patient with three visits may be converted to three samples ([visit 1], [visit 1, visit 2], [visit 1, visit 2, visit 3]). Patients can also be excluded from the task dataset by returning an empty list.

pyhealth.datasets.SampleDataset#

This class the takes a list of samples as input (either from BaseDataset.set_task() or user-provided json input), and provides a uniform interface for accessing the samples.

class pyhealth.datasets.SampleDataset(samples, dataset_name='', task_name='')[source]#

Bases: Dataset

Sample dataset class.

This class the takes a list of samples as input (either from BaseDataset.set_task() or user-provided input), and provides a uniform interface for accessing the samples.

Parameters
  • samples (List[Dict]) – a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key.

  • dataset_name – the name of the dataset. Default is None.

  • task_name – the name of the task. Default is None.

Currently, the following types of attributes are supported:
  • a single value. Type: int/float/str. Dim: 0.

  • a single vector. Type: int/float. Dim: 1.

  • a list of codes. Type: str. Dim: 2.

  • a list of vectors. Type: int/float. Dim: 2.

  • a list of list of codes. Type: str. Dim: 3.

  • a list of list of vectors. Type: int/float. Dim: 3.

input_info#

Dict, a dict whose keys are the same as the keys in the samples, and values are the corresponding input information: - “type”: the element type of each key attribute, one of float, int, str. - “dim”: the list dimension of each key attribute, one of 0, 1, 2, 3. - “len”: the length of the vector, only valid for vector-based attributes.

patient_to_index#

Dict[str, List[int]], a dict mapping patient_id to a list of sample indices.

visit_to_index#

Dict[str, List[int]], a dict mapping visit_id to a list of sample indices.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "single_vector": [1, 2, 3],
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "single_vector": [1, 5, 8],
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]],
...                 [[7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples)
>>> dataset.input_info
{'patient_id': {'type': <class 'str'>, 'dim': 0}, 'visit_id': {'type': <class 'str'>, 'dim': 0}, 'single_vector': {'type': <class 'int'>, 'dim': 1, 'len': 3}, 'list_codes': {'type': <class 'str'>, 'dim': 2}, 'list_vectors': {'type': <class 'float'>, 'dim': 2, 'len': 3}, 'list_list_codes': {'type': <class 'str'>, 'dim': 3}, 'list_list_vectors': {'type': <class 'float'>, 'dim': 3, 'len': 3}, 'label': {'type': <class 'int'>, 'dim': 0}}
>>> dataset.patient_to_index
{'patient-0': [0, 1]}
>>> dataset.visit_to_index
{'visit-0': [0], 'visit-1': [1]}
property available_keys: List[str]#

Returns a list of available keys for the dataset.

Return type

List[str]

Returns

List of available keys.

get_all_tokens(key, remove_duplicates=True, sort=True)[source]#

Gets all tokens with a specific key in the samples.

Parameters
  • key (str) – the key of the tokens in the samples.

  • remove_duplicates (bool) – whether to remove duplicates. Default is True.

  • sort (bool) – whether to sort the tokens by alphabet order. Default is True.

Returns

a list of tokens.

Return type

tokens

get_distribution_tokens(key)[source]#

Gets the distribution of tokens with a specific key in the samples.

Parameters

key (str) – the key of the tokens in the samples.

Returns

a dict mapping token to count.

Return type

distribution

stat()[source]#

Returns some statistics of the task-specific dataset.

Return type

str

pyhealth.datasets.MIMIC3Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC3Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseDataset

Base dataset for MIMIC-III dataset.

The MIMIC-III dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.

The basic information is stored in the following tables:
  • PATIENTS: defines a patient in the database, SUBJECT_ID.

  • ADMISSIONS: defines a patient’s hospital admission, HADM_ID.

We further support the following tables:
  • DIAGNOSES_ICD: contains ICD-9 diagnoses (ICD9CM code) for patients.

  • PROCEDURES_ICD: contains ICD-9 procedures (ICD9PROC code) for patients.

  • PRESCRIPTIONS: contains medication related order entries (NDC code)

    for patients.

  • LABEVENTS: contains laboratory measurements (MIMIC3_ITEMID code)

    for patients

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> dataset = MIMIC3Dataset(
...         root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...         tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"],
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper function which parses PATIENTS and ADMISSIONS tables.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses DIAGNOSES_ICD table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_procedures_icd(patients)[source]#

Helper function which parses PROCEDURES_ICD table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in PROCEDURES_ICD

table, so we set it to None.

parse_prescriptions(patients)[source]#

Helper function which parses PRESCRIPTIONS table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_labevents(patients)[source]#

Helper function which parses LABEVENTS table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.MIMIC4Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC4Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseDataset

Base dataset for MIMIC-IV dataset.

The MIMIC-IV dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.

The basic information is stored in the following tables:
  • patients: defines a patient in the database, subject_id.

  • admission: define a patient’s hospital admission, hadm_id.

We further support the following tables:
  • diagnoses_icd: contains ICD diagnoses (ICD9CM and ICD10CM code)

    for patients.

  • procedures_icd: contains ICD procedures (ICD9PROC and ICD10PROC

    code) for patients.

  • prescriptions: contains medication related order entries (NDC code)

    for patients.

  • labevents: contains laboratory measurements (MIMIC4_ITEMID code)

    for patients

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> dataset = MIMIC4Dataset(
...         root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...         tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"],
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses patients and admissions tables.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses diagnosis_icd table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in diagnoses_icd

table, so we set it to None.

parse_procedures_icd(patients)[source]#

Helper function which parses procedures_icd table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in procedures_icd

table, so we set it to None.

parse_prescriptions(patients)[source]#

Helper function which parses prescriptions table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_labevents(patients)[source]#

Helper function which parses labevents table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.eICUDataset#

The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.eICUDataset(**kwargs)[source]#

Bases: BaseDataset

Base dataset for eICU dataset.

The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.

The basic information is stored in the following tables:
  • patient: defines a patient (uniquepid), a hospital admission

    (patienthealthsystemstayid), and a ICU stay (patientunitstayid) in the database.

  • hospital: contains information about a hospital (e.g., region).

Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time. Thus, we only know the order of ICU stays within a hospital admission, but not the order of hospital admissions within a patient. As a result, we use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.

We further support the following tables:
  • diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)

    for patients

  • treatment: contains treatment information (eICU_TREATMENTSTRING code)

    for patients.

  • medication: contains medication related order entries (eICU_DRUGNAME

    code) for patients.

  • lab: contains laboratory measurements (eICU_LABNAME code)

    for patients

  • physicalExam: contains all physical exam (eICU_PHYSICALEXAMPATH)

    conducted for patients.

Parameters
  • dataset_name – name of the dataset.

  • root – root directory of the raw data (should contain many csv files).

  • tables – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> dataset = eICUDataset(
...         root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...         tables=["diagnosis", "medication", "lab", "treatment", "physicalExam"],
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses patient and hospital tables.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

We use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.

parse_diagnosis(patients)[source]#

Helper function which parses diagnosis table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

Note

This table contains both ICD9CM and ICD10CM codes in one single

cell. We need to use medcode to distinguish them.

parse_treatment(patients)[source]#

Helper function which parses treatment table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_medication(patients)[source]#

Helper function which parses medication table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_lab(patients)[source]#

Helper function which parses lab table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_physicalexam(patients)[source]#

Helper function which parses physicalExam table.

Will be called in self.parse_tables().

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.OMOPDataset#

We can process any OMOP-CDM formatted database, refer to doc for more information. We it into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.OMOPDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseDataset

Base dataset for OMOP dataset.

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence.

See: https://www.ohdsi.org/data-standardization/the-common-data-model/.

The basic information is stored in the following tables:
  • person: contains records that uniquely identify each person or patient,

    and some demographic information.

  • visit_occurrence: contains info for how a patient engages with the

    healthcare system for a duration of time.

  • death: contains info for how and when a patient dies.

We further support the following tables:
  • condition_occurrence.csv: contains the condition information

    (CONDITION_CONCEPT_ID code) of patients’ visits.

  • procedure_occurrence.csv: contains the procedure information

    (PROCEDURE_CONCEPT_ID code) of patients’ visits.

  • drug_exposure.csv: contains the drug information (DRUG_CONCEPT_ID code)

    of patients’ visits.

  • measurement.csv: contains all laboratory measurements

    (MEASUREMENT_CONCEPT_ID code) of patients’ visits.

Parameters
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> dataset = OMOPDataset(
...         root="/srv/local/data/zw12/pyhealth/raw_data/synpuf1k_omop_cdm_5.2.2",
...         tables=["condition_occurrence", "procedure_occurrence", "drug_exposure", "measurement",],
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses person, visit_occurrence, and death tables.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_condition_occurrence(patients)[source]#

Helper function which parses condition_occurrence table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_procedure_occurrence(patients)[source]#

Helper function which parses procedure_occurrence table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_drug_exposure(patients)[source]#

Helper function which parses drug_exposure table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

parse_measurement(patients)[source]#

Helper function which parses measurement table.

Will be called in self.parse_tables()

Docs:
Parameters

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type

Dict[str, Patient]

Returns

The updated patients dict.

pyhealth.datasets.splitter#

Several data splitting function for pyhealth.datasets module to obtain training / validation / test sets.

pyhealth.datasets.splitter.split_by_visit(dataset, ratios, seed=None)[source]#

Splits the dataset by visit (i.e., samples).

Parameters
Returns

three subsets of the dataset of

type torch.utils.data.Subset.

Return type

train_dataset, val_dataset, test_dataset

Note

The original dataset can be accessed by train_dataset.dataset,

val_dataset.dataset, and test_dataset.dataset.

pyhealth.datasets.splitter.split_by_patient(dataset, ratios, seed=None)[source]#

Splits the dataset by patient.

Parameters
Returns

three subsets of the dataset of

type torch.utils.data.Subset.

Return type

train_dataset, val_dataset, test_dataset

Note

The original dataset can be accessed by train_dataset.dataset,

val_dataset.dataset, and test_dataset.dataset.

pyhealth.datasets.utils#

Several utility functions.

pyhealth.datasets.utils.hash_str(s)[source]#
pyhealth.datasets.utils.strptime(s)[source]#

Helper function which parses a string to datetime object.

Parameters

s (str) – str, string to be parsed.

Return type

Optional[datetime]

Returns

Optional[datetime], parsed datetime object. If s is nan, return None.

pyhealth.datasets.utils.flatten_list(l)[source]#

Flattens a list of list.

Parameters

l (List) – List, the list of list to be flattened.

Return type

List

Returns

List, the flattened list.

Examples

>>> flatten_list([[1], [2, 3], [4]])
[1, 2, 3, 4]R
>>> flatten_list([[1], [[2], 3], [4]])
[1, [2], 3, 4]
pyhealth.datasets.utils.list_nested_levels(l)[source]#

Gets all the different nested levels of a list.

Parameters

l (List) – the list to be checked.

Return type

Tuple[int]

Returns

All the different nested levels of the list.

Examples

>>> list_nested_levels([])
(1,)
>>> list_nested_levels([1, 2, 3])
(1,)
>>> list_nested_levels([[]])
(2,)
>>> list_nested_levels([[1, 2, 3], [4, 5, 6]])
(2,)
>>> list_nested_levels([1, [2, 3], 4])
(1, 2)
>>> list_nested_levels([[1, [2, 3], 4]])
(2, 3)
pyhealth.datasets.utils.is_homo_list(l)[source]#

Checks if a list is homogeneous.

Parameters

l (List) – the list to be checked.

Return type

bool

Returns

bool, True if the list is homogeneous, False otherwise.

Examples

>>> is_homo_list([1, 2, 3])
True
>>> is_homo_list([])
True
>>> is_homo_list([1, 2, "3"])
False
>>> is_homo_list([1, 2, 3, [4, 5, 6]])
False
pyhealth.datasets.utils.collate_fn_dict(batch)[source]#
pyhealth.datasets.utils.get_dataloader(dataset, batch_size, shuffle=False)[source]#

Tasks#

We support various real-world healthcare predictive tasks defined by function calls. The following example tasks are collected from top AI/Medical venues:

  1. Drug Recommendation [Yang et al. IJCAI 2021a, Yang et al. IJCAI 2021b, Shang et al. AAAI 2020]

  2. Readmission Prediction [Choi et al. AAAI 2021]

  3. Mortality Prediction [Choi et al. AAAI 2021]

pyhealth.tasks.drug_recommendation#

pyhealth.tasks.drug_recommendation.drug_recommendation_mimic3_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import drug_recommendation_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(drug_recommendation_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
pyhealth.tasks.drug_recommendation.drug_recommendation_mimic4_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import drug_recommendation_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(drug_recommendation_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
pyhealth.tasks.drug_recommendation.drug_recommendation_eicu_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import drug_recommendation_eicu_fn
>>> eicu_sample = eicu_base.set_task(drug_recommendation_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
pyhealth.tasks.drug_recommendation.drug_recommendation_omop_fn(patient)[source]#

Processes a single patient for the drug recommendation task.

Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import drug_recommendation_omop_fn
>>> omop_sample = omop_base.set_task(drug_recommendation_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51'], ['98', '663', '58', '51']], 'procedures': [['1'], ['2', '3']], 'label': [['2', '3', '4'], ['0', '1', '4', '5']]}]

pyhealth.tasks.readmission_prediction#

pyhealth.tasks.readmission_prediction.readmission_prediction_mimic3_fn(patient, time_window=15)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import readmission_prediction_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(readmission_prediction_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.readmission_prediction.readmission_prediction_mimic4_fn(patient, time_window=15)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import readmission_prediction_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(readmission_prediction_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
pyhealth.tasks.readmission_prediction.readmission_prediction_eicu_fn(patient, time_window=5)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import readmission_prediction_eicu_fn
>>> eicu_sample = eicu_base.set_task(readmission_prediction_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.readmission_prediction.readmission_prediction_omop_fn(patient, time_window=15)[source]#

Processes a single patient for the readmission prediction task.

Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).

Parameters
  • patient (Patient) – a Patient object

  • time_window – the time window threshold (gap < time_window means label=1 for the task)

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import readmission_prediction_omop_fn
>>> omop_sample = omop_base.set_task(readmission_prediction_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]

pyhealth.tasks.mortality_prediction#

pyhealth.tasks.mortality_prediction.mortality_prediction_mimic3_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import mortality_prediction_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(mortality_prediction_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
pyhealth.tasks.mortality_prediction.mortality_prediction_mimic4_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import mortality_prediction_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(mortality_prediction_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.mortality_prediction.mortality_prediction_eicu_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import mortality_prediction_eicu_fn
>>> eicu_sample = eicu_base.set_task(mortality_prediction_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
pyhealth.tasks.mortality_prediction.mortality_prediction_omop_fn(patient)[source]#

Processes a single patient for the mortality prediction task.

Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object

Returns

a list of samples, each sample is a dict with patient_id,

visit_id, and other task-specific attributes as key

Return type

samples

Note that we define the task as a binary classification task.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import mortality_prediction_omop_fn
>>> omop_sample = omop_base.set_task(mortality_prediction_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]

pyhealth.tasks.length_of_stay_prediction#

pyhealth.tasks.length_of_stay_prediction.categorize_los(days)[source]#

Categorizes length of stay into 10 categories.

One for ICU stays shorter than a day, seven day-long categories for each day of the first week, one for stays of over one week but less than two, and one for stays of over two weeks.

Parameters

days (int) – int, length of stay in days

Returns

int, category of length of stay

Return type

category

pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic3_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> mimic3_base = MIMIC3Dataset(
...    root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
...    code_mapping={"ICD9CM": "CCSCM"},
... )
>>> from pyhealth.tasks import length_of_stay_prediction_mimic3_fn
>>> mimic3_sample = mimic3_base.set_task(length_of_stay_prediction_mimic3_fn)
>>> mimic3_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 4}]
pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic4_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> mimic4_base = MIMIC4Dataset(
...     root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...     tables=["diagnoses_icd", "procedures_icd"],
...     code_mapping={"ICD10PROC": "CCSPROC"},
... )
>>> from pyhealth.tasks import length_of_stay_prediction_mimic4_fn
>>> mimic4_sample = mimic4_base.set_task(length_of_stay_prediction_mimic4_fn)
>>> mimic4_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 2}]
pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_eicu_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> eicu_base = eICUDataset(
...     root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...     tables=["diagnosis", "medication"],
...     code_mapping={},
...     dev=True
... )
>>> from pyhealth.tasks import length_of_stay_prediction_eicu_fn
>>> eicu_sample = eicu_base.set_task(length_of_stay_prediction_eicu_fn)
>>> eicu_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 5}]
pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_omop_fn(patient)[source]#

Processes a single patient for the length-of-stay prediction task.

Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).

Parameters

patient (Patient) – a Patient object.

Returns

a list of samples, each sample is a dict with patient_id, visit_id,

and other task-specific attributes as key.

Return type

samples

Note that we define the task as a multi-class classification task.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> omop_base = OMOPDataset(
...     root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2",
...     tables=["condition_occurrence", "procedure_occurrence"],
...     code_mapping={},
... )
>>> from pyhealth.tasks import length_of_stay_prediction_omop_fn
>>> omop_sample = omop_base.set_task(length_of_stay_prediction_eicu_fn)
>>> omop_sample.samples[0]
[{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 7}]

Models#

We implement the following models for supporting multiple healthcare predictive tasks.

pyhealth.models.MLP#

The separate callable MLP model.

class pyhealth.models.MLP(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, n_layers=2, activation='relu', **kwargs)[source]#

Bases: BaseModel

Multi-layer perceptron model.

This model applies a separate MLP layer for each feature, and then concatenates the final hidden states of each MLP layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate MLP layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the rnn model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector; we use mean/sum pooling and then MLP

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we first use the embedding table to encode each code into a vector

    and then use mean/sum pooling to get one vector for each sample; we then use MLP layers

  • case 3. [1.5, 2.0, 0.0]
    • we run MLP directly

  • case 4. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • This case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 1 after embedding table

  • case 5. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • This case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 2 after embedding table

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • n_layers (int) – the number of layers. Default is 2.

  • activation (str) – the activation function. Default is “relu”.

  • **kwargs – other parameters for the RNN layer.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "conditions": ["cond-33", "cond-86", "cond-80"],
...             "procedures": [1.0, 2.0, 3.5, 4],
...             "label": 0,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "conditions": ["cond-33", "cond-86", "cond-80"],
...             "procedures": [5.0, 2.0, 3.5, 4],
...             "label": 1,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import MLP
>>> model = MLP(
...         dataset=dataset,
...         feature_keys=["conditions", "procedures"],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.6816, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5418],
        [0.5584]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.],
        [1.]])}
>>>
static mean_pooling(x, mask)[source]#

Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)

Returns

tensor of shape (batch_size, embedding_dim)

Return type

x

Examples

>>> x.shape
[128, 5, 32]
>>> mean_pooling(x).shape
[128, 32]
static sum_pooling(x)[source]#

Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)

Returns

tensor of shape (batch_size, embedding_dim)

Return type

x

Examples

>>> x.shape
[128, 5, 32]
>>> sum_pooling(x).shape
[128, 32]
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.CNN#

The separate callable CNNLayer and the complete CNN model.

class pyhealth.models.CNNLayer(input_size, hidden_size, num_layers=1)[source]#

Bases: Module

Convolutional neural network layer.

This layer stacks multiple CNN blocks and applies adaptive average pooling at the end. It is used in the CNN model. But it can also be used as a standalone layer.

Parameters
  • input_size (int) – input feature size.

  • hidden_size (int) – hidden feature size.

  • num_layers (int) – number of convolutional layers. Default is 1.

Examples

>>> from pyhealth.models import CNNLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = CNNLayer(5, 64)
>>> outputs, last_outputs = layer(input)
>>> outputs.shape
torch.Size([3, 128, 64])
>>> last_outputs.shape
torch.Size([3, 64])
forward(x)[source]#

Forward propagation.

Parameters

x (tensor) – a tensor of shape [batch size, sequence len, input size].

Returns

a tensor of shape [batch size, sequence len, hidden size],

containing the output features for each time step.

pooled_outputs: a tensor of shape [batch size, hidden size], containing

the pooled output features.

Return type

outputs

training: bool#
class pyhealth.models.CNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

Convolutional neural network model.

This model applies a separate CNN layer for each feature, and then concatenates the final hidden states of each CNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate CNN layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the CNN model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply CNN on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use CNN one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the CNN layer.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]],
...                 [[7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import CNN
>>> model = CNN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.8725, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.7620],
        [0.7339]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.],
        [1.]])}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.RNN#

The separate callable RNNLayer and the complete RNN model.

class pyhealth.models.RNNLayer(input_size, hidden_size, rnn_type='GRU', num_layers=1, dropout=0.5, bidirectional=False)[source]#

Bases: Module

Recurrent neural network layer.

This layer wraps the PyTorch RNN layer with masking and dropout support. It is used in the RNN model. But it can also be used as a standalone layer.

Parameters
  • input_size (int) – input feature size.

  • hidden_size (int) – hidden feature size.

  • rnn_type (str) – type of rnn, one of “RNN”, “LSTM”, “GRU”. Default is “GRU”.

  • num_layers (int) – number of recurrent layers. Default is 1.

  • dropout (float) – dropout rate. If non-zero, introduces a Dropout layer before each RNN layer. Default is 0.5.

  • bidirectional (bool) – whether to use bidirectional recurrent layers. If True, a fully-connected layer is applied to the concatenation of the forward and backward hidden states to reduce the dimension to hidden_size. Default is False.

Examples

>>> from pyhealth.models import RNNLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = RNNLayer(5, 64)
>>> outputs, last_outputs = layer(input)
>>> outputs.shape
torch.Size([3, 128, 64])
>>> last_outputs.shape
torch.Size([3, 64])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, input size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, sequence len, hidden size],

containing the output features for each time step.

last_outputs: a tensor of shape [batch size, hidden size], containing

the output features for the last time step.

Return type

outputs

training: bool#
class pyhealth.models.RNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

Recurrent neural network model.

This model applies a separate RNN layer for each feature, and then concatenates the final hidden states of each RNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate rnn layers for different feature_keys. Currently, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the rnn model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply rnn on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use rnn one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the RNN layer.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import RNN
>>> model = RNN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.7664, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.4714],
        [0.4085]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.],
        [1.]])}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.Transformer#

The separate callable TransformerLayer and the complete Transformer model.

class pyhealth.models.TransformerLayer(feature_size, heads=1, dropout=0.5, num_layers=1)[source]#

Bases: Module

Transformer layer.

Paper: Ashish Vaswani et al. Attention is all you need. NIPS 2017.

This layer is used in the Transformer model. But it can also be used as a standalone layer.

Parameters
  • feature_size – the hidden feature size.

  • heads – the number of attention heads. Default is 1.

  • dropout – dropout rate. Default is 0.5.

  • num_layers – number of transformer layers. Default is 1.

Examples

>>> from pyhealth.models import TransformerLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = TransformerLayer(64)
>>> emb, cls_emb = layer(input)
>>> emb.shape
torch.Size([3, 128, 64])
>>> cls_emb.shape
torch.Size([3, 64])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, feature_size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, sequence len, feature_size],

containing the output features for each time step.

cls_emb: a tensor of shape [batch size, feature_size], containing

the output features for the first time step.

Return type

emb

training: bool#
class pyhealth.models.Transformer(dataset, feature_keys, label_key, mode, embedding_dim=128, **kwargs)[source]#

Bases: BaseModel

Transformer model.

This model applies a separate Transformer layer for each feature, and then concatenates the final hidden states of each Transformer layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.

Note

We use separate Transformer layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the transformer model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply transformer on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use transformer one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 2 after embedding table

dataset: the dataset to train the model. It is used to query certain

information such as the set of all tokens.

feature_keys: list of keys in samples to use as features,

e.g. [“conditions”, “procedures”].

label_key: key in samples to use as label (e.g., “drugs”). mode: one of “binary”, “multiclass”, or “multilabel”. embedding_dim: the embedding dimension. Default is 128. **kwargs: other parameters for the Transformer layer.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import Transformer
>>> model = Transformer(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.4234, grad_fn=<NllLossBackward0>), 'y_prob': tensor([[9.9998e-01, 2.2920e-05],
        [5.7120e-01, 4.2880e-01]], grad_fn=<SoftmaxBackward0>), 'y_true': tensor([0, 1])}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.RETAIN#

The separate callable RETAINLayer and the complete RETAIN model.

class pyhealth.models.RETAINLayer(feature_size, dropout=0.5)[source]#

Bases: Module

RETAIN layer.

Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.

This layer is used in the RETAIN model. But it can also be used as a standalone layer.

Parameters
  • feature_size (int) – the hidden feature size.

  • dropout (float) – dropout rate. Default is 0.5.

Examples

>>> from pyhealth.models import RETAINLayer
>>> input = torch.randn(3, 128, 64)  # [batch size, sequence len, feature_size]
>>> layer = RETAINLayer(64)
>>> c = layer(input)
>>> c.shape
torch.Size([3, 64])
static reverse_x(input, lengths)[source]#

Reverses the input.

compute_alpha(rx, lengths)[source]#

Computes alpha attention.

compute_beta(rx, lengths)[source]#

Computes beta attention.

forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (tensor) – a tensor of shape [batch size, sequence len, feature_size].

  • mask (Optional[tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a tensor of shape [batch size, feature_size] representing the

context vector.

Return type

c

training: bool#
class pyhealth.models.RETAIN(dataset, feature_keys, label_key, mode, embedding_dim=128, **kwargs)[source]#

Bases: BaseModel

RETAIN model.

Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.

Note

We use separate Retain layers for different feature_keys. Currentluy, we automatically support different input formats:

  • code based input (need to use the embedding table later)

  • float/int based value input

We follow the current convention for the Retain model:
  • case 1. [code1, code2, code3, …]
    • we will assume the code follows the order; our model will encode

    each code into a vector and apply Retain on the code level

  • case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
    • we will assume the inner bracket follows the order; our model first

    use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use Retain one the braket level

  • case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 1 after embedding table

  • case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
    • this case only makes sense when each inner bracket has the same length;

    we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 2 after embedding table

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • **kwargs – other parameters for the RETAIN layer.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]],
...                 [[7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import CNN
>>> model = CNN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             # "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.8725, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.7620],
        [0.7339]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.],
        [1.]])}
>>>

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import RETAIN
>>> model = RETAIN(
...         dataset=dataset,
...         feature_keys=[
...             "list_codes",
...             "list_vectors",
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.7234, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5423],
        [0.5142]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.],
        [1.]])}
>>>
forward(**kwargs)[source]#

Forward propagation.

The label kwargs[self.label_key] is a list of labels for each patient.

Parameters

**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.GAMENet#

The separate callable GAMENetLayer and the complete GAMENet model.

class pyhealth.models.GAMENetLayer(hidden_size, ehr_adj, ddi_adj, dropout=0.5)[source]#

Bases: Module

GAMENet layer.

Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.

This layer is used in the GAMENet model. But it can also be used as a standalone layer.

Parameters
  • hidden_size (int) – hidden feature size.

  • ehr_adj (tensor) – an adjacency tensor of shape [num_drugs, num_drugs].

  • ddi_adj (tensor) – an adjacency tensor of shape [num_drugs, num_drugs].

  • dropout (float) – the dropout rate. Default is 0.5.

Examples

>>> from pyhealth.models import GAMENetLayer
>>> queries = torch.randn(3, 5, 32) # [patient, visit, hidden_size]
>>> prev_drugs = torch.randint(0, 2, (3, 4, 50)).float()
>>> curr_drugs = torch.randint(0, 2, (3, 50)).float()
>>> ehr_adj = torch.randint(0, 2, (50, 50)).float()
>>> ddi_adj = torch.randint(0, 2, (50, 50)).float()
>>> layer = GAMENetLayer(32, ehr_adj, ddi_adj)
>>> loss, y_prob = layer(queries, prev_drugs, curr_drugs)
>>> loss.shape
torch.Size([])
>>> y_prob.shape
torch.Size([3, 50])
forward(queries, prev_drugs, curr_drugs, mask=None)[source]#

Forward propagation.

Parameters
  • queries (tensor) – query tensor of shape [patient, visit, hidden_size].

  • prev_drugs (tensor) – multihot tensor indicating drug usage in all previous visits of shape [patient, visit - 1, num_drugs].

  • curr_drugs (tensor) – multihot tensor indicating drug usage in the current visit of shape [patient, num_drugs].

  • mask (Optional[tensor]) – an optional mask tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.GAMENet(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#

Bases: BaseModel

GAMENet model.

Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs_all as label_key (i.e., both current and previous drugs). It only operates on the visit level.

Note

This model only accepts ATC level 3 as medication codes.

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • num_layers (int) – the number of layers used in RNN. Default is 1.

  • dropout (float) – the dropout rate. Default is 0.5.

  • **kwargs – other parameters for the GAMENet layer.

generate_ehr_adj()[source]#

Generates the EHR graph adjacency matrix.

Return type

tensor

generate_ddi_adj()[source]#

Generates the DDI graph adjacency matrix.

Return type

tensor

forward(conditions, procedures, drugs_all, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels [patient, visit, procedure].

  • drugs_all (List[List[List[str]]]) – a nested list in three levels [patient, visit, drug].

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing

the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels] representing

the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.MICRON#

The separate callable MICRONLayer and the complete MICRON model.

class pyhealth.models.MICRONLayer(input_size, hidden_size, num_drugs, lam=0.1)[source]#

Bases: Module

MICRON layer.

Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.

This layer is used in the MICRON model. But it can also be used as a standalone layer.

Parameters
  • input_size (int) – input feature size.

  • hidden_size (int) – hidden feature size.

  • num_drugs (int) – total number of drugs to recommend.

  • lam (float) – regularization parameter for the reconstruction loss. Default is 0.1.

Examples

>>> from pyhealth.models import MICRONLayer
>>> patient_emb = torch.randn(3, 5, 32) # [patient, visit, input_size]
>>> drugs = torch.randint(0, 2, (3, 50)).float()
>>> layer = MICRONLayer(32, 64, 50)
>>> loss, y_prob = layer(patient_emb, drugs)
>>> loss.shape
torch.Size([])
>>> y_prob.shape
torch.Size([3, 50])
static compute_reconstruction_loss(logits, logits_residual, mask)[source]#
Return type

tensor

forward(patient_emb, drugs, mask=None)[source]#

Forward propagation.

Parameters
  • patient_emb (tensor) – a tensor of shape [patient, visit, input_size].

  • drugs (tensor) – a multihot tensor of shape [patient, num_labels].

  • mask (Optional[tensor]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.MICRON(dataset, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

MICRON model.

Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the MICRON layer.

forward(conditions, procedures, drugs, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels [patient, visit, procedure].

  • drugs (List[List[str]]) – a nested list in two levels [patient, drug].

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing

the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels] representing

the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.SafeDrug#

The separate callable SafeDrugLayer and the complete SafeDrug model.

class pyhealth.models.SafeDrugLayer(hidden_size, mask_H, ddi_adj, num_fingerprints, molecule_set, average_projection, kp=0.05, target_ddi=0.08)[source]#

Bases: Module

SafeDrug model.

Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.

This layer is used in the SafeDrug model. But it can also be used as a standalone layer.

Parameters
  • hidden_size (int) – hidden feature size.

  • mask_H (Tensor) – the mask matrix H of shape [num_drugs, num_substructures].

  • ddi_adj (Tensor) – an adjacency tensor of shape [num_drugs, num_drugs].

  • num_fingerprints (int) – total number of different fingerprints.

  • molecule_set (List[Tuple]) – a list of molecule tuples (A, B, C) of length num_molecules. - A <torch.tensor>: fingerprints of atoms in the molecule - B <torch.tensor>: adjacency matrix of the molecule - C <int>: molecular_size

  • average_projection (Tensor) – a tensor of shape [num_drugs, num_molecules] representing the average projection for aggregating multiple molecules of the same drug into one vector.

  • kp (float) – correcting factor for the proportional signal. Default is 0.5.

  • target_ddi (float) – DDI acceptance rate. Default is 0.08.

pad(matrices, pad_value)[source]#

Pads the list of matrices.

Padding with a pad_value (e.g., 0) for batch processing. For example, given a list of matrices [A, B, C], we obtain a new matrix [A00, 0B0, 00C], where 0 is the zero (i.e., pad value) matrix.

calculate_loss(logits, y_prob, labels)[source]#
Return type

Tensor

forward(patient_emb, drugs, mask=None)[source]#

Forward propagation.

Parameters
  • patient_emb (tensor) – a tensor of shape [patient, visit, input_size].

  • drugs (tensor) – a multihot tensor of shape [patient, num_labels].

  • mask (Optional[tensor]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.

Returns

a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing

the probability of each drug.

Return type

loss

training: bool#
class pyhealth.models.SafeDrug(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#

Bases: BaseModel

SafeDrug model.

Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.

Note

This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.

Note

This model only accepts ATC level 3 as medication codes.

Parameters
  • dataset (SampleDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • num_layers (int) – the number of layers used in RNN. Default is 1.

  • dropout (float) – the dropout rate. Default is 0.5.

  • **kwargs – other parameters for the SafeDrug layer.

generate_ddi_adj()[source]#

Generates the DDI graph adjacency matrix.

Return type

tensor

generate_smiles_list()[source]#

Generates the list of SMILES strings.

Return type

List[List[str]]

generate_mask_H()[source]#

Generates the molecular segmentation mask H.

Return type

tensor

generate_molecule_info(radius=1)[source]#

Generates the molecule information.

forward(conditions, procedures, drugs, **kwargs)[source]#

Forward propagation.

Parameters
  • conditions (List[List[List[str]]]) – a nested list in three levels [patient, visit, condition].

  • procedures (List[List[List[str]]]) – a nested list in three levels [patient, visit, procedure].

  • drugs (List[List[str]]) – a nested list in two levels [patient, drug].

Returns

loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing

the probability of each drug.

y_true: a tensor of shape [patient, visit, num_labels] representing

the ground truth of each drug.

Return type

A dictionary with the following keys

training: bool#

pyhealth.models.Deepr#

The separate callable DeeprLayer and the complete Deepr model.

class pyhealth.models.DeeprLayer(feature_size=100, window=1, hidden_size=3)[source]#

Bases: Module

Deepr layer.

Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, ” Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.

This layer is used in the Deepr model.

Parameters
  • feature_size (int) – embedding dim of codes (m in the original paper).

  • window (int) – sliding window (d in the original paper)

  • hidden_size (int) – number of conv filters (motif size, p, in the original paper)

Examples

>>> from pyhealth.models import DeeprLayer
>>> input = torch.randn(3, 128, 5)  # [batch size, sequence len, input_size]
>>> layer = DeeprLayer(5, window=4, hidden_size=7) # window does not impact the output shape
>>> outputs = layer(input)
>>> outputs.shape
torch.Size([3, 7])
forward(x, mask=None)[source]#

Forward propagation.

Parameters
  • x (Tensor) – a Tensor of shape [batch size, sequence len, input size].

  • mask (Optional[Tensor]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.

Returns

a Tensor of shape [batch size, hidden_size] representing the

summarized vector.

Return type

c

training: bool#
class pyhealth.models.Deepr(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#

Bases: BaseModel

Deepr model.

Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, ” Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.

Note

We use separate Deepr layers for different feature_keys.

Parameters
  • dataset (BaseDataset) – the dataset to train the model. It is used to query certain information such as the set of all tokens.

  • feature_keys (List[str]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].

  • label_key (str) – key in samples to use as label (e.g., “drugs”).

  • mode (str) – one of “binary”, “multiclass”, or “multilabel”.

  • embedding_dim (int) – the embedding dimension. Default is 128.

  • hidden_dim (int) – the hidden dimension. Default is 128.

  • **kwargs – other parameters for the Deepr layer.

Examples

>>> from pyhealth.datasets import SampleDataset
>>> samples = [
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-0",
...             "list_codes": ["505800458", "50580045810", "50580045811"],  # NDC
...             "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]],
...             "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]],  # ATC-4
...             "list_list_vectors": [
...                 [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]],
...                 [[7.7, 8.5, 9.4]],
...             ],
...             "label": 1,
...         },
...         {
...             "patient_id": "patient-0",
...             "visit_id": "visit-1",
...             "list_codes": [
...                 "55154191800",
...                 "551541928",
...                 "55154192800",
...                 "705182798",
...                 "70518279800",
...             ],
...             "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]],
...             "list_list_codes": [["A04A", "B035", "C129"]],
...             "list_list_vectors": [
...                 [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]],
...             ],
...             "label": 0,
...         },
...     ]
>>> dataset = SampleDataset(samples=samples, dataset_name="test")
>>>
>>> from pyhealth.models import Deepr
>>> model = Deepr(
...         dataset=dataset,
...         feature_keys=[
...             "list_list_codes",
...             "list_list_vectors",
...         ],
...         label_key="label",
...         mode="binary",
...     )
>>>
>>> from pyhealth.datasets import get_dataloader
>>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True)
>>> data_batch = next(iter(train_loader))
>>>
>>> ret = model(**data_batch)
>>> print(ret)
{'loss': tensor(0.9139, device='cuda:0',
    grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.7530],
        [0.6510]], device='cuda:0', grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.],
        [1.]], device='cuda:0')}
>>>
forward(**kwargs)[source]#

Forward propagation.

Return type

Dict[str, Tensor]

training: bool#

Trainer#

class pyhealth.trainer.Trainer(model, checkpoint_path=None, metrics=None, device=None, enable_logging=True, output_path=None, exp_name=None)[source]#

Bases: object

Trainer for PyTorch models.

Parameters
  • model (Module) – PyTorch model.

  • checkpoint_path (Optional[str]) – Path to the checkpoint. Default is None, which means the model will be randomly initialized.

  • metrics (Optional[List[str]]) – List of metric names to be calculated. Default is None, which means the default metrics in each metrics_fn will be used.

  • device (Optional[str]) – Device to be used for training. Default is None, which means the device will be GPU if available, otherwise CPU.

  • enable_logging (bool) – Whether to enable logging. Default is True.

  • output_path (Optional[str]) – Path to save the output. Default is “./output”.

  • exp_name (Optional[str]) – Name of the experiment. Default is current datetime.

train(train_dataloader, val_dataloader=None, test_dataloader=None, epochs=5, optimizer_class=<class 'torch.optim.adam.Adam'>, optimizer_params=None, weight_decay=0.0, max_grad_norm=None, monitor=None, monitor_criterion='max', load_best_model_at_last=True)[source]#

Trains the model.

Parameters
  • train_dataloader (DataLoader) – Dataloader for training.

  • val_dataloader (Optional[DataLoader]) – Dataloader for validation. Default is None.

  • test_dataloader (Optional[DataLoader]) – Dataloader for testing. Default is None.

  • epochs (int) – Number of epochs. Default is 5.

  • optimizer_class (Type[Optimizer]) – Optimizer class. Default is torch.optim.Adam.

  • optimizer_params (Optional[Dict[str, object]]) – Parameters for the optimizer. Default is {“lr”: 1e-3}.

  • weight_decay (float) – Weight decay. Default is 0.0.

  • max_grad_norm (Optional[float]) – Maximum gradient norm. Default is None.

  • monitor (Optional[str]) – Metric name to monitor. Default is None.

  • monitor_criterion (str) – Criterion to monitor. Default is “max”.

  • load_best_model_at_last (bool) – Whether to load the best model at the last. Default is True.

inference(dataloader)[source]#

Model inference.

Parameters

dataloader – Dataloader for evaluation.

Returns

List of true labels. y_prob_all: List of predicted probabilities. loss_mean: Mean loss over batches.

Return type

y_true_all

evaluate(dataloader)[source]#

Evaluates the model.

Parameters

dataloader – Dataloader for evaluation.

Returns

a dictionary of scores.

Return type

scores

save_ckpt(ckpt_path)[source]#

Saves the model checkpoint.

Return type

None

load_ckpt(ckpt_path)[source]#

Saves the model checkpoint.

Return type

None

Tokenizer#

The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.

class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#

Bases: object

Vocabulary class for mapping between tokens and indices.

add_token(token)[source]#

Adds a token to the vocabulary.

class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#

Bases: object

Tokenizer class for converting tokens to indices and vice versa.

This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.

get_padding_index()[source]#

Returns the index of the padding token.

get_vocabulary_size()[source]#

Returns the size of the vocabulary.

convert_tokens_to_indices(tokens)[source]#

Converts a list of tokens to indices.

Return type

List[int]

convert_indices_to_tokens(indices)[source]#

Converts a list of indices to tokens.

Return type

List[str]

batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#

Converts a list of lists of tokens (2D) to indices.

Parameters
  • batch (List[List[str]]) – List of lists of tokens to convert to indices.

  • padding (bool) – whether to pad the tokens to the max number of tokens in the batch (smart padding).

  • truncation (bool) – whether to truncate the tokens to max_length.

  • max_length (int) – maximum length of the tokens. This argument is ignored if truncation is False.

batch_decode_2d(batch, padding=False)[source]#

Converts a list of lists of indices (2D) to tokens.

Parameters
  • batch (List[List[int]]) – List of lists of indices to convert to tokens.

  • padding (bool) – whether to keep the padding tokens from the tokens.

batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#

Converts a list of lists of lists of tokens (3D) to indices.

Parameters
  • batch (List[List[List[str]]]) – List of lists of lists of tokens to convert to indices.

  • padding (Tuple[bool, bool]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).

  • truncation (Tuple[bool, bool]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_length

  • max_length (Tuple[int, int]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.

batch_decode_3d(batch, padding=False)[source]#

Converts a list of lists of lists of indices (3D) to tokens.

Parameters
  • batch (List[List[List[int]]]) – List of lists of lists of indices to convert to tokens.

  • padding (bool) – whether to keep the padding tokens from the tokens.

Metrics#

We provide easy to use metrics (the same style and args as sklearn.metrics) for binary classification, multiclass classification, multilabel classification. We also provide other metrics specically for healthcare tasks, such as drug drug interaction (DDI) rate.

pyhealth.metrics.multiclass#

pyhealth.metrics.multiclass.multiclass_metrics_fn(y_true, y_prob, metrics=None)[source]#

Computes metrics for multiclass classification.

User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:

  • roc_auc_macro_ovo: area under the receiver operating characteristic curve,

    macro averaged over one-vs-one multiclass classification

  • roc_auc_macro_ovr: area under the receiver operating characteristic curve,

    macro averaged over one-vs-rest multiclass classification

  • roc_auc_weighted_ovo: area under the receiver operating characteristic curve,

    weighted averaged over one-vs-one multiclass classification

  • roc_auc_weighted_ovr: area under the receiver operating characteristic curve,

    weighted averaged over one-vs-rest multiclass classification

  • accuracy: accuracy score

  • balanced_accuracy: balanced accuracy score (usually used for imbalanced

    datasets)

  • f1_micro: f1 score, micro averaged

  • f1_macro: f1 score, macro averaged

  • f1_weighted: f1 score, weighted averaged

  • jaccard_micro: Jaccard similarity coefficient score, micro averaged

  • jaccard_macro: Jaccard similarity coefficient score, macro averaged

  • jaccard_weighted: Jaccard similarity coefficient score, weighted averaged

  • cohen_kappa: Cohen’s kappa score

If no metrics are specified, accuracy, f1_macro, and f1_micro are computed by default.

This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.

Parameters
  • y_true (ndarray) – True target values of shape (n_samples,).

  • y_prob (ndarray) – Predicted probabilities of shape (n_samples, n_classes).

  • metrics (Optional[List[str]]) – List of metrics to compute. Default is [“accuracy”, “f1_macro”, “f1_micro”].

Return type

Dict[str, float]

Returns

Dictionary of metrics whose keys are the metric names and values are

the metric values.

Examples

>>> from pyhealth.metrics import multiclass_metrics_fn
>>> y_true = np.array([0, 1, 2, 2])
>>> y_prob = np.array([[0.9,  0.05, 0.05],
...                    [0.05, 0.9,  0.05],
...                    [0.05, 0.05, 0.9],
...                    [0.6,  0.2,  0.2]])
>>> multiclass_metrics_fn(y_true, y_prob, metrics=["accuracy"])
{'accuracy': 0.75}

pyhealth.metrics.multilabel#

pyhealth.metrics.multilabel.multilabel_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5)[source]#

Computes metrics for multilabel classification.

User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:

  • roc_auc_micro: area under the receiver operating characteristic curve,

    micro averaged

  • roc_auc_macro: area under the receiver operating characteristic curve,

    macro averaged

  • roc_auc_weighted: area under the receiver operating characteristic curve,

    weighted averaged

  • roc_auc_samples: area under the receiver operating characteristic curve,

    samples averaged

  • pr_auc_micro: area under the precision recall curve, micro averaged

  • pr_auc_macro: area under the precision recall curve, macro averaged

  • pr_auc_weighted: area under the precision recall curve, weighted averaged

  • pr_auc_samples: area under the precision recall curve, samples averaged

  • accuracy: accuracy score

  • f1_micro: f1 score, micro averaged

  • f1_macro: f1 score, macro averaged

  • f1_weighted: f1 score, weighted averaged

  • f1_samples: f1 score, samples averaged

  • precision_micro: precision score, micro averaged

  • precision_macro: precision score, macro averaged

  • precision_weighted: precision score, weighted averaged

  • precision_samples: precision score, samples averaged

  • recall_micro: recall score, micro averaged

  • recall_macro: recall score, macro averaged

  • recall_weighted: recall score, weighted averaged

  • recall_samples: recall score, samples averaged

  • jaccard_micro: Jaccard similarity coefficient score, micro averaged

  • jaccard_macro: Jaccard similarity coefficient score, macro averaged

  • jaccard_weighted: Jaccard similarity coefficient score, weighted averaged

  • jaccard_samples: Jaccard similarity coefficient score, samples averaged

  • hamming_loss: Hamming loss

If no metrics are specified, pr_auc_samples is computed by default.

This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.

Parameters
  • y_true (ndarray) – True target values of shape (n_samples, n_labels).

  • y_prob (ndarray) – Predicted probabilities of shape (n_samples, n_labels).

  • metrics (Optional[List[str]]) – List of metrics to compute. Default is [“pr_auc_samples”].

  • threshold (float) – Threshold to binarize the predicted probabilities. Default is 0.5.

Return type

Dict[str, float]

Returns

Dictionary of metrics whose keys are the metric names and values are

the metric values.

Examples

>>> from pyhealth.metrics import multilabel_metrics_fn
>>> y_true = np.array([[0, 1, 1], [1, 0, 1]])
>>> y_prob = np.array([[0.1, 0.9, 0.8], [0.05, 0.95, 0.6]])
>>> multilabel_metrics_fn(y_true, y_prob, metrics=["accuracy"])
{'accuracy': 0.5}

pyhealth.metrics.binary#

pyhealth.metrics.binary.binary_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5)[source]#

Computes metrics for binary classification.

User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:

  • pr_auc: area under the precision-recall curve

  • roc_auc: area under the receiver operating characteristic curve

  • accuracy: accuracy score

  • balanced_accuracy: balanced accuracy score (usually used for imbalanced

    datasets)

  • f1: f1 score

  • precision: precision score

  • recall: recall score

  • cohen_kappa: Cohen’s kappa score

  • jaccard: Jaccard similarity coefficient score

If no metrics are specified, pr_auc, roc_auc and f1 are computed by default.

This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.

Parameters
  • y_true (ndarray) – True target values of shape (n_samples,).

  • y_prob (ndarray) – Predicted probabilities of shape (n_samples,).

  • metrics (Optional[List[str]]) – List of metrics to compute. Default is [“pr_auc”, “roc_auc”, “f1”].

  • threshold (float) – Threshold for binary classification. Default is 0.5.

Return type

Dict[str, float]

Returns

Dictionary of metrics whose keys are the metric names and values are

the metric values.

Examples

>>> from pyhealth.metrics import binary_metrics_fn
>>> y_true = np.array([0, 0, 1, 1])
>>> y_prob = np.array([0.1, 0.4, 0.35, 0.8])
>>> binary_metrics_fn(y_true, y_prob, metrics=["accuracy"])
{'accuracy': 0.75}

MedCode#

We provide medical code mapping tools for (i) ontology mapping within one coding system and (ii) mapping the same concept cross different coding systems.

class pyhealth.medcode.InnerMap(vocabulary, refresh_cache=False)[source]#

Bases: ABC

Contains information for a specific medical code system.

InnerMap is a base abstract class for all medical code systems. It will be instantiated as a specific medical code system with InnerMap.load(vocabulary).

Note

This class cannot be instantiated using __init__() (throws an error).

classmethod load(vocabulary, refresh_cache=False)[source]#

Initializes a specific medical code system inheriting from InnerMap.

Parameters
  • vocabulary (str) – vocabulary name. E.g., “ICD9CM”, “ICD9PROC”.

  • refresh_cache (bool) – whether to refresh the cache. Default is False.

Examples

>>> from pyhealth.medcode import InnerMap
>>> icd9cm = InnerMap.load("ICD9CM")
>>> icd9cm.lookup("428.0")
'Congestive heart failure, unspecified'
>>> icd9cm.get_ancestors("428.0")
['428', '420-429.99', '390-459.99', '001-999.99']
property available_attributes: List[str]#

Returns a list of available attributes.

Return type

List[str]

Returns

List of available attributes.

stat()[source]#

Prints statistics of the code system.

static standardize(code)[source]#

Standardizes a given code.

Subclass will override this method based on different medical code systems.

Return type

str

static convert(code, **kwargs)[source]#

Converts a given code.

Subclass will override this method based on different medical code systems.

Return type

str

lookup(code, attribute='name')[source]#

Looks up the code.

Parameters
  • code (str) – code to look up.

  • attribute (str) – attribute to look up. One of self.available_attributes. Default is “name”.

Returns

The attribute value of the code.

get_ancestors(code)[source]#

Gets the ancestors of the code.

Parameters

code (str) – code to look up.

Return type

List[str]

Returns

List of ancestors ordered from the closest to the farthest.

get_descendants(code)[source]#

Gets the descendants of the code.

Parameters

code (str) – code to look up.

Return type

List[str]

Returns

List of ancestors ordered from the closest to the farthest.

class pyhealth.medcode.CrossMap(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#

Bases: object

Contains mapping between two medical code systems.

CrossMap is a base class for all possible mappings. It will be initialized with two specific medical code systems with CrossMap.load(source_vocabulary, target_vocabulary).

classmethod load(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#

Initializes the mapping between two medical code systems.

Parameters
  • source_vocabulary (str) – source medical code system.

  • target_vocabulary (str) – target medical code system.

  • refresh_cache (bool) – whether to refresh the cache. Default is False.

Examples

>>> from pyhealth.medcode import CrossMap
>>> mapping = CrossMap("ICD9CM", "CCSCM")
>>> mapping.map("428.0")
['108']
>>> mapping = CrossMap.load("NDC", "ATC")
>>> mapping.map("00527051210", target_kwargs={"level": 3})
['A11C']
map(source_code, source_kwargs=None, target_kwargs=None)[source]#

Maps a source code to a list of target codes.

Parameters
  • source_code (str) – source code.

  • **source_kwargs (Optional[Dict]) – additional arguments for the source code. Will be passed to self.s_class.convert(). Default is empty dict.

  • **target_kwargs (Optional[Dict]) – additional arguments for the target code. Will be passed to self.t_class.convert(). Default is empty dict.

Return type

List[str]

Returns

A list of target codes.

Diagnosis codes:#

class pyhealth.medcode.ICD9CM(**kwargs)[source]#

Bases: InnerMap

9-th International Classification of Diseases, Clinical Modification.

static standardize(code)[source]#

Standardizes ICD9CM code.

class pyhealth.medcode.ICD10CM(**kwargs)[source]#

Bases: InnerMap

10-th International Classification of Diseases, Clinical Modification.

static standardize(code)[source]#

Standardizes ICD10CM code.

class pyhealth.medcode.CCSCM(**kwargs)[source]#

Bases: InnerMap

Classification of Diseases, Clinical Modification.

Procedure codes:#

class pyhealth.medcode.ICD9PROC(**kwargs)[source]#

Bases: InnerMap

9-th International Classification of Diseases, Procedure.

static standardize(code)[source]#

Standardizes ICD9PROC code.

class pyhealth.medcode.ICD10PROC(**kwargs)[source]#

Bases: InnerMap

10-th International Classification of Diseases, Procedure.

class pyhealth.medcode.CCSPROC(**kwargs)[source]#

Bases: InnerMap

Classification of Diseases, Procedure.

Medication codes:#

class pyhealth.medcode.NDC(**kwargs)[source]#

Bases: InnerMap

National Drug Code.

class pyhealth.medcode.RxNorm(**kwargs)[source]#

Bases: InnerMap

RxNorm.

class pyhealth.medcode.ATC(**kwargs)[source]#

Bases: InnerMap

Anatomical Therapeutic Chemical.

static convert(code, level=5)[source]#

Convert ATC code to a specific level.

get_ddi(gamenet_ddi=False, refresh_cache=False)[source]#

Gets the drug-drug interactions (DDI).

Parameters
  • gamenet_ddi (bool) – Whether to use the DDI from the GAMENet paper, which is a subset of the DDI from the ATC.

  • refresh_cache (bool) – Whether to refresh the cache. Default is False.

Return type

List[str]

PyHealth live#

Start Time: 8 PM Central Time, Wednesday

Recurrence: Weekly (starting from Dec 21, 2022)

Zoom: Join Link

Add to Google Calender: Invitation

Add to Microsoft Outlook (.ics): Invitation

YouTube: Recorded Live Sessions

User/Developer Slack: Click to join

Schedules#

(Dec 21, Wed) Live 01 - What is PyHealth and How to Get Started? [Recap]

(Dec 28, Wed) Live 02 - Data & Datasets & Tasks: store unstructured data in an structured way. [Recap I] [II] [III] [IV]

(Jan 4, Wed) Live 03 - Models & Trainer & Metrics: initialize and train a deep learning model. [Recap I] [II] [III]

(Jan 11, Wed) Live 04 - Tokenizer & Medcode: master the medical code lookup and mapping [Recap I] [II]

(Jan 18, Wed) Live 05 - PyHealth can support a complete healthcare ML pipeline [Recap I] [II]

(Jan 25, Wed) Live 06 - Fit your own dataset into pipeline and use our model

(Feb 1, Wed) Live 07 - Adopt your customized model and quickly try it on our data

(Feb 8, Wed) Live 08 - Define your own healthcare task on MIMIC data

Development logs#

We track the new development here:

Jan 24, 2023

1. Fix the code typo in pyhealth/tasks/drug_recommendation.py for issue #71.
2. update the pyhealth live schedule

Jan 22, 2023

1. Fix the list of list of vector problem in RNN, Transformer, RETAIN, and CNN
2. Add initialization examples for RNN, Transformer, RETAIN, CNN, and Deepr
3. (minor) change the parameters from "Type" and "level" to "type_" and "dim_"
4. BPDanek adds the __repr__ function to medcode for better print understanding
5. add unittest for pyhealth.data

Jan 21, 2023

1. Added a new model, Deepr (models.Deepr)

Jan 20, 2023

1. add the pyhealth live 05
2. add slack channel invitation in pyhealth live page

Jan 13, 2023

1. add the pyhealth live 03 and 04 video link to the nagivation
2. add future pyhealth live schedule

Jan 8, 2023

1. Changed BaseModel.add_feature_transform_layer in models/base_model.py so that it accepts special_tokens if necessary
2. fix an int/float bug in dataset checking (transform int to float and then process them uniformly)

Dec 26, 2022

1. add examples to pyhealth.data, pyhealth.datasets
2. improve jupyter notebook tutorials 0, 1, 2

Dec 21, 2022

1. add the development logs to the navigation
2. add the pyhealth live schedule to the nagivation

About us#

We are the SunLab healthcare research team at UIUC.

*Zhenbang Wu (Ph.D. Student @ University of Illinois Urbana-Champaign)

*Chaoqi Yang (Ph.D. Student @ University of Illinois Urbana-Champaign)

Patrick Jiang (M.S. Student @ University of Illinois Urbana-Champaign)

Jimeng Sun (Professor @ University of Illinois Urbana-Champaign)

(* indicates equal contribution)


Acknowledgement#

Yue Zhao (Ph.D. Student @ Carnegie Mellon University)

Dr. Zhi Qiao (Associate ML Director, ACOE @ IQVIA)

Dr. Xiao Cao (VP of Machine Learning and NLP, Relativity)

Xiyang Hu (Ph.D. Student @ Carnegie Mellon University)