Welcome to PyHealth!#
PyHealth is designed for both ML researchers and medical practitioners. We can make your healthcare AI applications easier to develop, test and validate. Your development process becomes more flexible and more customizable. [GitHub]
[News!] We are running the “PyHealth Live” gathering at 8 PM central time every Wednesday night! Welcome to join the live discussion over zoom. You may also add the schedules to your Google Calender or Microsoft Outlook (.ics).
FYI, the PyHealth Weekly Live will introduce basic pyhealth modules sequentially and showcase the newly developed functions as well as different use cases. For efficiency, all live lasts for around half an hour and the video collections are can be found in YouTube. The future scheduled topics are announced here. Hope to see you all on every wednesday night!
Introduction [Video]#
PyHealth can support diverse electronic health records (EHRs) such as MIMIC and eICU and all OMOP-CDM based databases and provide various advanced deep learning algorithms for handling important healthcare tasks such as diagnosis-based drug recommendation, patient hospitalization and mortality prediction, and ICU length stay forecasting, etc.
Build a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.
Modules#
All healthcare tasks in our package follow a five-stage pipeline:
load dataset -> define task function -> build ML/DL model -> model training -> inference
! We try hard to make sure each stage is as separate as possibe, so that people can customize their own pipeline by only using our data processing steps or the ML models. Each step will call one module and we introduce them using an example.
An ML Pipeline Example#
STEP 1: <pyhealth.datasets> provides a clean structure for the dataset, independent from the tasks. We support
MIMIC-III
,MIMIC-IV
andeICU
, as well as the standardOMOP-formatted data
. The dataset is stored in a unifiedPatient-Visit-Event
structure.
from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
# map all NDC codes to ATC 3-rd level codes in these tables
code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)
User could also store their own dataset into our <pyhealth.datasets.SampleDataset>
structure and then follow the same pipeline below, see Tutorial
STEP 2: <pyhealth.tasks> inputs the
<pyhealth.datasets>
object and defines how to process each patient’s data into a set of samples for the tasks. In the package, we provide several task examples, such asdrug recommendation
andlength of stay prediction
.
from pyhealth.tasks import drug_recommendation_mimic3_fn
from pyhealth.datasets import split_by_patient, get_dataloader
mimic3sample = mimic3base.set_task(task_fn=drug_recommendation_mimic3_fn) # use default task
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])
# create dataloaders (torch.data.DataLoader)
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
STEP 3: <pyhealth.models> provides the healthcare ML models using
<pyhealth.models>
. This module also provides model layers, such aspyhealth.models.RETAINLayer
for building customized ML architectures. Our model layers can used as easily astorch.nn.Linear
.
from pyhealth.models import Transformer
model = Transformer(
dataset=mimic3sample,
feature_keys=["conditions", "procedures"],
label_key="drugs",
mode="multilabel",
)
STEP 4: <pyhealth.trainer> is the training manager with
train_loader
, theval_loader
,val_metric
, and specify other arguemnts, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.
from pyhealth.trainer import Trainer
trainer = Trainer(model=model)
trainer.train(
train_dataloader=train_loader,
val_dataloader=val_loader,
epochs=50,
monitor="pr_auc_samples",
)
STEP 5: <pyhealth.metrics> provides several common evaluation metrics (refer to Doc and see what are available) and special metrics in healthcare, such as drug-drug interaction (DDI) rate.
trainer.evaluate(test_loader)
Medical Code Map#
<pyhealth.codemap> provides two core functionalities: (i) looking up information for a given medical code (e.g., name, category, sub-concept); (ii) mapping codes across coding systems (e.g., ICD9CM to CCSCM). This module can be independently applied to your research.
For code mapping between two coding systems
from pyhealth.medcode import CrossMap
codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("82101") # use it like a dict
codemap = CrossMap.load("NDC", "ATC")
codemap.map("00527051210")
For code ontology lookup within one system
from pyhealth.medcode import InnerMap
icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0") # get detailed info
icd9cm.get_ancestors("428.0") # get parents
Medical Code Tokenizer#
<pyhealth.tokenizer> is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be independently applied to your research.
from pyhealth.tokenizer import Tokenizer
# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens) # [[8, 9, 10, 11], [12, 1, 1, 0]]
# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices) # [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
Users can customize their healthcare AI pipeline as simply as calling one module
process your OMOP data via
pyhealth.datasets
process the open eICU (e.g., MIMIC) data via
pyhealth.datasets
define your own task on existing databases via
pyhealth.tasks
use existing healthcare models or build upon it (e.g., RETAIN) via
pyhealth.models
.code map between for conditions and medicaitons via
pyhealth.codemap
.
Datasets#
We provide the following datasets for general purpose healthcare AI research:
Dataset |
Module |
Year |
Information |
---|---|---|---|
MIMIC-III |
|
2016 |
|
MIMIC-IV |
|
2020 |
|
eICU |
|
2018 |
|
OMOP |
|
Machine/Deep Learning Models#
Model Name |
Type |
Module |
Year |
Reference |
---|---|---|---|---|
Convolutional Neural Network (CNN) |
deep learning |
|
1989 |
Handwritten Digit Recognition with a Back-Propagation Network |
Recurrent Neural Nets (RNN) |
deep Learning |
|
2011 |
|
Transformer |
deep Learning |
|
2017 |
|
RETAIN |
deep Learning |
|
2016 |
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism |
GAMENet |
deep Learning |
|
2019 |
GAMENet: Graph Attention Mechanism for Explainable Electronic Health Record Prediction |
MICRON |
deep Learning |
|
2021 |
Change Matters: Medication Change Prediction with Recurrent Residual Networks |
SafeDrug |
deep Learning |
|
2021 |
SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations |
Benchmark on Healthcare Tasks#
Here is our benchmark doc on healthcare tasks. You can also check this below.
We also provide function for leaderboard generation, check it out in our github repo.
Here are the dynamic visualizations of the leaderboard. You can click the checkbox and easily compare the performance for different models doing different tasks on different datasets!
import sys
sys.path.append('../..')
from leaderboard import leaderboard_gen, utils
args = leaderboard_gen.construct_args()
leaderboard_gen.plots_generation(args)
Installation#
You could install from PyPi:
pip install pyhealth
or from github source:
git clone https://github.com/sunlabuiuc/PyHealth.git
cd pyhealth
pip install .
Required Dependencies:
python>=3.8
torch>=1.8.0
rdkit>=2022.03.4
scikit-learn>=0.24.2
networkx>=2.6.3
pandas>=1.3.2
tqdm
Warning 1:
PyHealth has multiple neural network based models, e.g., LSTM, which are implemented in PyTorch. However, PyHealth does NOT install these DL libraries for you. This reduces the risk of interfering with your local copies. If you want to use neural-net based models, please make sure PyTorch is installed. Similarly, models depending on xgboost would NOT enforce xgboost installation by default.
CUDA Setting:
To run PyHealth, you also need CUDA and cudatoolkit that support your GPU well. More info
For example, if you use NVIDIA RTX A6000 as your GPU for training, you should install a compatible cudatoolkit using:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch.
Tutorials#
We provide the following tutorials to help users get started with our pyhealth.
Tutorial 0: Introduction to pyhealth.data [Video]
Tutorial 1: Introduction to pyhealth.datasets [Video]
Tutorial 2: Introduction to pyhealth.tasks [Video]
Tutorial 3: Introduction to pyhealth.models [Video]
Tutorial 4: Introduction to pyhealth.trainer [Video]
Tutorial 5: Introduction to pyhealth.metrics [Video]
Tutorial 6: Introduction to pyhealth.tokenizer [Video]
Tutorial 7: Introduction to pyhealth.medcode [Video]
The following tutorials will help users build their own task pipelines. [Video]
Pipeline 1: Drug Recommendation
Pipeline 2: Length of Stay Prediction
Pipeline 3: Readmission Prediction
Pipeline 4: Mortality Prediction
Advanced Tutorials#
We provided the advanced tutorials for supporting various needs.
Advanced Tutorial 1: Fit your dataset into our pipeline
Advanced Tutorial 2: Define your own healthcare task
Advanced Tutorial 3: Adopt customized model into pyhealth
Advanced Tutorial 4: Load your own processed data into pyhealth and try out our ML models
Data#
pyhealth.data defines the atomic data structures of this package.
pyhealth.data.Event#
One basic data structure in the package. It is a simple container for a single event. It contains all necessary attributes for supporting various healthcare tasks.
- class pyhealth.data.Event(code, table, vocabulary, visit_id, patient_id, timestamp=None, **attr)[source]#
Bases:
object
Contains information about a single event.
An event can be anything from a diagnosis to a prescription or a lab test that happened in a visit of a patient at a specific time.
- Parameters
code (
str
) – code of the event. E.g., “428.0” for congestive heart failure.table (
str
) – name of the table where the event is recorded. This corresponds to the raw csv file name in the dataset. E.g., “DIAGNOSES_ICD”.vocabulary (
str
) – vocabulary of the code. E.g., “ICD9CM” for ICD-9 diagnosis codes.visit_id (
str
) – unique identifier of the visit.patient_id (
str
) – unique identifier of the patient.timestamp (
Optional
[datetime
]) – timestamp of the event. Default is None.**attr – optional attributes to add to the event as key=value pairs.
- attr_dict#
Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.
Examples
>>> from pyhealth.data import Event >>> event = Event( ... code="00069153041", ... table="PRESCRIPTIONS", ... vocabulary="NDC", ... visit_id="v001", ... patient_id="p001", ... dosage="250mg", ... ) >>> event Event with NDC code 00069153041 from table PRESCRIPTIONS >>> event.attr_dict {'dosage': '250mg'}
pyhealth.data.Visit#
Another basic data structure in the package. A Visit is a single encounter in hospital. It is a container a sequence of Event for each information aspect, such as diagnosis or medications. It also contains other necessary attributes for supporting healthcare tasks, such as the date of the visit.
- class pyhealth.data.Visit(visit_id, patient_id, encounter_time=None, discharge_time=None, discharge_status=None, **attr)[source]#
Bases:
object
Contains information about a single visit.
A visit is a period of time in which a patient is admitted to a hospital or a specific department. Each visit is associated with a patient and contains a list of different events.
- Parameters
visit_id (
str
) – unique identifier of the visit.patient_id (
str
) – unique identifier of the patient.encounter_time (
Optional
[datetime
]) – timestamp of visit’s encounter. Default is None.discharge_time (
Optional
[datetime
]) – timestamp of visit’s discharge. Default is None.discharge_status – patient’s status upon discharge. Default is None.
**attr – optional attributes to add to the visit as key=value pairs.
- attr_dict#
Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.
- event_list_dict#
Dict[str, List[Event]], dictionary of event lists. Each key is a table name and each value is a list of events from that table ordered by timestamp.
Examples
>>> from pyhealth.data import Event, Visit >>> event = Event( ... code="00069153041", ... table="PRESCRIPTIONS", ... vocabulary="NDC", ... visit_id="v001", ... patient_id="p001", ... dosage="250mg", ... ) >>> visit = Visit( ... visit_id="v001", ... patient_id="p001", ... ) >>> visit.add_event(event) >>> visit Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS'] >>> visit.available_tables ['PRESCRIPTIONS'] >>> visit.num_events 1 >>> visit.get_event_list('PRESCRIPTIONS') [Event with NDC code 00069153041 from table PRESCRIPTIONS] >>> visit.get_code_list('PRESCRIPTIONS') ['00069153041'] >>> patient.available_tables ['PRESCRIPTIONS'] >>> patient.get_visit_by_index(0) Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS'] >>> patient.get_visit_by_index(0).get_code_list(table="PRESCRIPTIONS") ['00069153041']
- add_event(event)[source]#
Adds an event to the visit.
If the event’s table is not in the visit’s event list dictionary, it is added as a new key. The event is then added to the list of events of that table.
- Parameters
event (
Event
) – event to add.
Note
- As for now, there is no check on the order of the events. The new event
is simply appended to end of the list.
- Return type
- get_event_list(table)[source]#
Returns a list of events from a specific table.
If the table is not in the visit’s event list dictionary, an empty list is returned.
- Parameters
table (
str
) – name of the table.- Return type
- Returns
List of events from the specified table.
Note
- As for now, there is no check on the order of the events. The list of
events is simply returned as is.
- get_code_list(table, remove_duplicate=True)[source]#
Returns a list of codes from a specific table.
If the table is not in the visit’s event list dictionary, an empty list is returned.
- Parameters
- Return type
- Returns
List of codes from the specified table.
Note
- As for now, there is no check on the order of the codes. The list of
codes is simply returned as is.
- set_event_list(table, event_list)[source]#
Sets the list of events from a specific table.
This function will overwrite any existing list of events from the specified table.
Note
- As for now, there is no check on the order of the events. The list of
events is simply set as is.
- Return type
pyhealth.data.Patient#
Another basic data structure in the package. A Patient is a collection of Visit for the current patients. It contains all necessary attributes of a patient, such as ethnicity, mortality status, gender, etc. It can support various healthcare tasks.
- class pyhealth.data.Patient(patient_id, birth_datetime=None, death_datetime=None, gender=None, ethnicity=None, **attr)[source]#
Bases:
object
Contains information about a single patient.
A patient is a person who is admitted at least once to a hospital or a specific department. Each patient is associated with a list of visits.
- Parameters
patient_id (
str
) – unique identifier of the patient.birth_datetime (
Optional
[datetime
]) – timestamp of patient’s birth. Default is None.death_datetime (
Optional
[datetime
]) – timestamp of patient’s death. Default is None.gender – gender of the patient. Default is None.
ethnicity – ethnicity of the patient. Default is None.
**attr – optional attributes to add to the patient as key=value pairs.
- attr_dict#
Dict, dictionary of patient attributes. Each key is an attribute name and each value is the attribute’s value.
- visits#
OrderedDict[str, Visit], an ordered dictionary of visits. Each key is a visit_id and each value is a visit.
- index_to_visit_id#
Dict[int, str], dictionary that maps the index of a visit in the visits list to the corresponding visit_id.
Examples
>>> from pyhealth.data import Event, Visit, Patient >>> event = Event( ... code="00069153041", ... table="PRESCRIPTIONS", ... vocabulary="NDC", ... visit_id="v001", ... patient_id="p001", ... dosage="250mg", ... ) >>> visit = Visit( ... visit_id="v001", ... patient_id="p001", ... ) >>> visit.add_event(event) >>> patient = Patient( ... patient_id="p001", ... ) >>> patient.add_visit(visit) >>> patient Patient p001 with 1 visits
- add_visit(visit)[source]#
Adds a visit to the patient.
If the visit’s visit_id is already in the patient’s visits dictionary, it will be overwritten by the new visit.
- Parameters
visit (
Visit
) – visit to add.
Note
- As for now, there is no check on the order of the visits. The new visit
is simply added to the end of the ordered dictionary of visits.
- Return type
- add_event(event)[source]#
Adds an event to the patient.
If the event’s visit_id is not in the patient’s visits dictionary, this function will raise KeyError.
- Parameters
event (
Event
) – event to add.
Note
- As for now, there is no check on the order of the events. The new event
is simply appended to the end of the list of events of the corresponding visit.
- Return type
Datasets#
pyhealth.datasets.BaseDataset#
This is the basic dataset class. Any specific dataset will inherit from this class.
- class pyhealth.datasets.BaseDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
ABC
Abstract base dataset class.
This abstract class defines a uniform interface for all datasets (e.g., MIMIC-III, MIMIC-IV, eICU, OMOP).
Each specific dataset will be a subclass of this abstract class, which can then be converted to samples dataset for different tasks by calling self.set_task().
- Parameters
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary. E.g., {“NDC”, “ATC”}.
- a tuple with two elements. The first element is a str of the
target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method. E.g., {“NDC”, (“ATC”, {“target_kwargs”: {“level”: 3}})}.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- parse_tables()[source]#
Parses the tables in self.tables and return a dict of patients.
- Will be called in self.__init__() if cache file does not exist or
refresh_cache is True.
This function will first call self.parse_basic_info() to parse the basic patient information, and then call self.parse_[table_name]() to parse the table with name table_name. Both self.parse_basic_info() and self.parse_[table_name]() should be implemented in the subclass.
- set_task(task_fn, task_name=None)[source]#
Processes the base dataset to generate the task-specific sample dataset.
This function should be called by the user after the base dataset is initialized. It will iterate through all patients in the base dataset and call task_fn which should be implemented by the specific task.
- Parameters
task_fn (
Callable
) – a function that takes a single patient and returns a list of samples (each sample is a dict with patient_id, visit_id, and other task-specific attributes as key). The samples will be concatenated to form the sample dataset.task_name (
Optional
[str
]) – the name of the task. If None, the name of the task function will be used.
- Returns
the task-specific sample dataset.
- Return type
sample_dataset
Note
- In task_fn, a patient may be converted to multiple samples, e.g.,
a patient with three visits may be converted to three samples ([visit 1], [visit 1, visit 2], [visit 1, visit 2, visit 3]). Patients can also be excluded from the task dataset by returning an empty list.
pyhealth.datasets.SampleDataset#
This class the takes a list of samples as input (either from BaseDataset.set_task() or user-provided json input), and provides a uniform interface for accessing the samples.
- class pyhealth.datasets.SampleDataset(samples, dataset_name='', task_name='')[source]#
Bases:
Dataset
Sample dataset class.
This class the takes a list of samples as input (either from BaseDataset.set_task() or user-provided input), and provides a uniform interface for accessing the samples.
- Parameters
- Currently, the following types of attributes are supported:
a single value. Type: int/float/str. Dim: 0.
a single vector. Type: int/float. Dim: 1.
a list of codes. Type: str. Dim: 2.
a list of vectors. Type: int/float. Dim: 2.
a list of list of codes. Type: str. Dim: 3.
a list of list of vectors. Type: int/float. Dim: 3.
- input_info#
Dict, a dict whose keys are the same as the keys in the samples, and values are the corresponding input information: - “type”: the element type of each key attribute, one of float, int, str. - “dim”: the list dimension of each key attribute, one of 0, 1, 2, 3. - “len”: the length of the vector, only valid for vector-based attributes.
- patient_to_index#
Dict[str, List[int]], a dict mapping patient_id to a list of sample indices.
- visit_to_index#
Dict[str, List[int]], a dict mapping visit_id to a list of sample indices.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "single_vector": [1, 2, 3], ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "single_vector": [1, 5, 8], ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]], ... [[7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples) >>> dataset.input_info {'patient_id': {'type': <class 'str'>, 'dim': 0}, 'visit_id': {'type': <class 'str'>, 'dim': 0}, 'single_vector': {'type': <class 'int'>, 'dim': 1, 'len': 3}, 'list_codes': {'type': <class 'str'>, 'dim': 2}, 'list_vectors': {'type': <class 'float'>, 'dim': 2, 'len': 3}, 'list_list_codes': {'type': <class 'str'>, 'dim': 3}, 'list_list_vectors': {'type': <class 'float'>, 'dim': 3, 'len': 3}, 'label': {'type': <class 'int'>, 'dim': 0}} >>> dataset.patient_to_index {'patient-0': [0, 1]} >>> dataset.visit_to_index {'visit-0': [0], 'visit-1': [1]}
- get_all_tokens(key, remove_duplicates=True, sort=True)[source]#
Gets all tokens with a specific key in the samples.
pyhealth.datasets.MIMIC3Dataset#
The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.MIMIC3Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseDataset
Base dataset for MIMIC-III dataset.
The MIMIC-III dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.
- The basic information is stored in the following tables:
PATIENTS: defines a patient in the database, SUBJECT_ID.
ADMISSIONS: defines a patient’s hospital admission, HADM_ID.
- We further support the following tables:
DIAGNOSES_ICD: contains ICD-9 diagnoses (ICD9CM code) for patients.
PROCEDURES_ICD: contains ICD-9 procedures (ICD9PROC code) for patients.
- PRESCRIPTIONS: contains medication related order entries (NDC code)
for patients.
- LABEVENTS: contains laboratory measurements (MIMIC3_ITEMID code)
for patients
- Parameters
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> dataset = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"], ... code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})}, ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper function which parses PATIENTS and ADMISSIONS tables.
Will be called in self.parse_tables()
- Docs:
- parse_diagnoses_icd(patients)[source]#
Helper function which parses DIAGNOSES_ICD table.
Will be called in self.parse_tables()
- Docs:
DIAGNOSES_ICD: https://mimic.mit.edu/docs/iii/tables/diagnoses_icd/
- Parameters
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type
- Returns
The updated patients dict.
Note
- MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD
table, so we set it to None.
- parse_procedures_icd(patients)[source]#
Helper function which parses PROCEDURES_ICD table.
Will be called in self.parse_tables()
- Docs:
PROCEDURES_ICD: https://mimic.mit.edu/docs/iii/tables/procedures_icd/
- Parameters
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type
- Returns
The updated patients dict.
Note
- MIMIC-III does not provide specific timestamps in PROCEDURES_ICD
table, so we set it to None.
- parse_prescriptions(patients)[source]#
Helper function which parses PRESCRIPTIONS table.
Will be called in self.parse_tables()
- Docs:
PRESCRIPTIONS: https://mimic.mit.edu/docs/iii/tables/prescriptions/
pyhealth.datasets.MIMIC4Dataset#
The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.MIMIC4Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseDataset
Base dataset for MIMIC-IV dataset.
The MIMIC-IV dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.
- The basic information is stored in the following tables:
patients: defines a patient in the database, subject_id.
admission: define a patient’s hospital admission, hadm_id.
- We further support the following tables:
- diagnoses_icd: contains ICD diagnoses (ICD9CM and ICD10CM code)
for patients.
- procedures_icd: contains ICD procedures (ICD9PROC and ICD10PROC
code) for patients.
- prescriptions: contains medication related order entries (NDC code)
for patients.
- labevents: contains laboratory measurements (MIMIC4_ITEMID code)
for patients
- Parameters
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> dataset = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"], ... code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})}, ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper functions which parses patients and admissions tables.
Will be called in self.parse_tables()
- Docs:
patients:https://mimic.mit.edu/docs/iv/modules/hosp/patients/
admissions: https://mimic.mit.edu/docs/iv/modules/hosp/admissions/
- parse_diagnoses_icd(patients)[source]#
Helper function which parses diagnosis_icd table.
Will be called in self.parse_tables()
- Docs:
diagnosis_icd: https://mimic.mit.edu/docs/iv/modules/hosp/diagnoses_icd/
- Parameters
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type
- Returns
The updated patients dict.
Note
- MIMIC-IV does not provide specific timestamps in diagnoses_icd
table, so we set it to None.
- parse_procedures_icd(patients)[source]#
Helper function which parses procedures_icd table.
Will be called in self.parse_tables()
- Docs:
procedures_icd: https://mimic.mit.edu/docs/iv/modules/hosp/procedures_icd/
- Parameters
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type
- Returns
The updated patients dict.
Note
- MIMIC-IV does not provide specific timestamps in procedures_icd
table, so we set it to None.
- parse_prescriptions(patients)[source]#
Helper function which parses prescriptions table.
Will be called in self.parse_tables()
- Docs:
prescriptions: https://mimic.mit.edu/docs/iv/modules/hosp/prescriptions/
pyhealth.datasets.eICUDataset#
The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.eICUDataset(**kwargs)[source]#
Bases:
BaseDataset
Base dataset for eICU dataset.
The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.
- The basic information is stored in the following tables:
- patient: defines a patient (uniquepid), a hospital admission
(patienthealthsystemstayid), and a ICU stay (patientunitstayid) in the database.
hospital: contains information about a hospital (e.g., region).
Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time. Thus, we only know the order of ICU stays within a hospital admission, but not the order of hospital admissions within a patient. As a result, we use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.
- We further support the following tables:
- diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)
for patients
- treatment: contains treatment information (eICU_TREATMENTSTRING code)
for patients.
- medication: contains medication related order entries (eICU_DRUGNAME
code) for patients.
- lab: contains laboratory measurements (eICU_LABNAME code)
for patients
- physicalExam: contains all physical exam (eICU_PHYSICALEXAMPATH)
conducted for patients.
- Parameters
dataset_name – name of the dataset.
root – root directory of the raw data (should contain many csv files).
tables – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).
code_mapping –
a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev – whether to enable dev mode (only use a small subset of the data). Default is False.
refresh_cache – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import eICUDataset >>> dataset = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication", "lab", "treatment", "physicalExam"], ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper functions which parses patient and hospital tables.
Will be called in self.parse_tables().
- Docs:
- Parameters
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type
- Returns
The updated patients dict.
Note
We use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.
- parse_diagnosis(patients)[source]#
Helper function which parses diagnosis table.
Will be called in self.parse_tables().
- Docs:
- Parameters
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type
- Returns
The updated patients dict.
Note
- This table contains both ICD9CM and ICD10CM codes in one single
cell. We need to use medcode to distinguish them.
- parse_treatment(patients)[source]#
Helper function which parses treatment table.
Will be called in self.parse_tables().
- Docs:
- parse_medication(patients)[source]#
Helper function which parses medication table.
Will be called in self.parse_tables().
- Docs:
medication: https://eicu-crd.mit.edu/eicutables/medication/
- parse_lab(patients)[source]#
Helper function which parses lab table.
Will be called in self.parse_tables().
- Docs:
- parse_physicalexam(patients)[source]#
Helper function which parses physicalExam table.
Will be called in self.parse_tables().
- Docs:
physicalExam: https://eicu-crd.mit.edu/eicutables/physicalexam/
pyhealth.datasets.OMOPDataset#
We can process any OMOP-CDM formatted database, refer to doc for more information. We it into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.OMOPDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseDataset
Base dataset for OMOP dataset.
The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence.
See: https://www.ohdsi.org/data-standardization/the-common-data-model/.
- The basic information is stored in the following tables:
- person: contains records that uniquely identify each person or patient,
and some demographic information.
- visit_occurrence: contains info for how a patient engages with the
healthcare system for a duration of time.
death: contains info for how and when a patient dies.
- We further support the following tables:
- condition_occurrence.csv: contains the condition information
(CONDITION_CONCEPT_ID code) of patients’ visits.
- procedure_occurrence.csv: contains the procedure information
(PROCEDURE_CONCEPT_ID code) of patients’ visits.
- drug_exposure.csv: contains the drug information (DRUG_CONCEPT_ID code)
of patients’ visits.
- measurement.csv: contains all laboratory measurements
(MEASUREMENT_CONCEPT_ID code) of patients’ visits.
- Parameters
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> dataset = OMOPDataset( ... root="/srv/local/data/zw12/pyhealth/raw_data/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence", "drug_exposure", "measurement",], ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper functions which parses person, visit_occurrence, and death tables.
Will be called in self.parse_tables()
- Docs:
- parse_condition_occurrence(patients)[source]#
Helper function which parses condition_occurrence table.
Will be called in self.parse_tables()
- Docs:
condition_occurrence: http://ohdsi.github.io/CommonDataModel/cdm53.html#CONDITION_OCCURRENCE
- parse_procedure_occurrence(patients)[source]#
Helper function which parses procedure_occurrence table.
Will be called in self.parse_tables()
- Docs:
procedure_occurrence: http://ohdsi.github.io/CommonDataModel/cdm53.html#PROCEDURE_OCCURRENCE
- parse_drug_exposure(patients)[source]#
Helper function which parses drug_exposure table.
Will be called in self.parse_tables()
- Docs:
procedure_occurrence: http://ohdsi.github.io/CommonDataModel/cdm53.html#DRUG_EXPOSURE
pyhealth.datasets.splitter#
Several data splitting function for pyhealth.datasets module to obtain training / validation / test sets.
- pyhealth.datasets.splitter.split_by_visit(dataset, ratios, seed=None)[source]#
Splits the dataset by visit (i.e., samples).
- Parameters
- Returns
- three subsets of the dataset of
type torch.utils.data.Subset.
- Return type
train_dataset, val_dataset, test_dataset
Note
- The original dataset can be accessed by train_dataset.dataset,
val_dataset.dataset, and test_dataset.dataset.
- pyhealth.datasets.splitter.split_by_patient(dataset, ratios, seed=None)[source]#
Splits the dataset by patient.
- Parameters
- Returns
- three subsets of the dataset of
type torch.utils.data.Subset.
- Return type
train_dataset, val_dataset, test_dataset
Note
- The original dataset can be accessed by train_dataset.dataset,
val_dataset.dataset, and test_dataset.dataset.
pyhealth.datasets.utils#
Several utility functions.
- pyhealth.datasets.utils.strptime(s)[source]#
Helper function which parses a string to datetime object.
- pyhealth.datasets.utils.flatten_list(l)[source]#
Flattens a list of list.
- Parameters
l (
List
) – List, the list of list to be flattened.- Return type
- Returns
List, the flattened list.
Examples
>>> flatten_list([[1], [2, 3], [4]]) [1, 2, 3, 4]R >>> flatten_list([[1], [[2], 3], [4]]) [1, [2], 3, 4]
- pyhealth.datasets.utils.list_nested_levels(l)[source]#
Gets all the different nested levels of a list.
- Parameters
l (
List
) – the list to be checked.- Return type
- Returns
All the different nested levels of the list.
Examples
>>> list_nested_levels([]) (1,) >>> list_nested_levels([1, 2, 3]) (1,) >>> list_nested_levels([[]]) (2,) >>> list_nested_levels([[1, 2, 3], [4, 5, 6]]) (2,) >>> list_nested_levels([1, [2, 3], 4]) (1, 2) >>> list_nested_levels([[1, [2, 3], 4]]) (2, 3)
- pyhealth.datasets.utils.is_homo_list(l)[source]#
Checks if a list is homogeneous.
- Parameters
l (
List
) – the list to be checked.- Return type
- Returns
bool, True if the list is homogeneous, False otherwise.
Examples
>>> is_homo_list([1, 2, 3]) True >>> is_homo_list([]) True >>> is_homo_list([1, 2, "3"]) False >>> is_homo_list([1, 2, 3, [4, 5, 6]]) False
Tasks#
We support various real-world healthcare predictive tasks defined by function calls. The following example tasks are collected from top AI/Medical venues:
Drug Recommendation [Yang et al. IJCAI 2021a, Yang et al. IJCAI 2021b, Shang et al. AAAI 2020]
Readmission Prediction [Choi et al. AAAI 2021]
Mortality Prediction [Choi et al. AAAI 2021]
pyhealth.tasks.drug_recommendation#
- pyhealth.tasks.drug_recommendation.drug_recommendation_mimic3_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import drug_recommendation_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(drug_recommendation_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
- pyhealth.tasks.drug_recommendation.drug_recommendation_mimic4_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import drug_recommendation_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(drug_recommendation_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
- pyhealth.tasks.drug_recommendation.drug_recommendation_eicu_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import drug_recommendation_eicu_fn >>> eicu_sample = eicu_base.set_task(drug_recommendation_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
- pyhealth.tasks.drug_recommendation.drug_recommendation_omop_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import drug_recommendation_omop_fn >>> omop_sample = omop_base.set_task(drug_recommendation_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51'], ['98', '663', '58', '51']], 'procedures': [['1'], ['2', '3']], 'label': [['2', '3', '4'], ['0', '1', '4', '5']]}]
pyhealth.tasks.readmission_prediction#
- pyhealth.tasks.readmission_prediction.readmission_prediction_mimic3_fn(patient, time_window=15)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import readmission_prediction_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(readmission_prediction_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_mimic4_fn(patient, time_window=15)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import readmission_prediction_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(readmission_prediction_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_eicu_fn(patient, time_window=5)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import readmission_prediction_eicu_fn >>> eicu_sample = eicu_base.set_task(readmission_prediction_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_omop_fn(patient, time_window=15)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import readmission_prediction_omop_fn >>> omop_sample = omop_base.set_task(readmission_prediction_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.mortality_prediction#
- pyhealth.tasks.mortality_prediction.mortality_prediction_mimic3_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import mortality_prediction_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(mortality_prediction_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
- pyhealth.tasks.mortality_prediction.mortality_prediction_mimic4_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import mortality_prediction_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(mortality_prediction_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.mortality_prediction.mortality_prediction_eicu_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import mortality_prediction_eicu_fn >>> eicu_sample = eicu_base.set_task(mortality_prediction_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
- pyhealth.tasks.mortality_prediction.mortality_prediction_omop_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object- Returns
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import mortality_prediction_omop_fn >>> omop_sample = omop_base.set_task(mortality_prediction_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.length_of_stay_prediction#
- pyhealth.tasks.length_of_stay_prediction.categorize_los(days)[source]#
Categorizes length of stay into 10 categories.
One for ICU stays shorter than a day, seven day-long categories for each day of the first week, one for stays of over one week but less than two, and one for stays of over two weeks.
- Parameters
days (
int
) – int, length of stay in days- Returns
int, category of length of stay
- Return type
category
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic3_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object.- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import length_of_stay_prediction_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(length_of_stay_prediction_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 4}]
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic4_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object.- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import length_of_stay_prediction_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(length_of_stay_prediction_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 2}]
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_eicu_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object.- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import length_of_stay_prediction_eicu_fn >>> eicu_sample = eicu_base.set_task(length_of_stay_prediction_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 5}]
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_omop_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters
patient (
Patient
) – a Patient object.- Returns
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import length_of_stay_prediction_omop_fn >>> omop_sample = omop_base.set_task(length_of_stay_prediction_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 7}]
Models#
We implement the following models for supporting multiple healthcare predictive tasks.
pyhealth.models.MLP#
The separate callable MLP model.
- class pyhealth.models.MLP(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, n_layers=2, activation='relu', **kwargs)[source]#
Bases:
BaseModel
Multi-layer perceptron model.
This model applies a separate MLP layer for each feature, and then concatenates the final hidden states of each MLP layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate MLP layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the rnn model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector; we use mean/sum pooling and then MLP
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we first use the embedding table to encode each code into a vector
and then use mean/sum pooling to get one vector for each sample; we then use MLP layers
- case 3. [1.5, 2.0, 0.0]
we run MLP directly
- case 4. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
This case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 1 after embedding table
- case 5. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
This case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 2 after embedding table
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.n_layers (
int
) – the number of layers. Default is 2.activation (
str
) – the activation function. Default is “relu”.**kwargs – other parameters for the RNN layer.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "conditions": ["cond-33", "cond-86", "cond-80"], ... "procedures": [1.0, 2.0, 3.5, 4], ... "label": 0, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "conditions": ["cond-33", "cond-86", "cond-80"], ... "procedures": [5.0, 2.0, 3.5, 4], ... "label": 1, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import MLP >>> model = MLP( ... dataset=dataset, ... feature_keys=["conditions", "procedures"], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.6816, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5418], [0.5584]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]])} >>>
- static mean_pooling(x, mask)[source]#
Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)
- Returns
tensor of shape (batch_size, embedding_dim)
- Return type
x
Examples
>>> x.shape [128, 5, 32] >>> mean_pooling(x).shape [128, 32]
- static sum_pooling(x)[source]#
Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)
- Returns
tensor of shape (batch_size, embedding_dim)
- Return type
x
Examples
>>> x.shape [128, 5, 32] >>> sum_pooling(x).shape [128, 32]
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type
A dictionary with the following keys
pyhealth.models.CNN#
The separate callable CNNLayer and the complete CNN model.
- class pyhealth.models.CNNLayer(input_size, hidden_size, num_layers=1)[source]#
Bases:
Module
Convolutional neural network layer.
This layer stacks multiple CNN blocks and applies adaptive average pooling at the end. It is used in the CNN model. But it can also be used as a standalone layer.
- Parameters
Examples
>>> from pyhealth.models import CNNLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = CNNLayer(5, 64) >>> outputs, last_outputs = layer(input) >>> outputs.shape torch.Size([3, 128, 64]) >>> last_outputs.shape torch.Size([3, 64])
- forward(x)[source]#
Forward propagation.
- Parameters
x (
tensor
) – a tensor of shape [batch size, sequence len, input size].- Returns
- a tensor of shape [batch size, sequence len, hidden size],
containing the output features for each time step.
- pooled_outputs: a tensor of shape [batch size, hidden size], containing
the pooled output features.
- Return type
outputs
- class pyhealth.models.CNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
Convolutional neural network model.
This model applies a separate CNN layer for each feature, and then concatenates the final hidden states of each CNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate CNN layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the CNN model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply CNN on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use CNN one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 2 after embedding table
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the CNN layer.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]], ... [[7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import CNN >>> model = CNN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.8725, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.7620], [0.7339]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]])} >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type
A dictionary with the following keys
pyhealth.models.RNN#
The separate callable RNNLayer and the complete RNN model.
- class pyhealth.models.RNNLayer(input_size, hidden_size, rnn_type='GRU', num_layers=1, dropout=0.5, bidirectional=False)[source]#
Bases:
Module
Recurrent neural network layer.
This layer wraps the PyTorch RNN layer with masking and dropout support. It is used in the RNN model. But it can also be used as a standalone layer.
- Parameters
input_size (
int
) – input feature size.hidden_size (
int
) – hidden feature size.rnn_type (
str
) – type of rnn, one of “RNN”, “LSTM”, “GRU”. Default is “GRU”.num_layers (
int
) – number of recurrent layers. Default is 1.dropout (
float
) – dropout rate. If non-zero, introduces a Dropout layer before each RNN layer. Default is 0.5.bidirectional (
bool
) – whether to use bidirectional recurrent layers. If True, a fully-connected layer is applied to the concatenation of the forward and backward hidden states to reduce the dimension to hidden_size. Default is False.
Examples
>>> from pyhealth.models import RNNLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = RNNLayer(5, 64) >>> outputs, last_outputs = layer(input) >>> outputs.shape torch.Size([3, 128, 64]) >>> last_outputs.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters
x (
tensor
) – a tensor of shape [batch size, sequence len, input size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns
- a tensor of shape [batch size, sequence len, hidden size],
containing the output features for each time step.
- last_outputs: a tensor of shape [batch size, hidden size], containing
the output features for the last time step.
- Return type
outputs
- class pyhealth.models.RNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
Recurrent neural network model.
This model applies a separate RNN layer for each feature, and then concatenates the final hidden states of each RNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate rnn layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the rnn model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply rnn on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use rnn one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 2 after embedding table
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the RNN layer.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import RNN >>> model = RNN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.7664, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.4714], [0.4085]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]])} >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type
A dictionary with the following keys
pyhealth.models.Transformer#
The separate callable TransformerLayer and the complete Transformer model.
- class pyhealth.models.TransformerLayer(feature_size, heads=1, dropout=0.5, num_layers=1)[source]#
Bases:
Module
Transformer layer.
Paper: Ashish Vaswani et al. Attention is all you need. NIPS 2017.
This layer is used in the Transformer model. But it can also be used as a standalone layer.
- Parameters
feature_size – the hidden feature size.
heads – the number of attention heads. Default is 1.
dropout – dropout rate. Default is 0.5.
num_layers – number of transformer layers. Default is 1.
Examples
>>> from pyhealth.models import TransformerLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = TransformerLayer(64) >>> emb, cls_emb = layer(input) >>> emb.shape torch.Size([3, 128, 64]) >>> cls_emb.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters
x (
tensor
) – a tensor of shape [batch size, sequence len, feature_size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns
- a tensor of shape [batch size, sequence len, feature_size],
containing the output features for each time step.
- cls_emb: a tensor of shape [batch size, feature_size], containing
the output features for the first time step.
- Return type
emb
- class pyhealth.models.Transformer(dataset, feature_keys, label_key, mode, embedding_dim=128, **kwargs)[source]#
Bases:
BaseModel
Transformer model.
This model applies a separate Transformer layer for each feature, and then concatenates the final hidden states of each Transformer layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate Transformer layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the transformer model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply transformer on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use transformer one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 2 after embedding table
- dataset: the dataset to train the model. It is used to query certain
information such as the set of all tokens.
- feature_keys: list of keys in samples to use as features,
e.g. [“conditions”, “procedures”].
label_key: key in samples to use as label (e.g., “drugs”). mode: one of “binary”, “multiclass”, or “multilabel”. embedding_dim: the embedding dimension. Default is 128. **kwargs: other parameters for the Transformer layer.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import Transformer >>> model = Transformer( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.4234, grad_fn=<NllLossBackward0>), 'y_prob': tensor([[9.9998e-01, 2.2920e-05], [5.7120e-01, 4.2880e-01]], grad_fn=<SoftmaxBackward0>), 'y_true': tensor([0, 1])} >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type
A dictionary with the following keys
pyhealth.models.RETAIN#
The separate callable RETAINLayer and the complete RETAIN model.
- class pyhealth.models.RETAINLayer(feature_size, dropout=0.5)[source]#
Bases:
Module
RETAIN layer.
Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.
This layer is used in the RETAIN model. But it can also be used as a standalone layer.
- Parameters
Examples
>>> from pyhealth.models import RETAINLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = RETAINLayer(64) >>> c = layer(input) >>> c.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters
x (
tensor
) – a tensor of shape [batch size, sequence len, feature_size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns
- a tensor of shape [batch size, feature_size] representing the
context vector.
- Return type
c
- class pyhealth.models.RETAIN(dataset, feature_keys, label_key, mode, embedding_dim=128, **kwargs)[source]#
Bases:
BaseModel
RETAIN model.
Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.
Note
We use separate Retain layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the Retain model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply Retain on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use Retain one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 2 after embedding table
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.**kwargs – other parameters for the RETAIN layer.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]], ... [[7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import CNN >>> model = CNN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... # "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.8725, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.7620], [0.7339]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]])} >>>
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import RETAIN >>> model = RETAIN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.7234, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5423], [0.5142]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]])} >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type
A dictionary with the following keys
pyhealth.models.GAMENet#
The separate callable GAMENetLayer and the complete GAMENet model.
- class pyhealth.models.GAMENetLayer(hidden_size, ehr_adj, ddi_adj, dropout=0.5)[source]#
Bases:
Module
GAMENet layer.
Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.
This layer is used in the GAMENet model. But it can also be used as a standalone layer.
- Parameters
Examples
>>> from pyhealth.models import GAMENetLayer >>> queries = torch.randn(3, 5, 32) # [patient, visit, hidden_size] >>> prev_drugs = torch.randint(0, 2, (3, 4, 50)).float() >>> curr_drugs = torch.randint(0, 2, (3, 50)).float() >>> ehr_adj = torch.randint(0, 2, (50, 50)).float() >>> ddi_adj = torch.randint(0, 2, (50, 50)).float() >>> layer = GAMENetLayer(32, ehr_adj, ddi_adj) >>> loss, y_prob = layer(queries, prev_drugs, curr_drugs) >>> loss.shape torch.Size([]) >>> y_prob.shape torch.Size([3, 50])
- forward(queries, prev_drugs, curr_drugs, mask=None)[source]#
Forward propagation.
- Parameters
queries (
tensor
) – query tensor of shape [patient, visit, hidden_size].prev_drugs (
tensor
) – multihot tensor indicating drug usage in all previous visits of shape [patient, visit - 1, num_drugs].curr_drugs (
tensor
) – multihot tensor indicating drug usage in the current visit of shape [patient, num_drugs].mask (
Optional
[tensor
]) – an optional mask tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.
- Returns
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type
loss
- class pyhealth.models.GAMENet(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#
Bases:
BaseModel
GAMENet model.
Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs_all as label_key (i.e., both current and previous drugs). It only operates on the visit level.
Note
This model only accepts ATC level 3 as medication codes.
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.num_layers (
int
) – the number of layers used in RNN. Default is 1.dropout (
float
) – the dropout rate. Default is 0.5.**kwargs – other parameters for the GAMENet layer.
- forward(conditions, procedures, drugs_all, **kwargs)[source]#
Forward propagation.
- Parameters
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing
the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels] representing
the ground truth of each drug.
- Return type
A dictionary with the following keys
pyhealth.models.MICRON#
The separate callable MICRONLayer and the complete MICRON model.
- class pyhealth.models.MICRONLayer(input_size, hidden_size, num_drugs, lam=0.1)[source]#
Bases:
Module
MICRON layer.
Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.
This layer is used in the MICRON model. But it can also be used as a standalone layer.
- Parameters
Examples
>>> from pyhealth.models import MICRONLayer >>> patient_emb = torch.randn(3, 5, 32) # [patient, visit, input_size] >>> drugs = torch.randint(0, 2, (3, 50)).float() >>> layer = MICRONLayer(32, 64, 50) >>> loss, y_prob = layer(patient_emb, drugs) >>> loss.shape torch.Size([]) >>> y_prob.shape torch.Size([3, 50])
- forward(patient_emb, drugs, mask=None)[source]#
Forward propagation.
- Parameters
patient_emb (
tensor
) – a tensor of shape [patient, visit, input_size].drugs (
tensor
) – a multihot tensor of shape [patient, num_labels].mask (
Optional
[tensor
]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.
- Returns
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type
loss
- class pyhealth.models.MICRON(dataset, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
MICRON model.
Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the MICRON layer.
- forward(conditions, procedures, drugs, **kwargs)[source]#
Forward propagation.
- Parameters
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing
the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels] representing
the ground truth of each drug.
- Return type
A dictionary with the following keys
pyhealth.models.SafeDrug#
The separate callable SafeDrugLayer and the complete SafeDrug model.
- class pyhealth.models.SafeDrugLayer(hidden_size, mask_H, ddi_adj, num_fingerprints, molecule_set, average_projection, kp=0.05, target_ddi=0.08)[source]#
Bases:
Module
SafeDrug model.
Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.
This layer is used in the SafeDrug model. But it can also be used as a standalone layer.
- Parameters
hidden_size (
int
) – hidden feature size.mask_H (
Tensor
) – the mask matrix H of shape [num_drugs, num_substructures].ddi_adj (
Tensor
) – an adjacency tensor of shape [num_drugs, num_drugs].num_fingerprints (
int
) – total number of different fingerprints.molecule_set (
List
[Tuple
]) – a list of molecule tuples (A, B, C) of length num_molecules. - A <torch.tensor>: fingerprints of atoms in the molecule - B <torch.tensor>: adjacency matrix of the molecule - C <int>: molecular_sizeaverage_projection (
Tensor
) – a tensor of shape [num_drugs, num_molecules] representing the average projection for aggregating multiple molecules of the same drug into one vector.kp (
float
) – correcting factor for the proportional signal. Default is 0.5.target_ddi (
float
) – DDI acceptance rate. Default is 0.08.
- pad(matrices, pad_value)[source]#
Pads the list of matrices.
Padding with a pad_value (e.g., 0) for batch processing. For example, given a list of matrices [A, B, C], we obtain a new matrix [A00, 0B0, 00C], where 0 is the zero (i.e., pad value) matrix.
- forward(patient_emb, drugs, mask=None)[source]#
Forward propagation.
- Parameters
patient_emb (
tensor
) – a tensor of shape [patient, visit, input_size].drugs (
tensor
) – a multihot tensor of shape [patient, num_labels].mask (
Optional
[tensor
]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.
- Returns
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type
loss
- class pyhealth.models.SafeDrug(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#
Bases:
BaseModel
SafeDrug model.
Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.
Note
This model only accepts ATC level 3 as medication codes.
- Parameters
dataset (
SampleDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.num_layers (
int
) – the number of layers used in RNN. Default is 1.dropout (
float
) – the dropout rate. Default is 0.5.**kwargs – other parameters for the SafeDrug layer.
- forward(conditions, procedures, drugs, **kwargs)[source]#
Forward propagation.
- Parameters
- Returns
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing
the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels] representing
the ground truth of each drug.
- Return type
A dictionary with the following keys
pyhealth.models.Deepr#
The separate callable DeeprLayer and the complete Deepr model.
- class pyhealth.models.DeeprLayer(feature_size=100, window=1, hidden_size=3)[source]#
Bases:
Module
Deepr layer.
Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, ” Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.
This layer is used in the Deepr model.
- Parameters
Examples
>>> from pyhealth.models import DeeprLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = DeeprLayer(5, window=4, hidden_size=7) # window does not impact the output shape >>> outputs = layer(input) >>> outputs.shape torch.Size([3, 7])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters
x (
Tensor
) – a Tensor of shape [batch size, sequence len, input size].mask (
Optional
[Tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns
- a Tensor of shape [batch size, hidden_size] representing the
summarized vector.
- Return type
c
- class pyhealth.models.Deepr(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
Deepr model.
Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, ” Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.
Note
We use separate Deepr layers for different feature_keys.
- Parameters
dataset (
BaseDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the Deepr layer.
Examples
>>> from pyhealth.datasets import SampleDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import Deepr >>> model = Deepr( ... dataset=dataset, ... feature_keys=[ ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) {'loss': tensor(0.9139, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.7530], [0.6510]], device='cuda:0', grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]], device='cuda:0')} >>>
Trainer#
- class pyhealth.trainer.Trainer(model, checkpoint_path=None, metrics=None, device=None, enable_logging=True, output_path=None, exp_name=None)[source]#
Bases:
object
Trainer for PyTorch models.
- Parameters
model (
Module
) – PyTorch model.checkpoint_path (
Optional
[str
]) – Path to the checkpoint. Default is None, which means the model will be randomly initialized.metrics (
Optional
[List
[str
]]) – List of metric names to be calculated. Default is None, which means the default metrics in each metrics_fn will be used.device (
Optional
[str
]) – Device to be used for training. Default is None, which means the device will be GPU if available, otherwise CPU.enable_logging (
bool
) – Whether to enable logging. Default is True.output_path (
Optional
[str
]) – Path to save the output. Default is “./output”.exp_name (
Optional
[str
]) – Name of the experiment. Default is current datetime.
- train(train_dataloader, val_dataloader=None, test_dataloader=None, epochs=5, optimizer_class=<class 'torch.optim.adam.Adam'>, optimizer_params=None, weight_decay=0.0, max_grad_norm=None, monitor=None, monitor_criterion='max', load_best_model_at_last=True)[source]#
Trains the model.
- Parameters
train_dataloader (
DataLoader
) – Dataloader for training.val_dataloader (
Optional
[DataLoader
]) – Dataloader for validation. Default is None.test_dataloader (
Optional
[DataLoader
]) – Dataloader for testing. Default is None.epochs (
int
) – Number of epochs. Default is 5.optimizer_class (
Type
[Optimizer
]) – Optimizer class. Default is torch.optim.Adam.optimizer_params (
Optional
[Dict
[str
,object
]]) – Parameters for the optimizer. Default is {“lr”: 1e-3}.weight_decay (
float
) – Weight decay. Default is 0.0.max_grad_norm (
Optional
[float
]) – Maximum gradient norm. Default is None.monitor (
Optional
[str
]) – Metric name to monitor. Default is None.monitor_criterion (
str
) – Criterion to monitor. Default is “max”.load_best_model_at_last (
bool
) – Whether to load the best model at the last. Default is True.
- inference(dataloader)[source]#
Model inference.
- Parameters
dataloader – Dataloader for evaluation.
- Returns
List of true labels. y_prob_all: List of predicted probabilities. loss_mean: Mean loss over batches.
- Return type
y_true_all
Tokenizer#
The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.
- class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#
Bases:
object
Vocabulary class for mapping between tokens and indices.
- class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#
Bases:
object
Tokenizer class for converting tokens to indices and vice versa.
This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.
- batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#
Converts a list of lists of tokens (2D) to indices.
- Parameters
batch (
List
[List
[str
]]) – List of lists of tokens to convert to indices.padding (
bool
) – whether to pad the tokens to the max number of tokens in the batch (smart padding).truncation (
bool
) – whether to truncate the tokens to max_length.max_length (
int
) – maximum length of the tokens. This argument is ignored if truncation is False.
- batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#
Converts a list of lists of lists of tokens (3D) to indices.
- Parameters
batch (
List
[List
[List
[str
]]]) – List of lists of lists of tokens to convert to indices.padding (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).truncation (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_lengthmax_length (
Tuple
[int
,int
]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.
Metrics#
We provide easy to use metrics (the same style and args as sklearn.metrics) for binary classification, multiclass classification, multilabel classification. We also provide other metrics specically for healthcare tasks, such as drug drug interaction (DDI) rate.
pyhealth.metrics.multiclass#
- pyhealth.metrics.multiclass.multiclass_metrics_fn(y_true, y_prob, metrics=None)[source]#
Computes metrics for multiclass classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
- roc_auc_macro_ovo: area under the receiver operating characteristic curve,
macro averaged over one-vs-one multiclass classification
- roc_auc_macro_ovr: area under the receiver operating characteristic curve,
macro averaged over one-vs-rest multiclass classification
- roc_auc_weighted_ovo: area under the receiver operating characteristic curve,
weighted averaged over one-vs-one multiclass classification
- roc_auc_weighted_ovr: area under the receiver operating characteristic curve,
weighted averaged over one-vs-rest multiclass classification
accuracy: accuracy score
- balanced_accuracy: balanced accuracy score (usually used for imbalanced
datasets)
f1_micro: f1 score, micro averaged
f1_macro: f1 score, macro averaged
f1_weighted: f1 score, weighted averaged
jaccard_micro: Jaccard similarity coefficient score, micro averaged
jaccard_macro: Jaccard similarity coefficient score, macro averaged
jaccard_weighted: Jaccard similarity coefficient score, weighted averaged
cohen_kappa: Cohen’s kappa score
If no metrics are specified, accuracy, f1_macro, and f1_micro are computed by default.
This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.
- Parameters
- Return type
- Returns
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
Examples
>>> from pyhealth.metrics import multiclass_metrics_fn >>> y_true = np.array([0, 1, 2, 2]) >>> y_prob = np.array([[0.9, 0.05, 0.05], ... [0.05, 0.9, 0.05], ... [0.05, 0.05, 0.9], ... [0.6, 0.2, 0.2]]) >>> multiclass_metrics_fn(y_true, y_prob, metrics=["accuracy"]) {'accuracy': 0.75}
pyhealth.metrics.multilabel#
- pyhealth.metrics.multilabel.multilabel_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5)[source]#
Computes metrics for multilabel classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
- roc_auc_micro: area under the receiver operating characteristic curve,
micro averaged
- roc_auc_macro: area under the receiver operating characteristic curve,
macro averaged
- roc_auc_weighted: area under the receiver operating characteristic curve,
weighted averaged
- roc_auc_samples: area under the receiver operating characteristic curve,
samples averaged
pr_auc_micro: area under the precision recall curve, micro averaged
pr_auc_macro: area under the precision recall curve, macro averaged
pr_auc_weighted: area under the precision recall curve, weighted averaged
pr_auc_samples: area under the precision recall curve, samples averaged
accuracy: accuracy score
f1_micro: f1 score, micro averaged
f1_macro: f1 score, macro averaged
f1_weighted: f1 score, weighted averaged
f1_samples: f1 score, samples averaged
precision_micro: precision score, micro averaged
precision_macro: precision score, macro averaged
precision_weighted: precision score, weighted averaged
precision_samples: precision score, samples averaged
recall_micro: recall score, micro averaged
recall_macro: recall score, macro averaged
recall_weighted: recall score, weighted averaged
recall_samples: recall score, samples averaged
jaccard_micro: Jaccard similarity coefficient score, micro averaged
jaccard_macro: Jaccard similarity coefficient score, macro averaged
jaccard_weighted: Jaccard similarity coefficient score, weighted averaged
jaccard_samples: Jaccard similarity coefficient score, samples averaged
hamming_loss: Hamming loss
If no metrics are specified, pr_auc_samples is computed by default.
This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.
- Parameters
y_true (
ndarray
) – True target values of shape (n_samples, n_labels).y_prob (
ndarray
) – Predicted probabilities of shape (n_samples, n_labels).metrics (
Optional
[List
[str
]]) – List of metrics to compute. Default is [“pr_auc_samples”].threshold (
float
) – Threshold to binarize the predicted probabilities. Default is 0.5.
- Return type
- Returns
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
Examples
>>> from pyhealth.metrics import multilabel_metrics_fn >>> y_true = np.array([[0, 1, 1], [1, 0, 1]]) >>> y_prob = np.array([[0.1, 0.9, 0.8], [0.05, 0.95, 0.6]]) >>> multilabel_metrics_fn(y_true, y_prob, metrics=["accuracy"]) {'accuracy': 0.5}
pyhealth.metrics.binary#
- pyhealth.metrics.binary.binary_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5)[source]#
Computes metrics for binary classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
pr_auc: area under the precision-recall curve
roc_auc: area under the receiver operating characteristic curve
accuracy: accuracy score
- balanced_accuracy: balanced accuracy score (usually used for imbalanced
datasets)
f1: f1 score
precision: precision score
recall: recall score
cohen_kappa: Cohen’s kappa score
jaccard: Jaccard similarity coefficient score
If no metrics are specified, pr_auc, roc_auc and f1 are computed by default.
This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.
- Parameters
- Return type
- Returns
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
Examples
>>> from pyhealth.metrics import binary_metrics_fn >>> y_true = np.array([0, 0, 1, 1]) >>> y_prob = np.array([0.1, 0.4, 0.35, 0.8]) >>> binary_metrics_fn(y_true, y_prob, metrics=["accuracy"]) {'accuracy': 0.75}
MedCode#
We provide medical code mapping tools for (i) ontology mapping within one coding system and (ii) mapping the same concept cross different coding systems.
- class pyhealth.medcode.InnerMap(vocabulary, refresh_cache=False)[source]#
Bases:
ABC
Contains information for a specific medical code system.
InnerMap is a base abstract class for all medical code systems. It will be instantiated as a specific medical code system with InnerMap.load(vocabulary).
Note
This class cannot be instantiated using __init__() (throws an error).
- classmethod load(vocabulary, refresh_cache=False)[source]#
Initializes a specific medical code system inheriting from InnerMap.
- Parameters
Examples
>>> from pyhealth.medcode import InnerMap >>> icd9cm = InnerMap.load("ICD9CM") >>> icd9cm.lookup("428.0") 'Congestive heart failure, unspecified' >>> icd9cm.get_ancestors("428.0") ['428', '420-429.99', '390-459.99', '001-999.99']
- static standardize(code)[source]#
Standardizes a given code.
Subclass will override this method based on different medical code systems.
- Return type
- static convert(code, **kwargs)[source]#
Converts a given code.
Subclass will override this method based on different medical code systems.
- Return type
- class pyhealth.medcode.CrossMap(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#
Bases:
object
Contains mapping between two medical code systems.
CrossMap is a base class for all possible mappings. It will be initialized with two specific medical code systems with CrossMap.load(source_vocabulary, target_vocabulary).
- classmethod load(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#
Initializes the mapping between two medical code systems.
- Parameters
Examples
>>> from pyhealth.medcode import CrossMap >>> mapping = CrossMap("ICD9CM", "CCSCM") >>> mapping.map("428.0") ['108']
>>> mapping = CrossMap.load("NDC", "ATC") >>> mapping.map("00527051210", target_kwargs={"level": 3}) ['A11C']
- map(source_code, source_kwargs=None, target_kwargs=None)[source]#
Maps a source code to a list of target codes.
- Parameters
source_code (
str
) – source code.**source_kwargs (
Optional
[Dict
]) – additional arguments for the source code. Will be passed to self.s_class.convert(). Default is empty dict.**target_kwargs (
Optional
[Dict
]) – additional arguments for the target code. Will be passed to self.t_class.convert(). Default is empty dict.
- Return type
- Returns
A list of target codes.
Diagnosis codes:#
- class pyhealth.medcode.ICD9CM(**kwargs)[source]#
Bases:
InnerMap
9-th International Classification of Diseases, Clinical Modification.
Procedure codes:#
- class pyhealth.medcode.ICD9PROC(**kwargs)[source]#
Bases:
InnerMap
9-th International Classification of Diseases, Procedure.
Medication codes:#
PyHealth live#
Start Time: 8 PM Central Time, Wednesday
Recurrence: Weekly (starting from Dec 21, 2022)
Zoom: Join Link
Add to Google Calender: Invitation
Add to Microsoft Outlook (.ics): Invitation
YouTube: Recorded Live Sessions
User/Developer Slack: Click to join
Schedules#
(Dec 21, Wed) Live 01 - What is PyHealth and How to Get Started? [Recap]
(Dec 28, Wed) Live 02 - Data & Datasets & Tasks: store unstructured data in an structured way. [Recap I] [II] [III] [IV]
(Jan 4, Wed) Live 03 - Models & Trainer & Metrics: initialize and train a deep learning model. [Recap I] [II] [III]
(Jan 11, Wed) Live 04 - Tokenizer & Medcode: master the medical code lookup and mapping [Recap I] [II]
(Jan 18, Wed) Live 05 - PyHealth can support a complete healthcare ML pipeline [Recap I] [II]
(Jan 25, Wed) Live 06 - Fit your own dataset into pipeline and use our model
(Feb 1, Wed) Live 07 - Adopt your customized model and quickly try it on our data
(Feb 8, Wed) Live 08 - Define your own healthcare task on MIMIC data
Development logs#
We track the new development here:
Jan 24, 2023
1. Fix the code typo in pyhealth/tasks/drug_recommendation.py for issue #71.
2. update the pyhealth live schedule
Jan 22, 2023
1. Fix the list of list of vector problem in RNN, Transformer, RETAIN, and CNN
2. Add initialization examples for RNN, Transformer, RETAIN, CNN, and Deepr
3. (minor) change the parameters from "Type" and "level" to "type_" and "dim_"
4. BPDanek adds the __repr__ function to medcode for better print understanding
5. add unittest for pyhealth.data
Jan 21, 2023
1. Added a new model, Deepr (models.Deepr)
Jan 20, 2023
1. add the pyhealth live 05
2. add slack channel invitation in pyhealth live page
Jan 13, 2023
1. add the pyhealth live 03 and 04 video link to the nagivation
2. add future pyhealth live schedule
Jan 8, 2023
1. Changed BaseModel.add_feature_transform_layer in models/base_model.py so that it accepts special_tokens if necessary
2. fix an int/float bug in dataset checking (transform int to float and then process them uniformly)
Dec 26, 2022
1. add examples to pyhealth.data, pyhealth.datasets
2. improve jupyter notebook tutorials 0, 1, 2
Dec 21, 2022
1. add the development logs to the navigation
2. add the pyhealth live schedule to the nagivation
About us#
We are the SunLab healthcare research team at UIUC.
*Zhenbang Wu (Ph.D. Student @ University of Illinois Urbana-Champaign)
*Chaoqi Yang (Ph.D. Student @ University of Illinois Urbana-Champaign)
Patrick Jiang (M.S. Student @ University of Illinois Urbana-Champaign)
Jimeng Sun (Professor @ University of Illinois Urbana-Champaign)
(* indicates equal contribution)
Acknowledgement#
Yue Zhao (Ph.D. Student @ Carnegie Mellon University)
Dr. Zhi Qiao (Associate ML Director, ACOE @ IQVIA)
Dr. Xiao Cao (VP of Machine Learning and NLP, Relativity)
Xiyang Hu (Ph.D. Student @ Carnegie Mellon University)