Welcome to PyHealth!#
Go to KDD’23 Tutorial Link https://sunlabuiuc.github.io/PyHealth/#
PyHealth is designed for both ML researchers and medical practitioners. We can make your healthcare AI applications easier to develop, test and validate. Your development process becomes more flexible and more customizable. [GitHub]
[News!] We are continueously implemeting good papers and benchmarks into PyHealth, checkout the [Planned List]. Welcome to pick one from the list and send us a PR or add more influential and new papers into the plan list.
Introduction [Video]#
PyHealth can support diverse electronic health records (EHRs) such as MIMIC and eICU and all OMOP-CDM based databases and provide various advanced deep learning algorithms for handling important healthcare tasks such as diagnosis-based drug recommendation, patient hospitalization and mortality prediction, and ICU length stay forecasting, etc.
Build a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.
Modules#
All healthcare tasks in our package follow a five-stage pipeline:
load dataset -> define task function -> build ML/DL model -> model training -> inference
! We try hard to make sure each stage is as separate as possibe, so that people can customize their own pipeline by only using our data processing steps or the ML models. Each step will call one module and we introduce them using an example.
An ML Pipeline Example#
STEP 1: <pyhealth.datasets> provides a clean structure for the dataset, independent from the tasks. We support
MIMIC-III
,MIMIC-IV
andeICU
, as well as the standardOMOP-formatted data
. The dataset is stored in a unifiedPatient-Visit-Event
structure.
from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
# map all NDC codes to ATC 3-rd level codes in these tables
code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)
User could also store their own dataset into our <pyhealth.datasets.SampleBaseDataset>
structure and then follow the same pipeline below, see Tutorial
STEP 2: <pyhealth.tasks> inputs the
<pyhealth.datasets>
object and defines how to process each patient’s data into a set of samples for the tasks. In the package, we provide several task examples, such asdrug recommendation
andlength of stay prediction
.
from pyhealth.tasks import drug_recommendation_mimic3_fn
from pyhealth.datasets import split_by_patient, get_dataloader
mimic3sample = mimic3base.set_task(task_fn=drug_recommendation_mimic3_fn) # use default task
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])
# create dataloaders (torch.data.DataLoader)
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
STEP 3: <pyhealth.models> provides the healthcare ML models using
<pyhealth.models>
. This module also provides model layers, such aspyhealth.models.RETAINLayer
for building customized ML architectures. Our model layers can used as easily astorch.nn.Linear
.
from pyhealth.models import Transformer
model = Transformer(
dataset=mimic3sample,
feature_keys=["conditions", "procedures"],
label_key="drugs",
mode="multilabel",
)
STEP 4: <pyhealth.trainer> is the training manager with
train_loader
, theval_loader
,val_metric
, and specify other arguemnts, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.
from pyhealth.trainer import Trainer
trainer = Trainer(model=model)
trainer.train(
train_dataloader=train_loader,
val_dataloader=val_loader,
epochs=50,
monitor="pr_auc_samples",
)
STEP 5: <pyhealth.metrics> provides several common evaluation metrics (refer to Doc and see what are available) and special metrics in healthcare, such as drug-drug interaction (DDI) rate.
trainer.evaluate(test_loader)
Medical Code Map#
<pyhealth.codemap> provides two core functionalities: (i) looking up information for a given medical code (e.g., name, category, sub-concept); (ii) mapping codes across coding systems (e.g., ICD9CM to CCSCM). This module can be independently applied to your research.
For code mapping between two coding systems
from pyhealth.medcode import CrossMap
codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("82101") # use it like a dict
codemap = CrossMap.load("NDC", "ATC")
codemap.map("00527051210")
For code ontology lookup within one system
from pyhealth.medcode import InnerMap
icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0") # get detailed info
icd9cm.get_ancestors("428.0") # get parents
Medical Code Tokenizer#
<pyhealth.tokenizer> is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be independently applied to your research.
from pyhealth.tokenizer import Tokenizer
# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens) # [[8, 9, 10, 11], [12, 1, 1, 0]]
# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices) # [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
Users can customize their healthcare AI pipeline as simply as calling one module
process your OMOP data via
pyhealth.datasets
process the open eICU (e.g., MIMIC) data via
pyhealth.datasets
define your own task on existing databases via
pyhealth.tasks
use existing healthcare models or build upon it (e.g., RETAIN) via
pyhealth.models
.code map between for conditions and medicaitons via
pyhealth.codemap
.
Datasets#
We provide the following datasets for general purpose healthcare AI research:
Dataset |
Module |
Year |
Information |
---|---|---|---|
MIMIC-III |
|
2016 |
|
MIMIC-IV |
|
2020 |
|
eICU |
|
2018 |
|
OMOP |
|
||
SleepEDF |
|
2018 |
|
SHHS |
|
2016 |
|
ISRUC |
|
2016 |
Machine/Deep Learning Models#
Model Name |
Type |
Module |
Year |
Summary |
Reference |
---|---|---|---|---|---|
Multi-layer Perceptron |
deep learning |
|
1986 |
MLP treats each feature as static |
|
Convolutional Neural Network (CNN) |
deep learning |
|
1989 |
CNN runs on the conceptual patient-by-visit grids |
Handwritten Digit Recognition with a Back-Propagation Network |
Recurrent Neural Nets (RNN) |
deep Learning |
|
2011 |
RNN (includes LSTM and GRU) can run on any sequential level (e.g., visit by visit sequences) |
|
Transformer |
deep Learning |
|
2017 |
Transformer can run on any sequential level (e.g., visit by visit sequences) |
|
RETAIN |
deep Learning |
|
2016 |
RETAIN uses two RNN to learn patient embeddings while providing feature-level and visit-level importance. |
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism |
GAMENet |
deep Learning |
|
2019 |
GAMENet uses memory networks, used only for drug recommendation task |
GAMENet: Graph Attention Mechanism for Explainable Electronic Health Record Prediction |
MICRON |
deep Learning |
|
2021 |
MICRON predicts the future drug combination by instead predicting the changes w.r.t. the current combination, used only for drug recommendation task |
Change Matters: Medication Change Prediction with Recurrent Residual Networks |
SafeDrug |
deep Learning |
|
2021 |
SafeDrug encodes drug molecule structures by graph neural networks, used only for drug recommendation task |
SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations |
MoleRec |
deep Learning |
|
2023 |
MoleRec encodes drug molecule in a substructure level as well as the patient’s information into a drug combination representation, used only for drug recommendation task |
MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning |
Deepr |
deep Learning |
|
2017 |
Deepr is based on 1D CNN. General purpose. |
|
ContraWR Encoder (STFT+CNN) |
deep Learning |
|
2021 |
ContraWR encoder uses short time Fourier transform (STFT) + 2D CNN, used for biosignal learning |
Self-supervised EEG Representation Learning for Automatic Sleep Staging |
SparcNet (1D CNN) |
deep Learning |
|
2023 |
SparcNet is based on 1D CNN, used for biosignal learning |
|
TCN |
deep learning |
|
2018 |
TCN is based on dilated 1D CNN. General purpose |
|
AdaCare |
deep learning |
|
2020 |
AdaCare uses CNNs with dilated filters to learn enriched patient embedding. It uses feature calibration module to provide the feature-level and visit-level interpretability |
|
ConCare |
deep learning |
|
2020 |
ConCare uses transformers to learn patient embedding and calculate inter-feature correlations. |
ConCare: Personalized Clinical Feature Embedding via Capturing the Healthcare Context |
StageNet |
deep learning |
|
2020 |
StageNet uses stage-aware LSTM to conduct clinical predictive tasks while learning patient disease progression stage change unsupervisedly |
StageNet: Stage-Aware Neural Networks for Health Risk Prediction |
Dr. Agent |
deep learning |
|
2020 |
Dr. Agent uses two reinforcement learning agents to learn patient embeddings by mimicking clinical second opinions |
Dr. Agent: Clinical predictive model via mimicked second opinions |
GRASP |
deep learning |
|
2021 |
GRASP uses graph neural network to identify latent patient clusters and uses the clustering information to learn patient |
Benchmark on Healthcare Tasks#
Here is our benchmark doc on healthcare tasks. You can also check this below.
We also provide function for leaderboard generation, check it out in our github repo.
Here are the dynamic visualizations of the leaderboard. You can click the checkbox and easily compare the performance for different models doing different tasks on different datasets!
import sys
sys.path.append('../..')
from leaderboard import leaderboard_gen, utils
args = leaderboard_gen.construct_args()
leaderboard_gen.plots_generation(args)
Installation#
You could install from PyPi:
pip install pyhealth
or from github source:
git clone https://github.com/sunlabuiuc/PyHealth.git
cd pyhealth
pip install .
Required Dependencies:
python>=3.8
torch>=1.8.0
rdkit>=2022.03.4
scikit-learn>=0.24.2
networkx>=2.6.3
pandas>=1.3.2
tqdm
Warning 1:
PyHealth has multiple neural network based models, e.g., LSTM, which are implemented in PyTorch. However, PyHealth does NOT install these DL libraries for you. This reduces the risk of interfering with your local copies. If you want to use neural-net based models, please make sure PyTorch is installed. Similarly, models depending on xgboost would NOT enforce xgboost installation by default.
CUDA Setting:
To run PyHealth, you also need CUDA and cudatoolkit that support your GPU well. More info
For example, if you use NVIDIA RTX A6000 as your GPU for training, you should install a compatible cudatoolkit using:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch.
Tutorials#
We provide the following tutorials to help users get started with our pyhealth.
Tutorial 0: Introduction to pyhealth.data [Video]
Tutorial 1: Introduction to pyhealth.datasets [Video]
Tutorial 2: Introduction to pyhealth.tasks [Video]
Tutorial 3: Introduction to pyhealth.models [Video]
Tutorial 4: Introduction to pyhealth.trainer [Video]
Tutorial 5: Introduction to pyhealth.metrics [Video]
Tutorial 6: Introduction to pyhealth.tokenizer [Video]
Tutorial 7: Introduction to pyhealth.medcode [Video]
The following tutorials will help users build their own task pipelines. [Video]
Pipeline 1: Drug Recommendation
Pipeline 2: Length of Stay Prediction
Pipeline 3: Readmission Prediction
Pipeline 4: Mortality Prediction
Advanced Tutorials#
We provided the advanced tutorials for supporting various needs.
Advanced Tutorial 1: Fit your dataset into our pipeline [Video]
Advanced Tutorial 2: Define your own healthcare task
Advanced Tutorial 3: Adopt customized model into pyhealth [Video]
Advanced Tutorial 4: Load your own processed data into pyhealth and try out our ML models [Video]
Data#
pyhealth.data defines the atomic data structures of this package.
pyhealth.data.Event#
One basic data structure in the package. It is a simple container for a single event. It contains all necessary attributes for supporting various healthcare tasks.
- class pyhealth.data.Event(code=None, table=None, vocabulary=None, visit_id=None, patient_id=None, timestamp=None, **attr)[source]#
Bases:
object
Contains information about a single event.
An event can be anything from a diagnosis to a prescription or a lab test that happened in a visit of a patient at a specific time.
- Parameters:
code (
Optional
[str
]) – code of the event. E.g., “428.0” for congestive heart failure.table (
Optional
[str
]) – name of the table where the event is recorded. This corresponds to the raw csv file name in the dataset. E.g., “DIAGNOSES_ICD”.vocabulary (
Optional
[str
]) – vocabulary of the code. E.g., “ICD9CM” for ICD-9 diagnosis codes.patient_id (
Optional
[str
]) – unique identifier of the patient.timestamp (
Optional
[datetime
]) – timestamp of the event. Default is None.**attr – optional attributes to add to the event as key=value pairs.
- attr_dict#
Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.
Examples
>>> from pyhealth.data import Event >>> event = Event( ... code="00069153041", ... table="PRESCRIPTIONS", ... vocabulary="NDC", ... visit_id="v001", ... patient_id="p001", ... dosage="250mg", ... ) >>> event Event with NDC code 00069153041 from table PRESCRIPTIONS >>> event.attr_dict {'dosage': '250mg'}
pyhealth.data.Visit#
Another basic data structure in the package. A Visit is a single encounter in hospital. It is a container a sequence of Event for each information aspect, such as diagnosis or medications. It also contains other necessary attributes for supporting healthcare tasks, such as the date of the visit.
- class pyhealth.data.Visit(visit_id, patient_id, encounter_time=None, discharge_time=None, discharge_status=None, **attr)[source]#
Bases:
object
Contains information about a single visit.
A visit is a period of time in which a patient is admitted to a hospital or a specific department. Each visit is associated with a patient and contains a list of different events.
- Parameters:
visit_id (
str
) – unique identifier of the visit.patient_id (
str
) – unique identifier of the patient.encounter_time (
Optional
[datetime
]) – timestamp of visit’s encounter. Default is None.discharge_time (
Optional
[datetime
]) – timestamp of visit’s discharge. Default is None.discharge_status – patient’s status upon discharge. Default is None.
**attr – optional attributes to add to the visit as key=value pairs.
- attr_dict#
Dict, dictionary of visit attributes. Each key is an attribute name and each value is the attribute’s value.
- event_list_dict#
Dict[str, List[Event]], dictionary of event lists. Each key is a table name and each value is a list of events from that table ordered by timestamp.
Examples
>>> from pyhealth.data import Event, Visit >>> event = Event( ... code="00069153041", ... table="PRESCRIPTIONS", ... vocabulary="NDC", ... visit_id="v001", ... patient_id="p001", ... dosage="250mg", ... ) >>> visit = Visit( ... visit_id="v001", ... patient_id="p001", ... ) >>> visit.add_event(event) >>> visit Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS'] >>> visit.available_tables ['PRESCRIPTIONS'] >>> visit.num_events 1 >>> visit.get_event_list('PRESCRIPTIONS') [Event with NDC code 00069153041 from table PRESCRIPTIONS] >>> visit.get_code_list('PRESCRIPTIONS') ['00069153041'] >>> patient.available_tables ['PRESCRIPTIONS'] >>> patient.get_visit_by_index(0) Visit v001 from patient p001 with 1 events from tables ['PRESCRIPTIONS'] >>> patient.get_visit_by_index(0).get_code_list(table="PRESCRIPTIONS") ['00069153041']
- add_event(event)[source]#
Adds an event to the visit.
If the event’s table is not in the visit’s event list dictionary, it is added as a new key. The event is then added to the list of events of that table.
- Parameters:
event (
Event
) – event to add.
Note
- As for now, there is no check on the order of the events. The new event
is simply appended to end of the list.
- Return type:
- get_event_list(table)[source]#
Returns a list of events from a specific table.
If the table is not in the visit’s event list dictionary, an empty list is returned.
- Parameters:
table (
str
) – name of the table.- Return type:
- Returns:
List of events from the specified table.
Note
- As for now, there is no check on the order of the events. The list of
events is simply returned as is.
- get_code_list(table, remove_duplicate=True)[source]#
Returns a list of codes from a specific table.
If the table is not in the visit’s event list dictionary, an empty list is returned.
- Parameters:
- Return type:
- Returns:
List of codes from the specified table.
Note
- As for now, there is no check on the order of the codes. The list of
codes is simply returned as is.
- set_event_list(table, event_list)[source]#
Sets the list of events from a specific table.
This function will overwrite any existing list of events from the specified table.
Note
- As for now, there is no check on the order of the events. The list of
events is simply set as is.
- Return type:
pyhealth.data.Patient#
Another basic data structure in the package. A Patient is a collection of Visit for the current patients. It contains all necessary attributes of a patient, such as ethnicity, mortality status, gender, etc. It can support various healthcare tasks.
- class pyhealth.data.Patient(patient_id, birth_datetime=None, death_datetime=None, gender=None, ethnicity=None, **attr)[source]#
Bases:
object
Contains information about a single patient.
A patient is a person who is admitted at least once to a hospital or a specific department. Each patient is associated with a list of visits.
- Parameters:
patient_id (
str
) – unique identifier of the patient.birth_datetime (
Optional
[datetime
]) – timestamp of patient’s birth. Default is None.death_datetime (
Optional
[datetime
]) – timestamp of patient’s death. Default is None.gender – gender of the patient. Default is None.
ethnicity – ethnicity of the patient. Default is None.
**attr – optional attributes to add to the patient as key=value pairs.
- attr_dict#
Dict, dictionary of patient attributes. Each key is an attribute name and each value is the attribute’s value.
- visits#
OrderedDict[str, Visit], an ordered dictionary of visits. Each key is a visit_id and each value is a visit.
- index_to_visit_id#
Dict[int, str], dictionary that maps the index of a visit in the visits list to the corresponding visit_id.
Examples
>>> from pyhealth.data import Event, Visit, Patient >>> event = Event( ... code="00069153041", ... table="PRESCRIPTIONS", ... vocabulary="NDC", ... visit_id="v001", ... patient_id="p001", ... dosage="250mg", ... ) >>> visit = Visit( ... visit_id="v001", ... patient_id="p001", ... ) >>> visit.add_event(event) >>> patient = Patient( ... patient_id="p001", ... ) >>> patient.add_visit(visit) >>> patient Patient p001 with 1 visits
- add_visit(visit)[source]#
Adds a visit to the patient.
If the visit’s visit_id is already in the patient’s visits dictionary, it will be overwritten by the new visit.
- Parameters:
visit (
Visit
) – visit to add.
Note
- As for now, there is no check on the order of the visits. The new visit
is simply added to the end of the ordered dictionary of visits.
- Return type:
- add_event(event)[source]#
Adds an event to the patient.
If the event’s visit_id is not in the patient’s visits dictionary, this function will raise KeyError.
- Parameters:
event (
Event
) – event to add.
Note
- As for now, there is no check on the order of the events. The new event
is simply appended to the end of the list of events of the corresponding visit.
- Return type:
Datasets#
pyhealth.datasets.BaseEHRDataset#
This is the basic EHR dataset class. Any specific EHR dataset will inherit from this class.
- class pyhealth.datasets.BaseEHRDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
ABC
Abstract base dataset class.
This abstract class defines a uniform interface for all EHR datasets (e.g., MIMIC-III, MIMIC-IV, eICU, OMOP).
Each specific dataset will be a subclass of this abstract class, which can then be converted to samples dataset for different tasks by calling self.set_task().
- Parameters:
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]). Basic tables will be loaded by default.code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary. E.g., {“NDC”, “ATC”}.
- a tuple with two elements. The first element is a str of the
target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method. E.g., {“NDC”, (“ATC”, {“target_kwargs”: {“level”: 3}})}.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- parse_tables()[source]#
Parses the tables in self.tables and return a dict of patients.
- Will be called in self.__init__() if cache file does not exist or
refresh_cache is True.
This function will first call self.parse_basic_info() to parse the basic patient information, and then call self.parse_[table_name]() to parse the table with name table_name. Both self.parse_basic_info() and self.parse_[table_name]() should be implemented in the subclass.
- set_task(task_fn, task_name=None)[source]#
Processes the base dataset to generate the task-specific sample dataset.
This function should be called by the user after the base dataset is initialized. It will iterate through all patients in the base dataset and call task_fn which should be implemented by the specific task.
- Parameters:
task_fn (
Callable
) – a function that takes a single patient and returns a list of samples (each sample is a dict with patient_id, visit_id, and other task-specific attributes as key). The samples will be concatenated to form the sample dataset.task_name (
Optional
[str
]) – the name of the task. If None, the name of the task function will be used.
- Returns:
the task-specific sample dataset.
- Return type:
sample_dataset
Note
- In task_fn, a patient may be converted to multiple samples, e.g.,
a patient with three visits may be converted to three samples ([visit 1], [visit 1, visit 2], [visit 1, visit 2, visit 3]). Patients can also be excluded from the task dataset by returning an empty list.
pyhealth.datasets.BaseSignalDataset#
This is the basic Signal dataset class. Any specific Signal dataset will inherit from this class.
- class pyhealth.datasets.BaseSignalDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#
Bases:
ABC
Abstract base Signal dataset class.
This abstract class defines a uniform interface for all EEG datasets (e.g., SleepEDF, SHHS).
Each specific dataset will be a subclass of this abstract class, which can then be converted to samples dataset for different tasks by calling self.set_task().
- Parameters:
root (
str
) – root directory of the raw data (should contain many csv files).dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- set_task(task_fn, task_name=None)[source]#
Processes the base dataset to generate the task-specific sample dataset.
This function should be called by the user after the base dataset is initialized. It will iterate through all patients in the base dataset and call task_fn which should be implemented by the specific task.
- Parameters:
task_fn (
Callable
) – a function that takes a single patient and returns a list of samples (each sample is a dict with patient_id, visit_id, and other task-specific attributes as key). The samples will be concatenated to form the sample dataset.task_name (
Optional
[str
]) – the name of the task. If None, the name of the task function will be used.
- Returns:
the task-specific sample (Base) dataset.
- Return type:
sample_dataset
Note
- In task_fn, a patient may be converted to multiple samples, e.g.,
a patient with three visits may be converted to three samples ([visit 1], [visit 1, visit 2], [visit 1, visit 2, visit 3]). Patients can also be excluded from the task dataset by returning an empty list.
pyhealth.datasets.SampleEHRDataset#
This class the takes a list of samples as input (either from BaseEHRDataset.set_task() or user-provided json input), and provides a uniform interface for accessing the samples.
- class pyhealth.datasets.SampleEHRDataset(samples, code_vocs=None, dataset_name='', task_name='')[source]#
Bases:
SampleBaseDataset
Sample EHR dataset class.
- This class inherits from SampleBaseDataset and is specifically designed
for EHR datasets.
- Parameters:
- Currently, the following types of attributes are supported:
a single value. Type: int/float/str. Dim: 0.
a single vector. Type: int/float. Dim: 1.
a list of codes. Type: str. Dim: 2.
a list of vectors. Type: int/float. Dim: 2.
a list of list of codes. Type: str. Dim: 3.
a list of list of vectors. Type: int/float. Dim: 3.
- input_info#
Dict, a dict whose keys are the same as the keys in the samples, and values are the corresponding input information: - “type”: the element type of each key attribute, one of float, int, str. - “dim”: the list dimension of each key attribute, one of 0, 1, 2, 3. - “len”: the length of the vector, only valid for vector-based attributes.
- patient_to_index#
Dict[str, List[int]], a dict mapping patient_id to a list of sample indices.
- visit_to_index#
Dict[str, List[int]], a dict mapping visit_id to a list of sample indices.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "single_vector": [1, 2, 3], ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "single_vector": [1, 5, 8], ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]], ... [[7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples) >>> dataset.input_info {'patient_id': {'type': <class 'str'>, 'dim': 0}, 'visit_id': {'type': <class 'str'>, 'dim': 0}, 'single_vector': {'type': <class 'int'>, 'dim': 1, 'len': 3}, 'list_codes': {'type': <class 'str'>, 'dim': 2}, 'list_vectors': {'type': <class 'float'>, 'dim': 2, 'len': 3}, 'list_list_codes': {'type': <class 'str'>, 'dim': 3}, 'list_list_vectors': {'type': <class 'float'>, 'dim': 3, 'len': 3}, 'label': {'type': <class 'int'>, 'dim': 0}} >>> dataset.patient_to_index {'patient-0': [0, 1]} >>> dataset.visit_to_index {'visit-0': [0], 'visit-1': [1]}
pyhealth.datasets.SampleSignalDataset#
This class the takes a list of samples as input (either from BaseSignalDataset.set_task() or user-provided json input), and provides a uniform interface for accessing the samples.
- class pyhealth.datasets.SampleSignalDataset(samples, dataset_name='', task_name='')[source]#
Bases:
SampleBaseDataset
Sample signal dataset class.
This class the takes a list of samples as input (either from BaseDataset.set_task() or user-provided input), and provides a uniform interface for accessing the samples.
- Parameters:
samples (
List
[Dict
]) – a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key.classes – a list of classes, e.g., [“W”, “1”, “2”, “3”, “R”].
dataset_name – the name of the dataset. Default is None.
task_name – the name of the task. Default is None.
pyhealth.datasets.MIMIC3Dataset#
The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.MIMIC3Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseEHRDataset
Base dataset for MIMIC-III dataset.
The MIMIC-III dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.
- The basic information is stored in the following tables:
PATIENTS: defines a patient in the database, SUBJECT_ID.
ADMISSIONS: defines a patient’s hospital admission, HADM_ID.
- We further support the following tables:
DIAGNOSES_ICD: contains ICD-9 diagnoses (ICD9CM code) for patients.
PROCEDURES_ICD: contains ICD-9 procedures (ICD9PROC code) for patients.
- PRESCRIPTIONS: contains medication related order entries (NDC code)
for patients.
- LABEVENTS: contains laboratory measurements (MIMIC3_ITEMID code)
for patients
- Parameters:
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> dataset = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"], ... code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})}, ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper function which parses PATIENTS and ADMISSIONS tables.
Will be called in self.parse_tables()
- Docs:
- parse_diagnoses_icd(patients)[source]#
Helper function which parses DIAGNOSES_ICD table.
Will be called in self.parse_tables()
- Docs:
DIAGNOSES_ICD: https://mimic.mit.edu/docs/iii/tables/diagnoses_icd/
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD
table, so we set it to None.
- parse_procedures_icd(patients)[source]#
Helper function which parses PROCEDURES_ICD table.
Will be called in self.parse_tables()
- Docs:
PROCEDURES_ICD: https://mimic.mit.edu/docs/iii/tables/procedures_icd/
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-III does not provide specific timestamps in PROCEDURES_ICD
table, so we set it to None.
- parse_prescriptions(patients)[source]#
Helper function which parses PRESCRIPTIONS table.
Will be called in self.parse_tables()
- Docs:
PRESCRIPTIONS: https://mimic.mit.edu/docs/iii/tables/prescriptions/
pyhealth.datasets.MIMICExtractDataset#
The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.MIMICExtractDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False, pop_size=None, itemid_to_variable_map=None)[source]#
Bases:
BaseEHRDataset
Base dataset for MIMIC-Extract dataset.
Reads the HDF5 data produced by [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract#step-4-set-cohort-selection-and-extraction-criteria). Works with files created with or without LEVEL2 grouping and with restricted cohort population sizes, other optional parameter values, and should work with many customized versions of the pipeline.
You can create or obtain a MIMIC-Extract dataset in several ways:
The default chort dataset is [available on GCP](https://console.cloud.google.com/storage/browser/mimic_extract) (requires PhysioNet access provisioned in GCP).
Follow the [step-by-step instructions](https://github.com/MLforHealth/MIMIC_Extract#step-by-step-instructions) on the MIMIC_Extract github site, which includes setting up a PostgreSQL database and loading the MIMIC-III data files.
Use the instructions at [MIMICExtractEasy](https://github.com/SphtKr/MIMICExtractEasy) which uses DuckDB instead and should be a good bit simpler.
Any of these methods will provide you with a set of HDF5 files containing a cleaned subset of the MIMIC-III dataset. This class can be used to read that dataset (mainly the all_hourly_data.h5 file). Consult the MIMIC-Extract documentation for all the options available for dataset generation (cohort selection, aggregation level, etc.).
- Parameters:
root (
str
) – root directory of the raw data (should contain one or more HDF5 files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“vitals_labs”, “interventions”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.pop_size (
Optional
[int
]) – If your MIMIC-Extract dataset was created with a pop_size parameter, include it here. This is used to find the correct filenames.itemid_to_variable_map (
Optional
[str
]) – Path to the CSV file used for aggregation mapping during your dataset’s creation. Probably the one located in the MIMIC-Extract repo at resources/itemid_to_variable_map.csv, or your own version if you have customized it.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import MIMICExtractDataset >>> dataset = MIMICExtractDataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "NOTES"], TODO: What here? ... code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})}, ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper function which parses patients dataset (within all_hourly_data.h5)
Will be called in self.parse_tables()
- Docs:
- parse_diagnoses_icd(patients)[source]#
- Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5) in
a way compatible with MIMIC3Dataset.
Will be called in self.parse_tables()
- Docs:
DIAGNOSES_ICD: https://mimic.mit.edu/docs/iii/tables/diagnoses_icd/
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD
table, so we set it to None.
- parse_c(patients)[source]#
Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5).
Will be called in self.parse_tables()
- Docs:
DIAGNOSES_ICD: https://mimic.mit.edu/docs/iii/tables/diagnoses_icd/
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD
table, so we set it to None.
- parse_labevents(patients)[source]#
Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.
Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to LABEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done by MIMIC-Extract.
See also self.parse_vitals_labs()
Will be called in self.parse_tables()
- Docs:
- parse_chartevents(patients)[source]#
Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.
Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to CHARTEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done in MIMIC-Extract.
Will be called in self.parse_tables()
- Docs:
CHARTEVENTS: https://mimic.mit.edu/docs/iii/tables/chartevents/
- parse_vitals_labs(patients)[source]#
Helper function which parses the vitals_labs dataset (within all_hourly_data.h5).
Events are added using the MIMIC3_ITEMID vocabulary, and the mapping is determined by the CSV file passed to the constructor in itemid_to_variable_map. Since MIMIC-Extract aggregates like events, only a single MIMIC-III ITEMID will be used to represent all like items in the MIMIC-Extract dataset–so the data here will likely not match raw MIMIC-III data. Which ITEMIDs are used depends on the aggregation level in your dataset (i.e. whether you used –no_group_by_level2).
Will be called in self.parse_tables()
See also self.parse_chartevents() and self.parse_labevents()
- Docs:
- parse_interventions(patients)[source]#
Helper function which parses the interventions dataset (within all_hourly_data.h5). Events are added using the MIMIC3_ITEMID vocabulary, using a manually derived mapping corresponding to general items descriptive of the intervention. Since the raw MIMIC-III data had multiple codes, and MIMIC-Extract aggregates like items, these will not match raw MIMIC-III data.
In particular, note that ITEMID 41491 (“fluid bolus”) is used for crystalloid_bolus and ITEMID 46729 (“Dextran”) is used for colloid_bolus because there is no existing general ITEMID for colloid boluses.
Will be called in self.parse_tables()
- Docs:
pyhealth.datasets.MIMIC4Dataset#
The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.MIMIC4Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseEHRDataset
Base dataset for MIMIC-IV dataset.
The MIMIC-IV dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.
- The basic information is stored in the following tables:
patients: defines a patient in the database, subject_id.
admission: define a patient’s hospital admission, hadm_id.
- We further support the following tables:
- diagnoses_icd: contains ICD diagnoses (ICD9CM and ICD10CM code)
for patients.
- procedures_icd: contains ICD procedures (ICD9PROC and ICD10PROC
code) for patients.
- prescriptions: contains medication related order entries (NDC code)
for patients.
- labevents: contains laboratory measurements (MIMIC4_ITEMID code)
for patients
- Parameters:
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> dataset = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"], ... code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})}, ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper functions which parses patients and admissions tables.
Will be called in self.parse_tables()
- Docs:
patients:https://mimic.mit.edu/docs/iv/modules/hosp/patients/
admissions: https://mimic.mit.edu/docs/iv/modules/hosp/admissions/
- parse_diagnoses_icd(patients)[source]#
Helper function which parses diagnosis_icd table.
Will be called in self.parse_tables()
- Docs:
diagnosis_icd: https://mimic.mit.edu/docs/iv/modules/hosp/diagnoses_icd/
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-IV does not provide specific timestamps in diagnoses_icd
table, so we set it to None.
- parse_procedures_icd(patients)[source]#
Helper function which parses procedures_icd table.
Will be called in self.parse_tables()
- Docs:
procedures_icd: https://mimic.mit.edu/docs/iv/modules/hosp/procedures_icd/
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-IV does not provide specific timestamps in procedures_icd
table, so we set it to None.
- parse_prescriptions(patients)[source]#
Helper function which parses prescriptions table.
Will be called in self.parse_tables()
- Docs:
prescriptions: https://mimic.mit.edu/docs/iv/modules/hosp/prescriptions/
- parse_labevents(patients)[source]#
Helper function which parses labevents table.
Will be called in self.parse_tables()
- Docs:
- parse_hcpcsevents(patients)[source]#
Helper function which parses hcpcsevents table.
Will be called in self.parse_tables()
- Docs:
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- MIMIC-IV does not provide specific timestamps in hcpcsevents
table, so we set it to None.
pyhealth.datasets.eICUDataset#
The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.eICUDataset(**kwargs)[source]#
Bases:
BaseEHRDataset
Base dataset for eICU dataset.
The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.
- The basic information is stored in the following tables:
- patient: defines a patient (uniquepid), a hospital admission
(patienthealthsystemstayid), and a ICU stay (patientunitstayid) in the database.
hospital: contains information about a hospital (e.g., region).
Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time. Thus, we only know the order of ICU stays within a hospital admission, but not the order of hospital admissions within a patient. As a result, we use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.
- We further support the following tables:
- diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)
and diagnosis information (under attr_dict) for patients
- treatment: contains treatment information (eICU_TREATMENTSTRING code)
for patients.
- medication: contains medication related order entries (eICU_DRUGNAME
code) for patients.
- lab: contains laboratory measurements (eICU_LABNAME code)
for patients
- physicalExam: contains all physical exam (eICU_PHYSICALEXAMPATH)
conducted for patients.
- admissionDx: table contains the primary diagnosis for admission to
the ICU per the APACHE scoring criteria. (eICU_ADMITDXPATH)
- Parameters:
dataset_name – name of the dataset.
root – root directory of the raw data (should contain many csv files).
tables – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).
code_mapping –
a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev – whether to enable dev mode (only use a small subset of the data). Default is False.
refresh_cache – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import eICUDataset >>> dataset = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication", "lab", "treatment", "physicalExam", "admissionDx"], ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper functions which parses patient and hospital tables.
Will be called in self.parse_tables().
- Docs:
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
We use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.
- parse_diagnosis(patients)[source]#
Helper function which parses diagnosis table.
Will be called in self.parse_tables().
- Docs:
- Parameters:
patients (
Dict
[str
,Patient
]) – a dict of Patient objects indexed by patient_id.- Return type:
- Returns:
The updated patients dict.
Note
- This table contains both ICD9CM and ICD10CM codes in one single
cell. We need to use medcode to distinguish them.
- parse_treatment(patients)[source]#
Helper function which parses treatment table.
Will be called in self.parse_tables().
- Docs:
- parse_medication(patients)[source]#
Helper function which parses medication table.
Will be called in self.parse_tables().
- Docs:
medication: https://eicu-crd.mit.edu/eicutables/medication/
- parse_lab(patients)[source]#
Helper function which parses lab table.
Will be called in self.parse_tables().
- Docs:
- parse_physicalexam(patients)[source]#
Helper function which parses physicalExam table.
Will be called in self.parse_tables().
- Docs:
physicalExam: https://eicu-crd.mit.edu/eicutables/physicalexam/
- parse_admissiondx(patients)[source]#
Helper function which parses admissionDx (admission diagnosis) table.
Will be called in self.parse_tables().
- Docs:
admissionDx: https://eicu-crd.mit.edu/eicutables/admissiondx/
pyhealth.datasets.OMOPDataset#
We can process any OMOP-CDM formatted database, refer to doc for more information. The raw data is processed into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.OMOPDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseEHRDataset
Base dataset for OMOP dataset.
The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence.
See: https://www.ohdsi.org/data-standardization/the-common-data-model/.
- The basic information is stored in the following tables:
- person: contains records that uniquely identify each person or patient,
and some demographic information.
- visit_occurrence: contains info for how a patient engages with the
healthcare system for a duration of time.
death: contains info for how and when a patient dies.
- We further support the following tables:
- condition_occurrence.csv: contains the condition information
(CONDITION_CONCEPT_ID code) of patients’ visits.
- procedure_occurrence.csv: contains the procedure information
(PROCEDURE_CONCEPT_ID code) of patients’ visits.
- drug_exposure.csv: contains the drug information (DRUG_CONCEPT_ID code)
of patients’ visits.
- measurement.csv: contains all laboratory measurements
(MEASUREMENT_CONCEPT_ID code) of patients’ visits.
- Parameters:
root (
str
) – root directory of the raw data (should contain many csv files).tables (
List
[str
]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).code_mapping (
Optional
[Dict
[str
,Union
[str
,Tuple
[str
,Dict
]]]]) –a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
a str of the target code vocabulary;
a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “mortality prediction”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> dataset = OMOPDataset( ... root="/srv/local/data/zw12/pyhealth/raw_data/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence", "drug_exposure", "measurement",], ... ) >>> dataset.stat() >>> dataset.info()
- parse_basic_info(patients)[source]#
Helper functions which parses person, visit_occurrence, and death tables.
Will be called in self.parse_tables()
- Docs:
- parse_condition_occurrence(patients)[source]#
Helper function which parses condition_occurrence table.
Will be called in self.parse_tables()
- Docs:
condition_occurrence: http://ohdsi.github.io/CommonDataModel/cdm53.html#CONDITION_OCCURRENCE
- parse_procedure_occurrence(patients)[source]#
Helper function which parses procedure_occurrence table.
Will be called in self.parse_tables()
- Docs:
procedure_occurrence: http://ohdsi.github.io/CommonDataModel/cdm53.html#PROCEDURE_OCCURRENCE
- parse_drug_exposure(patients)[source]#
Helper function which parses drug_exposure table.
Will be called in self.parse_tables()
- Docs:
procedure_occurrence: http://ohdsi.github.io/CommonDataModel/cdm53.html#DRUG_EXPOSURE
pyhealth.datasets.SleepEDFDataset#
The open Sleep-EDF Database Expanded database, refer to doc for more information.
- class pyhealth.datasets.SleepEDFDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#
Bases:
BaseSignalDataset
Base EEG dataset for SleepEDF
Dataset is available at https://www.physionet.org/content/sleep-edfx/1.0.0/
- For the Sleep Cassette Study portion:
The 153 SC* files (SC = Sleep Cassette) were obtained in a 1987-1991 study of age effects on sleep in healthy Caucasians aged 25-101, without any sleep-related medication [2]. Two PSGs of about 20 hours each were recorded during two subsequent day-night periods at the subjects homes. Subjects continued their normal activities but wore a modified Walkman-like cassette-tape recorder described in chapter VI.4 (page 92) of Bob’s 1987 thesis [7].
Files are named in the form SC4ssNEO-PSG.edf where ss is the subject number, and N is the night. The first nights of subjects 36 and 52, and the second night of subject 13, were lost due to a failing cassette or laserdisk.
The EOG and EEG signals were each sampled at 100 Hz. The submental-EMG signal was electronically highpass filtered, rectified and low-pass filtered after which the resulting EMG envelope expressed in uV rms (root-mean-square) was sampled at 1Hz. Oro-nasal airflow, rectal body temperature and the event marker were also sampled at 1Hz.
Subjects and recordings are further described in the file headers, the descriptive spreadsheet SC-subjects.xls, and in [2].
- For the Sleep Telemetry portoin:
The 44 ST* files (ST = Sleep Telemetry) were obtained in a 1994 study of temazepam effects on sleep in 22 Caucasian males and females without other medication. Subjects had mild difficulty falling asleep but were otherwise healthy. The PSGs of about 9 hours were recorded in the hospital during two nights, one of which was after temazepam intake, and the other of which was after placebo intake. Subjects wore a miniature telemetry system with very good signal quality described in [8].
Files are named in the form ST7ssNJ0-PSG.edf where ss is the subject number, and N is the night.
EOG, EMG and EEG signals were sampled at 100 Hz, and the event marker at 1 Hz. The physical marker dimension ID+M-E relates to the fact that pressing the marker (M) button generated two-second deflections from a baseline value that either identifies the telemetry unit (ID = 1 or 2 if positive) or marks an error (E) in the telemetry link if negative. Subjects and recordings are further described in the file headers, the descriptive spreadsheet ST-subjects.xls, and in [1].
- Parameters:
root (
str
) – root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “sleep staging”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import SleepEDFDataset >>> dataset = SleepEDFDataset( ... root="/srv/local/data/SLEEPEDF/sleep-edf-database-expanded-1.0.0/sleep-cassette", ... ) >>> dataset.stat() >>> dataset.info()
pyhealth.datasets.SHHSDataset#
The open Sleep-EDF Database Expanded database, refer to doc for more information.
- class pyhealth.datasets.SHHSDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#
Bases:
BaseSignalDataset
Base EEG dataset for Sleep Heart Health Study (SHHS)
Dataset is available at https://sleepdata.org/datasets/shhs
The Sleep Heart Health Study (SHHS) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing. It tests whether sleep-related breathing is associated with an increased risk of coronary heart disease, stroke, all cause mortality, and hypertension. In all, 6,441 men and women aged 40 years and older were enrolled between November 1, 1995 and January 31, 1998 to take part in SHHS Visit 1. During exam cycle 3 (January 2001- June 2003), a second polysomnogram (SHHS Visit 2) was obtained in 3,295 of the participants. CVD Outcomes data were monitored and adjudicated by parent cohorts between baseline and 2011. More than 130 manuscripts have been published investigating predictors and outcomes of sleep disorders.
- Parameters:
root (
str
) – root directory of the raw data (should contain many csv files).dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “sleep staging”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import SHHSDataset >>> dataset = SHHSDataset( ... root="/srv/local/data/SHHS/", ... ) >>> dataset.stat() >>> dataset.info()
pyhealth.datasets.ISRUCDataset#
The open ISRUC EEF database, refer to doc for more information.
- class pyhealth.datasets.ISRUCDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#
Bases:
BaseSignalDataset
Base EEG dataset for ISRUC Group I.
Dataset is available at https://sleeptight.isr.uc.pt/
The EEG signals are sampled at 200 Hz.
There are 100 subjects in the orignal dataset.
Each subject’s data is about a night’s sleep.
- Parameters:
dataset_name (
Optional
[str
]) – name of the dataset. Default is ‘ISRUCDataset’.root (
str
) – root directory of the raw data. We expect root/raw to contain all extracted files (.txt, .rec, …) You can also download the data to a new directory by using download=True.dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – Whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.download – Whether to download the data automatically. Default is False.
Examples
>>> from pyhealth.datasets import ISRUCDataset >>> dataset = ISRUCDataset( ... root="/srv/local/data/data/ISRUC-I", ... download=True, ... ) >>> dataset.stat() >>> dataset.info()
pyhealth.datasets.CardiologyDataset#
The Cardiology dataset includes six portions “cpsc_2018”, “cpsc_2018_extra”, “georgia”, “ptb”, “ptb-xl”, “st_petersburg_incart”, refer to doc for more information.
- class pyhealth.datasets.CardiologyDataset(root, chosen_dataset=[1, 1, 1, 1, 1, 1], dataset_name=None, dev=False, refresh_cache=False)[source]#
Bases:
BaseSignalDataset
Base ECG dataset for Cardiology
Dataset is available at https://physionet.org/content/challenge-2020/1.0.2/
- Parameters:
root (
str
) – root directory of the raw data.dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.chosen_dataset (
List
[int
]) – a list of (0,1) of length 6 indicting which datasets will be used. Default: [1, 1, 1, 1, 1, 1] The datasets contain “cpsc_2018”, “cpsc_2018_extra”, “georgia”, “ptb”, “ptb-xl”, “st_petersburg_incart”. eg. [0,1,1,1,1,1] indicates that “cpsc_2018_extra”, “georgia”, “ptb”, “ptb-xl” and “st_petersburg_incart” will be used.
- task#
Optional[str], name of the task (e.g., “sleep staging”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import CardiologyDataset >>> dataset = CardiologyDataset( ... root="/srv/local/data/physionet.org/files/challenge-2020/1.0.2/training", ... ) >>> dataset.stat() >>> dataset.info()
pyhealth.datasets.TUABDataset#
Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml
The TUAB dataset (or Temple University Hospital EEG Abnormal Corpus) is a collection of EEG data acquired at the Temple University Hospital.
The dataset contains both normal and abnormal EEG readings.
- class pyhealth.datasets.TUABDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#
Bases:
BaseSignalDataset
Base EEG dataset for the TUH Abnormal EEG Corpus
Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml
The TUAB dataset (or Temple University Hospital EEG Abnormal Corpus) is a collection of EEG data acquired at the Temple University Hospital.
The dataset contains both normal and abnormal EEG readings.
Files are named in the form aaaaamye_s001_t000.edf. This includes the subject identifier (“aaaaamye”), the session number (“s001”) and a token number (“t000”). EEGs are split into a series of files starting with *t000.edf, *t001.edf, …
- Parameters:
root (
str
) – root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “EEG_abnormal”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import TUABDataset >>> dataset = TUABDataset( ... root="/srv/local/data/TUH/tuh_eeg_abnormal/v3.0.0/edf/", ... ) >>> dataset.stat() >>> dataset.info()
pyhealth.datasets.TUEVDataset#
Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml
This corpus is a subset of TUEG that contains annotations of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).
- class pyhealth.datasets.TUEVDataset(root, dataset_name=None, dev=False, refresh_cache=False, **kwargs)[source]#
Bases:
BaseSignalDataset
Base EEG dataset for the TUH EEG Events Corpus
Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml
This corpus is a subset of TUEG that contains annotations of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).
- Files are named in the form of bckg_032_a_.edf in the eval partition:
- bckg: this file contains background annotations.
032: a reference to the eval index a_.edf: EEG files are split into a series of files starting with a_.edf, a_1.ef, … These represent pruned EEGs, so the original EEG is split into these segments, and uninteresting parts of the original recording were deleted.
- or in the form of 00002275_00000001.edf in the train partition:
- 00002275: a reference to the train index.
0000001: indicating that this is the first file inssociated with this patient.
- Parameters:
root (
str
) – root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.dev (
bool
) – whether to enable dev mode (only use a small subset of the data). Default is False.refresh_cache (
bool
) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
- task#
Optional[str], name of the task (e.g., “EEG_events”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import TUEVDataset >>> dataset = TUEVDataset( ... root="/srv/local/data/TUH/tuh_eeg_events/v2.0.0/edf/", ... ) >>> dataset.stat() >>> dataset.info()
pyhealth.datasets.splitter#
Several data splitting function for pyhealth.datasets module to obtain training / validation / test sets.
- pyhealth.datasets.splitter.split_by_visit(dataset, ratios, seed=None)[source]#
Splits the dataset by visit (i.e., samples).
- Parameters:
- Returns:
- three subsets of the dataset of
type torch.utils.data.Subset.
- Return type:
train_dataset, val_dataset, test_dataset
Note
- The original dataset can be accessed by train_dataset.dataset,
val_dataset.dataset, and test_dataset.dataset.
- pyhealth.datasets.splitter.split_by_patient(dataset, ratios, seed=None)[source]#
Splits the dataset by patient.
- Parameters:
- Returns:
- three subsets of the dataset of
type torch.utils.data.Subset.
- Return type:
train_dataset, val_dataset, test_dataset
Note
- The original dataset can be accessed by train_dataset.dataset,
val_dataset.dataset, and test_dataset.dataset.
- pyhealth.datasets.splitter.split_by_sample(dataset, ratios, seed=None, get_index=False)[source]#
Splits the dataset by sample
- Parameters:
- Returns:
- three subsets of the dataset of
type torch.utils.data.Subset.
- Return type:
train_dataset, val_dataset, test_dataset
Note
- The original dataset can be accessed by train_dataset.dataset,
val_dataset.dataset, and test_dataset.dataset.
pyhealth.datasets.utils#
Several utility functions.
- pyhealth.datasets.utils.strptime(s)[source]#
Helper function which parses a string to datetime object.
- pyhealth.datasets.utils.padyear(year, month='1', day='1')[source]#
Pad a date time year of format ‘YYYY’ to format ‘YYYY-MM-DD’
- Parameters:
year (
str
) – str, year to be padded. Must be non-zero value.month – str, month string to be used as padding. Must be in [1, 12]
day – str, day string to be used as padding. Must be in [1, 31]
- Returns:
str, padded year.
- Return type:
padded_date
- pyhealth.datasets.utils.flatten_list(l)[source]#
Flattens a list of list.
- Parameters:
l (
List
) – List, the list of list to be flattened.- Return type:
- Returns:
List, the flattened list.
Examples
>>> flatten_list([[1], [2, 3], [4]]) [1, 2, 3, 4]R >>> flatten_list([[1], [[2], 3], [4]]) [1, [2], 3, 4]
- pyhealth.datasets.utils.list_nested_levels(l)[source]#
Gets all the different nested levels of a list.
- Parameters:
l (
List
) – the list to be checked.- Return type:
- Returns:
All the different nested levels of the list.
Examples
>>> list_nested_levels([]) (1,) >>> list_nested_levels([1, 2, 3]) (1,) >>> list_nested_levels([[]]) (2,) >>> list_nested_levels([[1, 2, 3], [4, 5, 6]]) (2,) >>> list_nested_levels([1, [2, 3], 4]) (1, 2) >>> list_nested_levels([[1, [2, 3], 4]]) (2, 3)
- pyhealth.datasets.utils.is_homo_list(l)[source]#
Checks if a list is homogeneous.
- Parameters:
l (
List
) – the list to be checked.- Return type:
- Returns:
bool, True if the list is homogeneous, False otherwise.
Examples
>>> is_homo_list([1, 2, 3]) True >>> is_homo_list([]) True >>> is_homo_list([1, 2, "3"]) False >>> is_homo_list([1, 2, 3, [4, 5, 6]]) False
Tasks#
We support various real-world healthcare predictive tasks defined by function calls. The following example tasks are collected from top AI/Medical venues, such as:
Drug Recommendation [Yang et al. IJCAI 2021a, Yang et al. IJCAI 2021b, Shang et al. AAAI 2020]
Readmission Prediction [Choi et al. AAAI 2021]
Mortality Prediction [Choi et al. AAAI 2021]
Length of Stay Prediction
Sleep Staging [Yang et al. ArXiv 2021]
pyhealth.tasks.drug_recommendation#
- pyhealth.tasks.drug_recommendation.drug_recommendation_mimic3_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key, like this {
”patient_id”: xxx, “visit_id”: xxx, “conditions”: [list of diag in visit 1, list of diag in visit 2, …, list of diag in visit N], “procedures”: [list of prod in visit 1, list of prod in visit 2, …, list of prod in visit N], “drugs_hist”: [list of drug in visit 1, list of drug in visit 2, …, list of drug in visit (N-1)], “drugs”: list of drug in visit N, # this is the predicted target
}
- Return type:
samples
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import drug_recommendation_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(drug_recommendation_mimic3_fn) >>> mimic3_sample.samples[0] { 'visit_id': '174162', 'patient_id': '107', 'conditions': [['139', '158', '237', '99', '60', '101', '51', '54', '53', '133', '143', '140', '117', '138', '55']], 'procedures': [['4443', '4513', '3995']], 'drugs_hist': [[]], 'drugs': ['0000', '0033', '5817', '0057', '0090', '0053', '0', '0012', '6332', '1001', '6155', '1001', '6332', '0033', '5539', '6332', '5967', '0033', '0040', '5967', '5846', '0016', '5846', '5107', '5551', '6808', '5107', '0090', '5107', '5416', '0033', '1150', '0005', '6365', '0090', '6155', '0005', '0090', '0000', '6373'], }
- pyhealth.tasks.drug_recommendation.drug_recommendation_mimic4_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key {
”patient_id”: xxx, “visit_id”: xxx, “conditions”: [list of diag in visit 1, list of diag in visit 2, …, list of diag in visit N], “procedures”: [list of prod in visit 1, list of prod in visit 2, …, list of prod in visit N], “drugs_hist”: [list of drug in visit 1, list of drug in visit 2, …, list of drug in visit (N-1)], “drugs”: list of drug in visit N, # this is the predicted target
}
- Return type:
samples
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import drug_recommendation_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(drug_recommendation_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
- pyhealth.tasks.drug_recommendation.drug_recommendation_eicu_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import drug_recommendation_eicu_fn >>> eicu_sample = eicu_base.set_task(drug_recommendation_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': [['2', '3', '4']]}]
- pyhealth.tasks.drug_recommendation.drug_recommendation_omop_fn(patient)[source]#
Processes a single patient for the drug recommendation task.
Drug recommendation aims at recommending a set of drugs given the patient health history (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import drug_recommendation_omop_fn >>> omop_sample = omop_base.set_task(drug_recommendation_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51'], ['98', '663', '58', '51']], 'procedures': [['1'], ['2', '3']], 'label': [['2', '3', '4'], ['0', '1', '4', '5']]}]
pyhealth.tasks.readmission_prediction#
- pyhealth.tasks.readmission_prediction.readmission_prediction_mimic3_fn(patient, time_window=15)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import readmission_prediction_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(readmission_prediction_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_mimic4_fn(patient, time_window=15)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import readmission_prediction_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(readmission_prediction_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_eicu_fn(patient, time_window=5)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
Features key-value pairs: - using diagnosis table (ICD9CM and ICD10CM) as condition codes - using physicalExam table as procedure codes - using medication table as drugs codes
- Parameters:
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication", "physicalExam"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import readmission_prediction_eicu_fn >>> eicu_sample = eicu_base.set_task(readmission_prediction_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_eicu_fn2(patient, time_window=5)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
Similar to readmission_prediction_eicu_fn, but with different code mapping: - using admissionDx table and diagnosisString under diagnosis table as condition codes - using treatment table as procedure codes
- Parameters:
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "treatment", "admissionDx"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import readmission_prediction_eicu_fn2 >>> eicu_sample = eicu_base.set_task(readmission_prediction_eicu_fn2) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.readmission_prediction.readmission_prediction_omop_fn(patient, time_window=15)[source]#
Processes a single patient for the readmission prediction task.
Readmission prediction aims at predicting whether the patient will be readmitted into hospital within time_window days based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient objecttime_window – the time window threshold (gap < time_window means label=1 for the task)
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import readmission_prediction_omop_fn >>> omop_sample = omop_base.set_task(readmission_prediction_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.mortality_prediction#
- pyhealth.tasks.mortality_prediction.mortality_prediction_mimic3_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import mortality_prediction_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(mortality_prediction_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
- pyhealth.tasks.mortality_prediction.mortality_prediction_mimic4_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import mortality_prediction_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(mortality_prediction_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
- pyhealth.tasks.mortality_prediction.mortality_prediction_eicu_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
Features key-value pairs: - using diagnosis table (ICD9CM and ICD10CM) as condition codes - using physicalExam table as procedure codes - using medication table as drugs codes
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication", "physicalExam"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import mortality_prediction_eicu_fn >>> eicu_sample = eicu_base.set_task(mortality_prediction_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}]
- pyhealth.tasks.mortality_prediction.mortality_prediction_eicu_fn2(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
Similar to mortality_prediction_eicu_fn, but with different code mapping: - using admissionDx table and diagnosisString under diagnosis table as condition codes - using treatment table as procedure codes
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "admissionDx", "treatment"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import mortality_prediction_eicu_fn2 >>> eicu_sample = eicu_base.set_task(mortality_prediction_eicu_fn2) >>> eicu_sample.samples[0] {'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 0}
- pyhealth.tasks.mortality_prediction.mortality_prediction_omop_fn(patient)[source]#
Processes a single patient for the mortality prediction task.
Mortality prediction aims at predicting whether the patient will decease in the next hospital visit based on the clinical information from current visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object- Returns:
- a list of samples, each sample is a dict with patient_id,
visit_id, and other task-specific attributes as key
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import mortality_prediction_omop_fn >>> omop_sample = omop_base.set_task(mortality_prediction_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 1}]
pyhealth.tasks.length_of_stay_prediction#
- pyhealth.tasks.length_of_stay_prediction.categorize_los(days)[source]#
Categorizes length of stay into 10 categories.
One for ICU stays shorter than a day, seven day-long categories for each day of the first week, one for stays of over one week but less than two, and one for stays of over two weeks.
- Parameters:
days (
int
) – int, length of stay in days- Returns:
int, category of length of stay
- Return type:
category
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic3_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object.- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import MIMIC3Dataset >>> mimic3_base = MIMIC3Dataset( ... root="/srv/local/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"ICD9CM": "CCSCM"}, ... ) >>> from pyhealth.tasks import length_of_stay_prediction_mimic3_fn >>> mimic3_sample = mimic3_base.set_task(length_of_stay_prediction_mimic3_fn) >>> mimic3_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 4}]
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_mimic4_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object.- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> mimic4_base = MIMIC4Dataset( ... root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp", ... tables=["diagnoses_icd", "procedures_icd"], ... code_mapping={"ICD10PROC": "CCSPROC"}, ... ) >>> from pyhealth.tasks import length_of_stay_prediction_mimic4_fn >>> mimic4_sample = mimic4_base.set_task(length_of_stay_prediction_mimic4_fn) >>> mimic4_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '19', '122', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 2}]
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_eicu_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object.- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import eICUDataset >>> eicu_base = eICUDataset( ... root="/srv/local/data/physionet.org/files/eicu-crd/2.0", ... tables=["diagnosis", "medication"], ... code_mapping={}, ... dev=True ... ) >>> from pyhealth.tasks import length_of_stay_prediction_eicu_fn >>> eicu_sample = eicu_base.set_task(length_of_stay_prediction_eicu_fn) >>> eicu_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 5}]
- pyhealth.tasks.length_of_stay_prediction.length_of_stay_prediction_omop_fn(patient)[source]#
Processes a single patient for the length-of-stay prediction task.
Length of stay prediction aims at predicting the length of stay (in days) of the current hospital visit based on the clinical information from the visit (e.g., conditions and procedures).
- Parameters:
patient (
Patient
) – a Patient object.- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id,
and other task-specific attributes as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import OMOPDataset >>> omop_base = OMOPDataset( ... root="https://storage.googleapis.com/pyhealth/synpuf1k_omop_cdm_5.2.2", ... tables=["condition_occurrence", "procedure_occurrence"], ... code_mapping={}, ... ) >>> from pyhealth.tasks import length_of_stay_prediction_omop_fn >>> omop_sample = omop_base.set_task(length_of_stay_prediction_eicu_fn) >>> omop_sample.samples[0] [{'visit_id': '130744', 'patient_id': '103', 'conditions': [['42', '109', '98', '663', '58', '51']], 'procedures': [['1']], 'label': 7}]
pyhealth.tasks.sleep_staging#
- pyhealth.tasks.sleep_staging.sleep_staging_isruc_fn(record, epoch_seconds=10, label_id=1)[source]#
Processes a single patient for the sleep staging task on ISRUC.
Sleep staging aims at predicting the sleep stages (Awake, N1, N2, N3, REM) based on the multichannel EEG signals. The task is defined as a multi-class classification.
- Parameters:
record –
a singleton list of one subject from the ISRUCDataset. The (single) record is a dictionary with the following keys:
load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id
epoch_seconds – how long will each epoch be (in seconds). It has to be a factor of 30 because the original data was labeled every 30 seconds.
label_id – which set of labels to use. ISURC is labeled by two experts. By default we use the first set of labels (label_id=1).
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Y”: label} as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import ISRUCDataset >>> isruc = ISRUCDataset( ... root="/srv/local/data/data/ISRUC-I", download=True, ... ) >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> sleepstage_ds = isruc.set_task(sleep_staging_isruc_fn) >>> sleepstage_ds.samples[0] { 'record_id': '1-0', 'patient_id': '1', 'epoch_path': '/home/zhenlin4/.cache/pyhealth/datasets/832afe6e6e8a5c9ea5505b47e7af8125/10-1/1/0.pkl', 'label': 'W' }
- pyhealth.tasks.sleep_staging.sleep_staging_sleepedf_fn(record, epoch_seconds=30)[source]#
Processes a single patient for the sleep staging task on Sleep EDF.
Sleep staging aims at predicting the sleep stages (Awake, REM, N1, N2, N3, N4) based on the multichannel EEG signals. The task is defined as a multi-class classification.
- Parameters:
patient – a list of (load_from_path, signal_file, label_file, save_to_path) tuples, where PSG is the signal files and the labels are
file (in label) –
epoch_seconds – how long will each epoch be (in seconds)
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Y”: label} as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import SleepEDFDataset >>> sleepedf = SleepEDFDataset( ... root="/srv/local/data/SLEEPEDF/sleep-edf-database-expanded-1.0.0/sleep-cassette", ... ) >>> from pyhealth.tasks import sleep_staging_sleepedf_fn >>> sleepstage_ds = sleepedf.set_task(sleep_staging_sleepedf_fn) >>> sleepstage_ds.samples[0] { 'record_id': 'SC4001-0', 'patient_id': 'SC4001', 'epoch_path': '/home/chaoqiy2/.cache/pyhealth/datasets/70d6dbb28bd81bab27ae2f271b2cbb0f/SC4001-0.pkl', 'label': 'W' }
- pyhealth.tasks.sleep_staging.sleep_staging_shhs_fn(record, epoch_seconds=30)[source]#
Processes a single recording for the sleep staging task on SHHS.
Sleep staging aims at predicting the sleep stages (Awake, REM, N1, N2, N3) based on the multichannel EEG signals. The task is defined as a multi-class classification.
- Parameters:
patient – a list of (load_from_path, signal file, label file, save_to_path) tuples, where the signal is in edf file and
file (the labels are in the label) –
epoch_seconds – how long will each epoch be (in seconds), 30 seconds as default given by the label file
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Y”: label} as key.
- Return type:
samples
Note that we define the task as a multi-class classification task.
Examples
>>> from pyhealth.datasets import SHHSDataset >>> shhs = SHHSDataset( ... root="/srv/local/data/SHHS/polysomnography", ... dev=True, ... ) >>> from pyhealth.tasks import sleep_staging_shhs_fn >>> shhs_ds = shhs.set_task(sleep_staging_shhs_fn) >>> shhs_ds.samples[0] { 'record_id': 'shhs1-200001-0', 'patient_id': 'shhs1-200001', 'epoch_path': '/home/chaoqiy2/.cache/pyhealth/datasets/76c1ce8195a2e1a654e061cb5df4671a/shhs1-200001-0.pkl', 'label': '0' }
pyhealth.tasks.cardiology_detect#
- pyhealth.tasks.cardiology_detect.cardiology_isAR_fn(record, epoch_sec=10, shift=5)[source]#
Processes a single patient for the Arrhythmias symptom in cardiology on the CardiologyDataset
Cardiology symptoms can be divided into six categories. The task focuses on Arrhythmias and is defined as a binary classification.
- Parameters:
record –
a singleton list of one subject from the CardiologyDataset. The (single) record is a dictionary with the following keys:
load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id
epoch_sec – how long will each epoch be (in seconds).
shift – the step size for the sampling window (with a width of epoch_sec)
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Sex”: gender, “Age”: age, Y”: label} as key.
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import CardiologyDataset >>> isAR = CardiologyDataset( ... root="physionet.org/files/challenge-2020/1.0.2/training", chosen_dataset=[1,1,1,1,1,1], ... ) >>> from pyhealth.tasks import cardiology_isAR_fn >>> cardiology_ds = isAR.set_task(cardiology_isAR_fn) >>> cardiology_ds.samples[0] { 'patient_id': '0_0', 'visit_id': 'A0033', 'record_id': 1, 'Sex': ['Female'], 'Age': ['34'], 'epoch_path': '/Users/liyanjing/.cache/pyhealth/datasets/46c18f2a1a18803b4707a934a577a331/0_0-0.pkl', 'label': '0' }
- pyhealth.tasks.cardiology_detect.cardiology_isBBBFB_fn(record, epoch_sec=10, shift=5)[source]#
Processes a single patient for the Bundle branch blocks and fascicular blocks symptom in cardiology on the CardiologyDataset
Cardiology symptoms can be divided into six categories. The task focuses on Bundle branch blocks and fascicular blocks and is defined as a binary classification.
- Parameters:
record –
a singleton list of one subject from the CardiologyDataset. The (single) record is a dictionary with the following keys:
load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id
epoch_sec – how long will each epoch be (in seconds).
shift – the step size for the sampling window (with a width of epoch_sec)
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Sex”: gender, “Age”: age, Y”: label} as key.
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import CardiologyDataset >>> isBBBFB = CardiologyDataset( ... root="physionet.org/files/challenge-2020/1.0.2/training", chosen_dataset=[1,1,1,1,1,1], ... ) >>> from pyhealth.tasks import cardiology_isBBBFB_fn >>> cardiology_ds = isBBBFB.set_task(cardiology_isBBBFB_fn) >>> cardiology_ds.samples[0] { 'patient_id': '0_0', 'visit_id': 'A0033', 'record_id': 1, 'Sex': ['Female'], 'Age': ['34'], 'epoch_path': '/Users/liyanjing/.cache/pyhealth/datasets/46c18f2a1a18803b4707a934a577a331/0_0-0.pkl', 'label': '0' }
- pyhealth.tasks.cardiology_detect.cardiology_isAD_fn(record, epoch_sec=10, shift=5)[source]#
Processes a single patient for the Axis deviations symptom in cardiology on the CardiologyDataset
Cardiology symptoms can be divided into six categories. The task focuses on Axis deviations and is defined as a binary classification.
- Parameters:
record –
a singleton list of one subject from the CardiologyDataset. The (single) record is a dictionary with the following keys:
load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id
epoch_sec – how long will each epoch be (in seconds).
shift – the step size for the sampling window (with a width of epoch_sec)
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Sex”: gender, “Age”: age, Y”: label} as key.
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import CardiologyDataset >>> isAD = CardiologyDataset( ... root="physionet.org/files/challenge-2020/1.0.2/training", chosen_dataset=[1,1,1,1,1,1], ... ) >>> from pyhealth.tasks import cardiology_isAD_fn >>> cardiology_ds = isAD.set_task(cardiology_isAD_fn) >>> cardiology_ds.samples[0] { 'patient_id': '0_0', 'visit_id': 'A0033', 'record_id': 1, 'Sex': ['Female'], 'Age': ['34'], 'epoch_path': '/Users/liyanjing/.cache/pyhealth/datasets/46c18f2a1a18803b4707a934a577a331/0_0-0.pkl', 'label': '0' }
- pyhealth.tasks.cardiology_detect.cardiology_isCD_fn(record, epoch_sec=10, shift=5)[source]#
Processes a single patient for the Conduction delays symptom in cardiology on the CardiologyDataset
Cardiology symptoms can be divided into six categories. The task focuses on Conduction delays and is defined as a binary classification.
- Parameters:
record –
a singleton list of one subject from the CardiologyDataset. The (single) record is a dictionary with the following keys:
load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id
epoch_sec – how long will each epoch be (in seconds).
shift – the step size for the sampling window (with a width of epoch_sec)
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Sex”: gender, “Age”: age, Y”: label} as key.
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import CardiologyDataset >>> isCD = CardiologyDataset( ... root="physionet.org/files/challenge-2020/1.0.2/training", chosen_dataset=[1,1,1,1,1,1], ... ) >>> from pyhealth.tasks import cardiology_isCD_fn >>> cardiology_ds = isCD.set_task(cardiology_isCD_fn) >>> cardiology_ds.samples[0] { 'patient_id': '0_0', 'visit_id': 'A0033', 'record_id': 1, 'Sex': ['Female'], 'Age': ['34'], 'epoch_path': '/Users/liyanjing/.cache/pyhealth/datasets/46c18f2a1a18803b4707a934a577a331/0_0-0.pkl', 'label': '0' }
- pyhealth.tasks.cardiology_detect.cardiology_isWA_fn(record, epoch_sec=10, shift=5)[source]#
Processes a single patient for the Wave abnormalities symptom in cardiology on the CardiologyDataset
Cardiology symptoms can be divided into six categories. The task focuses on Wave abnormalities and is defined as a binary classification.
- Parameters:
record –
a singleton list of one subject from the CardiologyDataset. The (single) record is a dictionary with the following keys:
load_from_path, signal_file, label1_file, label2_file, save_to_path, subject_id
epoch_sec – how long will each epoch be (in seconds).
shift – the step size for the sampling window (with a width of epoch_sec)
- Returns:
- a list of samples, each sample is a dict with patient_id, record_id,
and epoch_path (the path to the saved epoch {“X”: signal, “Sex”: gender, “Age”: age, Y”: label} as key.
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import CardiologyDataset >>> isWA = CardiologyDataset( ... root="physionet.org/files/challenge-2020/1.0.2/training", chosen_dataset=[1,1,1,1,1,1], ... ) >>> from pyhealth.tasks import cardiology_isWA_fn >>> cardiology_ds = isWA.set_task(cardiology_isWA_fn) >>> cardiology_ds.samples[0] { 'patient_id': '0_0', 'visit_id': 'A0033', 'record_id': 1, 'Sex': ['Female'], 'Age': ['34'], 'epoch_path': '/Users/liyanjing/.cache/pyhealth/datasets/46c18f2a1a18803b4707a934a577a331/0_0-0.pkl', 'label': '0' }
pyhealth.tasks.temple_university_EEG_tasks#
- pyhealth.tasks.temple_university_EEG_tasks.EEG_isAbnormal_fn(record)[source]#
Processes a single patient for the abnormal EEG detection task on TUAB.
Abnormal EEG detection aims at determining whether a EEG is abnormal.
- Parameters:
record –
a singleton list of one subject from the TUABDataset. The (single) record is a dictionary with the following keys:
load_from_path, patient_id, visit_id, signal_file, label_file, save_to_path
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id, record_id,
and epoch_path (the path to the saved epoch {“signal”: signal, “label”: label} as key.
- Return type:
samples
Note that we define the task as a binary classification task.
Examples
>>> from pyhealth.datasets import TUABDataset >>> isabnormal = TUABDataset( ... root="/srv/local/data/TUH/tuh_eeg_abnormal/v3.0.0/edf/", download=True, ... ) >>> from pyhealth.tasks import EEG_isabnormal_fn >>> EEG_abnormal_ds = isabnormal.set_task(EEG_isAbnormal_fn) >>> EEG_abnormal_ds.samples[0] { 'patient_id': 'aaaaamye', 'visit_id': 's001', 'record_id': '1', 'epoch_path': '/home/zhenlin4/.cache/pyhealth/datasets/832afe6e6e8a5c9ea5505b47e7af8125/10-1/1/0.pkl', 'label': 1 }
- pyhealth.tasks.temple_university_EEG_tasks.EEG_events_fn(record)[source]#
Processes a single patient for the EEG events task on TUEV.
This task aims at annotating of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).
- Parameters:
record –
a singleton list of one subject from the TUEVDataset. The (single) record is a dictionary with the following keys:
load_from_path, patient_id, visit_id, signal_file, label_file, save_to_path
- Returns:
- a list of samples, each sample is a dict with patient_id, visit_id, record_id, label, offending_channel,
and epoch_path (the path to the saved epoch {“signal”: signal, “label”: label} as key.
- Return type:
samples
Note that we define the task as a multiclass classification task.
Examples
>>> from pyhealth.datasets import TUEVDataset >>> EEGevents = TUEVDataset( ... root="/srv/local/data/TUH/tuh_eeg_events/v2.0.0/edf/", download=True, ... ) >>> from pyhealth.tasks import EEG_events_fn >>> EEG_events_ds = EEGevents.set_task(EEG_events_fn) >>> EEG_events_ds.samples[0] { 'patient_id': '0_00002265', 'visit_id': '00000001', 'record_id': 0, 'epoch_path': '/Users/liyanjing/.cache/pyhealth/datasets/d8f3cb92cc444d481444d3414fb5240c/0_00002265_00000001_0.pkl', 'label': 6, 'offending_channel': array([4.]) }
Models#
We implement the following models for supporting multiple healthcare predictive tasks.
pyhealth.models.MLP#
The separate callable MLP model.
- class pyhealth.models.MLP(dataset, feature_keys, label_key, mode, pretrained_emb=None, embedding_dim=128, hidden_dim=128, n_layers=2, activation='relu', **kwargs)[source]#
Bases:
BaseModel
Multi-layer perceptron model.
This model applies a separate MLP layer for each feature, and then concatenates the final hidden states of each MLP layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate MLP layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the rnn model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector; we use mean/sum pooling and then MLP
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we first use the embedding table to encode each code into a vector
and then use mean/sum pooling to get one vector for each sample; we then use MLP layers
- case 3. [1.5, 2.0, 0.0]
we run MLP directly
- case 4. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
This case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 1 after embedding table
- case 5. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
This case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we use mean/sum pooling within each outer bracket and use MLP, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.n_layers (
int
) – the number of layers. Default is 2.activation (
str
) – the activation function. Default is “relu”.**kwargs – other parameters for the RNN layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "conditions": ["cond-33", "cond-86", "cond-80"], ... "procedures": [1.0, 2.0, 3.5, 4], ... "label": 0, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "conditions": ["cond-33", "cond-86", "cond-80"], ... "procedures": [5.0, 2.0, 3.5, 4], ... "label": 1, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import MLP >>> model = MLP( ... dataset=dataset, ... feature_keys=["conditions", "procedures"], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.6659, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5680], [0.5352]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[0.2736], [0.1411]], grad_fn=<AddmmBackward0>) } >>>
- static mean_pooling(x, mask)[source]#
Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)
- Returns:
tensor of shape (batch_size, embedding_dim)
- Return type:
x
Examples
>>> x.shape [128, 5, 32] >>> mean_pooling(x).shape [128, 32]
- static sum_pooling(x)[source]#
Mean pooling over the middle dimension of the tensor. :param x: tensor of shape (batch_size, seq_len, embedding_dim) :param mask: tensor of shape (batch_size, seq_len)
- Returns:
tensor of shape (batch_size, embedding_dim)
- Return type:
x
Examples
>>> x.shape [128, 5, 32] >>> sum_pooling(x).shape [128, 32]
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.CNN#
The separate callable CNNLayer and the complete CNN model.
- class pyhealth.models.CNNLayer(input_size, hidden_size, num_layers=1)[source]#
Bases:
Module
Convolutional neural network layer.
This layer stacks multiple CNN blocks and applies adaptive average pooling at the end. It is used in the CNN model. But it can also be used as a standalone layer.
- Parameters:
Examples
>>> from pyhealth.models import CNNLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = CNNLayer(5, 64) >>> outputs, last_outputs = layer(input) >>> outputs.shape torch.Size([3, 128, 64]) >>> last_outputs.shape torch.Size([3, 64])
- forward(x)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, input size].- Returns:
- a tensor of shape [batch size, sequence len, hidden size],
containing the output features for each time step.
- pooled_outputs: a tensor of shape [batch size, hidden size], containing
the pooled output features.
- Return type:
outputs
- class pyhealth.models.CNN(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
Convolutional neural network model.
This model applies a separate CNN layer for each feature, and then concatenates the final hidden states of each CNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate CNN layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the CNN model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply CNN on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use CNN one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run CNN directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the CNN layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"], ["A07B", "A07C"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6]], ... [[7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import CNN >>> model = CNN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.8872, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5008], [0.6614]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[0.0033], [0.6695]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.RNN#
The separate callable RNNLayer and the complete RNN model.
- class pyhealth.models.RNNLayer(input_size, hidden_size, rnn_type='GRU', num_layers=1, dropout=0.5, bidirectional=False)[source]#
Bases:
Module
Recurrent neural network layer.
This layer wraps the PyTorch RNN layer with masking and dropout support. It is used in the RNN model. But it can also be used as a standalone layer.
- Parameters:
input_size (
int
) – input feature size.hidden_size (
int
) – hidden feature size.rnn_type (
str
) – type of rnn, one of “RNN”, “LSTM”, “GRU”. Default is “GRU”.num_layers (
int
) – number of recurrent layers. Default is 1.dropout (
float
) – dropout rate. If non-zero, introduces a Dropout layer before each RNN layer. Default is 0.5.bidirectional (
bool
) – whether to use bidirectional recurrent layers. If True, a fully-connected layer is applied to the concatenation of the forward and backward hidden states to reduce the dimension to hidden_size. Default is False.
Examples
>>> from pyhealth.models import RNNLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = RNNLayer(5, 64) >>> outputs, last_outputs = layer(input) >>> outputs.shape torch.Size([3, 128, 64]) >>> last_outputs.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, input size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a tensor of shape [batch size, sequence len, hidden size],
containing the output features for each time step.
- last_outputs: a tensor of shape [batch size, hidden size], containing
the output features for the last time step.
- Return type:
outputs
- class pyhealth.models.RNN(dataset, feature_keys, label_key, mode, pretrained_emb=None, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
Recurrent neural network model.
This model applies a separate RNN layer for each feature, and then concatenates the final hidden states of each RNN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate rnn layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the rnn model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply rnn on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use rnn one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run rnn directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the RNN layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import RNN >>> model = RNN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.8056, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5906], [0.6620]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[0.3666], [0.6721]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.Transformer#
The separate callable TransformerLayer and the complete Transformer model.
- class pyhealth.models.TransformerLayer(feature_size, heads=1, dropout=0.5, num_layers=1)[source]#
Bases:
Module
Transformer layer.
Paper: Ashish Vaswani et al. Attention is all you need. NIPS 2017.
This layer is used in the Transformer model. But it can also be used as a standalone layer.
- Parameters:
feature_size – the hidden feature size.
heads – the number of attention heads. Default is 1.
dropout – dropout rate. Default is 0.5.
num_layers – number of transformer layers. Default is 1.
register_hook – True to save gradients of attention layer, Default is False.
Examples
>>> from pyhealth.models import TransformerLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = TransformerLayer(64) >>> emb, cls_emb = layer(input) >>> emb.shape torch.Size([3, 128, 64]) >>> cls_emb.shape torch.Size([3, 64])
- forward(x, mask=None, register_hook=False)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, feature_size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a tensor of shape [batch size, sequence len, feature_size],
containing the output features for each time step.
- cls_emb: a tensor of shape [batch size, feature_size], containing
the output features for the first time step.
- Return type:
emb
- class pyhealth.models.Transformer(dataset, feature_keys, label_key, mode, pretrained_emb=None, embedding_dim=128, **kwargs)[source]#
Bases:
BaseModel
Transformer model.
This model applies a separate Transformer layer for each feature, and then concatenates the final hidden states of each Transformer layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate Transformer layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the transformer model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply transformer on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use transformer one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run transformer directly on the inner bracket level, similar to case 2 after embedding table
- dataset: the dataset to train the model. It is used to query certain
information such as the set of all tokens.
- feature_keys: list of keys in samples to use as features,
e.g. [“conditions”, “procedures”].
label_key: key in samples to use as label (e.g., “drugs”). mode: one of “binary”, “multiclass”, or “multilabel”. embedding_dim: the embedding dimension. Default is 128. **kwargs: other parameters for the Transformer layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import Transformer >>> model = Transformer( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="multiclass", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(4.0555, grad_fn=<NllLossBackward0>), 'y_prob': tensor([[1.0000e+00, 1.8206e-06], [9.9970e-01, 3.0020e-04]], grad_fn=<SoftmaxBackward0>), 'y_true': tensor([0, 1]), 'logit': tensor([[ 7.6283, -5.5881], [ 1.0898, -7.0210]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.RETAIN#
The separate callable RETAINLayer and the complete RETAIN model.
- class pyhealth.models.RETAINLayer(feature_size, dropout=0.5)[source]#
Bases:
Module
RETAIN layer.
Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.
This layer is used in the RETAIN model. But it can also be used as a standalone layer.
- Parameters:
Examples
>>> from pyhealth.models import RETAINLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = RETAINLayer(64) >>> c = layer(input) >>> c.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, feature_size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a tensor of shape [batch size, feature_size] representing the
context vector.
- Return type:
c
- class pyhealth.models.RETAIN(dataset, feature_keys, label_key, mode, pretrained_emb=None, embedding_dim=128, **kwargs)[source]#
Bases:
BaseModel
RETAIN model.
Paper: Edward Choi et al. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. NIPS 2016.
Note
We use separate Retain layers for different feature_keys. Currentluy, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the Retain model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply Retain on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use Retain one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run Retain directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.**kwargs – other parameters for the RETAIN layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import RETAIN >>> model = RETAIN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.5640, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5325], [0.3922]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[ 0.1303], [-0.4382]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.GAMENet#
The separate callable GAMENetLayer and the complete GAMENet model.
- class pyhealth.models.GAMENetLayer(hidden_size, ehr_adj, ddi_adj, dropout=0.5)[source]#
Bases:
Module
GAMENet layer.
Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.
This layer is used in the GAMENet model. But it can also be used as a standalone layer.
- Parameters:
Examples
>>> from pyhealth.models import GAMENetLayer >>> queries = torch.randn(3, 5, 32) # [patient, visit, hidden_size] >>> prev_drugs = torch.randint(0, 2, (3, 4, 50)).float() >>> curr_drugs = torch.randint(0, 2, (3, 50)).float() >>> ehr_adj = torch.randint(0, 2, (50, 50)).float() >>> ddi_adj = torch.randint(0, 2, (50, 50)).float() >>> layer = GAMENetLayer(32, ehr_adj, ddi_adj) >>> loss, y_prob = layer(queries, prev_drugs, curr_drugs) >>> loss.shape torch.Size([]) >>> y_prob.shape torch.Size([3, 50])
- forward(queries, prev_drugs, curr_drugs, mask=None)[source]#
Forward propagation.
- Parameters:
queries (
tensor
) – query tensor of shape [patient, visit, hidden_size].prev_drugs (
tensor
) – multihot tensor indicating drug usage in all previous visits of shape [patient, visit - 1, num_drugs].curr_drugs (
tensor
) – multihot tensor indicating drug usage in the current visit of shape [patient, num_drugs].mask (
Optional
[tensor
]) – an optional mask tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.
- Returns:
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type:
loss
- class pyhealth.models.GAMENet(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#
Bases:
BaseModel
GAMENet model.
Paper: Junyuan Shang et al. GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination AAAI 2019.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level. Thus, we have disable the feature_keys, label_key, and mode arguments.
Note
This model only accepts ATC level 3 as medication codes.
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.num_layers (
int
) – the number of layers used in RNN. Default is 1.dropout (
float
) – the dropout rate. Default is 0.5.**kwargs – other parameters for the GAMENet layer.
- forward(conditions, procedures, drugs_hist, drugs, **kwargs)[source]#
Forward propagation.
- Parameters:
conditions (
List
[List
[List
[str
]]]) – a nested list in three levels [patient, visit, condition].procedures (
List
[List
[List
[str
]]]) – a nested list in three levels [patient, visit, procedure].drugs_hist (
List
[List
[List
[str
]]]) – a nested list in three levels [patient, visit, drug], up to visit (N-1)drugs (
List
[List
[str
]]) – a nested list in two levels [patient, drug], at visit N
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing
the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels] representing
the ground truth of each drug.
- Return type:
A dictionary with the following keys
pyhealth.models.MICRON#
The separate callable MICRONLayer and the complete MICRON model.
- class pyhealth.models.MICRONLayer(input_size, hidden_size, num_drugs, lam=0.1)[source]#
Bases:
Module
MICRON layer.
Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.
This layer is used in the MICRON model. But it can also be used as a standalone layer.
- Parameters:
Examples
>>> from pyhealth.models import MICRONLayer >>> patient_emb = torch.randn(3, 5, 32) # [patient, visit, input_size] >>> drugs = torch.randint(0, 2, (3, 50)).float() >>> layer = MICRONLayer(32, 64, 50) >>> loss, y_prob = layer(patient_emb, drugs) >>> loss.shape torch.Size([]) >>> y_prob.shape torch.Size([3, 50])
- forward(patient_emb, drugs, mask=None)[source]#
Forward propagation.
- Parameters:
patient_emb (
tensor
) – a tensor of shape [patient, visit, input_size].drugs (
tensor
) – a multihot tensor of shape [patient, num_labels].mask (
Optional
[tensor
]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.
- Returns:
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type:
loss
- class pyhealth.models.MICRON(dataset, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
MICRON model.
Paper: Chaoqi Yang et al. Change Matters: Medication Change Prediction with Recurrent Residual Networks. IJCAI 2021.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the MICRON layer.
- forward(conditions, procedures, drugs, **kwargs)[source]#
Forward propagation.
- Parameters:
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing
the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels] representing
the ground truth of each drug.
- Return type:
A dictionary with the following keys
pyhealth.models.SafeDrug#
The separate callable SafeDrugLayer and the complete SafeDrug model.
- class pyhealth.models.SafeDrugLayer(hidden_size, mask_H, ddi_adj, num_fingerprints, molecule_set, average_projection, kp=0.05, target_ddi=0.08)[source]#
Bases:
Module
SafeDrug model.
Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.
This layer is used in the SafeDrug model. But it can also be used as a standalone layer. Note that we improve the layer a little bit to make it compatible with the package. Original code can be found at https://github.com/ycq091044/SafeDrug/blob/main/src/models.py.
- Parameters:
hidden_size (
int
) – hidden feature size.mask_H (
Tensor
) – the mask matrix H of shape [num_drugs, num_substructures].ddi_adj (
Tensor
) – an adjacency tensor of shape [num_drugs, num_drugs].num_fingerprints (
int
) – total number of different fingerprints.molecule_set (
List
[Tuple
]) – a list of molecule tuples (A, B, C) of length num_molecules. - A <torch.tensor>: fingerprints of atoms in the molecule - B <torch.tensor>: adjacency matrix of the molecule - C <int>: molecular_sizeaverage_projection (
Tensor
) – a tensor of shape [num_drugs, num_molecules] representing the average projection for aggregating multiple molecules of the same drug into one vector.kp (
float
) – correcting factor for the proportional signal. Default is 0.5.target_ddi (
float
) – DDI acceptance rate. Default is 0.08.
- pad(matrices, pad_value)[source]#
Pads the list of matrices.
Padding with a pad_value (e.g., 0) for batch processing. For example, given a list of matrices [A, B, C], we obtain a new matrix [A00, 0B0, 00C], where 0 is the zero (i.e., pad value) matrix.
- forward(patient_emb, drugs, mask=None)[source]#
Forward propagation.
- Parameters:
patient_emb (
tensor
) – a tensor of shape [patient, visit, input_size].drugs (
tensor
) – a multihot tensor of shape [patient, num_labels].mask (
Optional
[tensor
]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.
- Returns:
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type:
loss
- class pyhealth.models.SafeDrug(dataset, embedding_dim=128, hidden_dim=128, num_layers=1, dropout=0.5, **kwargs)[source]#
Bases:
BaseModel
SafeDrug model.
Paper: Chaoqi Yang et al. SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations. IJCAI 2021.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.
Note
This model only accepts ATC level 3 as medication codes.
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.num_layers (
int
) – the number of layers used in RNN. Default is 1.dropout (
float
) – the dropout rate. Default is 0.5.**kwargs – other parameters for the SafeDrug layer.
- forward(conditions, procedures, drugs, **kwargs)[source]#
Forward propagation.
- Parameters:
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels] representing
the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels] representing
the ground truth of each drug.
- Return type:
A dictionary with the following keys
pyhealth.models.MoleRec#
The separate callable MoleRecLayer and the complete MoleRec model.
- class pyhealth.models.MoleRecLayer(hidden_size, coef=2.5, target_ddi=0.08, GNN_layers=4, dropout=0.5, multiloss_weight=0.05, **kwargs)[source]#
Bases:
Module
MoleRec model.
Paper: Nianzu Yang et al. MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning. WWW 2023.
This layer is used in the MoleRec model. But it can also be used as a standalone layer.
- Parameters:
hidden_size (
int
) – hidden feature size.coef (
float
) – coefficient of ddi loss weight annealing. larger coefficient means higher penalty to the drug-drug-interaction. Default is 2.5.target_ddi (
float
) – DDI acceptance rate. Default is 0.06.GNN_layers (
int
) – the number of layers of GNNs encoding molecule and substructures. Default is 4.dropout (
float
) – the dropout ratio of model. Default is 0.7.multiloss_weight (
float
) – the weight of multilabel_margin_loss for multilabel classification. Value should be set between [0, 1]. Default is 0.05
- forward(patient_emb, drugs, average_projection, ddi_adj, substructure_mask, substructure_graph, molecule_graph, mask=None, drug_indexes=None)[source]#
Forward propagation.
- Parameters:
patient_emb (
Tensor
) – a tensor of shape [patient, visit, num_substructures], representating the relation between each patient visit and each substructures.drugs (
Tensor
) – a multihot tensor of shape [patient, num_labels].mask (
Optional
[tensor
]) – an optional tensor of shape [patient, visit] where 1 indicates valid visits and 0 indicates invalid visits.substructure_mask (
Tensor
) – tensor of shape [num_drugs, num_substructures], representing whether a substructure shows up in one of the molecule of each drug.average_projection (
Tensor
) – a tensor of shape [num_drugs, num_molecules] representing the average projection for aggregating multiple molecules of the same drug into one vector.substructure_graph (
Union
[StaticParaDict
,Dict
[str
,Union
[int
,Tensor
]]]) – a dictionary representating a graph batch of all substructures, where each graph is extracted via ‘smiles2graph’ api of ogb library.molecule_graph (
Union
[StaticParaDict
,Dict
[str
,Union
[int
,Tensor
]]]) – dictionary with same form of substructure_graph, representing the graph batch of all molecules.ddi_adj (
Tensor
) – an adjacency tensor for drug drug interaction of shape [num_drugs, num_drugs].drug_indexes (
Optional
[Tensor
]) – the index version of drugs (ground truth) of shape [patient, num_labels], padded with -1
- Returns:
a scalar tensor representing the loss. y_prob: a tensor of shape [patient, num_labels] representing
the probability of each drug.
- Return type:
loss
- class pyhealth.models.MoleRec(dataset, embedding_dim=64, hidden_dim=64, num_rnn_layers=1, num_gnn_layers=4, dropout=0.5, **kwargs)[source]#
Bases:
BaseModel
MoleRec model.
Paper: Nianzu Yang et al. MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning. WWW 2023.
Note
This model is only for medication prediction which takes conditions and procedures as feature_keys, and drugs as label_key. It only operates on the visit level.
Note
This model only accepts ATC level 3 as medication codes.
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.num_rnn_layers (
int
) – the number of layers used in RNN. Default is 1.num_gnn_layers (
int
) – the number of layers used in GNN. Default is 4.dropout (
float
) – the dropout rate. Default is 0.7.**kwargs – other parameters for the MoleRec layer.
- forward(conditions, procedures, drugs, **kwargs)[source]#
Forward propagation.
- Parameters:
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor of shape [patient, visit, num_labels]
representing the probability of each drug.
- y_true: a tensor of shape [patient, visit, num_labels]
representing the ground truth of each drug.
- Return type:
A dictionary with the following keys
pyhealth.models.Deepr#
The separate callable DeeprLayer and the complete Deepr model.
- class pyhealth.models.DeeprLayer(feature_size=100, window=1, hidden_size=3)[source]#
Bases:
Module
Deepr layer.
Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, “ Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.
This layer is used in the Deepr model.
- Parameters:
Examples
>>> from pyhealth.models import DeeprLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = DeeprLayer(5, window=4, hidden_size=7) # window does not impact the output shape >>> outputs = layer(input) >>> outputs.shape torch.Size([3, 7])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters:
x (
Tensor
) – a Tensor of shape [batch size, sequence len, input size].mask (
Optional
[Tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a Tensor of shape [batch size, hidden_size] representing the
summarized vector.
- Return type:
c
- class pyhealth.models.Deepr(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
Deepr model.
Paper: P. Nguyen, T. Tran, N. Wickramasinghe and S. Venkatesh, “ Deepr : A Convolutional Net for Medical Records,” in IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22-30, Jan. 2017, doi: 10.1109/JBHI.2016.2633963.
Note
We use separate Deepr layers for different feature_keys.
- Parameters:
dataset (
BaseEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the Deepr layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import Deepr >>> model = Deepr( ... dataset=dataset, ... feature_keys=[ ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.8908, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.2295], [0.2665]], device='cuda:0', grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]], device='cuda:0'), 'logit': tensor([[-1.2110], [-1.0126]], device='cuda:0', grad_fn=<AddmmBackward0>) }
pyhealth.models.ContraWR#
The separate callable ResBlock2D and the complete ContraWR model.
- class pyhealth.models.ResBlock2D(in_channels, out_channels, stride=2, downsample=True, pooling=True)[source]#
Bases:
Module
Convolutional Residual Block 2D
This block stacks two convolutional layers with batch normalization, max pooling, dropout, and residual connection.
- Parameters:
Example
>>> import torch >>> from pyhealth.models import ResBlock2D >>> >>> model = ResBlock2D(6, 16, 1, True, True) >>> input_ = torch.randn((16, 6, 28, 150)) # (batch, channel, height, width) >>> output = model(input_) >>> output.shape torch.Size([16, 16, 14, 75])
- class pyhealth.models.ContraWR(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, n_fft=128, **kwargs)[source]#
Bases:
BaseModel
The encoder model of ContraWR (a supervised model, STFT + 2D CNN layers)
Paper: Yang, Chaoqi, Danica Xiao, M. Brandon Westover, and Jimeng Sun. “Self-supervised eeg representation learning for automatic sleep staging.” arXiv preprint arXiv:2110.15278 (2021).
Note
We use one encoder to handle multiple channel together.
- Parameters:
dataset (
BaseSignalDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the Deepr layer.
Examples
>>> from pyhealth.datasets import SampleSignalDataset >>> samples = [ ... { ... "record_id": "SC4001-0", ... "patient_id": "SC4001", ... "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-0.pkl", ... "label": "W", ... }, ... { ... "record_id": "SC4001-1", ... "patient_id": "SC4001", ... "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-1.pkl", ... "label": "R", ... } ... ] >>> dataset = SampleSignalDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import ContraWR >>> model = ContraWR( ... dataset=dataset, ... feature_keys=["signal"], # dataloader will load the signal from "epoch_path" and put it in "signal" ... label_key="label", ... mode="multiclass", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(2.8425, device='cuda:0', grad_fn=<NllLossBackward0>), 'y_prob': tensor([[0.9345, 0.0655], [0.9482, 0.0518]], device='cuda:0', grad_fn=<SoftmaxBackward0>), 'y_true': tensor([1, 1], device='cuda:0'), 'logit': tensor([[ 0.1472, -2.5104], [2.1584, -0.7481]], device='cuda:0', grad_fn=<AddmmBackward0>) } >>>
- cal_encoder_stat()[source]#
obtain the convolution encoder initialization statistics
Note
We show an example to illustrate the encoder statistics. input x:
torch.Size([5, 7, 3000])
- after stft transform
torch.Size([5, 7, 65, 90])
- we design the first CNN (out_channels = 8)
torch.Size([5, 8, 16, 22])
here: 8 * 16 * 22 > 256, we continute the convolution
- we design the second CNN (out_channels = 16)
torch.Size([5, 16, 4, 5])
here: 16 * 4 * 5 > 256, we continute the convolution
- we design the second CNN (out_channels = 32)
torch.Size([5, 32, 1, 1])
here: 32 * 1 * 1, we stop the convolution
- output:
channels = [7, 8, 16, 32]
emb_size = 32 * 1 * 1 = 32
pyhealth.models.SparcNet#
The SparcNet Model: Jin Jing, et al. Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation. Neurology 2023.
- class pyhealth.models.DenseLayer(input_channels, growth_rate, bn_size, drop_rate=0.5, conv_bias=True, batch_norm=True)[source]#
Bases:
Sequential
Densely connected layer :param input_channels: number of input channels :param growth_rate: rate of growth of channels in this layer :param bn_size: multiplicative factor for the bottleneck layer (does not affect the output size) :param drop_rate: dropout rate :param conv_bias: whether to use bias in convolutional layers :param batch_norm: whether to use batch normalization
Example
>>> x = torch.randn(128, 5, 1000) >>> batch, channels, length = x.shape >>> model = DenseLayer(channels, 5, 2) >>> y = model(x) >>> y.shape torch.Size([128, 10, 1000])
- forward(x)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pyhealth.models.DenseBlock(num_layers, input_channels, growth_rate, bn_size, drop_rate=0.5, conv_bias=True, batch_norm=True)[source]#
Bases:
Sequential
Densely connected block :param num_layers: number of layers in this block :param input_channls: number of input channels :param growth_rate: rate of growth of channels in this layer :param bn_size: multiplicative factor for the bottleneck layer (does not affect the output size) :param drop_rate: dropout rate :param conv_bias: whether to use bias in convolutional layers :param batch_norm: whether to use batch normalization
Example
>>> x = torch.randn(128, 5, 1000) >>> batch, channels, length = x.shape >>> model = DenseBlock(3, channels, 5, 2) >>> y = model(x) >>> y.shape torch.Size([128, 20, 1000])
- class pyhealth.models.TransitionLayer(input_channels, output_channels, conv_bias=True, batch_norm=True)[source]#
Bases:
Sequential
pooling transition layer
- Parameters:
input_channls – number of input channels
output_channels – number of output channels
conv_bias – whether to use bias in convolutional layers
batch_norm – whether to use batch normalization
Example
>>> x = torch.randn(128, 5, 1000) >>> model = TransitionLayer(5, 18) >>> y = model(x) >>> y.shape torch.Size([128, 18, 500])
- class pyhealth.models.SparcNet(dataset, feature_keys, label_key, mode, embedding_dim=128, hidden_dim=128, block_layers=4, growth_rate=16, bn_size=16, drop_rate=0.5, conv_bias=True, batch_norm=True, **kwargs)[source]#
Bases:
BaseModel
The SparcNet model for sleep staging.
Paper: Jin Jing, et al. Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation. Neurology 2023.
Note
We use one encoder to handle multiple channel together.
- Parameters:
dataset (
BaseSignalDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – (not used now) the embedding dimension. Default is 128.hidden_dim (
int
) – (not used now) the hidden dimension. Default is 128.block_layer – the number of layers in each dense block. Default is 4.
growth_rate – the growth rate of each dense layer. Default is 16.
bn_size – the bottleneck size of each dense layer. Default is 16.
conv_bias – whether to use bias in convolutional layers. Default is True.
batch_norm – whether to use batch normalization. Default is True.
**kwargs – other parameters for the Deepr layer.
Examples
>>> from pyhealth.datasets import SampleSignalDataset >>> samples = [ ... { ... "record_id": "SC4001-0", ... "patient_id": "SC4001", ... "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-0.pkl", ... "label": "W", ... }, ... { ... "record_id": "SC4001-1", ... "patient_id": "SC4001", ... "epoch_path": "/home/chaoqiy2/.cache/pyhealth/datasets/2f06a9232e54254cbcb4b62624294d71/SC4001-1.pkl", ... "label": "R", ... } ... ] >>> dataset = SampleSignalDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import SparcNet >>> model = SparcNet( ... dataset=dataset, ... feature_keys=["signal"], # dataloader will load the signal from "epoch_path" and put it in "signal" ... label_key="label", ... mode="multiclass", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.6530, device='cuda:0', grad_fn=<NllLossBackward0>), 'y_prob': tensor([[0.4459, 0.5541], [0.5111, 0.4889]], device='cuda:0', grad_fn=<SoftmaxBackward0>), 'y_true': tensor([1, 1], device='cuda:0'), 'logit': tensor([[-0.2750, -0.0577], [-0.1319, -0.1763]], device='cuda:0', grad_fn=<AddmmBackward0>) }
- label_tokenizer#
input statistics
pyhealth.models.StageNet#
The separate callable StageNetLayer and the complete StageNet model.
- class pyhealth.models.StageNetLayer(input_dim, chunk_size=128, conv_size=10, levels=3, dropconnect=0.3, dropout=0.3, dropres=0.3)[source]#
Bases:
Module
StageNet layer.
Paper: Stagenet: Stage-aware neural networks for health risk prediction. WWW 2020.
This layer is used in the StageNet model. But it can also be used as a standalone layer.
- Parameters:
input_dim (
int
) – dynamic feature size.chunk_size (
int
) – the chunk size for the StageNet layer. Default is 128.levels (
int
) – the number of levels for the StageNet layer. levels * chunk_size = hidden_dim in the RNN. Smaller chunk size and more levels can capture more detailed patient status variations. Default is 3.conv_size (
int
) – the size of the convolutional kernel. Default is 10.dropconnect (
int
) – the dropout rate for the dropconnect. Default is 0.3.dropout (
int
) – the dropout rate for the dropout. Default is 0.3.dropres (
int
) – the dropout rate for the residual connection. Default is 0.3.
Examples
>>> from pyhealth.models import StageNetLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = StageNetLayer(64) >>> c, _, _ = layer(input) >>> c.shape torch.Size([3, 384])
- forward(x, time=None, mask=None)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, input_dim].static – a tensor of shape [batch size, static_dim].
mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a tensor of shape [batch size, chunk_size*levels] representing the
patient embedding.
outputs: a tensor of shape [batch size, sequence len, chunk_size*levels] representing the patient at each time step.
- Return type:
last_output
- class pyhealth.models.StageNet(dataset, feature_keys, label_key, mode, time_keys=None, embedding_dim=128, chunk_size=128, levels=3, **kwargs)[source]#
Bases:
BaseModel
StageNet model.
Paper: Junyi Gao et al. Stagenet: Stage-aware neural networks for health risk prediction. WWW 2020.
Note
We use separate StageNet layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the StageNet model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply StageNet on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use StageNet one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run StageNet directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run StageNet directly on the inner bracket level, similar to case 2 after embedding table
The time interval information specified by time_keys will be used to calculate the memory decay between each visit. If time_keys is None, all visits are treated as the same time interval. For each feature, the time interval should be a two-dimensional float array with shape (time_step, 1).
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.time_keys (
Optional
[List
[str
]]) – list of keys in samples to use as time interval information for each feature, Default is None. If none, all visits are treated as the same time interval.embedding_dim (
int
) – the embedding dimension. Default is 128.chunk_size (
int
) – the chunk size for the StageNet layer. Default is 128.levels (
int
) – the number of levels for the StageNet layer. levels * chunk_size = hidden_dim in the RNN. Smaller chunk size and more levels can capture more detailed patient status variations. Default is 3.**kwargs – other parameters for the StageNet layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... # "single_vector": [1, 2, 3], ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... "list_vectors_time": [[0.0], [1.3]], ... "list_codes_time": [[0.0], [2.0], [1.3]], ... "list_list_codes_time": [[0.0], [1.5]], ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... # "single_vector": [1, 5, 8], ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... "list_vectors_time": [[0.0], [2.0], [1.0]], ... "list_codes_time": [[0.0], [2.0], [1.3], [1.0], [2.0]], ... "list_list_codes_time": [[0.0]], ... }, ... ] >>> >>> # dataset >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> # data loader >>> from pyhealth.datasets import get_dataloader >>> >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> >>> # model >>> model = StageNet( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... # "list_list_vectors", ... ], ... time_keys=["list_codes_time", "list_vectors_time", "list_list_codes_time"], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.7111, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.4815], [0.4991]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[-0.0742], [-0.0038]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the final loss. distance: list of tensors representing the stage variation of the patient. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.AdaCare#
The separate callable AdaCareLayer and the complete AdaCare model.
- class pyhealth.models.AdaCareLayer(input_dim, hidden_dim=128, kernel_size=2, kernel_num=64, r_v=4, r_c=4, activation='sigmoid', rnn_type='gru', dropout=0.5)[source]#
Bases:
Module
AdaCare layer.
Paper: Liantao Ma et al. Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration. AAAI 2020.
This layer is used in the AdaCare model. But it can also be used as a standalone layer.
- Parameters:
input_dim (
int
) – the input feature size.hidden_dim (
int
) – the hidden dimension of the GRU layer. Default is 128.kernel_size (
int
) – the kernel size of the causal convolution layer. Default is 2.kernel_num (
int
) – the kernel number of the causal convolution layer. Default is 64.r_v (
int
) – the number of the reduction rate for the original feature calibration. Default is 4.r_c (
int
) – the number of the reduction rate for the convolutional feature recalibration. Default is 4.activation (
str
) – the activation function for the recalibration layer (sigmoid, sparsemax, softmax). Default is “sigmoid”.dropout (
float
) – dropout rate. Default is 0.5.
Examples
>>> from pyhealth.models import AdaCareLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = AdaCareLayer(64) >>> c, _, inputatt, convatt = layer(input) >>> c.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, input_dim].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a tensor of shape [batch size, input_dim] representing the
patient embedding.
output: a tensor of shape [batch size, sequence_len, input_dim] representing the patient embedding at each time step. inputatt: a tensor of shape [batch size, sequence_len, input_dim] representing the feature importance of the input. convatt: a tensor of shape [batch size, sequence_len, 3 * kernel_num] representing the feature importance of the convolutional features.
- Return type:
last_output
- class pyhealth.models.AdaCare(dataset, feature_keys, label_key, mode, use_embedding, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
AdaCare model.
Paper: Liantao Ma et al. Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration. AAAI 2020.
Note
We use separate AdaCare layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
Since the AdaCare model calibrate the original features to provide interpretability, we do not recommend use embeddings for the input features. We follow the current convention for the AdaCare model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply AdaCare on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use AdaCare one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run AdaCare directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run AdaCare directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.use_embedding (
List
[bool
]) – list of bools indicating whether to use embedding for each feature type, e.g. [True, False].embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the AdaCare layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import AdaCare >>> model = AdaCare( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... use_embedding=[True, False, True, False], ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.7167, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.5009], [0.4779]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]]), 'logit': tensor([[ 0.0036], [-0.0886]], grad_fn=<AddmmBackward0>) }
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. feature_importance: a list of tensors with shape (feature_type, batch_size, time_step, features)
representing the feature importance.
- conv_feature_importance: a list of tensors with shape (feature_type, batch_size, time_step, 3*kernal_size)
representing the convolutional feature importance.
y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.ConCare#
The separate callable ConCareLayer and the complete ConCare model.
- class pyhealth.models.ConCareLayer(input_dim, static_dim=0, hidden_dim=128, num_head=4, pe_hidden=64, dropout=0.5)[source]#
Bases:
Module
ConCare layer.
Paper: Liantao Ma et al. Concare: Personalized clinical feature embedding via capturing the healthcare context. AAAI 2020.
This layer is used in the ConCare model. But it can also be used as a standalone layer.
- Parameters:
input_dim (
int
) – dynamic feature size.static_dim (
int
) – static feature size, if 0, then no static feature is used.hidden_dim (
int
) – hidden dimension of the channel-wise GRU, default 128.transformer_hidden – hidden dimension of the transformer, default 128.
num_head (
int
) – number of heads in the transformer, default 4.pe_hidden (
int
) – hidden dimension of the positional encoding, default 64.dropout (
int
) – dropout rate, default 0.5.
Examples
>>> from pyhealth.models import ConCareLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = ConCareLayer(64) >>> c, _ = layer(input) >>> c.shape torch.Size([3, 128])
- class pyhealth.models.ConCare(dataset, feature_keys, label_key, mode, use_embedding, static_key=None, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
ConCare model.
Paper: Liantao Ma et al. Concare: Personalized clinical feature embedding via capturing the healthcare context. AAAI 2020.
Note
We use separate ConCare layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
If you need the interpretable feature correlations provided by the ConCare model calculates the , we do not recommend use embeddings for the input features. We follow the current convention for the ConCare model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply ConCare on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use ConCare one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run ConCare directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run ConCare directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.static_keys – the key in samples to use as static features, e.g. “demographics”. Default is None. we only support numerical static features.
use_embedding (
List
[bool
]) – list of bools indicating whether to use embedding for each feature type, e.g. [True, False].embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the ConCare layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "demographic": [0.0, 2.0, 1.5], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "demographic": [0.0, 2.0, 1.5], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import ConCare >>> model = ConCare( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... static_key="demographic", ... use_embedding=[True, False, True, False], ... mode="binary" ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(9.5541, grad_fn=<AddBackward0>), 'y_prob': tensor([[0.5323], [0.5363]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[0.1293], [0.1454]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the final loss. loss_task: a scalar tensor representing the task loss. loss_decov: a scalar tensor representing the decov loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.Agent#
The separate callable AgentLayer and the complete Agent model.
- class pyhealth.models.AgentLayer(input_dim, static_dim=0, cell='gru', use_baseline=True, n_actions=10, n_units=64, n_hidden=128, dropout=0.5, lamda=0.5)[source]#
Bases:
Module
Dr. Agent layer.
Paper: Junyi Gao et al. Dr. Agent: Clinical predictive model via mimicked second opinions. JAMIA.
This layer is used in the Dr. Agent model. But it can also be used as a standalone layer.
- Parameters:
input_dim (
int
) – dynamic feature size.static_dim (
int
) – static feature size, if 0, then no static feature is used.cell (
str
) – rnn cell type. Default is “gru”.use_baseline (
bool
) – whether to use baseline for the RL agent. Default is True.n_actions (
int
) – number of historical visits to choose. Default is 10.n_units (
int
) – number of hidden units in each agent. Default is 64.fusion_dim – number of hidden units in the final representation. Default is 128.
n_hidden (
int
) – number of hidden units in the rnn. Default is 128.dropout (
int
) – dropout rate. Default is 0.5.lamda (
int
) – weight for the agent selected hidden state and the current hidden state. Default is 0.5.
Examples
>>> from pyhealth.models import AgentLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = AgentLayer(64) >>> c, _ = layer(input) >>> c.shape torch.Size([3, 128])
- forward(x, static=None, mask=None)[source]#
Forward propagation.
- Parameters:
- Returns:
- a tensor of shape [batch size, n_hidden] representing the
patient embedding.
output: a tensor of shape [batch size, sequence len, n_hidden] representing the patient embedding at each time step.
- Return type:
last_output
- class pyhealth.models.Agent(dataset, feature_keys, label_key, mode, static_key=None, embedding_dim=128, hidden_dim=128, use_baseline=True, **kwargs)[source]#
Bases:
BaseModel
Dr. Agent model.
Paper: Junyi Gao et al. Dr. Agent: Clinical predictive model via mimicked second opinions. JAMIA.
Note
We use separate Dr. Agent layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the Dr. Agent model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply Dr. Agent on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use Dr. Agent one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run Dr. Agent directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run Dr. Agent directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.static_keys – the key in samples to use as static features, e.g. “demographics”. Default is None. we only support numerical static features.
embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension of the RNN in the Dr. Agent layer. Default is 128.use_baseline (
bool
) – whether to use the baseline value to calculate the RL loss. Default is True.**kwargs – other parameters for the Dr. Agent layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "demographic": [0.0, 2.0, 1.5], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "demographic": [0.0, 2.0, 1.5], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import Agent >>> model = Agent( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... static_key="demographic", ... mode="binary" ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(1.4059, grad_fn=<AddBackward0>), 'y_prob': tensor([[0.4861], [0.5348]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]]), 'logit': tensor([[-0.0556], [0.1392]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the final loss. loss_task: a scalar tensor representing the task loss. loss_RL: a scalar tensor representing the RL loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.GRASP#
The separate callable GRASPLayer and the complete GRASP model.
- class pyhealth.models.GRASPLayer(input_dim, static_dim=0, hidden_dim=128, cluster_num=2, dropout=0.5, block='ConCare')[source]#
Bases:
Module
GRASPLayer layer.
Paper: Liantao Ma et al. GRASP: generic framework for health status representation learning based on incorporating knowledge from similar patients. AAAI 2021.
This layer is used in the GRASP model. But it can also be used as a standalone layer.
- Parameters:
input_dim (
int
) – dynamic feature size.static_dim (
int
) – static feature size, if 0, then no static feature is used.hidden_dim (
int
) – hidden dimension of the GRASP layer, default 128.cluster_num (
int
) – number of clusters, default 12. The cluster_num should be no more than the number of samples.dropout (
int
) – dropout rate, default 0.5.block (
str
) – the backbone model used in the GRASP layer (‘ConCare’, ‘LSTM’ or ‘GRU’), default ‘ConCare’.
Examples
>>> from pyhealth.models import GRASPLayer >>> input = torch.randn(3, 128, 64) # [batch size, sequence len, feature_size] >>> layer = GRASPLayer(64, cluster_num=2) >>> c = layer(input) >>> c.shape torch.Size([3, 128])
- gumbel_softmax(logits, temperature, device, hard=False)[source]#
ST-gumple-softmax input: [, n_class] return: flatten –> [, n_class] an one-hot vector
- class pyhealth.models.GRASP(dataset, feature_keys, label_key, mode, use_embedding, static_key=None, embedding_dim=128, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
GRASP model.
Paper: Liantao Ma et al. GRASP: generic framework for health status representation learning based on incorporating knowledge from similar patients. AAAI 2021.
Note
We use separate GRASP layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the GRASP model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply GRASP on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use GRASP one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run GRASP directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run GRASP directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.static_keys – the key in samples to use as static features, e.g. “demographics”. Default is None. we only support numerical static features.
use_embedding (
List
[bool
]) – list of bools indicating whether to use embedding for each feature type, e.g. [True, False].embedding_dim (
int
) – the embedding dimension. Default is 128.hidden_dim (
int
) – the hidden dimension of the GRASP layer. Default is 128.cluster_num – the number of clusters. Default is 10. Note that batch size should be greater than cluster_num.
**kwargs – other parameters for the GRASP layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "demographic": [0.0, 2.0, 1.5], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "demographic": [0.0, 2.0, 1.5], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import GRASP >>> model = GRASP( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... static_key="demographic", ... use_embedding=[True, False, True, False], ... mode="binary" ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(0.6896, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.4983], [0.4947]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[1.], [0.]]), 'logit': tensor([[-0.0070], [-0.0213]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the final loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.TCN#
The separate callable TCNLayer and the complete TCN model.
- class pyhealth.models.TCNLayer(input_dim, num_channels=128, max_seq_length=20, kernel_size=2, dropout=0.5)[source]#
Bases:
Module
Temporal Convolutional Networks layer.
Shaojie Bai et al. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling.
This layer wraps the PyTorch TCN layer with masking and dropout support. It is used in the TCN model. But it can also be used as a standalone layer.
- Parameters:
input_dim (
int
) – input feature size.num_channels (
int
) – int or list of ints. If int, the depth will be automatically decided by the max_seq_length. If list, number of channels in each layer.max_seq_length (
int
) – max sequence length. Used to compute the depth of the TCN.kernel_size (
int
) – kernel size of the TCN.dropout (
float
) – dropout rate. If non-zero, introduces a Dropout layer before each TCN blocks. Default is 0.5.
Examples
>>> from pyhealth.models import TCNLayer >>> input = torch.randn(3, 128, 5) # [batch size, sequence len, input_size] >>> layer = TCNLayer(5, 64) >>> outputs, last_outputs = layer(input) >>> outputs.shape torch.Size([3, 128, 64]) >>> last_outputs.shape torch.Size([3, 64])
- forward(x, mask=None)[source]#
Forward propagation.
- Parameters:
x (
tensor
) – a tensor of shape [batch size, sequence len, input size].mask (
Optional
[tensor
]) – an optional tensor of shape [batch size, sequence len], where 1 indicates valid and 0 indicates invalid.
- Returns:
- a tensor of shape [batch size, hidden size], containing
the output features for the last time step.
- out: a tensor of shape [batch size, sequence len, hidden size],
containing the output features for each time step.
- Return type:
last_out
- class pyhealth.models.TCN(dataset, feature_keys, label_key, mode, embedding_dim=128, num_channels=128, **kwargs)[source]#
Bases:
BaseModel
Temporal Convolutional Networks model.
This model applies a separate TCN layer for each feature, and then concatenates the final hidden states of each TCN layer. The concatenated hidden states are then fed into a fully connected layer to make predictions.
Note
We use separate TCN layers for different feature_keys. Currently, we automatically support different input formats:
code based input (need to use the embedding table later)
float/int based value input
- We follow the current convention for the TCN model:
- case 1. [code1, code2, code3, …]
we will assume the code follows the order; our model will encode
each code into a vector and apply TCN on the code level
- case 2. [[code1, code2]] or [[code1, code2], [code3, code4, code5], …]
we will assume the inner bracket follows the order; our model first
use the embedding table to encode each code into a vector and then use average/mean pooling to get one vector for one inner bracket; then use TCN one the braket level
- case 3. [[1.5, 2.0, 0.0]] or [[1.5, 2.0, 0.0], [8, 1.2, 4.5], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run TCN directly on the inner bracket level, similar to case 1 after embedding table
- case 4. [[[1.5, 2.0, 0.0]]] or [[[1.5, 2.0, 0.0], [8, 1.2, 4.5]], …]
this case only makes sense when each inner bracket has the same length;
we assume each dimension has the same meaning; we run TCN directly on the inner bracket level, similar to case 2 after embedding table
- Parameters:
dataset (
SampleEHRDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim (
int
) – the embedding dimension. Default is 128.num_channels (
int
) – the number of channels in the TCN layer. Default is 128.**kwargs – other parameters for the TCN layer.
Examples
>>> from pyhealth.datasets import SampleEHRDataset >>> samples = [ ... { ... "patient_id": "patient-0", ... "visit_id": "visit-0", ... "list_codes": ["505800458", "50580045810", "50580045811"], # NDC ... "list_vectors": [[1.0, 2.55, 3.4], [4.1, 5.5, 6.0]], ... "list_list_codes": [["A05B", "A05C", "A06A"], ["A11D", "A11E"]], # ATC-4 ... "list_list_vectors": [ ... [[1.8, 2.25, 3.41], [4.50, 5.9, 6.0]], ... [[7.7, 8.5, 9.4]], ... ], ... "label": 1, ... }, ... { ... "patient_id": "patient-0", ... "visit_id": "visit-1", ... "list_codes": [ ... "55154191800", ... "551541928", ... "55154192800", ... "705182798", ... "70518279800", ... ], ... "list_vectors": [[1.4, 3.2, 3.5], [4.1, 5.9, 1.7], [4.5, 5.9, 1.7]], ... "list_list_codes": [["A04A", "B035", "C129"]], ... "list_list_vectors": [ ... [[1.0, 2.8, 3.3], [4.9, 5.0, 6.6], [7.7, 8.4, 1.3], [7.7, 8.4, 1.3]], ... ], ... "label": 0, ... }, ... ] >>> dataset = SampleEHRDataset(samples=samples, dataset_name="test") >>> >>> from pyhealth.models import TCN >>> model = TCN( ... dataset=dataset, ... feature_keys=[ ... "list_codes", ... "list_vectors", ... "list_list_codes", ... "list_list_vectors", ... ], ... label_key="label", ... mode="binary", ... ) >>> >>> from pyhealth.datasets import get_dataloader >>> train_loader = get_dataloader(dataset, batch_size=2, shuffle=True) >>> data_batch = next(iter(train_loader)) >>> >>> ret = model(**data_batch) >>> print(ret) { 'loss': tensor(1.1641, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'y_prob': tensor([[0.6837], [0.3081]], grad_fn=<SigmoidBackward0>), 'y_true': tensor([[0.], [1.]]), 'logit': tensor([[ 0.7706], [-0.8091]], grad_fn=<AddmmBackward0>) } >>>
- forward(**kwargs)[source]#
Forward propagation.
The label kwargs[self.label_key] is a list of labels for each patient.
- Parameters:
**kwargs – keyword arguments for the model. The keys must contain all the feature keys and the label key.
- Returns:
loss: a scalar tensor representing the loss. y_prob: a tensor representing the predicted probabilities. y_true: a tensor representing the true labels.
- Return type:
A dictionary with the following keys
pyhealth.models.GAN#
The GAN model (pyhealth trainer does not apply to GAN, refer to the example/ChestXray-image-generation-GAN.ipynb for examples of using GAN model).
- class pyhealth.models.GAN(input_channel, input_size, hidden_dim=128, **kwargs)[source]#
Bases:
Module
GAN model (take 128x128 or 64x64 or 32x32 images)
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.
Note
We use CNN models as the encoder and decoder layers for now.
- Parameters:
dataset – the dataset to train the model. It is used to query certain information such as the set of all tokens.
feature_keys – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].
label_key – key in samples to use as label (e.g., “drugs”).
mode – one of “binary”, “multiclass”, or “multilabel”.
embedding_dim – the embedding dimension. Default is 128.
hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the Deepr layer.
Examples:
pyhealth.models.VAE#
The VAE model (treated as a regression task).
- class pyhealth.models.VAE(dataset, feature_keys, label_key, input_channel, input_size, mode, hidden_dim=128, **kwargs)[source]#
Bases:
BaseModel
VAE model (take 128x128 or 64x64 or 32x32 images)
Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.”
Note
We use CNN models as the encoder and decoder layers for now.
- Parameters:
dataset (
BaseSignalDataset
) – the dataset to train the model. It is used to query certain information such as the set of all tokens.feature_keys (
List
[str
]) – list of keys in samples to use as features, e.g. [“conditions”, “procedures”].label_key (
str
) – key in samples to use as label (e.g., “drugs”).mode (
str
) – one of “binary”, “multiclass”, or “multilabel”.embedding_dim – the embedding dimension. Default is 128.
hidden_dim (
int
) – the hidden dimension. Default is 128.**kwargs – other parameters for the Deepr layer.
Examples:
- forward(**kwargs)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
Trainer#
- class pyhealth.trainer.Trainer(model, checkpoint_path=None, metrics=None, device=None, enable_logging=True, output_path=None, exp_name=None)[source]#
Bases:
object
Trainer for PyTorch models.
- Parameters:
model (
Module
) – PyTorch model.checkpoint_path (
Optional
[str
]) – Path to the checkpoint. Default is None, which means the model will be randomly initialized.metrics (
Optional
[List
[str
]]) – List of metric names to be calculated. Default is None, which means the default metrics in each metrics_fn will be used.device (
Optional
[str
]) – Device to be used for training. Default is None, which means the device will be GPU if available, otherwise CPU.enable_logging (
bool
) – Whether to enable logging. Default is True.output_path (
Optional
[str
]) – Path to save the output. Default is “./output”.exp_name (
Optional
[str
]) – Name of the experiment. Default is current datetime.
- train(train_dataloader, val_dataloader=None, test_dataloader=None, epochs=5, optimizer_class=<class 'torch.optim.adam.Adam'>, optimizer_params=None, steps_per_epoch=None, evaluation_steps=1, weight_decay=0.0, max_grad_norm=None, monitor=None, monitor_criterion='max', load_best_model_at_last=True)[source]#
Trains the model.
- Parameters:
train_dataloader (
DataLoader
) – Dataloader for training.val_dataloader (
Optional
[DataLoader
]) – Dataloader for validation. Default is None.test_dataloader (
Optional
[DataLoader
]) – Dataloader for testing. Default is None.epochs (
int
) – Number of epochs. Default is 5.optimizer_class (
Type
[Optimizer
]) – Optimizer class. Default is torch.optim.Adam.optimizer_params (
Optional
[Dict
[str
,object
]]) – Parameters for the optimizer. Default is {“lr”: 1e-3}.steps_per_epoch (
Optional
[int
]) – Number of steps per epoch. Default is None.weight_decay (
float
) – Weight decay. Default is 0.0.max_grad_norm (
Optional
[float
]) – Maximum gradient norm. Default is None.monitor (
Optional
[str
]) – Metric name to monitor. Default is None.monitor_criterion (
str
) – Criterion to monitor. Default is “max”.load_best_model_at_last (
bool
) – Whether to load the best model at the last. Default is True.
- inference(dataloader, additional_outputs=None, return_patient_ids=False)[source]#
Model inference.
- Parameters:
dataloader – Dataloader for evaluation.
additional_outputs – List of additional output to collect. Defaults to None ([]).
- Returns:
List of true labels. y_prob_all: List of predicted probabilities. loss_mean: Mean loss over batches. additional_outputs (only if requested): Dict of additional results. patient_ids (only if requested): List of patient ids in the same order as y_true_all/y_prob_all.
- Return type:
y_true_all
Tokenizer#
The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.
- class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#
Bases:
object
Vocabulary class for mapping between tokens and indices.
- class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#
Bases:
object
Tokenizer class for converting tokens to indices and vice versa.
This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.
Examples
>>> from pyhealth.tokenizer import Tokenizer >>> token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E', ... 'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', ... 'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X', ... 'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A', ... 'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A'] >>> tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
- get_vocabulary_size()[source]#
Returns the size of the vocabulary.
Examples
>>> tokenizer.get_vocabulary_size() 44
- convert_tokens_to_indices(tokens)[source]#
Converts a list of tokens to indices.
Examples
>>> tokens = ['A03C', 'A03D', 'A03E', 'A03F', 'A04A', 'A05A', 'A05B', 'B035', 'C129'] >>> indices = tokenizer.convert_tokens_to_indices(tokens) >>> print(indices) [8, 9, 10, 11, 12, 13, 14, 1, 1]
- convert_indices_to_tokens(indices)[source]#
Converts a list of indices to tokens.
Examples
>>> indices = [0, 1, 2, 3, 4, 5] >>> tokens = tokenizer.convert_indices_to_tokens(indices) >>> print(tokens) ['<pad>', '<unk>', 'A01A', 'A02A', 'A02B', 'A02X']
- batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#
Converts a list of lists of tokens (2D) to indices.
- Parameters:
batch (
List
[List
[str
]]) – List of lists of tokens to convert to indices.padding (
bool
) – whether to pad the tokens to the max number of tokens in the batch (smart padding).truncation (
bool
) – whether to truncate the tokens to max_length.max_length (
int
) – maximum length of the tokens. This argument is ignored if truncation is False.
Examples
>>> tokens = [ ... ['A03C', 'A03D', 'A03E', 'A03F'], ... ['A04A', 'B035', 'C129'] ... ]
>>> indices = tokenizer.batch_encode_2d(tokens) >>> print ('case 1:', indices) case 1: [[8, 9, 10, 11], [12, 1, 1, 0]]
>>> indices = tokenizer.batch_encode_2d(tokens, padding=False) >>> print ('case 2:', indices) case 2: [[8, 9, 10, 11], [12, 1, 1]]
>>> indices = tokenizer.batch_encode_2d(tokens, max_length=3) >>> print ('case 3:', indices) case 3: [[9, 10, 11], [12, 1, 1]]
- batch_decode_2d(batch, padding=False)[source]#
Converts a list of lists of indices (2D) to tokens.
- Parameters:
Examples
>>> indices = [ ... [8, 9, 10, 11], ... [12, 1, 1, 0] ... ]
>>> tokens = tokenizer.batch_decode_2d(indices) >>> print ('case 1:', tokens) case 1: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
>>> tokens = tokenizer.batch_decode_2d(indices, padding=True) >>> print ('case 2:', tokens) case 2: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>', '<pad>']]
- batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#
Converts a list of lists of lists of tokens (3D) to indices.
- Parameters:
batch (
List
[List
[List
[str
]]]) – List of lists of lists of tokens to convert to indices.padding (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).truncation (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_lengthmax_length (
Tuple
[int
,int
]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.
Examples
>>> tokens = [ ... [ ... ['A03C', 'A03D', 'A03E', 'A03F'], ... ['A08A', 'A09A'], ... ], ... [ ... ['A04A', 'B035', 'C129'], ... ] ... ]
>>> indices = tokenizer.batch_encode_3d(tokens) >>> print ('case 1:', indices) case 1: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0], [0, 0, 0, 0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, True)) >>> print ('case 2:', indices) case 2: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(True, False)) >>> print ('case 3:', indices) case 3: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1], [0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, False)) >>> print ('case 4:', indices) case 4: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1]]]
>>> indices = tokenizer.batch_encode_3d(tokens, max_length=(2,2)) >>> print ('case 5:', indices) case 5: [[[10, 11], [24, 25]], [[1, 1], [0, 0]]]
- batch_decode_3d(batch, padding=False)[source]#
Converts a list of lists of lists of indices (3D) to tokens.
- Parameters:
Examples
>>> indices = [ ... [ ... [8, 9, 10, 11], ... [24, 25, 0, 0] ... ], ... [ ... [12, 1, 1, 0], ... [0, 0, 0, 0] ... ] ... ]
>>> tokens = tokenizer.batch_decode_3d(indices) >>> print ('case 1:', tokens) case 1: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], [['A04A', '<unk>', '<unk>']]]
>>> tokens = tokenizer.batch_decode_3d(indices, padding=True) >>> print ('case 2:', tokens) case 2: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A', '<pad>', '<pad>']], [['A04A', '<unk>', '<unk>', '<pad>'], ['<pad>', '<pad>', '<pad>', '<pad>']]]
Metrics#
We provide easy to use metrics (the same style and args as sklearn.metrics) for binary classification, multiclass classification, multilabel classification. For applicable tasks, we provide the relevant metrics for model calibration, as well as those for prediction set evaluation. Among these we also provide metrics related to uncertainty quantification, for model calibration, as well as metrics that measure the quality of prediction sets We also provide other metrics specically for healthcare tasks, such as drug drug interaction (DDI) rate.
pyhealth.metrics.multiclass#
- pyhealth.metrics.multiclass.multiclass_metrics_fn(y_true, y_prob, metrics=None, y_predset=None)[source]#
Computes metrics for multiclass classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
- roc_auc_macro_ovo: area under the receiver operating characteristic curve,
macro averaged over one-vs-one multiclass classification
- roc_auc_macro_ovr: area under the receiver operating characteristic curve,
macro averaged over one-vs-rest multiclass classification
- roc_auc_weighted_ovo: area under the receiver operating characteristic curve,
weighted averaged over one-vs-one multiclass classification
- roc_auc_weighted_ovr: area under the receiver operating characteristic curve,
weighted averaged over one-vs-rest multiclass classification
accuracy: accuracy score
- balanced_accuracy: balanced accuracy score (usually used for imbalanced
datasets)
f1_micro: f1 score, micro averaged
f1_macro: f1 score, macro averaged
f1_weighted: f1 score, weighted averaged
jaccard_micro: Jaccard similarity coefficient score, micro averaged
jaccard_macro: Jaccard similarity coefficient score, macro averaged
jaccard_weighted: Jaccard similarity coefficient score, weighted averaged
cohen_kappa: Cohen’s kappa score
brier_top1: brier score between the top prediction and the true label
ECE: Expected Calibration Error (with 20 equal-width bins). Check
pyhealth.metrics.calibration.ece_confidence_multiclass()
.ECE_adapt: adaptive ECE (with 20 equal-size bins). Check
pyhealth.metrics.calibration.ece_confidence_multiclass()
.cwECEt: classwise ECE with threshold=min(0.01,1/K). Check
pyhealth.metrics.calibration.ece_classwise()
.cwECEt_adapt: classwise adaptive ECE with threshold=min(0.01,1/K). Check
pyhealth.metrics.calibration.ece_classwise()
.
- The following metrics related to the prediction sets are accepted as well, but will be ignored if y_predset is None:
rejection_rate: Frequency of rejection, where rejection happens when the prediction set has cardinality other than 1. Check
pyhealth.metrics.prediction_set.rejection_rate()
.set_size: Average size of the prediction sets. Check
pyhealth.metrics.prediction_set.size()
.miscoverage_ps: Prob(k not in prediction set). Check
pyhealth.metrics.prediction_set.miscoverage_ps()
.miscoverage_mean_ps: The average (across different classes k) of miscoverage_ps.
miscoverage_overall_ps: Prob(Y not in prediction set). Check
pyhealth.metrics.prediction_set.miscoverage_overall_ps()
.error_ps: Same as miscoverage_ps, but retricted to un-rejected samples. Check
pyhealth.metrics.prediction_set.error_ps()
.error_mean_ps: The average (across different classes k) of error_ps.
error_overall_ps: Same as miscoverage_overall_ps, but restricted to un-rejected samples. Check
pyhealth.metrics.prediction_set.error_overall_ps()
.
If no metrics are specified, accuracy, f1_macro, and f1_micro are computed by default.
This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.
- Parameters:
- Return type:
- Returns:
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
Examples
>>> from pyhealth.metrics import multiclass_metrics_fn >>> y_true = np.array([0, 1, 2, 2]) >>> y_prob = np.array([[0.9, 0.05, 0.05], ... [0.05, 0.9, 0.05], ... [0.05, 0.05, 0.9], ... [0.6, 0.2, 0.2]]) >>> multiclass_metrics_fn(y_true, y_prob, metrics=["accuracy"]) {'accuracy': 0.75}
pyhealth.metrics.multilabel#
- pyhealth.metrics.multilabel.multilabel_metrics_fn(y_true, y_prob, metrics=None, threshold=0.3, y_predset=None)[source]#
Computes metrics for multilabel classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
roc_auc_micro: area under the receiver operating characteristic curve, micro averaged
roc_auc_macro: area under the receiver operating characteristic curve, macro averaged
roc_auc_weighted: area under the receiver operating characteristic curve, weighted averaged
roc_auc_samples: area under the receiver operating characteristic curve, samples averaged
pr_auc_micro: area under the precision recall curve, micro averaged
pr_auc_macro: area under the precision recall curve, macro averaged
pr_auc_weighted: area under the precision recall curve, weighted averaged
pr_auc_samples: area under the precision recall curve, samples averaged
accuracy: accuracy score
f1_micro: f1 score, micro averaged
f1_macro: f1 score, macro averaged
f1_weighted: f1 score, weighted averaged
f1_samples: f1 score, samples averaged
precision_micro: precision score, micro averaged
precision_macro: precision score, macro averaged
precision_weighted: precision score, weighted averaged
precision_samples: precision score, samples averaged
recall_micro: recall score, micro averaged
recall_macro: recall score, macro averaged
recall_weighted: recall score, weighted averaged
recall_samples: recall score, samples averaged
jaccard_micro: Jaccard similarity coefficient score, micro averaged
jaccard_macro: Jaccard similarity coefficient score, macro averaged
jaccard_weighted: Jaccard similarity coefficient score, weighted averaged
jaccard_samples: Jaccard similarity coefficient score, samples averaged
ddi: drug-drug interaction score (specifically for drug-related tasks, such as drug recommendation)
hamming_loss: Hamming loss
cwECE: classwise ECE (with 20 equal-width bins). Check
pyhealth.metrics.calibration.ece_classwise()
.cwECE_adapt: classwise adaptive ECE (with 20 equal-size bins). Check
pyhealth.metrics.calibration.ece_classwise()
.
- The following metrics related to the prediction sets are accepted as well, but will be ignored if y_predset is None:
fp: Number of false positives.
tp: Number of true positives.
If no metrics are specified, pr_auc_samples is computed by default.
This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.
- Parameters:
y_true (
ndarray
) – True target values of shape (n_samples, n_labels).y_prob (
ndarray
) – Predicted probabilities of shape (n_samples, n_labels).metrics (
Optional
[List
[str
]]) – List of metrics to compute. Default is [“pr_auc_samples”].threshold (
float
) – Threshold to binarize the predicted probabilities. Default is 0.5.
- Return type:
- Returns:
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
Examples
>>> from pyhealth.metrics import multilabel_metrics_fn >>> y_true = np.array([[0, 1, 1], [1, 0, 1]]) >>> y_prob = np.array([[0.1, 0.9, 0.8], [0.05, 0.95, 0.6]]) >>> multilabel_metrics_fn(y_true, y_prob, metrics=["accuracy"]) {'accuracy': 0.5}
pyhealth.metrics.binary#
- pyhealth.metrics.binary.binary_metrics_fn(y_true, y_prob, metrics=None, threshold=0.5)[source]#
Computes metrics for binary classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
pr_auc: area under the precision-recall curve
roc_auc: area under the receiver operating characteristic curve
accuracy: accuracy score
balanced_accuracy: balanced accuracy score (usually used for imbalanced datasets)
f1: f1 score
precision: precision score
recall: recall score
cohen_kappa: Cohen’s kappa score
jaccard: Jaccard similarity coefficient score
ECE: Expected Calibration Error (with 20 equal-width bins). Check
pyhealth.metrics.calibration.ece_confidence_binary()
.ECE_adapt: adaptive ECE (with 20 equal-size bins). Check
pyhealth.metrics.calibration.ece_confidence_binary()
.
If no metrics are specified, pr_auc, roc_auc and f1 are computed by default.
This function calls sklearn.metrics functions to compute the metrics. For more information on the metrics, please refer to the documentation of the corresponding sklearn.metrics functions.
- Parameters:
- Return type:
- Returns:
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
Examples
>>> from pyhealth.metrics import binary_metrics_fn >>> y_true = np.array([0, 0, 1, 1]) >>> y_prob = np.array([0.1, 0.4, 0.35, 0.8]) >>> binary_metrics_fn(y_true, y_prob, metrics=["accuracy"]) {'accuracy': 0.75}
[core] calibration#
Metrics that meature model calibration.
Reference Papers:
[1] Lin, Zhen, Shubhendu Trivedi, and Jimeng Sun. “Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks.” ICLR 2023.
[2] Nixon, Jeremy, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. “Measuring Calibration in Deep Learning.” In CVPR workshops, vol. 2, no. 7. 2019.
[3] Patel, Kanil, William Beluch, Bin Yang, Michael Pfeiffer, and Dan Zhang. “Multi-class uncertainty calibration via mutual information maximization-based binning.” ICLR 2021.
[4] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On calibration of modern neural networks.” ICML 2017.
[5] Kull, Meelis, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. “Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.” Advances in neural information processing systems 32 (2019).
[6] Brier, Glenn W. “Verification of forecasts expressed in terms of probability.” Monthly weather review 78, no. 1 (1950): 1-3.
- pyhealth.metrics.calibration.ece_confidence_multiclass(prob, label, bins=20, adaptive=False)[source]#
Expected Calibration Error (ECE).
We group samples into ‘bins’ basing on the top-class prediction. Then, we compute the absolute difference between the average top-class prediction and the frequency of top-class being correct (i.e. accuracy) for each bin. ECE is the average (weighed by number of points in each bin) of these absolute differences. It could be expressed by the following formula, with \(B_m\) denoting the m-th bin:
\[ECE = \sum_{m=1}^M \frac{|B_m|}{N} |acc(B_m) - conf(B_m)|\]Example
>>> pred = np.asarray([[0.2, 0.2, 0.6], [0.2, 0.31, 0.49], [0.1, 0.1, 0.8]]) >>> label = np.asarray([2,1,2]) >>> ECE_confidence_multiclass(pred, label, bins=2) 0.36333333333333334
Explanation of the example: The bins are [0, 0.5] and (0.5, 1]. In the first bin, we have one sample with top-class prediction of 0.49, and its accuracy is 0. In the second bin, we have average confidence of 0.7 and average accuracy of 1. Thus, the ECE is \(\frac{1}{3} \cdot 0.49 + \frac{2}{3}\cdot 0.3=0.3633\).
- Parameters:
- pyhealth.metrics.calibration.ece_confidence_binary(prob, label, bins=20, adaptive=False)[source]#
Expected Calibration Error (ECE) for binary classification.
Similar to
ece_confidence_multiclass()
, but on class 1 instead of the top-prediction.- Parameters:
- pyhealth.metrics.calibration.ece_classwise(prob, label, bins=20, threshold=0.0, adaptive=False)[source]#
Classwise Expected Calibration Error (ECE).
This is equivalent to applying
ece_confidence_binary()
to each class and take the average.- Parameters:
prob (np.ndarray) – (N, C)
label (np.ndarray) – (N,)
bins (int, optional) – Number of bins. Defaults to 20.
threshold (float) – threshold to filter out samples. If the number of classes C is very large, many classes receive close to 0 prediction. Any prediction below threshold is considered noise and ignored. In recent papers, this is typically set to a small number (such as 1/C).
adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.
[core] prediction_set#
- pyhealth.metrics.prediction_set.rejection_rate(y_pred)[source]#
Rejection rate, defined as the proportion of samples with prediction set size != 1
- pyhealth.metrics.prediction_set.miscoverage_ps(y_pred, y_true)[source]#
Miscoverage rates for all samples (similar to recall).
Example
>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0],[0, 1, 0]]) >>> y_true = np.asarray([1,0,1,2]) >>> error_ps(y_pred, y_true) array([0. , 0.5, 1. ])
Explanation: For class 0, the 1-th prediction set ({0}) contains the label, so the miss-coverage is 0/1=0. For class 1, the 0-th prediction set ({0}) does not contain the label, the 2-th prediction set ({0,1}) contains the label. Thus, the miss-coverage is 1/2=0.5. For class 2, the last prediction set is {1} and the label is 2, so the miss-coverage is 1/1=1.
- pyhealth.metrics.prediction_set.error_ps(y_pred, y_true)[source]#
Miscoverage rates for unrejected samples, where rejection is defined to be sets with size !=1).
Example
>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0],[0, 1, 0]]) >>> y_true = np.asarray([1,0,1,2]) >>> error_ps(y_pred, y_true) array([0., 1., 1.])
Explanation: For class 0, the 1-th sample is correct and not rejected, so the error is 0/1=0. For class 1, the 0-th sample is incorrerct and not rejected, the 2-th is rejected. Thus, the error is 1/1=1. For class 2, the last sample is not-rejected but the prediction set is {1}, so the error is 1/1=1.
- pyhealth.metrics.prediction_set.miscoverage_overall_ps(y_pred, y_true)[source]#
Miscoverage rate for the true label. Only for multiclass.
Example
>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0]]) >>> y_true = np.asarray([1,0,1]) >>> miscoverage_overall_ps(y_pred, y_true) 0.333333
Explanation: The 0-th prediction set is {0} and the label is 1 (not covered). The 1-th prediction set is {0} and the label is 0 (covered). The 2-th prediction set is {0,1} and the label is 1 (covered). Thus the miscoverage rate is 1/3.
- pyhealth.metrics.prediction_set.error_overall_ps(y_pred, y_true)[source]#
Overall error rate for the un-rejected samples.
Example
>>> y_pred = np.asarray([[1,0,0],[1,0,0],[1,1,0]]) >>> y_true = np.asarray([1,0,1]) >>> error_overall_ps(y_pred, y_true) 0.5
Explanation: The 0-th prediction set is {0} and the label is 1, so it is an error (no rejection as its prediction set has only one class). The 1-th sample is not rejected and incurs on error. The 2-th sample is rejected, thus excluded from the computation.
pyhealth.metrics.fairness#
- pyhealth.metrics.fairness.fairness_metrics_fn(y_true, y_prob, sensitive_attributes, favorable_outcome=1, metrics=None, threshold=0.5)[source]#
Computes metrics for binary classification.
User can specify which metrics to compute by passing a list of metric names. The accepted metric names are:
disparate_impact:
statistical_parity_difference:
If no metrics are disparate_impact, and statistical_parity_difference are computed by default.
- Parameters:
y_true (
ndarray
) – True target values of shape (n_samples,).y_prob (
ndarray
) – Predicted probabilities of shape (n_samples,).sensitive_attributes (
ndarray
) – Sensitive attributes of shape (n_samples,) where 1 is the protected group and 0 is the unprotected group.favorable_outcome (
int
) – Label value which is considered favorable (i.e. “positive”).metrics (
Optional
[List
[str
]]) – List of metrics to compute. Default is [“disparate_impact”, “statistical_parity_difference”].threshold (
float
) – Threshold for binary classification. Default is 0.5.
- Return type:
- Returns:
- Dictionary of metrics whose keys are the metric names and values are
the metric values.
- pyhealth.metrics.fairness_utils.disparate_impact(sensitive_attributes, y_pred, favorable_outcome=1, allow_zero_division=False, epsilon=1e-08)[source]#
Computes the disparate impact between the the protected and unprotected group.
disparate_impact = P(y_pred = favorable_outcome | P) / P(y_pred = favorable_outcome | U)
- Parameters:
sensitive_attributes (
ndarray
) – Sensitive attributes of shape (n_samples,) where 1 is the protected group and 0 is the unprotected group.y_pred (
ndarray
) – Predicted target values of shape (n_samples,).favorable_outcome (
int
) – Label value which is considered favorable (i.e. “positive”).allow_zero_division – If True, use epsilon instead of 0 in the denominator if the denominator is 0. Otherwise, raise a ValueError.
- Return type:
- Returns:
The disparate impact between the protected and unprotected group.
- pyhealth.metrics.fairness_utils.statistical_parity_difference(sensitive_attributes, y_pred, favorable_outcome=1)[source]#
Computes the statistical parity difference between the the protected and unprotected group.
statistical_parity_difference = P(y_pred = favorable_outcome | P) - P(y_pred = favorable_outcome | U) :type sensitive_attributes:
ndarray
:param sensitive_attributes: Sensitive attributes of shape (n_samples,) where 1 is the protected group and 0 is the unprotected group. :type y_pred:ndarray
:param y_pred: Predicted target values of shape (n_samples,). :type favorable_outcome:int
:param favorable_outcome: Label value which is considered favorable (i.e. “positive”).- Return type:
- Returns:
The statistical parity difference between the protected and unprotected group.
- pyhealth.metrics.fairness_utils.sensitive_attributes_from_patient_ids(dataset, patient_ids, sensitive_attribute, protected_group)[source]#
Returns the desired sensitive attribute array from patient_ids.
- Parameters:
dataset (
BaseEHRDataset
) – Dataset object.sensitive_attribute (
str
) – Sensitive attribute to extract.protected_group (
str
) – Value of the protected group.
- Return type:
ndarray
- Returns:
Sensitive attribute array of shape (n_samples,).
MedCode#
We provide medical code mapping tools for (i) ontology mapping within one coding system and (ii) mapping the same concept cross different coding systems.
- class pyhealth.medcode.InnerMap(vocabulary, refresh_cache=False)[source]#
Bases:
ABC
Contains information for a specific medical code system.
InnerMap is a base abstract class for all medical code systems. It will be instantiated as a specific medical code system with InnerMap.load(vocabulary).
Note
This class cannot be instantiated using __init__() (throws an error).
- classmethod load(vocabulary, refresh_cache=False)[source]#
Initializes a specific medical code system inheriting from InnerMap.
- Parameters:
Examples
>>> from pyhealth.medcode import InnerMap >>> icd9cm = InnerMap.load("ICD9CM") >>> icd9cm.lookup("428.0") 'Congestive heart failure, unspecified' >>> icd9cm.get_ancestors("428.0") ['428', '420-429.99', '390-459.99', '001-999.99']
- static standardize(code)[source]#
Standardizes a given code.
Subclass will override this method based on different medical code systems.
- Return type:
- static convert(code, **kwargs)[source]#
Converts a given code.
Subclass will override this method based on different medical code systems.
- Return type:
- class pyhealth.medcode.CrossMap(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#
Bases:
object
Contains mapping between two medical code systems.
CrossMap is a base class for all possible mappings. It will be initialized with two specific medical code systems with CrossMap.load(source_vocabulary, target_vocabulary).
- classmethod load(source_vocabulary, target_vocabulary, refresh_cache=False)[source]#
Initializes the mapping between two medical code systems.
- Parameters:
Examples
>>> from pyhealth.medcode import CrossMap >>> mapping = CrossMap("ICD9CM", "CCSCM") >>> mapping.map("428.0") ['108']
>>> mapping = CrossMap.load("NDC", "ATC") >>> mapping.map("00527051210", target_kwargs={"level": 3}) ['A11C']
- map(source_code, source_kwargs=None, target_kwargs=None)[source]#
Maps a source code to a list of target codes.
- Parameters:
source_code (
str
) – source code.**source_kwargs (
Optional
[Dict
]) – additional arguments for the source code. Will be passed to self.s_class.convert(). Default is empty dict.**target_kwargs (
Optional
[Dict
]) – additional arguments for the target code. Will be passed to self.t_class.convert(). Default is empty dict.
- Return type:
- Returns:
A list of target codes.
Diagnosis codes:#
- class pyhealth.medcode.ICD9CM(**kwargs)[source]#
Bases:
InnerMap
9-th International Classification of Diseases, Clinical Modification.
Procedure codes:#
- class pyhealth.medcode.ICD9PROC(**kwargs)[source]#
Bases:
InnerMap
9-th International Classification of Diseases, Procedure.
Medication codes:#
Calibration and Uncertainty Quantification#
In this module, we implemented the following prediction set constructors or model calibration methods, which can be combined with any PyHealth models.
pyhealth.calib.calibration#
Model calibration methods
- class pyhealth.calib.calibration.DirichletCalibration(model, debug=False, **kwargs)[source]#
Bases:
PostHocCalibrator
Dirichlet Calibration
Dirichlet calibration is similar to retraining a linear layer mapping from the old logits to the new logits with regularizations. This is a calibration method for multiclass classification only.
Paper:
[1] Kull, Meelis, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. “Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.” Advances in neural information processing systems 32 (2019).
- Parameters:
model (BaseModel) – A trained base model.
Examples
>>> from pyhealth.datasets import ISRUCDataset, split_by_patient, get_dataloader >>> from pyhealth.models import SparcNet >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> from pyhealth.calib.calibration import DirichletCalibration >>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn) >>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2]) >>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"], ... label_key="label", mode="multiclass") >>> # ... Train the model here ... >>> # Calibrate >>> cal_model = DirichletCalibration(model) >>> cal_model.calibrate(cal_dataset=val_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl)) {'accuracy': 0.7096615988229524, 'cwECEt_adapt': 0.05336195546573208}
- calibrate(cal_dataset, lr=0.01, max_iter=128, reg_lambda=0.001)[source]#
Calibrate the base model using a calibration dataset.
- Parameters:
- Returns:
None
- Return type:
None
- forward(**kwargs)[source]#
Forward propagation (just like the original model).
- Parameters:
**kwargs –
Additional arguments to the base model.
- Returns:
A dictionary with all results from the base model, with the following modified:
y_prob
: calibrated predicted probabilities.loss
: Cross entropy loss with the new y_prob.logit
: temperature-scaled logits.- Return type:
Dict[str, torch.Tensor]
- class pyhealth.calib.calibration.HistogramBinning(model, debug=False, **kwargs)[source]#
Bases:
PostHocCalibrator
Histogram Binning
Histogram binning amounts to creating bins and computing the accuracy for each bin using the calibration dataset, and then predicting such at test time. For multilabel/binary/multiclass classification tasks, we calibrate each class independently following [1]. Users could choose to renormalize the probability scores for multiclass tasks so they sum to 1.
Paper:
[1] Gupta, Chirag, and Aaditya Ramdas. “Top-label calibration and multiclass-to-binary reductions.” ICLR 2022.
[2] Zadrozny, Bianca, and Charles Elkan. “Learning and making decisions when costs and probabilities are both unknown.” In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 204-213. 2001.
- Parameters:
model (BaseModel) – A trained base model.
Examples
>>> from pyhealth.datasets import ISRUCDataset, get_dataloader, split_by_patient >>> from pyhealth.models import SparcNet >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> from pyhealth.calib.calibration import HistogramBinning >>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn) >>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2]) >>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"], ... label_key="label", mode="multiclass") >>> # ... Train the model here ... >>> # Calibrate >>> cal_model = HistogramBinning(model) >>> cal_model.calibrate(cal_dataset=val_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl)) {'accuracy': 0.7189072348464207, 'cwECEt_adapt': 0.04455814993598299}
- calibrate(cal_dataset, nbins=15)[source]#
Calibrate the base model using a calibration dataset.
- Parameters:
cal_dataset (Subset) – Calibration set.
nbins (int, optional) – number of bins to use, defaults to 15
- forward(normalization='sum', **kwargs)[source]#
Forward propagation (just like the original model).
- Parameters:
normalization (str, optional) – how to normalize the calibrated probability. Defaults to ‘sum’ (and only ‘sum’ is supported for now).
**kwargs –
Additional arguments to the base model.
- Returns:
A dictionary with all results from the base model, with the following modified:
y_prob
: calibrated predicted probabilities.loss
: Cross entropy loss with the new y_prob.- Return type:
Dict[str, torch.Tensor]
- class pyhealth.calib.calibration.KCal(model, debug=False, **kwargs)[source]#
Bases:
PostHocCalibrator
Kernel-based Calibration. This is a full calibration method for multiclass classification. It tries to calibrate the predicted probabilities for all classes, by using KDE classifiers estimated from the calibration set.
Paper:
Lin, Zhen, Shubhendu Trivedi, and Jimeng Sun. “Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks.” ICLR 2023.
- Parameters:
model (BaseModel) – A trained model.
Examples
>>> from pyhealth.datasets import ISRUCDataset, split_by_patient, get_dataloader >>> from pyhealth.models import SparcNet >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> from pyhealth.calib.calibration import KCal >>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn) >>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2]) >>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"], ... label_key="label", mode="multiclass") >>> # ... Train the model here ... >>> # Calibrate >>> cal_model = KCal(model) >>> cal_model.calibrate(cal_dataset=val_data) >>> # Alternatively, you could re-fit the reprojection: >>> # cal_model.calibrate(cal_dataset=val_data, train_dataset=train_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl)) {'accuracy': 0.7303689172252193, 'cwECEt_adapt': 0.03324275630220515}
- fit(train_dataset, val_dataset=None, split_by_patient=False, dim=32, bs_pred=64, bs_supp=20, epoch_len=5000, epochs=10, load_best_model_at_last=False, **train_kwargs)[source]#
Fit the reprojection module. You don’t need to call this function - it is called in
KCal.calibrate()
. For training details, please refer to the paper.- Parameters:
train_dataset (Dataset) – The training dataset.
val_dataset (Dataset, optional) – The validation dataset. Defaults to None.
split_by_patient (bool, optional) – Whether to split the dataset by patient during training. Defaults to False.
dim (int, optional) – The dimension of the embedding. Defaults to 32.
bs_pred (int, optional) – The batch size for the prediction set. Defaults to 64.
bs_supp (int, optional) – The batch size for the support set. Defaults to 20.
epoch_len (int, optional) – The number of batches in an epoch. Defaults to 5000.
epochs (int, optional) – The number of epochs. Defaults to 10.
load_best_model_at_last (bool, optional) – Whether to load the best model (or the last model). Defaults to False.
**train_kwargs – Other keyword arguments for
pyhealth.trainer.Trainer.train()
.
- calibrate(cal_dataset, num_fold=20, record_id_name=None, train_dataset=None, train_split_by_patient=False, load_best_model_at_last=True, **train_kwargs)[source]#
Calibrate using a calibration dataset. If
train_dataset
is not None, it will be used to fit a re-projection from the base model embeddings. In either case, the calibration set will be used to construct the KDE classifier.- Parameters:
cal_dataset (Subset) – Calibration set.
record_id_name (str, optional) – the key/name of the unique index for records. Defaults to None.
train_dataset (Subset, optional) – Dataset to train the reprojection. Defaults to None (no training).
train_split_by_patient (bool, optional) – Whether to split by patient when training the embeddings. That is, do we use samples from the same patient in KDE during training. Defaults to False.
load_best_model_at_last (bool, optional) – Whether to load the best reprojection basing on the calibration set. Defaults to True.
train_kwargs (dict, optional) – Additional arguments for training the reprojection. Passed to
KCal.fit()
- forward(**kwargs)[source]#
Forward propagation (just like the original model).
- Parameters:
**kwargs –
Additional arguments to the base model.
- Returns:
A dictionary with all results from the base model, with the following modified:
y_prob
: calibrated predicted probabilities.loss
: Cross entropy loss with the new y_prob.- Return type:
Dict[str, torch.Tensor]
- class pyhealth.calib.calibration.TemperatureScaling(model, debug=False, **kwargs)[source]#
Bases:
PostHocCalibrator
Temperature Scaling
Temprature scaling refers to scaling the logits by a “temprature” tuned on the calibration set. For binary classification tasks, this amounts to Platt scaling. For multilabel classification, users can use one temperature for all classes, or one for each. For multiclass classification, this is a confidence calibration method: It tries to calibrate the predicted class’ predicted probability.
Paper:
[1] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On calibration of modern neural networks.” ICML 2017.
[2] Platt, John. “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.” Advances in large margin classifiers 10, no. 3 (1999): 61-74.
- Parameters:
model (BaseModel) – A trained base model.
Examples
>>> from pyhealth.datasets import ISRUCDataset, get_dataloader, split_by_patient >>> from pyhealth.models import SparcNet >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> from pyhealth.calib.calibration import TemperatureScaling >>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn) >>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2]) >>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"], ... label_key="label", mode="multiclass") >>> # ... Train the model here ... >>> # Calibrate >>> cal_model = TemperatureScaling(model) >>> cal_model.calibrate(cal_dataset=val_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> print(Trainer(model=cal_model, metrics=['cwECEt_adapt', 'accuracy']).evaluate(test_dl)) {'accuracy': 0.709843241966832, 'cwECEt_adapt': 0.051673596521491505}
- calibrate(cal_dataset, lr=0.01, max_iter=50, mult_temp=False)[source]#
Calibrate the base model using a calibration dataset.
- forward(**kwargs)[source]#
Forward propagation (just like the original model).
- Parameters:
**kwargs –
Additional arguments to the base model.
- Returns:
A dictionary with all results from the base model, with the following modified:
y_prob
: calibrated predicted probabilities.loss
: Cross entropy loss with the new y_prob.logit
: temperature-scaled logits.- Return type:
Dict[str, torch.Tensor]
pyhealth.calib.predictionset#
Prediction set construction methods
- class pyhealth.calib.predictionset.LABEL(model, alpha, debug=False, **kwargs)[source]#
Bases:
SetPredictor
LABEL: Least ambiguous set-valued classifiers with bounded error levels.
This is a prediction-set constructor for multi-class classification problems. It controls either \(\mathbb{P}\{Y \not \in C(X) | Y=k\}\leq \alpha_k\) (when
alpha
is an array), or \(\mathbb{P}\{Y \not \in C(X)\}\leq \alpha\) (whenalpha
is a float). Here, \(C(X)\) denotes the final prediction set. This is essentially a split conformal prediction method using the predicted scores.Paper:
Sadinle, Mauricio, Jing Lei, and Larry Wasserman. “Least ambiguous set-valued classifiers with bounded error levels.” Journal of the American Statistical Association 114, no. 525 (2019): 223-234.
- Parameters:
model (BaseModel) – A trained base model.
alpha (Union[float, np.ndarray]) – Target mis-coverage rate(s).
Examples
>>> from pyhealth.datasets import ISRUCDataset, split_by_patient, get_dataloader >>> from pyhealth.models import SparcNet >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> from pyhealth.calib.predictionset import LABEL >>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn) >>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2]) >>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"], ... label_key="label", mode="multiclass") >>> # ... Train the model here ... >>> # Calibrate the set classifier, with different class-specific mis-coverage rates >>> cal_model = LABEL(model, [0.15, 0.3, 0.15, 0.15, 0.15]) >>> # Note that we used the test set here because ISRUCDataset has relatively few >>> # patients, and calibration set should be different from the validation set >>> # if the latter is used to pick checkpoint. In general, the calibration set >>> # should be something exchangeable with the test set. Please refer to the paper. >>> cal_model.calibrate(cal_dataset=test_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer, get_metrics_fn >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> y_true_all, y_prob_all, _, extra_output = Trainer(model=cal_model).inference(test_dl, additional_outputs=['y_predset']) >>> print(get_metrics_fn(cal_model.mode)( ... y_true_all, y_prob_all, metrics=['accuracy', 'miscoverage_ps'], ... y_predset=extra_output['y_predset']) ... ) {'accuracy': 0.709843241966832, 'miscoverage_ps': array([0.1499847 , 0.29997638, 0.14993964, 0.14994704, 0.14999252])}
- class pyhealth.calib.predictionset.SCRIB(model, risk, loss_kwargs=None, debug=False, fill_max=True, **kwargs)[source]#
Bases:
SetPredictor
SCRIB: Set-classifier with Class-specific Risk Bounds
This is a prediction-set constructor for multi-class classification problems. SCRIB tries to control class-specific risk while minimizing the ambiguity. To to this, it selects class-specific thresholds for the predictions, on a calibration set.
If
risk
is a float (say 0.1), SCRIB controls the overall risk: \(\mathbb{P}\{Y \not \in C(X) | |C(X)| = 1\}\leq risk\). Ifrisk
is an array (say np.asarray([0.1] * 5)), SCRIB controls the class specific risks: \(\mathbb{P}\{Y \not \in C(X) | Y=k \land |C(X)| = 1\}\leq risk_k\) Here, \(C(X)\) denotes the final prediction set.Paper:
Lin, Zhen, Lucas Glass, M. Brandon Westover, Cao Xiao, and Jimeng Sun. “SCRIB: Set-classifier with Class-specific Risk Bounds for Blackbox Models.” AAAI 2022.
- Parameters:
model (BaseModel) – A trained model.
risk (Union[float, np.ndarray]) – risk targets.
loss_kwargs (dict, optional) –
Additional loss parameters (including hyperparameters). It could contain the following float/int hyperparameters:
- lk: The coefficient for the loss term associated with risk violation penalty.
The higher the lk, the more penalty on risk violation (likely higher ambiguity).
- fill_max: Whether to fill the class with max predicted score
when no class exceeds the threshold. In other words, if fill_max, the null region will be filled with max-prediction class.
Defaults to {‘lk’: 1e4, ‘fill_max’: False}
fill_max (bool, optional) – Whether to fill the empty prediction set with the max-predicted class. Defaults to True.
Examples
>>> from pyhealth.data import ISRUCDataset, split_by_patient, get_dataloader >>> from pyhealth.models import SparcNet >>> from pyhealth.tasks import sleep_staging_isruc_fn >>> from pyhealth.calib.predictionset import SCRIB >>> from pyhealth.trainer import get_metrics_fn >>> sleep_ds = ISRUCDataset("/srv/scratch1/data/ISRUC-I").set_task(sleep_staging_isruc_fn) >>> train_data, val_data, test_data = split_by_patient(sleep_ds, [0.6, 0.2, 0.2]) >>> model = SparcNet(dataset=sleep_ds, feature_keys=["signal"], ... label_key="label", mode="multiclass") >>> # ... Train the model here ... >>> # Calibrate the set classifier, with different class-specific risk targets >>> cal_model = SCRIB(model, [0.2, 0.3, 0.1, 0.2, 0.1]) >>> # Note that we used the test set here because ISRUCDataset has relatively few >>> # patients, and calibration set should be different from the validation set >>> # if the latter is used to pick checkpoint. In general, the calibration set >>> # should be something exchangeable with the test set. Please refer to the paper. >>> cal_model.calibrate(cal_dataset=test_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> y_true_all, y_prob_all, _, extra_output = Trainer(model=cal_model).inference(test_dl, additional_outputs=['y_predset']) >>> print(get_metrics_fn(cal_model.mode)( ... y_true_all, y_prob_all, metrics=['accuracy', 'error_ps', 'rejection_rate'], ... y_predset=extra_output['y_predset']) ... ) {'accuracy': 0.709843241966832, 'rejection_rate': 0.6381305287631919, 'error_ps': array([0.32161874, 0.36654135, 0.11461734, 0.23728814, 0.14993925])}
- class pyhealth.calib.predictionset.FavMac(model, value_weights=1.0, cost_weights=1.0, target_cost=1.0, delta=None, debug=False, **kwargs)[source]#
Bases:
SetPredictor
Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control (FavMac)
This is a prediction-set constructor for multi-label classification problems. FavMac could control the cost/risk while realizing high value on the prediction set.
Value and cost functions are functions in the form of \(V(S;Y)\) or \(C(S;Y)\), with S being the prediction set and Y being the label. For example, a classical cost function would be “numebr of false positives”. Denote the
target_cost
as \(c\), ifdelta=None
, FavMac controls the expected cost in the following sense:\(\mathbb{E}[C(S_{N+1};Y_{N+1}] \leq c\).
Otherwise, FavMac controls the violation probability in the following sense:
\(\mathbb{P}\{C(S_{N+1};Y_{N+1})>c\}\leq delta\).
Right now, this FavMac implementation only supports additive value and cost functions (unlike the implementation associated with [1]). That is, the value function is specified by the weights
value_weights
and the cost function is specified bycost_weights
. With \(k\) denoting classes, the cost function is then computed as\(C(S;Y,w) = \sum_{k} (1-Y_k)S_k w_k\)
Similarly, the value function is computed as
\(V(S;Y,w) = \sum_{k} Y_k S_k w_k\).
Papers:
[1] Lin, Zhen, Shubhendu Trivedi, Cao Xiao, and Jimeng Sun. “Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control (FavMac).” ICML 2023.
[2] Fisch, Adam, Tal Schuster, Tommi Jaakkola, and Regina Barzilay. “Conformal prediction sets with limited false positives.” ICML 2022.
- Parameters:
model (BaseModel) – A trained model.
value_weights (Union[float, np.ndarray]) – weights for the value function. See description above. Defaults to 1.
cost_weights (Union[float, np.ndarray]) – weights for the cost function. See description above. Defaults to 1.
target_cost (float) – Target cost. When cost_weights is set to 1, this is essentially the number of false positive. Defaults to 1.
delta (float) – Violation target (in violation control). Defaults to None (which means expectation control instead of violation control).
Examples
>>> from pyhealth.calib.predictionset import FavMac >>> from pyhealth.datasets import (MIMIC3Dataset, get_dataloader,split_by_patient) >>> from pyhealth.models import Transformer >>> from pyhealth.tasks import drug_recommendation_mimic3_fn >>> from pyhealth.trainer import get_metrics_fn >>> base_dataset = MIMIC3Dataset( ... root="/srv/scratch1/data/physionet.org/files/mimiciii/1.4", ... tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], ... code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})}, ... refresh_cache=False) >>> sample_dataset = base_dataset.set_task(drug_recommendation_mimic3_fn) >>> train_data, val_data, test_data = split_by_patient(sample_dataset, [0.6, 0.2, 0.2]) >>> model = Transformer(dataset=sample_dataset, feature_keys=["conditions", "procedures"], ... label_key="drugs", mode="multilabel") >>> # ... Train the model here ... >>> # Try to control false positive to <=3 >>> cal_model = FavMac(model, target_cost=3, delta=None) >>> cal_model.calibrate(cal_dataset=val_data) >>> # Evaluate >>> from pyhealth.trainer import Trainer >>> test_dl = get_dataloader(test_data, batch_size=32, shuffle=False) >>> y_true_all, y_prob_all, _, extra_output = Trainer(model=cal_model).inference( ... test_dl, additional_outputs=["y_predset"]) >>> print(get_metrics_fn(cal_model.mode)( ... y_true_all, y_prob_all, metrics=['tp', 'fp'], ... y_predset=extra_output["y_predset"])) # We get FP~=3 {'tp': 0.5049893086243763, 'fp': 2.8442622950819674}
PyHealth live#
Start Time: 8 PM Central Time, Wednesday
Recurrence: There is no live session for now.
Zoom: Join Link
Add to Google Calender: Invitation
Add to Microsoft Outlook (.ics): Invitation
YouTube: Recorded Live Sessions
User/Developer Slack: Click to join
Schedules#
(Dec 21, 2022) Live 01 - What is PyHealth and How to Get Started? [Recap]
(Dec 28, 2022) Live 02 - Data & Datasets & Tasks: store unstructured data in an structured way. [Recap I] [II] [III] [IV]
(Jan 4, 2023) Live 03 - Models & Trainer & Metrics: initialize and train a deep learning model. [Recap I] [II] [III]
(Jan 11, 2023) Live 04 - Tokenizer & Medcode: master the medical code lookup and mapping [Recap I] [II]
(Jan 18, 2023) Live 05 - PyHealth can support a complete healthcare ML pipeline [Recap I] [II]
(Jan 25, 2023) Live 06 - Fit your own dataset into pipeline and use our model [Recap]
(Feb 1, 2023) Live 07 - Adopt your customized model and quickly try it on our data [Recap]
(Feb 8, 2023) Live 08 - New feature: support for biosignal data (EEG, ECG, etc.) classification [Recap I] [II]
(Feb 15, 2023) Live 09 - New feature: parallel and faster data loading architecture
(Feb 22, 2023) Live 10 - Add a covid prediction benchmark (new datasets, new models)
Development logs#
We track the new development here:
Dec 29, 2023
..code-blocks:: rst
1. add GAN models and demos in pyhealth PR #256 2.
Dec 20, 2023
..code-blocks:: rst
add graph neural network models for drug recommendation in PR #251
add regression_metrics_fn for some unsupervised learning tasks.
add VAE models in pyhealth PR #253
Nov 28, 2023
..code-blocks:: rst
fix ddi metric calculation error raised by issue #249.
Nov 24, 2023
..code-blocks:: rst
add ddi metrics to multilabel_metrics_fn per issue #247
add medlink as a new medical record link pipeline in PR #240
add chefer transformer in PR #239
add KG embedding in PR #234
add LLM for Pyhealth in PR #231
Sep 01, 2023
1. add Base fairness metrics and example `#216`.
July 22, 2023
1. add Temple University TUEV and TUAB two datasets in `#194`.
2. add the EEG six event detection and abnormal EEG detection tasks.
July 1, 2023
1. add six ECG datasets: "cpsc_2018", "cpsc_2018_extra", "georgia", "ptb", "ptb-xl", "st_petersburg_incart" (from
PhysioNet Cardiology Challenge 2020 https://physionet.org/content/challenge-2020/1.0.2/ `#176`
2. add ECG binary classification tasks (for five symptom categories: Arrhythmias symptom, Bundle branch blocks and fascicular blocks symptom,
Axis deviations symptom, Conduction delays symptom, Wave abnormalities symptom) `#176`
May 31, 2023
1. add SHHS dataset and its sleep staging task.
May 25, 2023
1. add dirichlet calibration `PR #159`
May 9, 2023
1. add MIMIC-Extract dataset `#136`
2. add new maintainer members for pyhealth: Junyi Gao and Benjamin Danek
May 6, 2023
1. add new parser functions (admissionDx, diagnosisStrings) and prediction tasks for eICU dataset `#148`
Apr 27, 2023
1. add MoleRec model (WWW'23) for drug recommendation `#122`
Apr 26, 2023
1. fix bugs in GRASP model `#141`
2. add pandas install <2 constraints `#135`
3. add hcpcsevents table process in MIMIC4 dataset `#134`
Apr 10, 2023
1. fix Ambiguous datetime usage in eICU (https://github.com/sunlabuiuc/PyHealth/pull/132)
Mar 26, 2023
1. add the entire uncertainty quantification module (https://github.com/sunlabuiuc/PyHealth/pull/111)
Feb 26, 2023
1. add 6 EHR predictiom model: Adacare, Concare, Stagenet, TCN, Grasp, Agent
Feb 24, 2023
1. add unittest for omop dataset
2. add github action triggered manually, check `#104`
Feb 19, 2023
1. add unittest for eicu dataset
2. add ISRUC dataset (and task function) for signal learning
Feb 12, 2023
1. add unittest for mimiciii, mimiciv
2. add SHHS datasets for sleep staging task
3. add SparcNet model for signal classification task
Feb 08, 2023
1. complete the biosignal data support, add ContraWR [1] model for general purpose biosignal classification task ([1] Yang, Chaoqi, Danica Xiao, M. Brandon Westover, and Jimeng Sun.
"Self-supervised eeg representation learning for automatic sleep staging."
arXiv preprint arXiv:2110.15278 (2021).)
Feb 07, 2023
1. Support signal dataset processing and split: add SampleSignalDataset, BaseSignalDataset. Use SleepEDFcassette dataset as the first signal dataset. Use example/sleep_staging_sleepEDF_contrawr.py
2. rename the dataset/ parts: previous BaseDataset becomes BaseEHRDataset and SampleDatast becomes SampleEHRDataset. Right now, BaseDataset will be inherited by BaseEHRDataset and BaseSignalDataset. SampleBaseDataset will be inherited by SampleEHRDataset and SampleSignalDataset.
Feb 06, 2023
1. improve readme style
2. add the pyhealth live 06 and 07 link to pyhealth live
Feb 01, 2023
1. add unittest of PyHealth MedCode and Tokenizer
Jan 26, 2023
1. accelerate MIMIC-IV, eICU and OMOP data loading by using multiprocessing (pandarallel)
Jan 25, 2023
1. accelerate the MIMIC-III data loading process by using multiprocessing (pandarallel)
Jan 24, 2023
1. Fix the code typo in pyhealth/tasks/drug_recommendation.py for issue `#71`.
2. update the pyhealth live schedule
Jan 22, 2023
1. Fix the list of list of vector problem in RNN, Transformer, RETAIN, and CNN
2. Add initialization examples for RNN, Transformer, RETAIN, CNN, and Deepr
3. (minor) change the parameters from "Type" and "level" to "type_" and "dim_"
4. BPDanek adds the "__repr__" function to medcode for better print understanding
5. add unittest for pyhealth.data
Jan 21, 2023
1. Added a new model, Deepr (models.Deepr)
Jan 20, 2023
1. add the pyhealth live 05
2. add slack channel invitation in pyhealth live page
Jan 13, 2023
1. add the pyhealth live 03 and 04 video link to the nagivation
2. add future pyhealth live schedule
Jan 8, 2023
1. Changed BaseModel.add_feature_transform_layer in models/base_model.py so that it accepts special_tokens if necessary
2. fix an int/float bug in dataset checking (transform int to float and then process them uniformly)
Dec 26, 2022
1. add examples to pyhealth.data, pyhealth.datasets
2. improve jupyter notebook tutorials 0, 1, 2
Dec 21, 2022
1. add the development logs to the navigation
2. add the pyhealth live schedule to the nagivation
About us#
We are the SunLab healthcare research team at UIUC.
*Zhenbang Wu (Ph.D. Student @ University of Illinois Urbana-Champaign)
*Chaoqi Yang (Ph.D. Student @ University of Illinois Urbana-Champaign)
Patrick Jiang (M.S. Student @ University of Illinois Urbana-Champaign)
Zhen Lin (Ph.D. Student @ University of Illinois Urbana-Champaign)
Junyi Gao (M.S. Student @ UIUC, Ph.D. Student @ University of Edinburgh)
Benjamin Danek (M.S. Student @ University of Illinois Urbana-Champaign)
Jimeng Sun (Professor @ University of Illinois Urbana-Champaign)
(* indicates equal contribution)