pyhealth.datasets.MIMICExtractDataset#

The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMICExtractDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False, pop_size=None, itemid_to_variable_map=None)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-Extract dataset.

Reads the HDF5 data produced by [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract#step-4-set-cohort-selection-and-extraction-criteria). Works with files created with or without LEVEL2 grouping and with restricted cohort population sizes, other optional parameter values, and should work with many customized versions of the pipeline.

You can create or obtain a MIMIC-Extract dataset in several ways:

The default chort dataset is [available on GCP](https://console.cloud.google.com/storage/browser/mimic_extract) (requires PhysioNet access provisioned in GCP).
Follow the [step-by-step instructions](https://github.com/MLforHealth/MIMIC_Extract#step-by-step-instructions) on the MIMIC_Extract github site, which includes setting up a PostgreSQL database and loading the MIMIC-III data files.
Use the instructions at [MIMICExtractEasy](https://github.com/SphtKr/MIMICExtractEasy) which uses DuckDB instead and should be a good bit simpler.

Any of these methods will provide you with a set of HDF5 files containing a cleaned subset of the MIMIC-III dataset. This class can be used to read that dataset (mainly the all_hourly_data.h5 file). Consult the MIMIC-Extract documentation for all the options available for dataset generation (cohort selection, aggregation level, etc.).

Parameters:

dataset_name (Optional[str]) – name of the dataset.
root (str) – root directory of the raw data (should contain one or more HDF5 files).
tables (List[str]) – list of tables to be loaded (e.g., [“vitals_labs”, “interventions”]).
code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –
a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:
1. a str of the target code vocabulary;
2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.
Default is empty dict, which means the original code will be used.
dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.
refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.
pop_size (Optional[int]) – If your MIMIC-Extract dataset was created with a pop_size parameter, include it here. This is used to find the correct filenames.
itemid_to_variable_map (Optional[str]) – Path to the CSV file used for aggregation mapping during your dataset’s creation. Probably the one located in the MIMIC-Extract repo at resources/itemid_to_variable_map.csv, or your own version if you have customized it.

task#: Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#: Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#: Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#: Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMICExtractDataset
>>> dataset = MIMICExtractDataset(
...         root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...         tables=["DIAGNOSES_ICD", "NOTES"], TODO: What here?
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()

parse_basic_info(patients)[source]#

Helper function which parses patients dataset (within all_hourly_data.h5)

Will be called in self.parse_tables()

Docs:

PATIENTS: https://mimic.mit.edu/docs/iii/tables/patients/
ADMISSIONS: https://mimic.mit.edu/docs/iii/tables/admissions/

Parameters:: patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id which is updated with the mimic-3 table result.
Return type:: Dict[str, Patient]
Returns:: The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5) in: a way compatible with MIMIC3Dataset.

Will be called in self.parse_tables()

Docs:

DIAGNOSES_ICD: https://mimic.mit.edu/docs/iii/tables/diagnoses_icd/

Parameters:: patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.
Return type:: Dict[str, Patient]
Returns:: The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD: table, so we set it to None.

parse_c(patients)[source]#

Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5).

Will be called in self.parse_tables()

Docs:

DIAGNOSES_ICD: https://mimic.mit.edu/docs/iii/tables/diagnoses_icd/

Parameters:: patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.
Return type:: Dict[str, Patient]
Returns:: The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD: table, so we set it to None.

parse_labevents(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.

Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to LABEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done by MIMIC-Extract.