pyhealth.datasets.MIMICExtractDataset#

The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMICExtractDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False, pop_size=None, itemid_to_variable_map=None)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-Extract dataset.

Reads the HDF5 data produced by [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract#step-4-set-cohort-selection-and-extraction-criteria). Works with files created with or without LEVEL2 grouping and with restricted cohort population sizes, other optional parameter values, and should work with many customized versions of the pipeline.

You can create or obtain a MIMIC-Extract dataset in several ways:

Any of these methods will provide you with a set of HDF5 files containing a cleaned subset of the MIMIC-III dataset. This class can be used to read that dataset (mainly the all_hourly_data.h5 file). Consult the MIMIC-Extract documentation for all the options available for dataset generation (cohort selection, aggregation level, etc.).

Parameters:
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain one or more HDF5 files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“vitals_labs”, “interventions”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

  • pop_size (Optional[int]) – If your MIMIC-Extract dataset was created with a pop_size parameter, include it here. This is used to find the correct filenames.

  • itemid_to_variable_map (Optional[str]) – Path to the CSV file used for aggregation mapping during your dataset’s creation. Probably the one located in the MIMIC-Extract repo at resources/itemid_to_variable_map.csv, or your own version if you have customized it.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMICExtractDataset
>>> dataset = MIMICExtractDataset(
...         root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...         tables=["DIAGNOSES_ICD", "NOTES"], TODO: What here?
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper function which parses patients dataset (within all_hourly_data.h5)

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id which is updated with the mimic-3 table result.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_diagnoses_icd(patients)[source]#
Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5) in

a way compatible with MIMIC3Dataset.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_c(patients)[source]#

Helper function which parses the C (ICD9 diagnosis codes) dataset (within C.h5).

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_labevents(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.

Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to LABEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done by MIMIC-Extract.

See also self.parse_vitals_labs()

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_chartevents(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5) in a way compatible with MIMIC3Dataset.

Features in vitals_labs are corellated with MIMIC-III ITEM_ID values, and those ITEM_IDs that correspond to CHARTEVENTS table items in raw MIMIC-III will be added as events. This corellation depends on the contents of the provided itemid_to_variable_map.csv file. Note that this will likely not match the raw MIMIC-III data because of the harmonization/aggregation done in MIMIC-Extract.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_vitals_labs(patients)[source]#

Helper function which parses the vitals_labs dataset (within all_hourly_data.h5).

Events are added using the MIMIC3_ITEMID vocabulary, and the mapping is determined by the CSV file passed to the constructor in itemid_to_variable_map. Since MIMIC-Extract aggregates like events, only a single MIMIC-III ITEMID will be used to represent all like items in the MIMIC-Extract dataset–so the data here will likely not match raw MIMIC-III data. Which ITEMIDs are used depends on the aggregation level in your dataset (i.e. whether you used –no_group_by_level2).

Will be called in self.parse_tables()

See also self.parse_chartevents() and self.parse_labevents()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_interventions(patients)[source]#

Helper function which parses the interventions dataset (within all_hourly_data.h5). Events are added using the MIMIC3_ITEMID vocabulary, using a manually derived mapping corresponding to general items descriptive of the intervention. Since the raw MIMIC-III data had multiple codes, and MIMIC-Extract aggregates like items, these will not match raw MIMIC-III data.

In particular, note that ITEMID 41491 (“fluid bolus”) is used for crystalloid_bolus and ITEMID 46729 (“Dextran”) is used for colloid_bolus because there is no existing general ITEMID for colloid boluses.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.