pyhealth.datasets.MIMIC4Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC4Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-IV dataset.

The MIMIC-IV dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.

The basic information is stored in the following tables:
  • patients: defines a patient in the database, subject_id.

  • admission: define a patient’s hospital admission, hadm_id.

We further support the following tables:
  • diagnoses_icd: contains ICD diagnoses (ICD9CM and ICD10CM code)

    for patients.

  • procedures_icd: contains ICD procedures (ICD9PROC and ICD10PROC

    code) for patients.

  • prescriptions: contains medication related order entries (NDC code)

    for patients.

  • labevents: contains laboratory measurements (MIMIC4_ITEMID code)

    for patients

Parameters:
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> dataset = MIMIC4Dataset(
...         root="/srv/local/data/physionet.org/files/mimiciv/2.0/hosp",
...         tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"],
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses patients and admissions tables.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses diagnosis_icd table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in diagnoses_icd

table, so we set it to None.

parse_procedures_icd(patients)[source]#

Helper function which parses procedures_icd table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in procedures_icd

table, so we set it to None.

parse_prescriptions(patients)[source]#

Helper function which parses prescriptions table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_labevents(patients)[source]#

Helper function which parses labevents table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_hcpcsevents(patients)[source]#

Helper function which parses hcpcsevents table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-IV does not provide specific timestamps in hcpcsevents

table, so we set it to None.