pyhealth.datasets.MIMIC3Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-III) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC3Dataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseEHRDataset

Base dataset for MIMIC-III dataset.

The MIMIC-III dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://mimic.physionet.org/.

The basic information is stored in the following tables:
  • PATIENTS: defines a patient in the database, SUBJECT_ID.

  • ADMISSIONS: defines a patient’s hospital admission, HADM_ID.

We further support the following tables:
  • DIAGNOSES_ICD: contains ICD-9 diagnoses (ICD9CM code) for patients.

  • PROCEDURES_ICD: contains ICD-9 procedures (ICD9PROC code) for patients.

  • PRESCRIPTIONS: contains medication related order entries (NDC code)

    for patients.

  • LABEVENTS: contains laboratory measurements (MIMIC3_ITEMID code)

    for patients

Parameters:
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import MIMIC3Dataset
>>> dataset = MIMIC3Dataset(
...         root="/srv/local/data/physionet.org/files/mimiciii/1.4",
...         tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"],
...         code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper function which parses PATIENTS and ADMISSIONS tables.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id which is updated with the mimic-3 table result.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_diagnoses_icd(patients)[source]#

Helper function which parses DIAGNOSES_ICD table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in DIAGNOSES_ICD

table, so we set it to None.

parse_procedures_icd(patients)[source]#

Helper function which parses PROCEDURES_ICD table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

MIMIC-III does not provide specific timestamps in PROCEDURES_ICD

table, so we set it to None.

parse_prescriptions(patients)[source]#

Helper function which parses PRESCRIPTIONS table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_labevents(patients)[source]#

Helper function which parses LABEVENTS table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.