pyhealth.datasets.eICUDataset#

The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.eICUDataset(**kwargs)[source]#

Bases: BaseEHRDataset

Base dataset for eICU dataset.

The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.

The basic information is stored in the following tables:
  • patient: defines a patient (uniquepid), a hospital admission

    (patienthealthsystemstayid), and a ICU stay (patientunitstayid) in the database.

  • hospital: contains information about a hospital (e.g., region).

Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time. Thus, we only know the order of ICU stays within a hospital admission, but not the order of hospital admissions within a patient. As a result, we use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.

We further support the following tables:
  • diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)

    and diagnosis information (under attr_dict) for patients

  • treatment: contains treatment information (eICU_TREATMENTSTRING code)

    for patients.

  • medication: contains medication related order entries (eICU_DRUGNAME

    code) for patients.

  • lab: contains laboratory measurements (eICU_LABNAME code)

    for patients

  • physicalExam: contains all physical exam (eICU_PHYSICALEXAMPATH)

    conducted for patients.

  • admissionDx: table contains the primary diagnosis for admission to

    the ICU per the APACHE scoring criteria. (eICU_ADMITDXPATH)

Parameters:
  • dataset_name – name of the dataset.

  • root – root directory of the raw data (should contain many csv files).

  • tables – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import eICUDataset
>>> dataset = eICUDataset(
...         root="/srv/local/data/physionet.org/files/eicu-crd/2.0",
...         tables=["diagnosis", "medication", "lab", "treatment", "physicalExam", "admissionDx"],
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses patient and hospital tables.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

We use Patient object to represent a hospital admission of a patient, and use Visit object to store the ICU stays within that hospital admission.

parse_diagnosis(patients)[source]#

Helper function which parses diagnosis table.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

Note

This table contains both ICD9CM and ICD10CM codes in one single

cell. We need to use medcode to distinguish them.

parse_treatment(patients)[source]#

Helper function which parses treatment table.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_medication(patients)[source]#

Helper function which parses medication table.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_lab(patients)[source]#

Helper function which parses lab table.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_physicalexam(patients)[source]#

Helper function which parses physicalExam table.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_admissiondx(patients)[source]#

Helper function which parses admissionDx (admission diagnosis) table.

Will be called in self.parse_tables().

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.