pyhealth.datasets.eICUDataset#

The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.eICUDataset(root, tables, dataset_name=None, config_path=None, **kwargs)[source]#

Bases: BaseDataset

A dataset class for handling eICU data.

The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.

The basic information is stored in the following tables:
  • patient: defines a patient (uniquepid), a hospital admission

    (patienthealthsystemstayid), and an ICU stay (patientunitstayid) in the database.

  • hospital: contains information about a hospital (e.g., region).

Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time.

We further support the following tables:
  • diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)

    and diagnosis information for patients

  • treatment: contains treatment information for patients.

  • medication: contains medication related order entries for patients.

  • lab: contains laboratory measurements for patients

  • physicalexam: contains all physical exams conducted for patients.

  • admissiondx: table contains the primary diagnosis for admission to

    the ICU per the APACHE scoring criteria.

root#

The root directory where the dataset is stored.

Type:

str

tables#

A list of tables to be included in the dataset.

Type:

List[str]

dataset_name#

The name of the dataset.

Type:

Optional[str]

config_path#

The path to the configuration file.

Type:

Optional[str]

Examples

>>> from pyhealth.datasets import eICUDataset
>>> dataset = eICUDataset(
...     root="/path/to/eicu-crd/2.0",
...     tables=["diagnosis", "medication", "treatment"],
... )
>>> dataset.stats()
>>> patient = dataset.get_patient("patient_id")
clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:

None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:

The path to the new temporary directory.

Return type:

Path

property default_task: Optional[BaseTask]#

Returns the default task for the dataset.

Returns:

The default task, if any.

Return type:

Optional[BaseTask]

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:

patient_id (str) – The ID of the patient to retrieve.

Returns:

The Patient object for the given ID.

Return type:

Patient

Raises:

AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:

The path to the cached event dataframe.

Return type:

Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:

Iterator[Patient] – An iterator over Patient objects.

Return type:

Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:

A concatenated lazy frame of all tables.

Return type:

dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:
set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files
Parameters:
  • task (Optional[BaseTask]) – The task to set. Uses default task if None.

  • num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.

  • input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.

  • output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:

None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:

List of unique patient IDs.

Return type:

List[str]