pyhealth.datasets.eICUDataset#
The open eICU Collaborative Research Database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.eICUDataset(root, tables, dataset_name=None, config_path=None, **kwargs)[source]#
Bases:
BaseDatasetA dataset class for handling eICU data.
The eICU dataset is a large dataset of de-identified health records of ICU patients. The dataset is available at https://eicu-crd.mit.edu/.
- The basic information is stored in the following tables:
- patient: defines a patient (uniquepid), a hospital admission
(patienthealthsystemstayid), and an ICU stay (patientunitstayid) in the database.
hospital: contains information about a hospital (e.g., region).
Note that in eICU, a patient can have multiple hospital admissions and each hospital admission can have multiple ICU stays. The data in eICU is centered around the ICU stay and all timestamps are relative to the ICU admission time.
- We further support the following tables:
- diagnosis: contains ICD diagnoses (ICD9CM and ICD10CM code)
and diagnosis information for patients
treatment: contains treatment information for patients.
medication: contains medication related order entries for patients.
lab: contains laboratory measurements for patients
physicalexam: contains all physical exams conducted for patients.
- admissiondx: table contains the primary diagnosis for admission to
the ICU per the APACHE scoring criteria.
Examples
>>> from pyhealth.datasets import eICUDataset >>> dataset = eICUDataset( ... root="/path/to/eicu-crd/2.0", ... tables=["diagnosis", "medication", "treatment"], ... ) >>> dataset.stats() >>> patient = dataset.get_patient("patient_id")
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- property default_task: Optional[BaseTask]#
Returns the default task for the dataset.
- Returns:
The default task, if any.
- Return type:
Optional[BaseTask]
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.