pyhealth.datasets.MIMIC4Dataset#

The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.MIMIC4Dataset(ehr_root=None, note_root=None, cxr_root=None, ehr_tables=None, note_tables=None, cxr_tables=None, ehr_config_path=None, note_config_path=None, cxr_config_path=None, dataset_name='mimic4', dev=False, cache_dir=None, num_workers=1)[source]#

Bases: BaseDataset

Unified MIMIC-IV dataset with support for EHR, clinical notes, and X-rays.

This class combines data from multiple MIMIC-IV sources: - Core EHR data (demographics, admissions, diagnoses, etc.) - Clinical notes (discharge summaries, radiology reports) - Chest X-rays (images and metadata)

Parameters:

ehr_root (Optional[str]) – Root directory for MIMIC-IV EHR data
note_root (Optional[str]) – Root directory for MIMIC-IV notes data
cxr_root (Optional[str]) – Root directory for MIMIC-CXR data
ehr_tables (Optional[List[str]]) – List of EHR tables to include
note_tables (Optional[List[str]]) – List of clinical note tables to include
cxr_tables (Optional[List[str]]) – List of X-ray tables to include
ehr_config_path (Optional[str]) – Path to the EHR config file
note_config_path (Optional[str]) – Path to the note config file
cxr_config_path (Optional[str]) – Path to the CXR config file
dataset_name (str) – Name of the dataset
dev (bool) – Whether to enable dev mode (limit to 1000 patients)

Examples

>>> from pyhealth.datasets import MIMIC4Dataset
>>> # Load unified MIMIC-IV dataset with EHR, notes, and CXR data
>>> dataset = MIMIC4Dataset(
...     ehr_root="/path/to/mimic-iv/2.2",
...     note_root="/path/to/mimic-iv-note/2.2",
...     cxr_root="/path/to/mimic-cxr/2.0.0",
...     ehr_tables=["diagnoses_icd", "procedures_icd", "labevents"],
...     note_tables=["discharge", "radiology"],
...     cxr_tables=["metadata", "chexpert"],
... )
>>> dataset.stats()
>>>
>>> # Load with only EHR and notes (without CXR)
>>> dataset = MIMIC4Dataset(
...     ehr_root="/path/to/mimic-iv/2.2",
...     note_root="/path/to/mimic-iv-note/2.2",
...     ehr_tables=["diagnoses_icd", "labevents"],
...     note_tables=["discharge"],
... )
>>> dataset.stats()

load_data()[source]#

Combines data from all initialized sub-datasets into a unified global event dataframe.

Returns:: Combined lazy frame from all data sources
Return type:: pl.LazyFrame

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:: None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:: The path to the new temporary directory.
Return type:: Path

property default_task: Optional[BaseTask]#

Returns the default task for the dataset.

Returns:: The default task, if any.
Return type:: Optional[BaseTask]

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:: patient_id (str) – The ID of the patient to retrieve.
Returns:: The Patient object for the given ID.
Return type:: Patient
Raises:: AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:: The path to the cached event dataframe.
Return type:: Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:: Iterator[Patient] – An iterator over Patient objects.
Return type:: Iterator[Patient]

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:

ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.

set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files

Parameters:

task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:: None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:: List of unique patient IDs.
Return type:: List[str]