pyhealth.datasets.MIMIC4Dataset#
The open Medical Information Mart for Intensive Care (MIMIC-IV) database, refer to doc for more information. We process this database into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.
- class pyhealth.datasets.MIMIC4Dataset(ehr_root=None, note_root=None, cxr_root=None, ehr_tables=None, note_tables=None, cxr_tables=None, ehr_config_path=None, note_config_path=None, cxr_config_path=None, dataset_name='mimic4', dev=False, cache_dir=None, num_workers=1)[source]#
Bases:
BaseDatasetUnified MIMIC-IV dataset with support for EHR, clinical notes, and X-rays.
This class combines data from multiple MIMIC-IV sources: - Core EHR data (demographics, admissions, diagnoses, etc.) - Clinical notes (discharge summaries, radiology reports) - Chest X-rays (images and metadata)
- Parameters:
ehr_root (
Optional[str]) – Root directory for MIMIC-IV EHR datanote_root (
Optional[str]) – Root directory for MIMIC-IV notes datacxr_root (
Optional[str]) – Root directory for MIMIC-CXR dataehr_tables (
Optional[List[str]]) – List of EHR tables to includenote_tables (
Optional[List[str]]) – List of clinical note tables to includecxr_tables (
Optional[List[str]]) – List of X-ray tables to includeehr_config_path (
Optional[str]) – Path to the EHR config filenote_config_path (
Optional[str]) – Path to the note config filecxr_config_path (
Optional[str]) – Path to the CXR config filedataset_name (
str) – Name of the datasetdev (
bool) – Whether to enable dev mode (limit to 1000 patients)
Examples
>>> from pyhealth.datasets import MIMIC4Dataset >>> # Load unified MIMIC-IV dataset with EHR, notes, and CXR data >>> dataset = MIMIC4Dataset( ... ehr_root="/path/to/mimic-iv/2.2", ... note_root="/path/to/mimic-iv-note/2.2", ... cxr_root="/path/to/mimic-cxr/2.0.0", ... ehr_tables=["diagnoses_icd", "procedures_icd", "labevents"], ... note_tables=["discharge", "radiology"], ... cxr_tables=["metadata", "chexpert"], ... ) >>> dataset.stats() >>> >>> # Load with only EHR and notes (without CXR) >>> dataset = MIMIC4Dataset( ... ehr_root="/path/to/mimic-iv/2.2", ... note_root="/path/to/mimic-iv-note/2.2", ... ehr_tables=["diagnoses_icd", "labevents"], ... note_tables=["discharge"], ... ) >>> dataset.stats()
- load_data()[source]#
Combines data from all initialized sub-datasets into a unified global event dataframe.
- Returns:
Combined lazy frame from all data sources
- Return type:
pl.LazyFrame
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- property default_task: Optional[BaseTask]#
Returns the default task for the dataset.
- Returns:
The default task, if any.
- Return type:
Optional[BaseTask]
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.