pyhealth.datasets.MIMIC4FHIR#

A pre-bundled FHIRDataset for the PhysioNet MIMIC-IV on FHIR export (R4, demo 2.1.0 and full release). All ingest logic — file globs, per-resource projection, downstream event schema — is described by the bundled YAML at pyhealth/datasets/fhir/configs/mimic4fhir.yaml; this class only points at that path.

For everything outside the MIMIC-specific defaults (transform registry, Col / ResourceSpec syntax, the three-tier cache story), see the parent page: pyhealth.datasets.FHIRDataset.

Quick start#

from pyhealth.datasets import MIMIC4FHIR
from pyhealth.tasks.mpf_clinical_prediction import MPFClinicalPredictionTask

def main():
    ds = MIMIC4FHIR(root="/data/mimic-iv-fhir")
    sample_ds = ds.set_task(MPFClinicalPredictionTask(), num_workers=1)
    # ... split / dataloader / model / trainer ...

if __name__ == "__main__":
    main()

For the full end-to-end demo (training EHR-Mamba on MPF samples) see examples/mimic4fhir_mpf_ehrmamba.py.

Resource coverage#

The bundled config flattens six FHIR resource types out of the PhysioNet export:

PhysioNet shards that contain only other resource types (MedicationAdministration, Specimen, Organization, …) are skipped at the file level by the bundled glob_patterns. To include them, override glob_patterns= at the constructor and add a resource_specs: entry plus matching tables: entry in a copy of the YAML.

Customising#

The bundled config is the easiest starting point for authoring a similar ingest for other FHIR exports. Copy pyhealth/datasets/fhir/configs/mimic4fhir.yaml, edit the resource_specs: and tables: blocks for the resources you care about, and either:

  • pass config_path=... directly to FHIRDataset(root=..., config_path=...), or

  • subclass FHIRDataset and set DEFAULT_CONFIG_PATH on the subclass.

See the “Customising for a non-MIMIC FHIR export” section of pyhealth.datasets.FHIRDataset for the step-by-step.

API reference#

class pyhealth.datasets.MIMIC4FHIR(root, config_path=None, glob_pattern=None, glob_patterns=None, output_format=None, max_patients=None, ingest_num_shards=None, cache_dir=None, num_workers=1, dev=False)[source]#

Bases: FHIRDataset

MIMIC-IV-on-FHIR (R4) dataset.

Streams the PhysioNet MIMIC-IV on FHIR NDJSON.GZ export into flattened Patient/Encounter/Condition/Observation/MedicationRequest/Procedure tables, then runs the standard BaseDataset pipeline.

The bundled config at pyhealth/datasets/fhir/configs/mimic4fhir.yaml matches both the PhysioNet 2.1.0 demo and the full release. Override config_path= to point at a customised copy.

Examples

>>> ds = MIMIC4FHIR(root="/data/mimic-iv-fhir", max_patients=500)
>>> sample_ds = ds.set_task(task, num_workers=4)
DEFAULT_CONFIG_PATH: Optional[str] = '/home/docs/checkouts/readthedocs.org/user_builds/pyhealth/envs/latest/lib/python3.12/site-packages/pyhealth/datasets/fhir/configs/mimic4fhir.yaml'#

Default ingest YAML path; set by source subclasses to bundle a config.

DATASET_NAME: str = 'mimic4fhir'#

Dataset name used for cache identity / logging.

DEFAULT_OUTPUT_FORMAT: str = 'parquet'#

Default flat-table output format.

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:

None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:

The path to the new temporary directory.

Return type:

Path

property default_task: Optional[BaseTask]#

Returns the default task for the dataset.

Returns:

The default task, if any.

Return type:

Optional[BaseTask]

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:

patient_id (str) – The ID of the patient to retrieve.

Returns:

The Patient object for the given ID.

Return type:

Patient

Raises:

AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:

The path to the cached event dataframe.

Return type:

Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:

Iterator[Patient] – An iterator over Patient objects.

Return type:

Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:

A concatenated lazy frame of all tables.

Return type:

dd.DataFrame

load_table(table_name)#

Load one flattened table into the standard event schema.

Deviations from BaseDataset.load_table (CSV via _scan_csv_tsv_gz):

  • Reads pre-built flat tables (parquet/csv/tsv) under prepared_tables_dir.

  • Timestamp parsing uses errors="coerce" + utc=True (FHIR ISO strings include timezone suffix or partial dates).

  • Strips tz-aware timestamps to naive UTC for Dask compat.

  • Drops rows with null patient_id before returning.

Return type:

DataFrame

property prepared_tables_dir: Path#
Return type:

Path

set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files
Parameters:
  • task (Optional[BaseTask]) – The task to set. Uses default task if None.

  • num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.

  • input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.

  • output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:

None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:

List of unique patient IDs.

Return type:

List[str]

resource_specs: Mapping[str, ResourceSpec]#
glob_patterns: List[str]#