pyhealth.datasets.MIMIC4FHIR#
A pre-bundled FHIRDataset for the PhysioNet
MIMIC-IV on FHIR export
(R4, demo 2.1.0 and full release). All ingest logic — file globs, per-resource
projection, downstream event schema — is described by the bundled YAML at
pyhealth/datasets/fhir/configs/mimic4fhir.yaml; this class only points at
that path.
For everything outside the MIMIC-specific defaults (transform registry,
Col / ResourceSpec syntax, the three-tier cache story), see the parent
page: pyhealth.datasets.FHIRDataset.
Quick start#
from pyhealth.datasets import MIMIC4FHIR
from pyhealth.tasks.mpf_clinical_prediction import MPFClinicalPredictionTask
def main():
ds = MIMIC4FHIR(root="/data/mimic-iv-fhir")
sample_ds = ds.set_task(MPFClinicalPredictionTask(), num_workers=1)
# ... split / dataloader / model / trainer ...
if __name__ == "__main__":
main()
For the full end-to-end demo (training EHR-Mamba on MPF samples) see
examples/mimic4fhir_mpf_ehrmamba.py.
Resource coverage#
The bundled config flattens six FHIR resource types out of the PhysioNet export:
PhysioNet shards that contain only other resource types
(MedicationAdministration, Specimen, Organization, …) are skipped
at the file level by the bundled glob_patterns. To include them, override
glob_patterns= at the constructor and add a resource_specs: entry plus
matching tables: entry in a copy of the YAML.
Customising#
The bundled config is the easiest starting point for authoring a similar ingest
for other FHIR exports. Copy
pyhealth/datasets/fhir/configs/mimic4fhir.yaml, edit the
resource_specs: and tables: blocks for the resources you care about,
and either:
pass
config_path=...directly toFHIRDataset(root=..., config_path=...), orsubclass
FHIRDatasetand setDEFAULT_CONFIG_PATHon the subclass.
See the “Customising for a non-MIMIC FHIR export” section of pyhealth.datasets.FHIRDataset for the step-by-step.
API reference#
- class pyhealth.datasets.MIMIC4FHIR(root, config_path=None, glob_pattern=None, glob_patterns=None, output_format=None, max_patients=None, ingest_num_shards=None, cache_dir=None, num_workers=1, dev=False)[source]#
Bases:
FHIRDatasetMIMIC-IV-on-FHIR (R4) dataset.
Streams the PhysioNet MIMIC-IV on FHIR NDJSON.GZ export into flattened Patient/Encounter/Condition/Observation/MedicationRequest/Procedure tables, then runs the standard
BaseDatasetpipeline.The bundled config at
pyhealth/datasets/fhir/configs/mimic4fhir.yamlmatches both the PhysioNet 2.1.0 demo and the full release. Overrideconfig_path=to point at a customised copy.Examples
>>> ds = MIMIC4FHIR(root="/data/mimic-iv-fhir", max_patients=500) >>> sample_ds = ds.set_task(task, num_workers=4)
- DEFAULT_CONFIG_PATH: Optional[str] = '/home/docs/checkouts/readthedocs.org/user_builds/pyhealth/envs/latest/lib/python3.12/site-packages/pyhealth/datasets/fhir/configs/mimic4fhir.yaml'#
Default ingest YAML path; set by source subclasses to bundle a config.
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- property default_task: Optional[BaseTask]#
Returns the default task for the dataset.
- Returns:
The default task, if any.
- Return type:
Optional[BaseTask]
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Load one flattened table into the standard event schema.
Deviations from
BaseDataset.load_table(CSV via_scan_csv_tsv_gz):Reads pre-built flat tables (parquet/csv/tsv) under
prepared_tables_dir.Timestamp parsing uses
errors="coerce"+utc=True(FHIR ISO strings include timezone suffix or partial dates).Strips tz-aware timestamps to naive UTC for Dask compat.
Drops rows with null
patient_idbefore returning.
- Return type:
DataFrame
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.