Datasets#
Getting Started#
New to PyHealth datasets? Start here:
This tutorial covers:
How to load and work with any PyHealth dataset (MIMIC-III, MIMIC-IV, eICU, OMOP, and many more)
Understanding the
BaseDatasetstructure and patient representationParsing raw EHR data into standardized PyHealth format
Accessing patient records, visits, and clinical events
Dataset splitting for train/validation/test sets
Data Access: If you’re new and need help accessing MIMIC datasets, check the How to Contribute guide’s “Data Access for Testing” section for information on:
Getting MIMIC credentialing through PhysioNet
Using openly available demo datasets (MIMIC-III Demo, MIMIC-IV Demo)
Working with synthetic data for testing
How PyHealth Loads Data#
When you initialise a dataset, PyHealth reads the raw CSV or Parquet files
using Polars, joins the tables according to a YAML schema, and writes a
compact global_event_df.parquet cache to disk. On subsequent runs with
the same configuration it reads from cache rather than re-parsing the source
files, so startup is fast.
The result is a BaseDataset — a structured
patient→event tree. It is different from a PyTorch Dataset: it has no integer
length and you cannot index into it with dataset[i]. Think of it as a
queryable dictionary of patient records. To turn it into something a model
can train on, you call dataset.set_task() (see Tasks), which
returns a SampleDataset that is indexable and
DataLoader-ready.
From BaseDataset to SampleDataset#
BaseDataset and SampleDataset serve different roles and are not
interchangeable:
BaseDataset is a queryable patient registry. It holds the raw patient→visit→event tree loaded from disk. You cannot index into it like a list — it has no integer length and is not DataLoader-ready.
SampleDataset is a PyTorch-compatible streaming dataset returned by
dataset.set_task(). Each element is a fully processed feature dictionary that a model can consume directly.
The conversion happens in one call:
import torch
from pyhealth.datasets import MIMIC3Dataset
from pyhealth.tasks import MortalityPredictionMIMIC3
dataset = MIMIC3Dataset(root="...", tables=["diagnoses_icd"])
samples = dataset.set_task(MortalityPredictionMIMIC3())
# `samples` is a SampleDataset — pass it straight to a DataLoader
loader = torch.utils.data.DataLoader(samples, batch_size=32)
Under the hood, set_task() runs a SampleBuilder that fits feature
processors (tokenisers, label encoders, etc.) across the full dataset, then
writes compressed, chunked sample files to disk via
litdata. A companion
schema.pkl stores the fitted processors so the dataset can be reloaded in
future runs without re-fitting.
SampleDataset also exposes two convenience lookups built during fitting:
samples.patient_to_index— maps a patient ID to all sample indices for that patient.samples.record_to_index— maps a visit/record ID to the sample indices for that visit.
For testing or small cohorts you can skip the disk step entirely using
InMemorySampleDataset, which holds all processed samples in RAM and is
returned by default from create_sample_dataset().
Note
Building a custom dataset or bringing your own data? See Tutorials (Tutorial 1) for a step-by-step walkthrough, and the config.yaml for Custom Datasets section below for the schema format.
Native Datasets vs Custom Datasets#
PyHealth includes native support for many standard healthcare databases — including MIMIC-III, MIMIC-IV, eICU, OMOP, and many others (see the full list in Available Datasets below). All of these come with built-in schema definitions so you can load them with just a root path and a list of tables:
from pyhealth.datasets import MIMIC3Dataset
if __name__ == '__main__':
dataset = MIMIC3Dataset(
root="/data/physionet.org/files/mimiciii/1.4",
tables=["diagnoses_icd", "procedures_icd", "prescriptions"],
cache_dir=".cache",
dev=True, # use 1 000 patients while exploring
)
For any other data source — a custom patient registry, an institutional cohort,
or a non-EHR dataset — you create a subclass of BaseDataset and provide a
config.yaml file that describes your table structure.
Initialization Parameters#
root — path to the directory containing the raw data files. For MIMIC-IV specifically, use
ehr_rootinstead ofroot.tables — the table names you want to load, e.g.
["diagnoses_icd", "labevents"]. Only these tables will be accessible in patient queries downstream.config_path — path to your
config.yaml; needed for custom datasets. Native datasets have this built in and ignore the parameter.cache_dir — where to store the cached Parquet and LitData files. PyHealth appends a UUID derived from your configuration, so different setups never overwrite each other.
num_workers — parallel processes for data loading. Increasing this can speed up
set_task()on large datasets.dev — when
True, PyHealth caps the dataset at 1 000 patients. This is very useful during development because it makes each iteration complete in seconds rather than minutes. Switch todev=Falsefor your final training run.
config.yaml for Custom Datasets#
If you are bringing your own data, the YAML file tells PyHealth which column is the patient identifier, which column is the timestamp, and which other columns to include as event attributes:
tables:
my_table:
file_path: relative/path/to/file.csv
patient_id: subject_id
timestamp: charttime
timestamp_format: "%Y-%m-%d %H:%M:%S"
attributes:
- icd_code
- value
- itemid
join: [] # optional table joins
All attribute column names are lowercased internally, so ICD_CODE in
your CSV becomes icd_code in your code.
Querying Patients and Events#
Once a dataset is loaded, you can explore it using these methods:
dataset.unique_patient_ids # all patient IDs as a list of strings
dataset.get_patient("p001") # retrieve one Patient object
dataset.iter_patients() # iterate over all patients
dataset.stats() # print patient and event counts
Patient records are accessed through get_events(), which supports
temporal filtering and attribute-level filters:
events = patient.get_events(
event_type="diagnoses_icd", # table name from your config
start=datetime(2020, 1, 1), # optional: exclude earlier events
end=datetime(2020, 6, 1), # optional: exclude later events
filters=[("icd_code", "==", "250.00")], # optional: attribute conditions
)
Each event in the returned list has:
event.timestamp— a Pythondatetimeobject. PyHealth normalises all timestamp columns (charttime,admittime, etc.) into this single property, so this is what you should use regardless of what the original column was called.event.icd_code,event["icd_code"],event.attr_dict— different ways to access the other attributes. All attribute names are lowercase.
Things to Watch Out For#
A few patterns that commonly trip up new users:
BaseDataset vs SampleDataset. Models expect a SampleDataset (the
output of set_task()), not the raw BaseDataset. Passing the wrong one
will raise an error. If you see an AttributeError about input_schema
or output_schema, this is likely the cause.
Timestamp attribute names. Writing event.charttime will raise an
AttributeError because PyHealth remaps that column to event.timestamp.
The same applies to admittime, starttime, or whatever the original
column was called.
Column name casing. PyHealth lowercases all column names at load time.
Even if your source CSV has ICD_CODE, you access it as event.icd_code.
dev=True in production. The dev flag is great for exploring data but
it caps the dataset at 1 000 patients. Remember to switch to dev=False
before running a full training job.
Multiprocessing guard. Scripts that call set_task() should wrap their
top-level code in if __name__ == '__main__':. See Tasks for details.
Available Datasets#
- pyhealth.datasets.BaseDataset
BaseDatasetBaseDataset.rootBaseDataset.tablesBaseDataset.dataset_nameBaseDataset.configBaseDataset.global_event_dfBaseDataset.devBaseDataset.create_tmpdir()BaseDataset.clean_tmpdir()BaseDataset.global_event_dfBaseDataset.load_data()BaseDataset.load_table()BaseDataset.unique_patient_idsBaseDataset.get_patient()BaseDataset.iter_patients()BaseDataset.stats()BaseDataset.default_taskBaseDataset.set_task()
- pyhealth.datasets.SampleDataset
SampleDatasetSampleDataset.input_schemaSampleDataset.output_schemaSampleDataset.input_processorsSampleDataset.output_processorsSampleDataset.patient_to_indexSampleDataset.record_to_indexSampleDataset.dataset_nameSampleDataset.task_nameSampleDataset.subset()SampleDataset.close()SampleDataset.get_len()SampleDataset.load_state_dict()SampleDataset.on_demand_bytesSampleDataset.reset()SampleDataset.reset_state_dict()SampleDataset.set_batch_size()SampleDataset.set_drop_last()SampleDataset.set_epoch()SampleDataset.set_num_workers()SampleDataset.set_shuffle()SampleDataset.state_dict()
- pyhealth.datasets.MIMIC3Dataset
MIMIC3DatasetMIMIC3Dataset.rootMIMIC3Dataset.tablesMIMIC3Dataset.dataset_nameMIMIC3Dataset.config_pathMIMIC3Dataset.preprocess_noteevents()MIMIC3Dataset.clean_tmpdir()MIMIC3Dataset.create_tmpdir()MIMIC3Dataset.default_taskMIMIC3Dataset.get_patient()MIMIC3Dataset.global_event_dfMIMIC3Dataset.iter_patients()MIMIC3Dataset.load_data()MIMIC3Dataset.load_table()MIMIC3Dataset.set_task()MIMIC3Dataset.stats()MIMIC3Dataset.unique_patient_ids
- pyhealth.datasets.MIMIC4Dataset
MIMIC4DatasetMIMIC4Dataset.load_data()MIMIC4Dataset.clean_tmpdir()MIMIC4Dataset.create_tmpdir()MIMIC4Dataset.default_taskMIMIC4Dataset.get_patient()MIMIC4Dataset.global_event_dfMIMIC4Dataset.iter_patients()MIMIC4Dataset.load_table()MIMIC4Dataset.set_task()MIMIC4Dataset.stats()MIMIC4Dataset.unique_patient_ids
- pyhealth.datasets.FHIRDataset
- pyhealth.datasets.MIMIC4FHIR
- pyhealth.datasets.MedicalTranscriptionsDataset
MedicalTranscriptionsDatasetMedicalTranscriptionsDataset.rootMedicalTranscriptionsDataset.dataset_nameMedicalTranscriptionsDataset.config_pathMedicalTranscriptionsDataset.default_taskMedicalTranscriptionsDataset.clean_tmpdir()MedicalTranscriptionsDataset.create_tmpdir()MedicalTranscriptionsDataset.get_patient()MedicalTranscriptionsDataset.global_event_dfMedicalTranscriptionsDataset.iter_patients()MedicalTranscriptionsDataset.load_data()MedicalTranscriptionsDataset.load_table()MedicalTranscriptionsDataset.set_task()MedicalTranscriptionsDataset.stats()MedicalTranscriptionsDataset.unique_patient_ids
- pyhealth.datasets.CardiologyDataset
- pyhealth.datasets.eICUDataset
eICUDataseteICUDataset.rooteICUDataset.tableseICUDataset.dataset_nameeICUDataset.config_patheICUDataset.clean_tmpdir()eICUDataset.create_tmpdir()eICUDataset.default_taskeICUDataset.get_patient()eICUDataset.global_event_dfeICUDataset.iter_patients()eICUDataset.load_data()eICUDataset.load_table()eICUDataset.set_task()eICUDataset.stats()eICUDataset.unique_patient_ids
- pyhealth.datasets.ISRUCDataset
- pyhealth.datasets.MIMICExtractDataset
MIMICExtractDatasetMIMICExtractDataset.taskMIMICExtractDataset.samplesMIMICExtractDataset.patient_to_indexMIMICExtractDataset.visit_to_indexMIMICExtractDataset.parse_basic_info()MIMICExtractDataset.parse_diagnoses_icd()MIMICExtractDataset.parse_c()MIMICExtractDataset.parse_labevents()MIMICExtractDataset.parse_chartevents()MIMICExtractDataset.parse_vitals_labs()MIMICExtractDataset.parse_interventions()
- pyhealth.datasets.OMOPDataset
OMOPDatasetOMOPDataset.taskOMOPDataset.samplesOMOPDataset.patient_to_indexOMOPDataset.visit_to_indexOMOPDataset.preprocess_person()OMOPDataset.clean_tmpdir()OMOPDataset.create_tmpdir()OMOPDataset.default_taskOMOPDataset.get_patient()OMOPDataset.global_event_dfOMOPDataset.iter_patients()OMOPDataset.load_data()OMOPDataset.load_table()OMOPDataset.set_task()OMOPDataset.stats()OMOPDataset.unique_patient_ids
- pyhealth.datasets.DREAMTDataset
DREAMTDatasetDREAMTDataset.rootDREAMTDataset.dataset_nameDREAMTDataset.config_pathDREAMTDataset.get_patient_file()DREAMTDataset.prepare_metadata()DREAMTDataset.clean_tmpdir()DREAMTDataset.create_tmpdir()DREAMTDataset.default_taskDREAMTDataset.get_patient()DREAMTDataset.global_event_dfDREAMTDataset.iter_patients()DREAMTDataset.load_data()DREAMTDataset.load_table()DREAMTDataset.set_task()DREAMTDataset.stats()DREAMTDataset.unique_patient_ids
- pyhealth.datasets.SHHSDataset
- pyhealth.datasets.SleepEDFDataset
SleepEDFDatasetSleepEDFDataset.taskSleepEDFDataset.samplesSleepEDFDataset.patient_to_indexSleepEDFDataset.visit_to_indexSleepEDFDataset.prepare_metadata_cassette()SleepEDFDataset.prepare_metadata_telemetry()SleepEDFDataset.default_taskSleepEDFDataset.clean_tmpdir()SleepEDFDataset.create_tmpdir()SleepEDFDataset.get_patient()SleepEDFDataset.global_event_dfSleepEDFDataset.iter_patients()SleepEDFDataset.load_data()SleepEDFDataset.load_table()SleepEDFDataset.set_task()SleepEDFDataset.stats()SleepEDFDataset.unique_patient_ids
- pyhealth.datasets.EHRShotDataset
EHRShotDatasetEHRShotDataset.rootEHRShotDataset.tablesEHRShotDataset.dataset_nameEHRShotDataset.config_pathEHRShotDataset.clean_tmpdir()EHRShotDataset.create_tmpdir()EHRShotDataset.default_taskEHRShotDataset.get_patient()EHRShotDataset.global_event_dfEHRShotDataset.iter_patients()EHRShotDataset.load_data()EHRShotDataset.load_table()EHRShotDataset.set_task()EHRShotDataset.stats()EHRShotDataset.unique_patient_ids
- pyhealth.datasets.Support2Dataset
- pyhealth.datasets.BMDHSDataset
BMDHSDatasetBMDHSDataset.rootBMDHSDataset.dataset_nameBMDHSDataset.config_pathBMDHSDataset.preprocess_recordings()BMDHSDataset.default_taskBMDHSDataset.clean_tmpdir()BMDHSDataset.create_tmpdir()BMDHSDataset.get_patient()BMDHSDataset.global_event_dfBMDHSDataset.iter_patients()BMDHSDataset.load_data()BMDHSDataset.load_table()BMDHSDataset.set_task()BMDHSDataset.stats()BMDHSDataset.unique_patient_ids
- pyhealth.datasets.COVID19CXRDataset
COVID19CXRDatasetCOVID19CXRDataset.rootCOVID19CXRDataset.dataset_nameCOVID19CXRDataset.config_pathCOVID19CXRDataset.prepare_metadata()COVID19CXRDataset.default_taskCOVID19CXRDataset.clean_tmpdir()COVID19CXRDataset.create_tmpdir()COVID19CXRDataset.get_patient()COVID19CXRDataset.global_event_dfCOVID19CXRDataset.iter_patients()COVID19CXRDataset.load_data()COVID19CXRDataset.load_table()COVID19CXRDataset.set_task()COVID19CXRDataset.stats()COVID19CXRDataset.unique_patient_ids
- pyhealth.datasets.ChestXray14Dataset
ChestXray14DatasetChestXray14Dataset.rootChestXray14Dataset.dataset_nameChestXray14Dataset.config_pathChestXray14Dataset.classesChestXray14Dataset.classesChestXray14Dataset.default_taskChestXray14Dataset.set_task()ChestXray14Dataset.clean_tmpdir()ChestXray14Dataset.create_tmpdir()ChestXray14Dataset.get_patient()ChestXray14Dataset.global_event_dfChestXray14Dataset.iter_patients()ChestXray14Dataset.load_data()ChestXray14Dataset.load_table()ChestXray14Dataset.stats()ChestXray14Dataset.unique_patient_ids
- pyhealth.datasets.PhysioNetDeIDDataset
PhysioNetDeIDDatasetPhysioNetDeIDDataset.rootPhysioNetDeIDDataset.dataset_namePhysioNetDeIDDataset.clean_tmpdir()PhysioNetDeIDDataset.create_tmpdir()PhysioNetDeIDDataset.default_taskPhysioNetDeIDDataset.get_patient()PhysioNetDeIDDataset.global_event_dfPhysioNetDeIDDataset.iter_patients()PhysioNetDeIDDataset.load_data()PhysioNetDeIDDataset.load_table()PhysioNetDeIDDataset.set_task()PhysioNetDeIDDataset.stats()PhysioNetDeIDDataset.unique_patient_ids
- pyhealth.datasets.TUABDataset
TUABDatasetTUABDataset.taskTUABDataset.samplesTUABDataset.patient_to_indexTUABDataset.visit_to_indexTUABDataset.prepare_metadata()TUABDataset.default_taskTUABDataset.clean_tmpdir()TUABDataset.create_tmpdir()TUABDataset.get_patient()TUABDataset.global_event_dfTUABDataset.iter_patients()TUABDataset.load_data()TUABDataset.load_table()TUABDataset.set_task()TUABDataset.stats()TUABDataset.unique_patient_ids
- pyhealth.datasets.TUEVDataset
TUEVDatasetTUEVDataset.taskTUEVDataset.samplesTUEVDataset.patient_to_indexTUEVDataset.visit_to_indexTUEVDataset.prepare_metadata()TUEVDataset.default_taskTUEVDataset.clean_tmpdir()TUEVDataset.create_tmpdir()TUEVDataset.get_patient()TUEVDataset.global_event_dfTUEVDataset.iter_patients()TUEVDataset.load_data()TUEVDataset.load_table()TUEVDataset.set_task()TUEVDataset.stats()TUEVDataset.unique_patient_ids
- pyhealth.datasets.ClinVarDataset
ClinVarDatasetClinVarDataset.rootClinVarDataset.dataset_nameClinVarDataset.config_pathClinVarDataset.prepare_metadata()ClinVarDataset.default_taskClinVarDataset.clean_tmpdir()ClinVarDataset.create_tmpdir()ClinVarDataset.get_patient()ClinVarDataset.global_event_dfClinVarDataset.iter_patients()ClinVarDataset.load_data()ClinVarDataset.load_table()ClinVarDataset.set_task()ClinVarDataset.stats()ClinVarDataset.unique_patient_ids
- pyhealth.datasets.COSMICDataset
COSMICDatasetCOSMICDataset.rootCOSMICDataset.dataset_nameCOSMICDataset.config_pathCOSMICDataset.prepare_metadata()COSMICDataset.default_taskCOSMICDataset.clean_tmpdir()COSMICDataset.create_tmpdir()COSMICDataset.get_patient()COSMICDataset.global_event_dfCOSMICDataset.iter_patients()COSMICDataset.load_data()COSMICDataset.load_table()COSMICDataset.set_task()COSMICDataset.stats()COSMICDataset.unique_patient_ids
- pyhealth.datasets.TCGAPRADDataset
TCGAPRADDatasetTCGAPRADDataset.rootTCGAPRADDataset.dataset_nameTCGAPRADDataset.config_pathTCGAPRADDataset.prepare_metadata()TCGAPRADDataset.default_taskTCGAPRADDataset.clean_tmpdir()TCGAPRADDataset.create_tmpdir()TCGAPRADDataset.get_patient()TCGAPRADDataset.global_event_dfTCGAPRADDataset.iter_patients()TCGAPRADDataset.load_data()TCGAPRADDataset.load_table()TCGAPRADDataset.set_task()TCGAPRADDataset.stats()TCGAPRADDataset.unique_patient_ids
- pyhealth.datasets.splitter
chainSampleDatasetSampleDataset.input_schemaSampleDataset.output_schemaSampleDataset.input_processorsSampleDataset.output_processorsSampleDataset.patient_to_indexSampleDataset.record_to_indexSampleDataset.dataset_nameSampleDataset.task_nameSampleDataset.subset()SampleDataset.close()SampleDataset.get_len()SampleDataset.load_state_dict()SampleDataset.on_demand_bytesSampleDataset.reset()SampleDataset.reset_state_dict()SampleDataset.set_batch_size()SampleDataset.set_drop_last()SampleDataset.set_epoch()SampleDataset.set_num_workers()SampleDataset.set_shuffle()SampleDataset.state_dict()
sample_balanced()split_by_visit()split_by_patient()split_by_sample()split_by_visit_conformal()split_by_patient_conformal()split_by_patient_conformal_tuh()split_by_sample_conformal_tuh()split_by_patient_tuh()split_by_sample_tuh()split_by_sample_conformal()
- pyhealth.datasets.utils
datetimedatetime.astimezone()datetime.combine()datetime.ctime()datetime.date()datetime.daydatetime.dst()datetime.folddatetime.fromisocalendar()datetime.fromisoformat()datetime.fromordinal()datetime.fromtimestamp()datetime.hourdatetime.isocalendar()datetime.isoformat()datetime.isoweekday()datetime.maxdatetime.microseconddatetime.mindatetime.minutedatetime.monthdatetime.now()datetime.replace()datetime.resolutiondatetime.seconddatetime.strftime()datetime.strptime()datetime.time()datetime.timestamp()datetime.timetuple()datetime.timetz()datetime.today()datetime.toordinal()datetime.tzinfodatetime.tzname()datetime.utcfromtimestamp()datetime.utcnow()datetime.utcoffset()datetime.utctimetuple()datetime.weekday()datetime.year
Anydateutil_parse()pad_sequence()DataLoadercreate_directory()hash_str()strptime()padyear()flatten_list()list_nested_levels()is_homo_list()collate_fn_dict()collate_fn_dict_with_padding()get_dataloader()save_processors()load_processors()