pyhealth.datasets.Support2Dataset#
Overview#
The SUPPORT2 (Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments) dataset contains data on seriously ill hospitalized adults. It includes patient demographics, diagnoses, clinical measurements, and outcomes such as survival and hospital mortality.
The dataset is commonly used for mortality prediction, length of stay prediction, and other clinical outcome prediction tasks.
- class pyhealth.datasets.Support2Dataset(root, tables, dataset_name=None, config_path=None, **kwargs)[source]#
Bases:
BaseDatasetA dataset class for handling SUPPORT2 (Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments) data.
The SUPPORT2 dataset contains data on 9,105 seriously ill hospitalized adults from five U.S. medical centers (1989-1994), including patient demographics, diagnoses, clinical measurements, and outcomes.
- Dataset is available for download from:
UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/880/support2
Vanderbilt Biostatistics: https://hbiostat.org/data/repo/supportdesc
Hugging Face: https://huggingface.co/datasets/jarrydmartinx/support2
R packages: “rms” and “Hmisc”
- Citation:
Knaus WA, Harrell FE, Lynn J, et al. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Ann Intern Med. 1995;122(3):191-203.
- Parameters:
root (str) – The root directory where the dataset CSV file is stored.
tables (List[str]) – A list of tables to be included (typically [“support2”]).
dataset_name (Optional[str]) – The name of the dataset. Defaults to “support2”.
config_path (Optional[str]) – The path to the configuration file. If not provided, uses the default config.
**kwargs – Additional arguments passed to BaseDataset.
Examples
>>> from pyhealth.datasets import Support2Dataset >>> dataset = Support2Dataset( ... root="/path/to/support2/data", ... tables=["support2"] ... ) >>> dataset.stats()
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- property default_task: Optional[BaseTask]#
Returns the default task for the dataset.
- Returns:
The default task, if any.
- Return type:
Optional[BaseTask]
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.