pyhealth.datasets.Support2Dataset#

Overview#

The SUPPORT2 (Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments) dataset contains data on seriously ill hospitalized adults. It includes patient demographics, diagnoses, clinical measurements, and outcomes such as survival and hospital mortality.

The dataset is commonly used for mortality prediction, length of stay prediction, and other clinical outcome prediction tasks.

class pyhealth.datasets.Support2Dataset(root, tables, dataset_name=None, config_path=None, **kwargs)[source]#

Bases: BaseDataset

A dataset class for handling SUPPORT2 (Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments) data.

The SUPPORT2 dataset contains data on 9,105 seriously ill hospitalized adults from five U.S. medical centers (1989-1994), including patient demographics, diagnoses, clinical measurements, and outcomes.

Dataset is available for download from:

UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/880/support2
Vanderbilt Biostatistics: https://hbiostat.org/data/repo/supportdesc
Hugging Face: https://huggingface.co/datasets/jarrydmartinx/support2
R packages: “rms” and “Hmisc”

Citation:

Knaus WA, Harrell FE, Lynn J, et al. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Ann Intern Med. 1995;122(3):191-203.

Parameters:

root (str) – The root directory where the dataset CSV file is stored.
tables (List[str]) – A list of tables to be included (typically [“support2”]).
dataset_name (Optional[str]) – The name of the dataset. Defaults to “support2”.
config_path (Optional[str]) – The path to the configuration file. If not provided, uses the default config.
**kwargs – Additional arguments passed to BaseDataset.

Examples

>>> from pyhealth.datasets import Support2Dataset
>>> dataset = Support2Dataset(
...     root="/path/to/support2/data",
...     tables=["support2"]
... )
>>> dataset.stats()

root#

The root directory where the dataset is stored.

Type:: str

tables#

A list of tables to be included in the dataset.

Type:: List[str]

dataset_name#

The name of the dataset.

Type:: str

config_path#

The path to the configuration file.

Type:: str

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:: None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:: The path to the new temporary directory.
Return type:: Path

property default_task: Optional[BaseTask]#

Returns the default task for the dataset.

Returns:: The default task, if any.
Return type:: Optional[BaseTask]

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:: patient_id (str) – The ID of the patient to retrieve.
Returns:: The Patient object for the given ID.
Return type:: Patient
Raises:: AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:: The path to the cached event dataframe.
Return type:: Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:: Iterator[Patient] – An iterator over Patient objects.
Return type:: Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:: A concatenated lazy frame of all tables.
Return type:: dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:

ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.

set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files

Parameters:

task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:: None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:: List of unique patient IDs.
Return type:: List[str]