pyhealth.datasets.TCGAPRADDataset#

The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) dataset provides multi-omics data including somatic mutations and clinical information for prostate cancer patients. For more information see TCGA-PRAD. This dataset was contributed as part of the Prostate-VarBench benchmarking work (arXiv:2511.09576).

class pyhealth.datasets.TCGAPRADDataset(root, tables=None, dataset_name=None, config_path=None, **kwargs)[source]#

Bases: BaseDataset

TCGA Prostate Adenocarcinoma (PRAD) dataset.

The Cancer Genome Atlas (TCGA) PRAD dataset contains multi-omics data for prostate adenocarcinoma patients, including somatic mutations, clinical data, and survival outcomes. This dataset enables cancer survival prediction and mutation analysis tasks.

Dataset is available at: https://portal.gdc.cancer.gov/projects/TCGA-PRAD

Parameters:

root (str) – Root directory of the raw data containing the TCGA-PRAD files.
tables (List[str]) – Optional list of additional tables to load beyond defaults.
dataset_name (Optional[str]) – Optional name of the dataset. Defaults to “tcga_prad”.
config_path (Optional[str]) – Optional path to the configuration file. If not provided, uses the default config in the configs directory.

root#: Root directory of the raw data.

dataset_name#: Name of the dataset.

config_path#: Path to the configuration file.

Examples

>>> from pyhealth.datasets import TCGAPRADDataset
>>> dataset = TCGAPRADDataset(root="/path/to/tcga_prad")
>>> dataset.stats()
>>> samples = dataset.set_task()
>>> print(samples[0])

static prepare_metadata(root)[source]#

Prepare metadata for the TCGA-PRAD dataset.

Converts raw TCGA MAF and clinical files to standardized CSV format.

Parameters:: root (str) – Root directory containing the TCGA-PRAD files.
Return type:: None

property default_task#

Returns the default task for this dataset.

Returns:: The default prediction task.
Return type:: CancerSurvivalPrediction

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:: None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:: The path to the new temporary directory.
Return type:: Path

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:: patient_id (str) – The ID of the patient to retrieve.
Returns:: The Patient object for the given ID.
Return type:: Patient
Raises:: AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:: The path to the cached event dataframe.
Return type:: Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:: Iterator[Patient] – An iterator over Patient objects.
Return type:: Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:: A concatenated lazy frame of all tables.
Return type:: dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:

ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.

set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files

Parameters:

task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:: None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:: List of unique patient IDs.
Return type:: List[str]