pyhealth.datasets.COSMICDataset#
The COSMIC (Catalogue of Somatic Mutations in Cancer) dataset provides comprehensive information about somatic mutations in human cancers. For more information see COSMIC. This dataset was contributed as part of the Prostate-VarBench benchmarking work (arXiv:2511.09576).
- class pyhealth.datasets.COSMICDataset(root, tables=None, dataset_name=None, config_path=None, **kwargs)[source]#
Bases:
BaseDatasetCOSMIC dataset for cancer somatic mutation analysis.
COSMIC (Catalogue Of Somatic Mutations In Cancer) is the world’s largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. This dataset enables mutation pathogenicity prediction and cancer gene analysis tasks.
Dataset is available at: https://cancer.sanger.ac.uk/cosmic/download
Note
COSMIC requires registration and license agreement for data access.
- Parameters:
root (
str) – Root directory of the raw data containing the COSMIC files.tables (
List[str]) – Optional list of additional tables to load beyond defaults.dataset_name (
Optional[str]) – Optional name of the dataset. Defaults to “cosmic”.config_path (
Optional[str]) – Optional path to the configuration file. If not provided, uses the default config in the configs directory.
- root#
Root directory of the raw data.
- dataset_name#
Name of the dataset.
- config_path#
Path to the configuration file.
Examples
>>> from pyhealth.datasets import COSMICDataset >>> dataset = COSMICDataset(root="/path/to/cosmic") >>> dataset.stats() >>> samples = dataset.set_task() >>> print(samples[0])
- static prepare_metadata(root)[source]#
Prepare metadata for the COSMIC dataset.
Converts raw COSMIC TSV/CSV files to standardized CSV format.
- property default_task#
Returns the default task for this dataset.
- Returns:
The default prediction task.
- Return type:
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.