pyhealth.datasets.ChestXray14Dataset#
The NIH ChestX-ray14 dataset. For more information see here. Note that the copy of this dataset on Kaggle is stale, as corrections have been made to the metadata (see here).
- class pyhealth.datasets.ChestXray14Dataset(root='.', config_path='/home/docs/checkouts/readthedocs.org/user_builds/pyhealth/envs/latest/lib/python3.12/site-packages/pyhealth/datasets/configs/chestxray14.yaml', download=False, partial=False, **kwargs)[source]#
Bases:
BaseDatasetDataset class for the NIH ChestX-ray14 dataset.
- classes: List[str] = ['atelectasis', 'cardiomegaly', 'consolidation', 'edema', 'effusion', 'emphysema', 'fibrosis', 'hernia', 'infiltration', 'mass', 'nodule', 'pleural_thickening', 'pneumonia', 'pneumothorax']#
- property default_task: ChestXray14MultilabelClassification#
Returns the default task for this dataset.
- Returns:
The default classification task.
- Return type:
- Example::
>>> dataset = ChestXray14Dataset() >>> task = dataset.default_task
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.
Note
If no image processor is provided, a default grayscale ImageProcessor(mode=’L’) is injected. This is needed because the ChestX-ray14 dataset images do not all have the same number of channels, causing the default PyHealth image processor to fail.
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.