pyhealth.datasets.ChestXray14Dataset#

The NIH ChestX-ray14 dataset. For more information see here. Note that the copy of this dataset on Kaggle is stale, as corrections have been made to the metadata (see here).

class pyhealth.datasets.ChestXray14Dataset(root='.', config_path='/home/docs/checkouts/readthedocs.org/user_builds/pyhealth/envs/latest/lib/python3.12/site-packages/pyhealth/datasets/configs/chestxray14.yaml', download=False, partial=False, **kwargs)[source]#

Bases: BaseDataset

Dataset class for the NIH ChestX-ray14 dataset.

root#

Root directory of the raw data.

Type:

str

dataset_name#

Name of the dataset.

Type:

str

config_path#

Path to the configuration file.

Type:

str

classes#

List of diseases that appear in the dataset.

Type:

List[str]

classes: List[str] = ['atelectasis', 'cardiomegaly', 'consolidation', 'edema', 'effusion', 'emphysema', 'fibrosis', 'hernia', 'infiltration', 'mass', 'nodule', 'pleural_thickening', 'pneumonia', 'pneumothorax']#
property default_task: ChestXray14MultilabelClassification#

Returns the default task for this dataset.

Returns:

The default classification task.

Return type:

ChestXray14MultilabelClassification

Example::
>>> dataset = ChestXray14Dataset()
>>> task = dataset.default_task
set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files
Parameters:
  • task (Optional[BaseTask]) – The task to set. Uses default task if None.

  • num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.

  • input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.

  • output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

Note

If no image processor is provided, a default grayscale ImageProcessor(mode=’L’) is injected. This is needed because the ChestX-ray14 dataset images do not all have the same number of channels, causing the default PyHealth image processor to fail.

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:

None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:

The path to the new temporary directory.

Return type:

Path

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:

patient_id (str) – The ID of the patient to retrieve.

Returns:

The Patient object for the given ID.

Return type:

Patient

Raises:

AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:

The path to the cached event dataframe.

Return type:

Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:

Iterator[Patient] – An iterator over Patient objects.

Return type:

Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:

A concatenated lazy frame of all tables.

Return type:

dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:
stats()#

Prints statistics about the dataset.

Return type:

None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:

List of unique patient IDs.

Return type:

List[str]