pyhealth.datasets.COVID19CXRDataset#
The COVID-19 Radiography chest X-ray dataset. Kaggle: https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database.
- class pyhealth.datasets.COVID19CXRDataset(root, dataset_name=None, config_path=None, cache_dir=None, num_workers=1, dev=False)[source]#
Bases:
BaseDatasetBase image dataset for COVID-19 Radiography Database.
Dataset is available at: https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database
- COVID-19 data:
2473 CXR images from padchest dataset[1]
183 CXR images from a Germany medical school[2]
559 CXR images from SIRM, Github, Kaggle & Tweeter[3,4,5,6]
400 CXR images from another Github source[7]
- Normal images:
8851 from RSNA [8]
1341 from Kaggle [9]
- Lung opacity images:
6012 from Radiological Society of North America (RSNA) CXR dataset[8]
- Viral Pneumonia images:
1345 from the Chest X-Ray Images (pneumonia) database[9]
If you use this dataset, please cite: 1. M.E.H. Chowdhury, T. Rahman, A. Khandakar, et al. “Can AI help in
screening Viral and COVID-19 pneumonia?” IEEE Access, Vol. 8, 2020, pp. 132665-132676.
Rahman, T., Khandakar, A., Qiblawey, Y., et al. “Exploring the Effect of Image Enhancement Techniques on COVID-19 Detection using Chest X-ray Images.” arXiv preprint arXiv:2012.02238.
[1] https://bimcv.cipf.es/bimcv-projects/bimcv-covid19/ [2] https://github.com/ml-workgroup/covid-19-image-repository/tree/master/png [3] https://sirm.org/category/senza-categoria/covid-19/ [4] https://eurorad.org [5] https://github.com/ieee8023/covid-chestxray-dataset [6] https://figshare.com/articles/COVID-19_Chest_X-Ray_Image_Repository/12580328 [7] https://github.com/armiro/COVID-CXNet [8] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data [9] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
- Parameters:
root (
str) – Root directory of the raw data containing the dataset files.dataset_name (
Optional[str]) – Optional name of the dataset. Defaults to “covid19_cxr”.config_path (
Optional[str]) – Optional path to the configuration file. If not provided, uses the default config in the configs directory.cache_dir (
Optional[str]) – Optional directory for caching processed data.num_workers (
int) – Number of parallel workers for data processing. Defaults to 1.dev (
bool) – If True, only loads a small subset of data for development/testing.
- root#
Root directory of the raw data.
- dataset_name#
Name of the dataset.
- config_path#
Path to the configuration file.
Examples
>>> from pyhealth.datasets import COVID19CXRDataset >>> dataset = COVID19CXRDataset( ... root="/path/to/covid19_cxr" ... ) >>> dataset.stats() >>> samples = dataset.set_task() >>> print(samples[0])
- prepare_metadata(root)[source]#
Prepare metadata for the COVID-19 CXR dataset.
- Parameters:
root (
str) – Root directory containing the dataset files.
This method: 1. Reads metadata from Excel files for each class 2. Processes file paths and labels 3. Combines all data into a single DataFrame 4. Saves the processed metadata to a CSV file
- Return type:
- property default_task: COVID19CXRClassification#
Returns the default task for this dataset.
- Returns:
The default classification task.
- Return type:
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.