pyhealth.datasets.BMDHSDataset#
The BUET Multi-disease Heart Sound (BMD-HS) dataset contains patient-level multi-label annotations for common valvular conditions and up to eight phonocardiogram (PCG) recordings per patient.
Refer to doc for more information.
- class pyhealth.datasets.BMDHSDataset(root, dataset_name=None, config_path=None, recordings_path=None, **kwargs)[source]#
Bases:
BaseDatasetBUET Multi-disease Heart Sound (BMD-HS) Dataset Repository (current main branch) available at: https://github.com/sani002/BMD-HS-Dataset
BMD-HS is a curated collection of phonocardiogram (PCG/heart-sound) recordings designed for automated cardiovascular disease research. It includes multi-label annotations for common valvular conditions: Aortic Stenosis (AS), Aortic Regurgitation (AR), Mitral Regurgitation (MR), Mitral Stenosis (MS), a Multi-Disease (MD) label for co-existing conditions, and Normal (N)—with accompanying patient-level metadata. The dataset also provides a training CSV mapping patient IDs to labels and up to eight 20-second recordings per patient captured at different auscultation positions.
If you use this dataset, please cite: Ali, S. N., Zahin, A., Shuvo, S. B., Nizam, N. B., Nuhash, S. I. S. K., Razin, S. S., Sani, S. M. S., Rahman, F., Nizam, N. B., Azam, F. B., Hossen, R., Ohab, S., Noor, N., & Hasan, T. (2024). BUET Multi-disease Heart Sound Dataset: A Comprehensive Auscultation Dataset for Developing Computer-Aided Diagnostic Systems. arXiv:2409.00724. https://arxiv.org/abs/2409.00724
- Parameters:
- root#
Root directory containing the dataset files.
- dataset_name#
Name of the dataset.
- config_path#
Path to configuration file.
train/ # .wav audio files (20 s, ~4 kHz), up to 8 per patient/positions
- train.csv # labels and recording filenames per patient
patient_id
AS, AR, MR, MS # 0 = absent, 1 = present
N # 0 = disease, 1 = normal (healthy indicator)
recording_1 … recording_8 # filenames for position-wise recordings
- additional_metadata.csv
patient_id, Age, Gender (M/F), Smoker (0/1), Lives (U/F)
Example
>>> from pyhealth.datasets import BMDHSDataset >>> dataset = BMDHSDataset(root=".../BMD-HS-Dataset/") >>> dataset.stats()
Note
This loader assumes the repository’s current layout (train/, train.csv, additional_metadata.csv) and multi-label schema as described above. Set root to the repository directory that includes these files and folders.
- preprocess_recordings(df)[source]#
Preprocess the recordings table by prepending the recordings_path to recording filenames.
- Return type:
LazyFrame
- property default_task: BMDHSDiseaseClassification#
BMDHSDiseaseClassification.
- Returns:
The default task instance.
- Return type:
- Type:
Returns the default task for the BMD-HS dataset
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.