pyhealth.datasets.DREAMTDataset#

The Dataset for Real-time sleep stage EstimAtion using Multisensor wearable Technology (DREAMT) includes wrist-based wearable and polysomnography (PSG) sleep data from 100 participants recruited from the Duke University Health System (DUHS) Sleep Disorder Lab.

This includes wearable signals, PSG signals, sleep labels, and clinical data related to sleep health and disorders.

The DREAMTDataset class provides an interface for loading and working with the DREAMT dataset. It can process DREAMT data across versions into a well-structured dataset object providing support for modeling and analysis.

Refer to the doc for more information about the dataset.

class pyhealth.datasets.DREAMTDataset(root, dataset_name=None, config_path=None)[source]#

Bases: BaseDataset

Base Dataset for Real-time sleep stage EstimAtion using Multisensor wearable Technology (DREAMT)

Dataset accepts current versions of DREAMT (1.0.0, 1.0.1, 2.0.0, 2.1.0), available at: https://physionet.org/content/dreamt/

DREAMT includes wrist-based wearable and polysomnography (PSG) sleep data from 100 participants recruited from the Duke University Health System (DUHS) Sleep Disorder Lab. This includes wearable signals, PSG signals, sleep labels, and clinical data related to sleep health and disorders.

When using this dataset, please cite:

Wang, K., Yang, J., Shetty, A., & Dunn, J. (2025). DREAMT: Dataset for Real-time sleep stage EstimAtion using Multisensor wearable Technology (version 2.1.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/7r9r-7r24

Will Ke Wang, Jiamu Yang, Leeor Hershkovich, Hayoung Jeong, Bill Chen, Karnika Singh, Ali R Roghanizad, Md Mobashir Hasan Shandhi, Andrew R Spector, Jessilyn Dunn. (2024). Proceedings of the fifth Conference on Health, Inference, and Learning, PMLR 248:380-396.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., … & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Dataset follows file and folder structure of dataset version, looks for participant_info.csv and data folders, so root path should be version downloaded, example: root = “…/dreamt/1.0.0/” or “…/dreamt/2.0.0/”

Parameters:

root (str) – root directory containing the dataset files
dataset_name (Optional[str]) – optional name of dataset, defaults to “dreamt_sleep”
config_path (Optional[str]) – optional configuration file, defaults to “dreamt.yaml”

root#: root directory containing the dataset files

dataset_name#: name of dataset

config_path#: path to configuration file

Examples

>>> from pyhealth.datasets import DREAMTDataset
>>> dataset = DREAMTDataset(root = "/path/to/dreamt/data/version")
>>> dataset.stats()
>>>
>>> # Get all patient ids
>>> unique_patients = dataset.unique_patient_ids
>>> print(f"There are {len(unique_patients)} patients")
>>>
>>> # Get single patient data
>>> patient = dataset.get_patient("S002")
>>> print(f"Patient has {len(patient.data_source)} event")
>>>
>>> # Get event
>>> event = patient.get_events(event_type="dreamt_sleep")
>>>
>>> # Get Apnea-Hypopnea Index (AHI)
>>> ahi = event[0].ahi
>>> print(f"AHI is {ahi}")
>>>
>>> # Get 64Hz sleep file path
>>> file_path = event[0].file_64hz
>>> print(f"64Hz sleep file path: {file_path}")

get_patient_file(patient_id, root, file_path)[source]#

Returns file path of 64Hz and 100Hz data for a patient, or None if no file found

Parameters:

patient_id (str) – patient identifier
root (str) – root directory containing the dataset files
file_path (str) – path to location of 64Hz or 100Hz file

Returns:

path to file location or None if no file found

Return type:

file

prepare_metadata(root)[source]#

Prepares metadata csv file for the DREAMT dataset by performing the following: 1. Obtain clinical data from participant_info.csv file 2. Process file paths based on patients found in clinical data 3. Organize all data into a single DataFrame 4. Save the processed DataFrame to a CSV file

Parameters:: root (str) – root directory containing the dataset files
Return type:: None

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:: None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:: The path to the new temporary directory.
Return type:: Path

property default_task: Optional[BaseTask]#

Returns the default task for the dataset.

Returns:: The default task, if any.
Return type:: Optional[BaseTask]

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:: patient_id (str) – The ID of the patient to retrieve.
Returns:: The Patient object for the given ID.
Return type:: Patient
Raises:: AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:: The path to the cached event dataframe.
Return type:: Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:: Iterator[Patient] – An iterator over Patient objects.
Return type:: Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:: A concatenated lazy frame of all tables.
Return type:: dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:

ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.

set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files

Parameters:

task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:: None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:: List of unique patient IDs.
Return type:: List[str]