pyhealth.datasets.SleepEDFDataset#

The open Sleep-EDF Database Expanded database, refer to doc for more information.

class pyhealth.datasets.SleepEDFDataset(root, dataset_name=None, config_path=None, subset='cassette')[source]#

Bases: BaseDataset

Base EEG dataset for SleepEDF

Dataset is available at https://www.physionet.org/content/sleep-edfx/1.0.0/

For the Sleep Cassette Study portion:
  • The 153 SC* files (SC = Sleep Cassette) were obtained in a 1987-1991 study of age effects on sleep in healthy Caucasians aged 25-101, without any sleep-related medication [2]. Two PSGs of about 20 hours each were recorded during two subsequent day-night periods at the subjects homes. Subjects continued their normal activities but wore a modified Walkman-like cassette-tape recorder described in chapter VI.4 (page 92) of Bob’s 1987 thesis.

  • Files are named in the form SC4ssNEO-PSG.edf where ss is the subject number, and N is the night. The first nights of subjects 36 and 52, and the second night of subject 13, were lost due to a failing cassette or laserdisk.

  • The EOG and EEG signals were each sampled at 100 Hz. The submental-EMG signal was electronically highpass filtered, rectified and low-pass filtered after which the resulting EMG envelope expressed in uV rms (root-mean-square) was sampled at 1Hz. Oro-nasal airflow, rectal body temperature and the event marker were also sampled at 1Hz.

  • Subjects and recordings are further described in the file headers, the descriptive spreadsheet SC-subjects.xls.

For the Sleep Telemetry portion:
  • The 44 ST* files (ST = Sleep Telemetry) were obtained in a 1994 study of temazepam effects on sleep in 22 Caucasian males and females without other medication. Subjects had mild difficulty falling asleep but were otherwise healthy. The PSGs of about 9 hours were recorded in the hospital during two nights, one of which was after temazepam intake, and the other of which was after placebo intake. Subjects wore a miniature telemetry system with very good signal quality.

  • Files are named in the form ST7ssNJ0-PSG.edf where ss is the subject number, and N is the night.

  • EOG, EMG and EEG signals were sampled at 100 Hz, and the event marker at 1 Hz. The physical marker dimension ID+M-E relates to the fact that pressing the marker (M) button generated two-second deflections from a baseline value that either identifies the telemetry unit (ID = 1 or 2 if positive) or marks an error (E) in the telemetry link if negative. Subjects and recordings are further described in the file headers, the descriptive spreadsheet ST-subjects.xls.

Parameters:
  • root (str) – str, root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.

  • dataset_name (Optional[str]) – Optional[str], name of the dataset. Default is None.

  • config_path (Optional[str]) – Optional[str], path to the config file. Default is None.

  • subset (Optional[str]) – Optional[str], subset of the SleepEDF dataset, either “cassette” or “telemetry”. Default is “cassette”.

task#

Optional[str], name of the task (e.g., “sleep staging”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import SleepEDFDataset
>>> dataset = SleepEDFDataset(
...         root="/srv/local/data/SLEEPEDF/sleep-edf-database-expanded-1.0.0/",
...     )
>>> dataset.stat()
>>> dataset.info()
prepare_metadata_cassette(root)[source]#

Prepare metadata for the SleepEDF cassette subset. :type root: str :param root: Root directory containing the dataset files.

This method processes the raw cassette metadata files and saves a processed CSV file.

Return type:

None

prepare_metadata_telemetry(root)[source]#

Prepare metadata for the SleepEDF telemetry subset.

Parameters:

root (str) – Root directory containing the dataset files.

This method processes the raw telemetry metadata files and saves a processed CSV file.

Return type:

None

property default_task: SleepStagingSleepEDF#

Returns the default task for this dataset.

Returns:

The default task instance.

Return type:

SleepStagingSleepEDF

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:

None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:

The path to the new temporary directory.

Return type:

Path

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:

patient_id (str) – The ID of the patient to retrieve.

Returns:

The Patient object for the given ID.

Return type:

Patient

Raises:

AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:

The path to the cached event dataframe.

Return type:

Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:

Iterator[Patient] – An iterator over Patient objects.

Return type:

Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:

A concatenated lazy frame of all tables.

Return type:

dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:
set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files
Parameters:
  • task (Optional[BaseTask]) – The task to set. Uses default task if None.

  • num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.

  • input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.

  • output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:

None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:

List of unique patient IDs.

Return type:

List[str]