pyhealth.datasets.SleepEDFDataset#
The open Sleep-EDF Database Expanded database, refer to doc for more information.
- class pyhealth.datasets.SleepEDFDataset(root, dataset_name=None, config_path=None, subset='cassette')[source]#
Bases:
BaseDatasetBase EEG dataset for SleepEDF
Dataset is available at https://www.physionet.org/content/sleep-edfx/1.0.0/
- For the Sleep Cassette Study portion:
The 153 SC* files (SC = Sleep Cassette) were obtained in a 1987-1991 study of age effects on sleep in healthy Caucasians aged 25-101, without any sleep-related medication [2]. Two PSGs of about 20 hours each were recorded during two subsequent day-night periods at the subjects homes. Subjects continued their normal activities but wore a modified Walkman-like cassette-tape recorder described in chapter VI.4 (page 92) of Bob’s 1987 thesis.
Files are named in the form SC4ssNEO-PSG.edf where ss is the subject number, and N is the night. The first nights of subjects 36 and 52, and the second night of subject 13, were lost due to a failing cassette or laserdisk.
The EOG and EEG signals were each sampled at 100 Hz. The submental-EMG signal was electronically highpass filtered, rectified and low-pass filtered after which the resulting EMG envelope expressed in uV rms (root-mean-square) was sampled at 1Hz. Oro-nasal airflow, rectal body temperature and the event marker were also sampled at 1Hz.
Subjects and recordings are further described in the file headers, the descriptive spreadsheet SC-subjects.xls.
- For the Sleep Telemetry portion:
The 44 ST* files (ST = Sleep Telemetry) were obtained in a 1994 study of temazepam effects on sleep in 22 Caucasian males and females without other medication. Subjects had mild difficulty falling asleep but were otherwise healthy. The PSGs of about 9 hours were recorded in the hospital during two nights, one of which was after temazepam intake, and the other of which was after placebo intake. Subjects wore a miniature telemetry system with very good signal quality.
Files are named in the form ST7ssNJ0-PSG.edf where ss is the subject number, and N is the night.
EOG, EMG and EEG signals were sampled at 100 Hz, and the event marker at 1 Hz. The physical marker dimension ID+M-E relates to the fact that pressing the marker (M) button generated two-second deflections from a baseline value that either identifies the telemetry unit (ID = 1 or 2 if positive) or marks an error (E) in the telemetry link if negative. Subjects and recordings are further described in the file headers, the descriptive spreadsheet ST-subjects.xls.
- Parameters:
root (
str) – str, root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.dataset_name (
Optional[str]) – Optional[str], name of the dataset. Default is None.config_path (
Optional[str]) – Optional[str], path to the config file. Default is None.subset (
Optional[str]) – Optional[str], subset of the SleepEDF dataset, either “cassette” or “telemetry”. Default is “cassette”.
- task#
Optional[str], name of the task (e.g., “sleep staging”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import SleepEDFDataset >>> dataset = SleepEDFDataset( ... root="/srv/local/data/SLEEPEDF/sleep-edf-database-expanded-1.0.0/", ... ) >>> dataset.stat() >>> dataset.info()
- prepare_metadata_cassette(root)[source]#
Prepare metadata for the SleepEDF cassette subset. :type root:
str:param root: Root directory containing the dataset files.This method processes the raw cassette metadata files and saves a processed CSV file.
- Return type:
- prepare_metadata_telemetry(root)[source]#
Prepare metadata for the SleepEDF telemetry subset.
- Parameters:
root (
str) – Root directory containing the dataset files.
This method processes the raw telemetry metadata files and saves a processed CSV file.
- Return type:
- property default_task: SleepStagingSleepEDF#
Returns the default task for this dataset.
- Returns:
The default task instance.
- Return type:
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.