pyhealth.datasets.TUEVDataset#
Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml
This corpus is a subset of TUEG that contains annotations of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).
- class pyhealth.datasets.TUEVDataset(root, dataset_name=None, config_path=None, subset='both', **kwargs)[source]#
Bases:
BaseDatasetBase EEG dataset for the TUH EEG Events Corpus
Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml
This corpus is a subset of TUEG that contains annotations of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).
- Files are named in the form of bckg_032_a_.edf in the eval partition:
- bckg: this file contains background annotations.
032: a reference to the eval index a_.edf: EEG files are split into a series of files starting with a_.edf, a_1.ef, … These represent pruned EEGs, so the original EEG is split into these segments, and uninteresting parts of the original recording were deleted.
- or in the form of 00002275_00000001.edf in the train partition:
- 00002275: a reference to the train index.
0000001: indicating that this is the first file associated with this patient.
- Parameters:
root (
str) – root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.config_path (
Optional[str]) – Optional configuration file name, defaults to “tuev.yaml”.dev – whether to enable dev mode (only use a small subset of the data). Default is False.
- task#
Optional[str], name of the task (e.g., “EEG_events”). Default is None.
- samples#
Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.
- patient_to_index#
Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.
- visit_to_index#
Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.
Examples
>>> from pyhealth.datasets import TUEVDataset >>> from pyhealth.tasks import EEGEventsTUEV >>> dataset = TUEVDataset( ... root="/srv/local/data/TUH/tuh_eeg_events/v2.0.0/edf/", ... ) >>> dataset.stats() >>> sample_dataset = dataset.set_task(EEGEventsTUEV()) >>> sample = sample_dataset[0] >>> print(sample['signal'].shape) # (16, 1280)
For a complete example, see examples/conformal_eeg/tuev_eeg_quickstart.ipynb.
- prepare_metadata()[source]#
Build and save processed metadata CSVs for TUEV train/eval separately.
This writes: - <root>/tuev-train-pyhealth.csv - <root>/tuev-eval-pyhealth.csv
Train filenames look like: 00002275_00000001.edf - subject_id = 00002275 - record_id = 00000001
Eval filenames look like: bckg_032_a_.edf - label_kind = bckg (or spsw/gped/pled/eyem/artf depending on file) - eval_index = 032 - segment_id = a_ / a_1 / …
- Return type:
- property default_task: EEGEventsTUEV#
BMDHSDiseaseClassification.
- Returns:
The default task instance.
- Return type:
- Type:
Returns the default task for the BMD-HS dataset
- create_tmpdir()#
Creates and returns a new temporary directory within the cache.
- Returns:
The path to the new temporary directory.
- Return type:
- get_patient(patient_id)#
Retrieves a Patient object for the given patient ID.
- Parameters:
patient_id (str) – The ID of the patient to retrieve.
- Returns:
The Patient object for the given ID.
- Return type:
- Raises:
AssertionError – If the patient ID is not found in the dataset.
- property global_event_df: LazyFrame#
Returns the path to the cached event dataframe.
- Returns:
The path to the cached event dataframe.
- Return type:
- iter_patients(df=None)#
Yields Patient objects for each unique patient in the dataset.
- load_data()#
Loads data from the specified tables.
- Returns:
A concatenated lazy frame of all tables.
- Return type:
dd.DataFrame
- load_table(table_name)#
Loads a table and processes joins if specified.
- Parameters:
table_name (str) – The name of the table to load.
- Returns:
The processed Dask dataframe for the table.
- Return type:
dd.DataFrame
- Raises:
ValueError – If the table is not found in the config.
FileNotFoundError – If the CSV file for the table or join is not found.
- set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#
Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:
{task_name}_{task_uuid}/ # Cached data for specific task based on task name, schema, and args task_df.ld/ # Intermediate task dataframe based on schema samples_{proc_uuid}.ld/ # Final processed samples after applying processors schema.pkl # Saved SampleBuilder schema *.bin # Processed sample files
- Parameters:
task (Optional[BaseTask]) – The task to set. Uses default task if None.
num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.
input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.
output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.
- Returns:
The generated sample dataset.
- Return type:
- Raises:
AssertionError – If no default task is found and task is None.