pyhealth.datasets.TUEVDataset#

Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml

This corpus is a subset of TUEG that contains annotations of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).

class pyhealth.datasets.TUEVDataset(root, dataset_name=None, config_path=None, subset='both', **kwargs)[source]#

Bases: BaseDataset

Base EEG dataset for the TUH EEG Events Corpus

Dataset is available at https://isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml

This corpus is a subset of TUEG that contains annotations of EEG segments as one of six classes: (1) spike and sharp wave (SPSW), (2) generalized periodic epileptiform discharges (GPED), (3) periodic lateralized epileptiform discharges (PLED), (4) eye movement (EYEM), (5) artifact (ARTF) and (6) background (BCKG).

Files are named in the form of bckg_032_a_.edf in the eval partition:
bckg: this file contains background annotations.

032: a reference to the eval index a_.edf: EEG files are split into a series of files starting with a_.edf, a_1.ef, … These represent pruned EEGs, so the original EEG is split into these segments, and uninteresting parts of the original recording were deleted.

or in the form of 00002275_00000001.edf in the train partition:
00002275: a reference to the train index.

0000001: indicating that this is the first file associated with this patient.

Parameters:
  • root (str) – root directory of the raw data. You can choose to use the path to Cassette portion or the Telemetry portion.

  • dataset_name (Optional[str]) – name of the dataset.

  • config_path (Optional[str]) – Optional configuration file name, defaults to “tuev.yaml”.

  • dev – whether to enable dev mode (only use a small subset of the data). Default is False.

task#

Optional[str], name of the task (e.g., “EEG_events”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, record_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import TUEVDataset
>>> from pyhealth.tasks import EEGEventsTUEV
>>> dataset = TUEVDataset(
...         root="/srv/local/data/TUH/tuh_eeg_events/v2.0.0/edf/",
...     )
>>> dataset.stats()
>>> sample_dataset = dataset.set_task(EEGEventsTUEV())
>>> sample = sample_dataset[0]
>>> print(sample['signal'].shape)  # (16, 1280)

For a complete example, see examples/conformal_eeg/tuev_eeg_quickstart.ipynb.

prepare_metadata()[source]#

Build and save processed metadata CSVs for TUEV train/eval separately.

This writes: - <root>/tuev-train-pyhealth.csv - <root>/tuev-eval-pyhealth.csv

Train filenames look like: 00002275_00000001.edf - subject_id = 00002275 - record_id = 00000001

Eval filenames look like: bckg_032_a_.edf - label_kind = bckg (or spsw/gped/pled/eyem/artf depending on file) - eval_index = 032 - segment_id = a_ / a_1 / …

Return type:

None

property default_task: EEGEventsTUEV#

BMDHSDiseaseClassification.

Returns:

The default task instance.

Return type:

BMDHSDiseaseClassification

Type:

Returns the default task for the BMD-HS dataset

clean_tmpdir()#

Cleans up the temporary directory within the cache.

Return type:

None

create_tmpdir()#

Creates and returns a new temporary directory within the cache.

Returns:

The path to the new temporary directory.

Return type:

Path

get_patient(patient_id)#

Retrieves a Patient object for the given patient ID.

Parameters:

patient_id (str) – The ID of the patient to retrieve.

Returns:

The Patient object for the given ID.

Return type:

Patient

Raises:

AssertionError – If the patient ID is not found in the dataset.

property global_event_df: LazyFrame#

Returns the path to the cached event dataframe.

Returns:

The path to the cached event dataframe.

Return type:

Path

iter_patients(df=None)#

Yields Patient objects for each unique patient in the dataset.

Yields:

Iterator[Patient] – An iterator over Patient objects.

Return type:

Iterator[Patient]

load_data()#

Loads data from the specified tables.

Returns:

A concatenated lazy frame of all tables.

Return type:

dd.DataFrame

load_table(table_name)#

Loads a table and processes joins if specified.

Parameters:

table_name (str) – The name of the table to load.

Returns:

The processed Dask dataframe for the table.

Return type:

dd.DataFrame

Raises:
set_task(task=None, num_workers=None, input_processors=None, output_processors=None)#

Processes the base dataset to generate the task-specific sample dataset. The cache structure is as follows:

{task_name}_{task_uuid}/        # Cached data for specific task based on task name, schema, and args
    task_df.ld/                 # Intermediate task dataframe based on schema
    samples_{proc_uuid}.ld/     # Final processed samples after applying processors
        schema.pkl              # Saved SampleBuilder schema
        *.bin                   # Processed sample files
Parameters:
  • task (Optional[BaseTask]) – The task to set. Uses default task if None.

  • num_workers (int) – Number of workers for multi-threading. Default is self.num_workers.

  • input_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted input processors. If provided, these will be used instead of creating new ones from task’s input_schema. Defaults to None.

  • output_processors (Optional[Dict[str, FeatureProcessor]]) – Pre-fitted output processors. If provided, these will be used instead of creating new ones from task’s output_schema. Defaults to None.

Returns:

The generated sample dataset.

Return type:

SampleDataset

Raises:

AssertionError – If no default task is found and task is None.

stats()#

Prints statistics about the dataset.

Return type:

None

property unique_patient_ids: List[str]#

Returns a list of unique patient IDs.

Returns:

List of unique patient IDs.

Return type:

List[str]