pyhealth.tasks.DeIDNERTask#
- class pyhealth.tasks.DeIDNERTask(window_size=None, window_overlap=0)[source]#
Bases:
BaseTaskToken-level NER task for clinical text de-identification.
Each sample contains a list of tokens and their BIO labels over 7 PHI categories: AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION.
Supports optional overlapping windowing (paper Section 3.3) to handle notes longer than BERT’s 512 token limit.
- Parameters:
Examples
>>> from pyhealth.datasets import PhysioNetDeIDDataset >>> from pyhealth.tasks import DeIDNERTask >>> dataset = PhysioNetDeIDDataset(root="/path/to/data") >>> task = DeIDNERTask() >>> samples = dataset.set_task(task) >>> task_windowed = DeIDNERTask(window_size=100, window_overlap=60) >>> samples = dataset.set_task(task_windowed)
- input_schema: Dict[str, Union[str, Type]] = {'text': <class 'pyhealth.processors.text_processor.TextProcessor'>}#
- output_schema: Dict[str, Union[str, Type]] = {'labels': <class 'pyhealth.processors.text_processor.TextProcessor'>}#
- pre_filter(df)#
- Return type:
LazyFrame