pyhealth.tasks.DeIDNERTask#

class pyhealth.tasks.DeIDNERTask(window_size=None, window_overlap=0)[source]#

Bases: BaseTask

Token-level NER task for clinical text de-identification.

Each sample contains a list of tokens and their BIO labels over 7 PHI categories: AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION.

Supports optional overlapping windowing (paper Section 3.3) to handle notes longer than BERT’s 512 token limit.

Parameters:
  • window_size (Optional[int]) – If set, split notes into overlapping windows of this many tokens. Default None (no windowing).

  • window_overlap (int) – Number of tokens shared between consecutive windows. Default 0.

task_name#

The name of the task.

Type:

str

input_schema#

The schema for the task input.

Type:

Dict[str, Union[str, Type]]

output_schema#

The schema for the task output.

Type:

Dict[str, Union[str, Type]]

Examples

>>> from pyhealth.datasets import PhysioNetDeIDDataset
>>> from pyhealth.tasks import DeIDNERTask
>>> dataset = PhysioNetDeIDDataset(root="/path/to/data")
>>> task = DeIDNERTask()
>>> samples = dataset.set_task(task)
>>> task_windowed = DeIDNERTask(window_size=100, window_overlap=60)
>>> samples = dataset.set_task(task_windowed)
task_name: str = 'DeIDNER'#
input_schema: Dict[str, Union[str, Type]] = {'text': <class 'pyhealth.processors.text_processor.TextProcessor'>}#
output_schema: Dict[str, Union[str, Type]] = {'labels': <class 'pyhealth.processors.text_processor.TextProcessor'>}#
pre_filter(df)#
Return type:

LazyFrame