pyhealth.processors.NestedSequenceProcessor#

Processor for nested categorical sequence data with vocabulary.

Handles nested sequences like drug recommendation history where each sample contains a list of visits, and each visit contains a list of codes. For example: [[“code1”, “code2”], [“code3”], [“code4”, “code5”, “code6”]]

class pyhealth.processors.NestedSequenceProcessor(padding=0)[source]#

Bases: FeatureProcessor, TokenProcessorInterface

Feature processor for nested categorical sequences with vocabulary.

Handles nested sequences like drug recommendation history where each sample contains a list of visits, and each visit contains a list of codes: [[“code1”, “code2”], [“code3”], [“code4”, “code5”, “code6”]]

The processor: 1. Builds a vocabulary from all codes across all samples 2. Encodes codes to indices 3. Pads inner sequences to the maximum sequence length found during fit 4. Returns a 2D tensor of shape (num_visits, max_codes_per_visit)

Special tokens:
  • <pad>: 0 for padding

  • <unk>: 1 for unknown codes

Parameters:

padding (int) – Additional padding to add on top of the observed maximum inner sequence length. The actual padding length will be observed_max + padding. This ensures the processor can handle sequences longer than those in the training data. Default: 0 (no extra padding).

Examples

>>> processor = NestedSequenceProcessor()
>>> # During fit, determines max inner sequence length
>>> samples = [
...     {"codes": [["A", "B"], ["C", "D", "E"]]},
...     {"codes": [["F"]]}
... ]
>>> processor.fit(samples, "codes")
>>> # Process nested sequence (observed_max=3, default padding=0, total=3)
>>> result = processor.process([["A", "B"], ["C"]])
>>> result.shape  # (2, 3) - 2 visits, padded to observed_max
fit(samples, field)[source]#

Build vocabulary and determine maximum inner sequence length.

Parameters:
  • samples (Iterable[Dict[str, Any]]) – List of sample dictionaries.

  • field (str) – The field name containing nested sequences.

Return type:

None

remove(tokens)[source]#

Remove specified vocabularies from the processor.

retain(tokens)[source]#

Retain only the specified vocabularies in the processor.

add(tokens)[source]#

Add specified vocabularies to the processor.

tokens()[source]#

Return the set of tokens in the processor’s vocabulary.

Return type:

set[str]

process(value)[source]#

Process nested sequence into padded 2D tensor.

Empty or None visits are filled with padding tokens.

Parameters:

value (List[List[Any]]) – Nested list of codes [[code1, code2], [code3], …]

Return type:

Tensor

Returns:

2D tensor of shape (num_visits, max_inner_len) with code indices

size()[source]#

Return max inner length (embedding dimension) for unified API.

Return type:

int

vocab_size()[source]#

Return vocabulary size.

Return type:

int

is_token()[source]#

Nested sequence codes are discrete token indices.

Return type:

bool

schema()[source]#

Returns the schema of the processed feature. For a processor that emits a single tensor, this should just return [“value”]. For a processor that emits a tuple of tensors, this should return a tuple of the same length as the tuple, with the semantic name of each tensor, such as [“time”, “value”], [“value”, “mask”], etc.

Typical semantic names include:
  • “value”: the main processed tensor output of the processor

  • “time”: the time tensor output of the processor (mostly for StageNet)

  • “mask”: the mask tensor output of the processor (if applicable)

Return type:

tuple[str, ...]

Returns:

Tuple of semantic names corresponding to the output of the processor.

dim()[source]#

Output is a 2D tensor (visits, codes_per_visit).

Return type:

tuple[int, ...]

spatial()[source]#

Whether each dimension (axis) of the value tensor is spatial (i.e. corresponds to a spatial axis like time, height, width, etc.) or not. This is used to determine how to apply augmentations and other transformations that should only be applied to spatial dimensions.

E.g. for CNN or RNN features, this would help determine which dimensions to apply spatial augmentations to, and which dimensions to treat as channels or features.

Return type:

tuple[bool, ...]

Returns:

Tuple of booleans corresponding to whether each axis of the value tensor is spatial or not.

PAD = 0#
UNK = 1#
load(path)#

Optional: Load processor state from disk.

Parameters:

path (str) – File path to load processor state from.

Return type:

None

save(path)#

Optional: Save processor state to disk.

Parameters:

path (str) – File path to save processor state.

Return type:

None