pyhealth.processors.SequenceProcessor#

Processor for sequence data.

class pyhealth.processors.SequenceProcessor(code_mapping=None)[source]#

Bases: FeatureProcessor, TokenProcessorInterface

Feature processor for encoding categorical sequences.

Encodes medical codes (e.g., diagnoses, procedures) into numerical indices. Supports single or multiple tokens and can build vocabulary on the fly if not provided.

Parameters:

code_mapping (Optional[Tuple[str, str]]) – optional tuple of (source_vocabulary, target_vocabulary) to map raw codes to a grouped vocabulary before tokenizing. Uses pyhealth.medcode.CrossMap internally. For example, ("ICD9CM", "CCSCM") maps ~128K ICD-9 diagnosis codes to ~280 CCS categories, and ("NDC", "ATC") maps ~940K drug codes to ~5K ATC categories. When None (default), codes are used as-is with no change to existing behavior.

Examples

>>> proc = SequenceProcessor()  # no mapping, same as before
>>> proc = SequenceProcessor(code_mapping=("ICD9CM", "CCSCM"))
fit(samples, field)[source]#

Build vocabulary from samples, applying code mapping if set.

Parameters:
  • samples (Iterable[Dict[str, Any]]) – iterable of sample dicts.

  • field (str) – key whose values are token lists.

Return type:

None

process(value)[source]#

Process token value(s) into tensor of indices.

Parameters:

value (Any) – Raw token string or list of token strings.

Return type:

Tensor

Returns:

Tensor of indices.

remove(tokens)[source]#

Remove specified vocabularies from the processor.

retain(tokens)[source]#

Retain only the specified vocabularies in the processor.

add(tokens)[source]#

Add specified vocabularies to the processor.

tokens()[source]#

Return the set of tokens in the processor’s vocabulary.

Return type:

set[str]

vocab_size()[source]#

Return the size of the processor’s vocabulary.

Return type:

int

size()[source]#
is_token()[source]#

Sequence codes are discrete token indices.

Return type:

bool

schema()[source]#

Returns the schema of the processed feature. For a processor that emits a single tensor, this should just return [“value”]. For a processor that emits a tuple of tensors, this should return a tuple of the same length as the tuple, with the semantic name of each tensor, such as [“time”, “value”], [“value”, “mask”], etc.

Typical semantic names include:
  • “value”: the main processed tensor output of the processor

  • “time”: the time tensor output of the processor (mostly for StageNet)

  • “mask”: the mask tensor output of the processor (if applicable)

Return type:

tuple[str, ...]

Returns:

Tuple of semantic names corresponding to the output of the processor.

dim()[source]#

Output is a 1D tensor of code indices.

Return type:

tuple[int, ...]

spatial()[source]#

Whether each dimension (axis) of the value tensor is spatial (i.e. corresponds to a spatial axis like time, height, width, etc.) or not. This is used to determine how to apply augmentations and other transformations that should only be applied to spatial dimensions.

E.g. for CNN or RNN features, this would help determine which dimensions to apply spatial augmentations to, and which dimensions to treat as channels or features.

Return type:

tuple[bool, ...]

Returns:

Tuple of booleans corresponding to whether each axis of the value tensor is spatial or not.

PAD = 0#
UNK = 1#
load(path)#

Optional: Load processor state from disk.

Parameters:

path (str) – File path to load processor state from.

Return type:

None

save(path)#

Optional: Save processor state to disk.

Parameters:

path (str) – File path to save processor state.

Return type:

None