Tokenizer#

The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.

class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#

Bases: object

Vocabulary class for mapping between tokens and indices.

add_token(token)[source]#

Adds a token to the vocabulary.

class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#

Bases: object

Tokenizer class for converting tokens to indices and vice versa.

This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.

get_padding_index()[source]#

Returns the index of the padding token.

get_vocabulary_size()[source]#

Returns the size of the vocabulary.

convert_tokens_to_indices(tokens)[source]#

Converts a list of tokens to indices.

Return type

List[int]

convert_indices_to_tokens(indices)[source]#

Converts a list of indices to tokens.

Return type

List[str]

batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#

Converts a list of lists of tokens (2D) to indices.

Parameters
  • batch (List[List[str]]) – List of lists of tokens to convert to indices.

  • padding (bool) – whether to pad the tokens to the max number of tokens in the batch (smart padding).

  • truncation (bool) – whether to truncate the tokens to max_length.

  • max_length (int) – maximum length of the tokens. This argument is ignored if truncation is False.

batch_decode_2d(batch, padding=False)[source]#

Converts a list of lists of indices (2D) to tokens.

Parameters
  • batch (List[List[int]]) – List of lists of indices to convert to tokens.

  • padding (bool) – whether to keep the padding tokens from the tokens.

batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#

Converts a list of lists of lists of tokens (3D) to indices.

Parameters
  • batch (List[List[List[str]]]) – List of lists of lists of tokens to convert to indices.

  • padding (Tuple[bool, bool]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).

  • truncation (Tuple[bool, bool]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_length

  • max_length (Tuple[int, int]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.

batch_decode_3d(batch, padding=False)[source]#

Converts a list of lists of lists of indices (3D) to tokens.

Parameters
  • batch (List[List[List[int]]]) – List of lists of lists of indices to convert to tokens.

  • padding (bool) – whether to keep the padding tokens from the tokens.