Tokenizer#
The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.
- class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#
Bases:
object
Vocabulary class for mapping between tokens and indices.
- class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#
Bases:
object
Tokenizer class for converting tokens to indices and vice versa.
This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.
- batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#
Converts a list of lists of tokens (2D) to indices.
- Parameters
batch (
List
[List
[str
]]) – List of lists of tokens to convert to indices.padding (
bool
) – whether to pad the tokens to the max number of tokens in the batch (smart padding).truncation (
bool
) – whether to truncate the tokens to max_length.max_length (
int
) – maximum length of the tokens. This argument is ignored if truncation is False.
- batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#
Converts a list of lists of lists of tokens (3D) to indices.
- Parameters
batch (
List
[List
[List
[str
]]]) – List of lists of lists of tokens to convert to indices.padding (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).truncation (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_lengthmax_length (
Tuple
[int
,int
]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.