Tokenizer#
The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.
- class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#
Bases:
objectVocabulary class for mapping between tokens and indices.
- class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#
Bases:
objectTokenizer class for converting tokens to indices and vice versa.
This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.
- batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#
Converts a list of lists of tokens (2D) to indices.
- Parameters
batch (
List[List[str]]) – List of lists of tokens to convert to indices.padding (
bool) – whether to pad the tokens to the max number of tokens in the batch (smart padding).truncation (
bool) – whether to truncate the tokens to max_length.max_length (
int) – maximum length of the tokens. This argument is ignored if truncation is False.
- batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#
Converts a list of lists of lists of tokens (3D) to indices.
- Parameters
batch (
List[List[List[str]]]) – List of lists of lists of tokens to convert to indices.padding (
Tuple[bool,bool]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).truncation (
Tuple[bool,bool]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_lengthmax_length (
Tuple[int,int]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.