Tokenizer#

The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.

class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#

Bases: object

Vocabulary class for mapping between tokens and indices.

add_token(token)[source]#: Adds a token to the vocabulary.

class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#

Bases: object

Tokenizer class for converting tokens to indices and vice versa.

This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.

Examples

>>> from pyhealth.tokenizer import Tokenizer
>>> token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E',             ...                'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C',             ...                'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X',             ...                'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A',             ...                'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
>>> tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])

get_padding_index()[source]#: Returns the index of the padding token.

get_vocabulary_size()[source]#

Returns the size of the vocabulary.

Examples

>>> tokenizer.get_vocabulary_size()
44

convert_tokens_to_indices(tokens)[source]#

Converts a list of tokens to indices.

Examples

>>> tokens = ['A03C', 'A03D', 'A03E', 'A03F', 'A04A', 'A05A', 'A05B', 'B035', 'C129']
>>> indices = tokenizer.convert_tokens_to_indices(tokens)
>>> print(indices)
[8, 9, 10, 11, 12, 13, 14, 1, 1]

Return type:: List[int]

convert_indices_to_tokens(indices)[source]#

Converts a list of indices to tokens.

Examples

>>> indices = [0, 1, 2, 3, 4, 5]
>>> tokens = tokenizer.convert_indices_to_tokens(indices)
>>> print(tokens)
['<pad>', '<unk>', 'A01A', 'A02A', 'A02B', 'A02X']

Return type:: List[str]

batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#

Converts a list of lists of tokens (2D) to indices.

Parameters:

batch (List[List[str]]) – List of lists of tokens to convert to indices.
padding (bool) – whether to pad the tokens to the max number of tokens in the batch (smart padding).
truncation (bool) – whether to truncate the tokens to max_length.
max_length (int) – maximum length of the tokens. This argument is ignored if truncation is False.

Examples

>>> tokens = [
...     ['A03C', 'A03D', 'A03E', 'A03F'],
...     ['A04A', 'B035', 'C129']
... ]

>>> indices = tokenizer.batch_encode_2d(tokens)
>>> print ('case 1:', indices)
case 1: [[8, 9, 10, 11], [12, 1, 1, 0]]

>>> indices = tokenizer.batch_encode_2d(tokens, padding=False)
>>> print ('case 2:', indices)
case 2: [[8, 9, 10, 11], [12, 1, 1]]

>>> indices = tokenizer.batch_encode_2d(tokens, max_length=3)
>>> print ('case 3:', indices)
case 3: [[9, 10, 11], [12, 1, 1]]

batch_decode_2d(batch, padding=False)[source]#

Converts a list of lists of indices (2D) to tokens.

Parameters:

batch (List[List[int]]) – List of lists of indices to convert to tokens.
padding (bool) – whether to keep the padding tokens from the tokens.

Examples

>>> indices = [
...     [8, 9, 10, 11],
...     [12, 1, 1, 0]
... ]

>>> tokens = tokenizer.batch_decode_2d(indices)
>>> print ('case 1:', tokens)
case 1: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]

>>> tokens = tokenizer.batch_decode_2d(indices, padding=True)
>>> print ('case 2:', tokens)
case 2: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>', '<pad>']]

batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#

Converts a list of lists of lists of tokens (3D) to indices.

Parameters:

batch (List[List[List[str]]]) – List of lists of lists of tokens to convert to indices.
padding (Tuple[bool, bool]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).
truncation (Tuple[bool, bool]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_length
max_length (Tuple[int, int]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.

Examples

>>> tokens = [
...     [
...         ['A03C', 'A03D', 'A03E', 'A03F'],
...         ['A08A', 'A09A'],
...     ],
...     [
...         ['A04A', 'B035', 'C129'],
...     ]
... ]

>>> indices = tokenizer.batch_encode_3d(tokens)
>>> print ('case 1:', indices)
case 1: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0], [0, 0, 0, 0]]]

>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, True))
>>> print ('case 2:', indices)
case 2: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0]]]

>>> indices = tokenizer.batch_encode_3d(tokens, padding=(True, False))
>>> print ('case 3:', indices)
case 3: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1], [0]]]

>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, False))
>>> print ('case 4:', indices)
case 4: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1]]]

>>> indices = tokenizer.batch_encode_3d(tokens, max_length=(2,2))
>>> print ('case 5:', indices)
case 5: [[[10, 11], [24, 25]], [[1, 1], [0, 0]]]

batch_decode_3d(batch, padding=False)[source]#

Converts a list of lists of lists of indices (3D) to tokens.

Parameters:

batch (List[List[List[int]]]) – List of lists of lists of indices to convert to tokens.
padding (bool) – whether to keep the padding tokens from the tokens.

Examples

>>> indices = [
...     [
...         [8, 9, 10, 11],
...         [24, 25, 0, 0]
...     ],
...     [
...         [12, 1, 1, 0],
...         [0, 0, 0, 0]
...     ]
... ]

>>> tokens = tokenizer.batch_decode_3d(indices)
>>> print ('case 1:', tokens)
case 1: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], [['A04A', '<unk>', '<unk>']]]

>>> tokens = tokenizer.batch_decode_3d(indices, padding=True)
>>> print ('case 2:', tokens)
case 2: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A', '<pad>', '<pad>']], [['A04A', '<unk>', '<unk>', '<pad>'], ['<pad>', '<pad>', '<pad>', '<pad>']]]