Tokenizer#
The tokenizer functionality can be used for supporting tokens-to-index or index-to-token mapping in general ML setting.
- class pyhealth.tokenizer.Vocabulary(tokens, special_tokens=None)[source]#
Bases:
object
Vocabulary class for mapping between tokens and indices.
- class pyhealth.tokenizer.Tokenizer(tokens, special_tokens=None)[source]#
Bases:
object
Tokenizer class for converting tokens to indices and vice versa.
This class will build a vocabulary from the provided tokens and provide the functionality to convert tokens to indices and vice versa. This class also provides the functionality to tokenize a batch of data.
Examples
>>> from pyhealth.tokenizer import Tokenizer >>> token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E', ... 'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', ... 'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X', ... 'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A', ... 'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A'] >>> tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
- get_vocabulary_size()[source]#
Returns the size of the vocabulary.
Examples
>>> tokenizer.get_vocabulary_size() 44
- convert_tokens_to_indices(tokens)[source]#
Converts a list of tokens to indices.
Examples
>>> tokens = ['A03C', 'A03D', 'A03E', 'A03F', 'A04A', 'A05A', 'A05B', 'B035', 'C129'] >>> indices = tokenizer.convert_tokens_to_indices(tokens) >>> print(indices) [8, 9, 10, 11, 12, 13, 14, 1, 1]
- convert_indices_to_tokens(indices)[source]#
Converts a list of indices to tokens.
Examples
>>> indices = [0, 1, 2, 3, 4, 5] >>> tokens = tokenizer.convert_indices_to_tokens(indices) >>> print(tokens) ['<pad>', '<unk>', 'A01A', 'A02A', 'A02B', 'A02X']
- batch_encode_2d(batch, padding=True, truncation=True, max_length=512)[source]#
Converts a list of lists of tokens (2D) to indices.
- Parameters:
batch (
List
[List
[str
]]) – List of lists of tokens to convert to indices.padding (
bool
) – whether to pad the tokens to the max number of tokens in the batch (smart padding).truncation (
bool
) – whether to truncate the tokens to max_length.max_length (
int
) – maximum length of the tokens. This argument is ignored if truncation is False.
Examples
>>> tokens = [ ... ['A03C', 'A03D', 'A03E', 'A03F'], ... ['A04A', 'B035', 'C129'] ... ]
>>> indices = tokenizer.batch_encode_2d(tokens) >>> print ('case 1:', indices) case 1: [[8, 9, 10, 11], [12, 1, 1, 0]]
>>> indices = tokenizer.batch_encode_2d(tokens, padding=False) >>> print ('case 2:', indices) case 2: [[8, 9, 10, 11], [12, 1, 1]]
>>> indices = tokenizer.batch_encode_2d(tokens, max_length=3) >>> print ('case 3:', indices) case 3: [[9, 10, 11], [12, 1, 1]]
- batch_decode_2d(batch, padding=False)[source]#
Converts a list of lists of indices (2D) to tokens.
- Parameters:
Examples
>>> indices = [ ... [8, 9, 10, 11], ... [12, 1, 1, 0] ... ]
>>> tokens = tokenizer.batch_decode_2d(indices) >>> print ('case 1:', tokens) case 1: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
>>> tokens = tokenizer.batch_decode_2d(indices, padding=True) >>> print ('case 2:', tokens) case 2: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>', '<pad>']]
- batch_encode_3d(batch, padding=(True, True), truncation=(True, True), max_length=(10, 512))[source]#
Converts a list of lists of lists of tokens (3D) to indices.
- Parameters:
batch (
List
[List
[List
[str
]]]) – List of lists of lists of tokens to convert to indices.padding (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to pad the tokens to the max number of tokens and visits (smart padding).truncation (
Tuple
[bool
,bool
]) – a tuple of two booleans indicating whether to truncate the tokens to the corresponding element in max_lengthmax_length (
Tuple
[int
,int
]) – a tuple of two integers indicating the maximum length of the tokens along the first and second dimension. This argument is ignored if truncation is False.
Examples
>>> tokens = [ ... [ ... ['A03C', 'A03D', 'A03E', 'A03F'], ... ['A08A', 'A09A'], ... ], ... [ ... ['A04A', 'B035', 'C129'], ... ] ... ]
>>> indices = tokenizer.batch_encode_3d(tokens) >>> print ('case 1:', indices) case 1: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0], [0, 0, 0, 0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, True)) >>> print ('case 2:', indices) case 2: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(True, False)) >>> print ('case 3:', indices) case 3: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1], [0]]]
>>> indices = tokenizer.batch_encode_3d(tokens, padding=(False, False)) >>> print ('case 4:', indices) case 4: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1]]]
>>> indices = tokenizer.batch_encode_3d(tokens, max_length=(2,2)) >>> print ('case 5:', indices) case 5: [[[10, 11], [24, 25]], [[1, 1], [0, 0]]]
- batch_decode_3d(batch, padding=False)[source]#
Converts a list of lists of lists of indices (3D) to tokens.
- Parameters:
Examples
>>> indices = [ ... [ ... [8, 9, 10, 11], ... [24, 25, 0, 0] ... ], ... [ ... [12, 1, 1, 0], ... [0, 0, 0, 0] ... ] ... ]
>>> tokens = tokenizer.batch_decode_3d(indices) >>> print ('case 1:', tokens) case 1: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], [['A04A', '<unk>', '<unk>']]]
>>> tokens = tokenizer.batch_decode_3d(indices, padding=True) >>> print ('case 2:', tokens) case 2: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A', '<pad>', '<pad>']], [['A04A', '<unk>', '<unk>', '<pad>'], ['<pad>', '<pad>', '<pad>', '<pad>']]]