pyhealth.processors.MultiHotProcessor#

Processor for multi-hot encoding.

class pyhealth.processors.MultiHotProcessor[source]#

Bases: MultiLabelProcessor

Processor for converting categorical variables into multi-hot encoded vectors.

This is an alias of MultiLabelProcessor specifically for input feature schemas. The implementation is identical to MultiLabelProcessor, but this class provides semantic clarity: use MultiLabelProcessor for output labels in classification tasks, and MultiHotProcessor for input features like patient demographics (e.g., ethnicity, race, comorbidities).

The processor builds a vocabulary during the fit() phase and converts each sample’s categorical values into a fixed-size binary vector.

Input:
  • List of categorical tokens (e.g., [“asian”, “non_hispanic”])

  • Each token is a hashable value (typically strings or integers)

Processing:
  1. During fit(): builds vocabulary by collecting all unique categorical values across the dataset and mapping each to a unique index

  2. During process(): converts a list of tokens into a 1D binary tensor where positions corresponding to present categories are set to 1.0, others to 0.0

Output:
  • torch.Tensor of shape (num_categories,) with dtype float32

  • Binary encoding: 1.0 at indices where categories are present, 0.0 elsewhere

  • The size() method returns num_categories (vocabulary size)

Example

Given samples with ethnicity field:
  • Sample 1: [“asian”, “non_hispanic”]

  • Sample 2: [“white”, “hispanic”]

  • Sample 3: [“black”]

After fit():

vocabulary = {“asian”: 0, “black”: 1, “hispanic”: 2, “non_hispanic”: 3, “white”: 4} size() returns 5

Processing Sample 1 produces:

tensor([1.0, 0.0, 0.0, 1.0, 0.0]) # asian and non_hispanic are present

Note

The processor sorts categories alphabetically during vocabulary construction to ensure consistent index assignments across runs.

See also

MultiLabelProcessor: Parent class with identical implementation.

Use for output labels in multi-label classification tasks.

process(value)[source]#

Process an individual field value.

Parameters:

value (Any) – Raw field value.

Returns:

Processed value.

dim()#

Output shape is (num_classes,), so 1 dimension.

Return type:

tuple[int, ...]

fit(samples, field)#

Fit the processor to the samples.

Parameters:

samples (Iterable[Dict[str, Any]]) – List of sample dictionaries.

Return type:

None

is_token()#

Multi-label indicators are continuous float targets for BCE loss.

Return type:

bool

load(path)#

Optional: Load processor state from disk.

Parameters:

path (str) – File path to load processor state from.

Return type:

None

save(path)#

Optional: Save processor state to disk.

Parameters:

path (str) – File path to save processor state.

Return type:

None

schema()#

Returns the schema of the processed feature. For a processor that emits a single tensor, this should just return [“value”]. For a processor that emits a tuple of tensors, this should return a tuple of the same length as the tuple, with the semantic name of each tensor, such as [“time”, “value”], [“value”, “mask”], etc.

Typical semantic names include:
  • “value”: the main processed tensor output of the processor

  • “time”: the time tensor output of the processor (mostly for StageNet)

  • “mask”: the mask tensor output of the processor (if applicable)

Return type:

tuple[str, ...]

Returns:

Tuple of semantic names corresponding to the output of the processor.

size()#
spatial()#

Whether each dimension (axis) of the value tensor is spatial (i.e. corresponds to a spatial axis like time, height, width, etc.) or not. This is used to determine how to apply augmentations and other transformations that should only be applied to spatial dimensions.

E.g. for CNN or RNN features, this would help determine which dimensions to apply spatial augmentations to, and which dimensions to treat as channels or features.

Return type:

tuple[bool, ...]

Returns:

Tuple of booleans corresponding to whether each axis of the value tensor is spatial or not.