pyhealth.processors.AudioProcessor#

Processor for audio data.

class pyhealth.processors.AudioProcessor(sample_rate=4000, duration=20.0, to_mono=True, normalize=False, mean=None, std=None, n_mels=None, n_fft=400, hop_length=None)[source]#

Bases: FeatureProcessor

Feature processor for loading audio from disk and converting them to tensors.

Parameters:

sample_rate (Optional[int]) – Desired output sample rate. If None, keeps original sample rate. Defaults to 4000.
duration (Optional[float]) – Desired duration in seconds. If None, keeps original duration. If shorter than audio, truncates. If longer, pads with zeros. Defaults to 20.0.
to_mono (bool) – Whether to convert stereo audio to mono. Defaults to True.
normalize (bool) – Whether to normalize audio values to [-1, 1]. Defaults to False.
mean (Optional[float]) – Precomputed mean for normalization. Defaults to None.
std (Optional[float]) – Precomputed std for normalization. Defaults to None.
n_mels (Optional[int]) – Number of mel filterbanks. If provided, converts to mel spectrogram. Defaults to None (keeps waveform).
n_fft (int) – Size of FFT for spectrogram. Defaults to 400.
hop_length (Optional[int]) – Length of hop between STFT windows. Defaults to None.

Raises:

ValueError – If normalization parameters are inconsistent.

process(value)[source]#

Process a single audio path into a transformed tensor.

Parameters:

value (Union[str, Path]) – Path to audio file as string or Path object.

Returns:

Waveform: (channels, samples)
Mel spectrogram: (channels, n_mels, time)

Return type:

Transformed audio tensor. Shape depends on parameters

Raises:

FileNotFoundError – If the audio file does not exist.

is_token()[source]#

Audio data is continuous (float-valued), not discrete tokens.

Return type:: bool
Returns:: False, since audio waveforms and spectrograms are continuous signals.

schema()[source]#

Returns the schema of the processed audio feature.

The audio processor emits a single tensor (waveform or mel spectrogram).

Return type:: tuple[str, ...]
Returns:: (“value”,)

dim()[source]#

Number of dimensions for the output tensor.

Return type:: tuple[int, ...]
Returns:: (2,) for waveform output (channels, samples), or (3,) for mel spectrogram output (channels, n_mels, time).

spatial()[source]#

Whether each dimension of the output tensor is spatial.

For waveform (channels, samples): channels is not spatial, samples is. For mel spectrogram (channels, n_mels, time): channels is not spatial, n_mels and time are.

Return type:: tuple[bool, ...]
Returns:: Tuple of booleans for each axis.

fit(samples, field)#

Fit the processor to the samples.

Parameters:: samples (Iterable[Dict[str, Any]]) – List of sample dictionaries.
Return type:: None

load(path)#

Optional: Load processor state from disk.

Parameters:: path (str) – File path to load processor state from.
Return type:: None

save(path)#

Optional: Save processor state to disk.

Parameters:: path (str) – File path to save processor state.
Return type:: None