pyhealth.processors.AudioProcessor#

Processor for audio data.

class pyhealth.processors.AudioProcessor(sample_rate=4000, duration=20.0, to_mono=True, normalize=False, mean=None, std=None, n_mels=None, n_fft=400, hop_length=None)[source]#

Bases: FeatureProcessor

Feature processor for loading audio from disk and converting them to tensors.

Parameters:
  • sample_rate (Optional[int]) – Desired output sample rate. If None, keeps original sample rate. Defaults to 4000.

  • duration (Optional[float]) – Desired duration in seconds. If None, keeps original duration. If shorter than audio, truncates. If longer, pads with zeros. Defaults to 20.0.

  • to_mono (bool) – Whether to convert stereo audio to mono. Defaults to True.

  • normalize (bool) – Whether to normalize audio values to [-1, 1]. Defaults to False.

  • mean (Optional[float]) – Precomputed mean for normalization. Defaults to None.

  • std (Optional[float]) – Precomputed std for normalization. Defaults to None.

  • n_mels (Optional[int]) – Number of mel filterbanks. If provided, converts to mel spectrogram. Defaults to None (keeps waveform).

  • n_fft (int) – Size of FFT for spectrogram. Defaults to 400.

  • hop_length (Optional[int]) – Length of hop between STFT windows. Defaults to None.

Raises:

ValueError – If normalization parameters are inconsistent.

process(value)[source]#

Process a single audio path into a transformed tensor.

Parameters:

value (Union[str, Path]) – Path to audio file as string or Path object.

Returns:

  • Waveform: (channels, samples)

  • Mel spectrogram: (channels, n_mels, time)

Return type:

Transformed audio tensor. Shape depends on parameters

Raises:

FileNotFoundError – If the audio file does not exist.

is_token()[source]#

Audio data is continuous (float-valued), not discrete tokens.

Return type:

bool

Returns:

False, since audio waveforms and spectrograms are continuous signals.

schema()[source]#

Returns the schema of the processed audio feature.

The audio processor emits a single tensor (waveform or mel spectrogram).

Return type:

tuple[str, ...]

Returns:

(“value”,)

dim()[source]#

Number of dimensions for the output tensor.

Return type:

tuple[int, ...]

Returns:

(2,) for waveform output (channels, samples), or (3,) for mel spectrogram output (channels, n_mels, time).

spatial()[source]#

Whether each dimension of the output tensor is spatial.

For waveform (channels, samples): channels is not spatial, samples is. For mel spectrogram (channels, n_mels, time): channels is not spatial, n_mels and time are.

Return type:

tuple[bool, ...]

Returns:

Tuple of booleans for each axis.

fit(samples, field)#

Fit the processor to the samples.

Parameters:

samples (Iterable[Dict[str, Any]]) – List of sample dictionaries.

Return type:

None

load(path)#

Optional: Load processor state from disk.

Parameters:

path (str) – File path to load processor state from.

Return type:

None

save(path)#

Optional: Save processor state to disk.

Parameters:

path (str) – File path to save processor state.

Return type:

None