pyhealth.processors.AudioProcessor#
Processor for audio data.
- class pyhealth.processors.AudioProcessor(sample_rate=4000, duration=20.0, to_mono=True, normalize=False, mean=None, std=None, n_mels=None, n_fft=400, hop_length=None)[source]#
Bases:
FeatureProcessorFeature processor for loading audio from disk and converting them to tensors.
- Parameters:
sample_rate (
Optional[int]) – Desired output sample rate. If None, keeps original sample rate. Defaults to 4000.duration (
Optional[float]) – Desired duration in seconds. If None, keeps original duration. If shorter than audio, truncates. If longer, pads with zeros. Defaults to 20.0.to_mono (
bool) – Whether to convert stereo audio to mono. Defaults to True.normalize (
bool) – Whether to normalize audio values to [-1, 1]. Defaults to False.mean (
Optional[float]) – Precomputed mean for normalization. Defaults to None.std (
Optional[float]) – Precomputed std for normalization. Defaults to None.n_mels (
Optional[int]) – Number of mel filterbanks. If provided, converts to mel spectrogram. Defaults to None (keeps waveform).n_fft (
int) – Size of FFT for spectrogram. Defaults to 400.hop_length (
Optional[int]) – Length of hop between STFT windows. Defaults to None.
- Raises:
ValueError – If normalization parameters are inconsistent.
- process(value)[source]#
Process a single audio path into a transformed tensor.
- Parameters:
value (
Union[str,Path]) – Path to audio file as string or Path object.- Returns:
Waveform: (channels, samples)
Mel spectrogram: (channels, n_mels, time)
- Return type:
Transformed audio tensor. Shape depends on parameters
- Raises:
FileNotFoundError – If the audio file does not exist.
- is_token()[source]#
Audio data is continuous (float-valued), not discrete tokens.
- Return type:
- Returns:
False, since audio waveforms and spectrograms are continuous signals.
- schema()[source]#
Returns the schema of the processed audio feature.
The audio processor emits a single tensor (waveform or mel spectrogram).
- spatial()[source]#
Whether each dimension of the output tensor is spatial.
For waveform (channels, samples): channels is not spatial, samples is. For mel spectrogram (channels, n_mels, time): channels is not spatial, n_mels and time are.
- fit(samples, field)#
Fit the processor to the samples.
- load(path)#
Optional: Load processor state from disk.