pyhealth.datasets.SampleDataset#

This class is the basic sample dataset. All sample datasets are inherited from this class.

class pyhealth.datasets.SampleDataset(path, dataset_name=None, task_name=None, **kwargs)[source]#

Bases: StreamingDataset

A streaming dataset that loads sample metadata and processors from disk.

SampleDataset expects the path directory to contain a schema.pkl file created by a SampleBuilder.save(…) call. The schema.pkl must include the fitted input_schema, output_schema, input_processors, output_processors, patient_to_index and record_to_index mappings.

input_schema#: The configuration used to instantiate processors for input features (string aliases or processor specs).

output_schema#: The configuration used to instantiate processors for output features.

input_processors#: A mapping of input feature names to fitted FeatureProcessor instances.

output_processors#: A mapping of output feature names to fitted FeatureProcessor instances.

patient_to_index#: Dictionary mapping patient IDs to the list of sample indices associated with that patient.

record_to_index#: Dictionary mapping record/visit IDs to the list of sample indices associated with that record.

dataset_name#: Optional human friendly dataset name.

task_name#: Optional human friendly task name.

subset(indices)[source]#

Create a StreamingDataset restricted to the provided indices.

Return type:: SampleDataset

close()[source]#

Cleans up any temporary directories used by the dataset.

Return type:: None

get_len(num_workers, batch_size)#

Return type:: int

load_state_dict(state_dict)#

Return type:: None

property on_demand_bytes: bool#

Return type:: bool

reset()#

Return type:: None

reset_state_dict()#

Return type:: None

set_batch_size(batch_size)#

Return type:: None

set_drop_last(drop_last)#

Set the drop_last parameter.

Invalidates the shuffler cache when the parameter changes to ensure subsequent length calculations reflect the new drop_last setting.

Parameters:: drop_last (bool) – Whether to drop the last incomplete batch.
Return type:: None

set_epoch(current_epoch)#

Set the current epoch to the dataset on epoch starts.

When using the StreamingDataLoader, this is done automatically

Return type:: None

set_num_workers(num_workers)#

Return type:: None

set_shuffle(shuffle)#

Set the shuffle parameter.

Invalidates the shuffler cache when the parameter changes to ensure subsequent length calculations reflect the new shuffle setting.

Parameters:: shuffle (bool) – Whether to shuffle the dataset.
Return type:: None

state_dict(num_samples_yielded, num_workers, batch_size)#

Return type:: dict[str, Any]