pyhealth.datasets.SampleDataset#

This class is the basic sample dataset. All sample datasets are inherited from this class.

class pyhealth.datasets.SampleDataset(path, dataset_name=None, task_name=None, **kwargs)[source]#

Bases: StreamingDataset

A streaming dataset that loads sample metadata and processors from disk.

SampleDataset expects the path directory to contain a schema.pkl file created by a SampleBuilder.save(…) call. The schema.pkl must include the fitted input_schema, output_schema, input_processors, output_processors, patient_to_index and record_to_index mappings.

input_schema#

The configuration used to instantiate processors for input features (string aliases or processor specs).

output_schema#

The configuration used to instantiate processors for output features.

input_processors#

A mapping of input feature names to fitted FeatureProcessor instances.

output_processors#

A mapping of output feature names to fitted FeatureProcessor instances.

patient_to_index#

Dictionary mapping patient IDs to the list of sample indices associated with that patient.

record_to_index#

Dictionary mapping record/visit IDs to the list of sample indices associated with that record.

dataset_name#

Optional human friendly dataset name.

task_name#

Optional human friendly task name.

subset(indices)[source]#

Create a StreamingDataset restricted to the provided indices.

Return type:

SampleDataset

close()[source]#

Cleans up any temporary directories used by the dataset.

Return type:

None

get_len(num_workers, batch_size)#
Return type:

int

load_state_dict(state_dict)#
Return type:

None

property on_demand_bytes: bool#
Return type:

bool

reset()#
Return type:

None

reset_state_dict()#
Return type:

None

set_batch_size(batch_size)#
Return type:

None

set_drop_last(drop_last)#

Set the drop_last parameter.

Invalidates the shuffler cache when the parameter changes to ensure subsequent length calculations reflect the new drop_last setting.

Parameters:

drop_last (bool) – Whether to drop the last incomplete batch.

Return type:

None

set_epoch(current_epoch)#

Set the current epoch to the dataset on epoch starts.

When using the StreamingDataLoader, this is done automatically

Return type:

None

set_num_workers(num_workers)#
Return type:

None

set_shuffle(shuffle)#

Set the shuffle parameter.

Invalidates the shuffler cache when the parameter changes to ensure subsequent length calculations reflect the new shuffle setting.

Parameters:

shuffle (bool) – Whether to shuffle the dataset.

Return type:

None

state_dict(num_samples_yielded, num_workers, batch_size)#
Return type:

dict[str, Any]