pyhealth.datasets.OMOPDataset#

We can process any OMOP-CDM formatted database, refer to doc for more information. The raw data is processed into well-structured dataset object and give user the best flexibility and convenience for supporting modeling and analysis.

class pyhealth.datasets.OMOPDataset(root, tables, dataset_name=None, code_mapping=None, dev=False, refresh_cache=False)[source]#

Bases: BaseEHRDataset

Base dataset for OMOP dataset.

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence.

See: https://www.ohdsi.org/data-standardization/the-common-data-model/.

The basic information is stored in the following tables:
  • person: contains records that uniquely identify each person or patient,

    and some demographic information.

  • visit_occurrence: contains info for how a patient engages with the

    healthcare system for a duration of time.

  • death: contains info for how and when a patient dies.

We further support the following tables:
  • condition_occurrence.csv: contains the condition information

    (CONDITION_CONCEPT_ID code) of patients’ visits.

  • procedure_occurrence.csv: contains the procedure information

    (PROCEDURE_CONCEPT_ID code) of patients’ visits.

  • drug_exposure.csv: contains the drug information (DRUG_CONCEPT_ID code)

    of patients’ visits.

  • measurement.csv: contains all laboratory measurements

    (MEASUREMENT_CONCEPT_ID code) of patients’ visits.

Parameters:
  • dataset_name (Optional[str]) – name of the dataset.

  • root (str) – root directory of the raw data (should contain many csv files).

  • tables (List[str]) – list of tables to be loaded (e.g., [“DIAGNOSES_ICD”, “PROCEDURES_ICD”]).

  • code_mapping (Optional[Dict[str, Union[str, Tuple[str, Dict]]]]) –

    a dictionary containing the code mapping information. The key is a str of the source code vocabulary and the value is of two formats:

    1. a str of the target code vocabulary;

    2. a tuple with two elements. The first element is a str of the target code vocabulary and the second element is a dict with keys “source_kwargs” or “target_kwargs” and values of the corresponding kwargs for the CrossMap.map() method.

    Default is empty dict, which means the original code will be used.

  • dev (bool) – whether to enable dev mode (only use a small subset of the data). Default is False.

  • refresh_cache (bool) – whether to refresh the cache; if true, the dataset will be processed from scratch and the cache will be updated. Default is False.

task#

Optional[str], name of the task (e.g., “mortality prediction”). Default is None.

samples#

Optional[List[Dict]], a list of samples, each sample is a dict with patient_id, visit_id, and other task-specific attributes as key. Default is None.

patient_to_index#

Optional[Dict[str, List[int]]], a dict mapping patient_id to a list of sample indices. Default is None.

visit_to_index#

Optional[Dict[str, List[int]]], a dict mapping visit_id to a list of sample indices. Default is None.

Examples

>>> from pyhealth.datasets import OMOPDataset
>>> dataset = OMOPDataset(
...         root="/srv/local/data/zw12/pyhealth/raw_data/synpuf1k_omop_cdm_5.2.2",
...         tables=["condition_occurrence", "procedure_occurrence", "drug_exposure", "measurement",],
...     )
>>> dataset.stat()
>>> dataset.info()
parse_basic_info(patients)[source]#

Helper functions which parses person, visit_occurrence, and death tables.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_condition_occurrence(patients)[source]#

Helper function which parses condition_occurrence table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_procedure_occurrence(patients)[source]#

Helper function which parses procedure_occurrence table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_drug_exposure(patients)[source]#

Helper function which parses drug_exposure table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.

parse_measurement(patients)[source]#

Helper function which parses measurement table.

Will be called in self.parse_tables()

Docs:
Parameters:

patients (Dict[str, Patient]) – a dict of Patient objects indexed by patient_id.

Return type:

Dict[str, Patient]

Returns:

The updated patients dict.