
Quick Start for Data Processing

We propose the idea of standard template, a formalized schema for healthcare datasets. Ideally, as long as the data is scanned as the template we defined, the downstream task processing and the use of ML models will be easy and standard. In short, it has the following structure: add a figure here. The dataloader for different datasets can be found in examples/data_generation. Using “examples/data_generation/dataloader_mimic_demo.py” as an exmaple:

  1. First read in patient, admission, and event tables.

    from pyhealth.utils.utility import read_csv_to_df
    patient_df = read_csv_to_df(os.path.join('data', 'mimic-iii-clinical-database-demo-1.4', 'PATIENTS.csv'))
    admission_df = read_csv_to_df(os.path.join('data', 'mimic-iii-clinical-database-demo-1.4', 'ADMISSIONS.csv'))
  2. Then invoke the parallel program to parse the tables in n_jobs cores.

    from pyhealth.data.base_mimic import parallel_parse_tables
    all_results = Parallel(n_jobs=n_jobs, max_nbytes=None, verbose=True)(
  3. The processed sequential data will be saved in the prespecified directory.

    with open(patient_data_loc, 'w') as outfile:
        json.dump(patient_data_list, outfile)

The provided examples in PyHealth mainly focus on scanning the data tables in the schema we have, and generate episode datasets. For instance, “examples/data_generation/dataloader_mimic_demo.py” demonstrates the basic procedure of processing MIMIC III demo datasets.

  1. The next step is to generate episode/sequence data for mortality prediction. See “examples/data_generation/generate_mortality_prediction_mimic_demo.py”

    with open(patient_data_loc, 'w') as outfile:
        json.dump(patient_data_list, outfile)

By this step, the dataset has been processed for generating X, y for phenotyping prediction. It is noted that the API across most datasets are similar. One may easily replicate this procedure by calling the data generation scripts in \examples\data_generation. You may also modify the parameters in the scripts to generate the customized datasets.

Preprocessed datasets are also available at \datasets\cms and \datasets\mimic.

Quick Start for Running Predictive Models

Note: Before running examples, you need the datasets. Please download from the GitHub repository “datasets”. You can either unzip them manually or running our script “00_extract_data_run_before_learning.py”

Note: “examples/learning_models/example_sequence_gpu_mortality.py” demonstrates the basic API of using GRU for mortality prediction. It is noted that the API across all other algorithms are consistent/similar.

Note: If you do not have the preprocessed datasets yet, download the \datasets folder (cms.zip and mimic.zip) from PyHealth repository, and run \examples\learning_models\extract_data_run_before_learning.py to prepare/unzip the datasets.

Note: For “certain examples”, pretrained bert models are needed. You will need to download these pretrained models at:

Please download, unzip, and save to ./auxiliary folder.

  1. Setup the datasets. X and y should be in x_data and y_data, respectively.

    # load pre-processed CMS dataset
    from pyhealth.data.expdata_generator import sequencedata as expdata_generator
    expdata_id = '2020.0810.data.mortality.mimic'
    cur_dataset = expdata_generator(exp_id=exp_id)
    cur_dataset.get_exp_data(sel_task='mortality', )
  2. Initialize a LSTM model, you may set up the parameters of the LSTM, e.g., n_epoch, learning_rate, etc,.

    # initialize the model for training
    from pyhealth.models.sequence.lstm import LSTM
    # enable GPU
    clf = LSTM(expmodel_id=expmodel_id, n_batchsize=20, use_gpu=True,
        n_epoch=100, gpu_ids='0,1')
    clf.fit(cur_dataset.train, cur_dataset.valid)
  3. Load the best shot of the training, predict on the test datasets

    # load the best model for inference
    pred_results = clf.get_results()
  4. Evaluation on the model. Multiple metrics are supported.

    # evaluate the model
    from pyhealth.evaluation.evaluator import func
    r = func(pred_results['hat_y'], pred_results['y'])