How to Contribute#
We welcome contributions to PyHealth! This guide will help you get started with contributing datasets, tasks, models, bug fixes, or other improvements to the project.
Examples of Great Contributions#
Before diving into the details, here are some excellent pull requests that demonstrate what we’re looking for. Use these as references when creating your own contributions:
Dataset & Task Contributions:
ChestX-ray14 Dataset with Classification Tasks (PR #392)
This is an exemplary dataset contribution that includes:
Complete dataset implementation with proper documentation
Both binary and multilabel classification tasks
Comprehensive test cases with synthetic data
API documentation files
Example notebooks demonstrating usage
Clear PR description with file guide
Model Updates & Revamps:
Model Compatibility Updates (PR #610)
A great example of updating models to work with PyHealth 2.0:
Systematic updates across multiple models
Maintained backward compatibility where possible
Clear documentation of breaking changes
Updated examples and test cases
These contributions demonstrate the level of completeness and documentation we appreciate. Take some time to review them to understand our expectations!
Review Criteria#
All contributions to PyHealth are evaluated based on the following criteria:
Completeness
Your contribution should be production-ready and complete. This includes:
Comprehensive test cases with synthetic data
Complete API documentation following our style guide
Proper inheritance from base classes with all required methods implemented
Working examples demonstrating usage
Clear docstrings for all public methods and classes
Relevance to Healthcare
PyHealth is a healthcare-focused library. Contributions should address healthcare-specific problems or datasets. Being sourced from or inspired by a peer-reviewed publication strengthens your contribution and provides validation of the approach.
Relevance to the Package
Your contribution should enhance PyHealth’s utility for the broader community. Consider:
Does this add meaningful functionality that others will use?
Does it integrate well with existing PyHealth components?
Does it follow established patterns and conventions?
Will it create confusion or maintenance burden?
The goal is to grow PyHealth thoughtfully, not just add features for the sake of features.
Getting Started#
Prerequisites#
PyHealth uses GitHub for development, so you’ll need a GitHub account to contribute.
Setting Up Your Development Environment#
To start contributing to PyHealth:
Fork the PyHealth repository on GitHub
Clone your forked repository to your local machine:
git clone https://github.com/your_username/PyHealth.git cd PyHealth
Install dependencies:
pip install -e .
Implement your code with proper test cases
Push changes to your forked repository
Create a pull request to the main PyHealth repository
Target the
mainbranchEnable edits by maintainers
Rebase with the remote
sunlabuiucmain branch before creating the PR
Data Access for Testing#
If you’re new to working with healthcare datasets and need access to data for testing your contributions, here are some helpful resources:
Getting MIMIC Access#
For full access to MIMIC datasets, you’ll need to complete credentialing through PhysioNet. Detailed instructions are available in our Getting MIMIC Access Guide.
Using Demo Datasets#
While completing the credentialing process, you can use publicly available demo datasets to develop and test your code:
MIMIC-IV Demo: A subset of 100 patients from MIMIC-IV, available at https://physionet.org/content/mimic-iv-demo/2.2/
MIMIC-III Demo: A subset of 100 patients from MIMIC-III, available at https://physionet.org/content/mimiciii-demo/1.4/
These demo datasets are open access and do not require credentialing. They’re perfect for:
Testing your dataset implementations
Developing new tasks
Creating reproducible examples
Verifying your code works before requesting full data access
Important Note on Test Data: While demo datasets are useful for development, your test cases should use small synthetic/pseudo data to ensure fast execution in continuous integration. Demo datasets (even with 100 patients) are too large for unit tests. See the “Writing Fast and Performant Tests” section below for guidance on creating minimal test fixtures.
Examples and Tutorials#
Before implementing new features, review our tutorials page for examples. The tutorials include Jupyter notebooks demonstrating:
How to load and process datasets
How to define and run tasks
How to train and evaluate models
Best practices for working with PyHealth
Visit the tutorials page in our documentation to explore these examples.
Implementation Requirements#
Code File Headers#
For new contributors, include the following information at the top of your code files:
Your name(s)
Your NetID(s) (if applicable for UIUC students)
Paper title (if applicable to a reproducibility contribution)
Paper link (if applicable)
Description of the task/dataset/model you’re implementing
Code Style and Documentation#
General Guidelines:
Use object-oriented programming with well-defined and typed functions
Follow snake_case naming for variables and functions (e.g.,
this_variable)Use PascalCase for class names (e.g.,
ThisClass)Follow PEP8 style with 88 character line length
Use Google style for docstrings
Function Documentation Requirements:
Each function must document:
Input arguments: Define variable types and descriptions
Output arguments: Define variable types and descriptions
High-level description of what the function does
Example use case or where it will be called
Example Well-Documented Function:
def parse_basic_info(self, patients: Dict[str, Patient]) -> Dict[str, Patient]:
"""Helper functions which parses patients and admissions tables.
Will be called in `self.parse_tables()`
Docs:
- patients: https://mimic.mit.edu/docs/iv/modules/hosp/patients/
- admissions: https://mimic.mit.edu/docs/iv/modules/hosp/admissions/
Args:
patients: a dict of `Patient` objects indexed by patient_id.
Returns:
The updated patients dict.
"""
Types of Contributions#
Contributing a Dataset#
All datasets must follow these guidelines:
Inherit from BaseDataset: All datasets must inherit from the appropriate BaseDataset class
Follow established patterns:
For EHR datasets: See the MIMIC4 dataset example
For image datasets: See the CovidCXR dataset example where each folder represents a sample
Include a test task: Datasets should ideally have an associated task for testing purposes
Key Requirements:
Define all required variables outlined in the BaseDataset documentation
Provide clear data loading and processing methods
Include proper error handling and validation
Contributing a Task#
Tasks must follow the established task class structure:
Inherit from base task class: Follow the pattern defined in existing tasks
Examples to reference:
Flexibility: Tasks can include various implementation details but must have clear inputs/outputs
Test cases: Include example test cases with defined inputs and expected outputs
Contributing a Model#
Models must follow the model base class structure:
Inherit from BaseModel: All models must inherit from the appropriate base model class
Reference implementation: See the RNN model example
Test cases: Include example test cases with dummy inputs and expected outputs
Key Requirements:
Implement required abstract methods from the base class
Provide clear forward pass implementation
Include proper initialization and configuration methods
Required File Updates#
When contributing new features to PyHealth, you’ll need to create and update several files to ensure your contribution is properly integrated into the library. Below are detailed examples of what files should be included for different types of contributions.
Contributing a New Dataset: Complete File Checklist#
When adding a new dataset (e.g., NewDataset), you should create and update the following files:
1. Core Implementation File
pyhealth/datasets/new_dataset.py
Inherits from appropriate BaseDataset class
Implements all required data loading and parsing methods
Includes comprehensive docstrings following Google style
Contains proper type hints for all methods
2. Documentation File
docs/api/datasets/pyhealth.datasets.new_dataset.rst
Create a new reStructuredText file documenting your dataset:
pyhealth.datasets.new_dataset
=============================
Overview
--------
Brief description of the dataset, its source, and key characteristics.
API Reference
-------------
.. autoclass:: pyhealth.datasets.NewDataset
:members:
:undoc-members:
:show-inheritance:
3. Update Dataset Index
docs/api/datasets.rst
Add your new dataset to the table of contents:
.. toctree::
:maxdepth: 4
datasets/pyhealth.datasets.existing_dataset1
datasets/pyhealth.datasets.existing_dataset2
datasets/pyhealth.datasets.new_dataset
...
4. Test Case File
tests/core/test_new_dataset.py
Create comprehensive test cases that verify:
Dataset can be instantiated correctly
Data loading works with demo/synthetic data
All parsing methods execute without errors
Output formats match expected schemas
Edge cases are handled appropriately
Critical: Use Synthetic/Pseudo Data for Tests
Your test cases must use small, synthetic data rather than real datasets. Create minimal CSV files or in-memory data structures with just enough records to verify functionality:
Use 2-5 patients maximum (not 100!)
Create a few events per patient (5-10 records total)
Generate this data programmatically or as tiny fixture files in
test-resources/Ensure tests complete in milliseconds, not seconds
Example test structure with synthetic data:
import unittest
import tempfile
import pandas as pd
from pathlib import Path
from pyhealth.datasets import NewDataset
class TestNewDataset(unittest.TestCase):
def setUp(self):
# Create temporary directory with synthetic data
self.test_dir = tempfile.mkdtemp()
# Create minimal synthetic patient data
patients_df = pd.DataFrame({
'patient_id': [1, 2, 3],
'birth_date': ['1980-01-01', '1975-05-15', '1990-03-20'],
'gender': ['M', 'F', 'M']
})
patients_df.to_csv(Path(self.test_dir) / 'patients.csv', index=False)
# Create minimal synthetic events
events_df = pd.DataFrame({
'patient_id': [1, 1, 2, 3],
'event_time': ['2020-01-01', '2020-01-05', '2020-02-01', '2020-03-01'],
'event_code': ['D001', 'P002', 'D003', 'D001']
})
events_df.to_csv(Path(self.test_dir) / 'events.csv', index=False)
# Instantiate dataset with synthetic data
self.dataset = NewDataset(root=self.test_dir)
def test_load_data(self):
# Test that data loads correctly
self.assertEqual(len(self.dataset.patients), 3)
def test_parse_tables(self):
# Test that parsing produces expected format
patient = self.dataset.patients[1]
self.assertIsNotNone(patient)
self.assertTrue(hasattr(patient, 'visits'))
5. Associated Task (Optional but Recommended)
pyhealth/tasks/new_dataset_task.py
If your dataset enables a specific machine learning task, create a task file that:
Defines the task’s objective (e.g., mortality prediction, disease classification)
Specifies input features and label generation
Implements data preprocessing specific to the task
Includes example usage in docstrings
What Should Be Tested:
For each new dataset contribution, your test cases should verify:
Data Loading: The dataset can locate and load files from the specified directory
Patient Parsing: Patient-level information is correctly extracted and structured
Event Parsing: Clinical events (diagnoses, procedures, medications, etc.) are properly parsed
Data Integrity: No missing critical fields, appropriate data types, valid value ranges
Sample Output: At least one complete example showing input data → processed output
Performance: Tests run quickly (critical: always use tiny synthetic data, never real datasets)
Why Synthetic Data Matters:
Speed: CI/CD pipelines must complete quickly; tests with real data can take minutes or hours
Reproducibility: Synthetic data ensures tests work without credentials or external dependencies
Maintainability: Small fixtures are easy to understand, modify, and debug
Coverage: You can create edge cases and corner cases easily with synthetic data
Contributing a New Task: File Checklist#
Files Required:
pyhealth/tasks/new_task.py- Core task implementationdocs/api/tasks/pyhealth.tasks.new_task.rst- Documentation filedocs/api/tasks.rst- Update the task indextests/core/test_new_task.py- Test cases with synthetic dataExample usage in
examples/directory (optional but encouraged)
Task Test Requirements:
Task tests should verify:
Task can process synthetic dataset samples correctly
Label generation works as expected
Feature extraction produces correct output format
Edge cases (missing data, empty visits) are handled gracefully
Use synthetic Patient objects or minimal datasets (2-5 patients) to test task logic quickly.
Contributing a New Model: File Checklist#
Files Required:
pyhealth/models/new_model.py- Core model implementationdocs/api/models/pyhealth.models.new_model.rst- Documentation filedocs/api/models.rst- Update the model indextests/core/test_new_model.py- Test cases with dummy dataExample usage in
examples/directory (optional but encouraged)
Model Test Requirements:
Model tests should verify:
Model can be instantiated with various configurations
Forward pass executes correctly with dummy inputs
Output shapes match expected dimensions
Gradient computation works (for trainable models)
Model can save and load state correctly
Use minimal synthetic tensors for model tests:
import torch
import unittest
from pyhealth.models import NewModel
class TestNewModel(unittest.TestCase):
def test_forward_pass(self):
# Use tiny dimensions for fast testing
model = NewModel(
input_dim=10,
hidden_dim=8, # Keep small!
output_dim=2
)
# Create minimal synthetic input (batch_size=2)
x = torch.randn(2, 10)
# Test forward pass
output = model(x)
self.assertEqual(output.shape, (2, 2))
General Documentation Guidelines#
For all contributions:
Keep documentation consistent with existing files in the same category
Include working examples in docstrings whenever possible
Reference related classes/methods using Sphinx cross-references
Update index files so your contribution appears in the documentation
Use small, reproducible examples in your documentation that others can easily run
Test Case Requirements#
Every contribution must include two types of test cases:
Automated tests: These will be run by our continuous integration system
Manual test cases: You must define these yourself with:
Clear input specifications
Expected output formats
Example usage demonstrating functionality
Note: You can use frontier LLMs to help generate basic test cases, which we consider valid as long as they are reasonable and comprehensive.
All unit tests should be placed in the tests/ directory following the existing structure, with ‘tests/core/’ for core functionality tests.
Writing Fast and Performant Tests#
Test cases must run quickly in CI/CD. Follow these guidelines to ensure your tests are fast and efficient:
Creating Synthetic/Pseudo Data:
The most important rule: never use real datasets in test cases. Instead, create minimal synthetic data:
Generate data programmatically: Use pandas, numpy, or Python dictionaries to create tiny datasets in-memory
Use fixture files: Place small synthetic CSV/JSON files in
test-resources/(max a few KB each)Keep it minimal: 2-5 patients, 5-20 events total - just enough to test logic
Use temporary directories: Create data in
tempfile.mkdtemp()and clean up after tests
Example of creating synthetic data in a test:
import tempfile
import pandas as pd
from pathlib import Path
def setUp(self):
# Create temporary directory
self.test_dir = tempfile.mkdtemp()
# Generate minimal synthetic CSV files
pd.DataFrame({
'id': [1, 2, 3],
'value': ['A', 'B', 'C']
}).to_csv(Path(self.test_dir) / 'data.csv', index=False)
# Now use self.test_dir as your dataset root
Additional Performance Guidelines:
Keep tests fast: Each test should complete in milliseconds; entire suite in seconds
Place core unit tests in
tests/core/and name filestest_*.pyAvoid network access: No external APIs, downloads, or database connections
Use small configurations: Minimal sample sizes, tiny models (e.g.,
hidden_dim=4),batch_size=2,epochs=1Make tests deterministic: Set random seeds (
np.random.seed(42),torch.manual_seed(42))Mock heavy operations: Stub or monkeypatch I/O, model training, or expensive computations when testing logic
Skip heavyweight tests: Use pytest markers for any tests that can’t be made fast; default tests must run quickly
What NOT to do in tests:
❌ Load MIMIC-III/IV demo datasets (too large)
❌ Download data from the internet
❌ Train models for multiple epochs
❌ Use real medical images or large files
❌ Require credentials or external databases
❌ Run tests that take > 1 second each
Pull Request Guidelines#
Formatting Your Pull Request#
Every pull request must include the following information in the comment:
Who you are (include NetID if you’re an Illinois student)
Type of contribution (dataset, task, model, bug fix, etc.)
High-level description of what you’ve implemented
File guide: Quick rundown of which files to examine to test your implementation
Example PR Description:
**Contributor:** Jane Doe (jdoe2@illinois.edu)
**Contribution Type:** New Dataset
**Description:** Added support for the XYZ Hospital dataset with patient
admission records and diagnostic codes. Includes data preprocessing and
sample task for mortality prediction.
**Files to Review:**
- `pyhealth/datasets/xyz_hospital.py` - Main dataset implementation
- `pyhealth/tasks/xyz_mortality.py` - Example task
- `tests/core/test_xyz_dataset.py` - Test cases
Review Process#
After submitting your pull request:
Maintainers will review your code for style, functionality, and completeness
Automated tests will be run to ensure compatibility
You may be asked to make revisions based on feedback
Once approved, your contribution will be merged into the main branch
Getting Help#
If you need assistance:
Check existing issues and discussions on GitHub
Review similar implementations in the codebase
Reach out to maintainers through GitHub issues
Consider using LLMs to help with code formatting and documentation
We appreciate your contributions to making PyHealth better for the healthcare AI community!