ArticlePDF Available

The Penn Medicine BioBank: Towards a Genomics-Enabled Learning Healthcare System to Accelerate Precision Medicine in a Diverse Population

Authors:

Abstract and Figures

The Penn Medicine BioBank (PMBB) is an electronic health record (EHR)-linked biobank at the University of Pennsylvania (Penn Medicine). A large variety of health-related information, ranging from diagnosis codes to laboratory measurements, imaging data and lifestyle information, is integrated with genomic and biomarker data in the PMBB to facilitate discoveries and translational science. To date, 174,712 participants have been enrolled into the PMBB, including approximately 30% of participants of non-European ancestry, making it one of the most diverse medical biobanks. There is a median of seven years of longitudinal data in the EHR available on participants, who also consent to permission to recontact. Herein, we describe the operations and infrastructure of the PMBB, summarize the phenotypic architecture of the enrolled participants, and use body mass index (BMI) as a proof-of-concept quantitative phenotype for PheWAS, LabWAS, and GWAS. The major representation of African-American participants in the PMBB addresses the essential need to expand the diversity in genetic and translational research. There is a critical need for a “medical biobank consortium” to facilitate replication, increase power for rare phenotypes and variants, and promote harmonized collaboration to optimize the potential for biological discovery and precision medicine.
Content may be subject to copyright.
J. Pers. Med. 2022, 12, 1974. https://doi.org/10.3390/jpm12121974 www.mdpi.com/journal/jpm
Article
The Penn Medicine BioBank: Towards a Genomics-Enabled
Learning Healthcare System to Accelerate Precision Medicine
in a Diverse Population
Anurag Verma
1,2,
*, Scott M. Damrauer
2,3,4
, Nawar Naseer
2
, JoEllen Weaver
2
, Colleen M Kripke
1,2
,
Lindsay Guare
5
, Giorgio Sirugo
1,2
, Rachel L. Kember
6
, Theodore G Drivas
1
, Scott M. Dudek
3
, Yuki Bradford
3
,
Anastasia Lucas
3
, Renae Judy
4
, Shefali S. Verma
5
, Emma Meagher
1,2
, Katherine L. Nathanson
1,7
,
Michael Feldman
5
, Marylyn D. Ritchie
3
, Daniel J. Rader
1,2,3,
* and The Penn Medicine BioBank
1
Department of Medicine, Division of Translational Medicine and Human Genetics, University of
Pennsylvania, Philadelphia, PA 19104, USA
2
Institute for Translational Medicine and Therapeutics, University of Pennsylvania,
Philadelphia, PA 19104, USA
3
Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
4
Department of Surgery, Division of Vascular Surgery and Endovascular Therapy, University of
Pennsylvania, Philadelphia, PA 19104, USA
5
Department of Pathology, University of Pennsylvania, Philadelphia, PA 19104, USA
6
Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
7
Abramson Cancer Center, Perelman School of Medicine, University of Pennsylvania,
Philadelphia, PA 19104, USA
* Correspondence: anurag.verma@pennmedicine.upenn.edu (A.V.); ra[email protected] (D.J.R.)
† A full list of contributions from Penn Medicine BioBank Team is provided in the supplement.
Abstract: The Penn Medicine BioBank (PMBB) is an electronic health record (EHR)-linked biobank
at the University of Pennsylvania (Penn Medicine). A large variety of health-related information,
ranging from diagnosis codes to laboratory measurements, imaging data and lifestyle information,
is integrated with genomic and biomarker data in the PMBB to facilitate discoveries and transla-
tional science. To date, 174,712 participants have been enrolled into the PMBB, including approxi-
mately 30% of participants of non-European ancestry, making it one of the most diverse medical
biobanks. There is a median of seven years of longitudinal data in the EHR available on participants,
who also consent to permission to recontact. Herein, we describe the operations and infrastructure
of the PMBB, summarize the phenotypic architecture of the enrolled participants, and use body
mass index (BMI) as a proof-of-concept quantitative phenotype for PheWAS, LabWAS, and GWAS.
The major representation of African-American participants in the PMBB addresses the essential
need to expand the diversity in genetic and translational research. There is a critical need for a
“medical biobank consortium” to facilitate replication, increase power for rare phenotypes and var-
iants, and promote harmonized collaboration to optimize the potential for biological discovery and
precision medicine.
Keywords: genomics; electronic health records; biobank; PMBB; precision medicine; learning health
system
1. Introduction
Precision medicine incorporates clinical, environmental, lifestyle, family, and ge-
nomic data to tailor disease management and optimize disease prevention and health
maintenance. Since the completion of the human genome project, large-scale genomic in-
formation linked to individual-level phenotype data has fueled biomedical discovery and
has become an integral component of precision medicine. Large amounts of clinical data
Citation: Verma, A.; Damrauer,
S.M.; Naseer, N.; Weaver, J.; Kripke,
C.M.; Guare, L.; Sirugo, G.; Kember,
R.L.; Drivas, T.G.; Dudek, S.M.; et al.
The Penn Medicine BioBank:
Towards a Genomics-Enabled
Learning Healthcare System to
Accelerate Precision Medicine in a
Diverse Population. J. Pers. Med.
2022, 12, 1974. https://doi.org/
10.3390/jpm12121974
Academic Editor: Michael F. Murray
Received: 8 October 2022
Accepted: 19 November 2022
Published: 29 November 2022
Publisher’s Note: MDPI stays neu-
tral with regard to jurisdictional
claims in published maps and institu-
tional affiliations.
Copyright: © 2022 by the authors. Li-
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (https://cre-
ativecommons.org/licenses/by/4.0/).
J. Pers. Med. 2022, 12, 1974 2 of 16
are generated daily in clinical care and stored in the electronic health record (EHR). Link-
ing the phenomic data encompassed in the EHR with biospecimens and genomic data
from appropriately consented individuals at scale represents a tremendous opportunity
for biomedical discovery and precision medicine. Penn Medicine, part of the University
of Pennsylvania, is a large integrated academic health system with six hospitals and ten
multispecialty centers that serve south-central Pennsylvania, south-central New Jersey,
and northern Delaware. The Penn Medicine BioBank (PMBB) was launched with the in-
tent of harnessing clinical data for discovery, creating a genomics-enabled learning
healthcare system [1], and facilitating precision medicine for disease prevention and per-
sonalized therapy. Patients are enrolled under a single IRB-approved protocol that ena-
bles the acquisition of biospecimens, generation of genomic and multi-omic data, linkage
to clinical information included in the EHR, and permission to recontact participants for
future studies and/or the return of clinically relevant results. The scientific goal of the
PMBB is to promote the integration of clinical and genomic data to power biomedical dis-
covery and precision medicine. This report describes the operations and architecture of
the PMBB and summarizes the information on the first ~170,000 participants recruited.
2. Materials and Methods
2.1. Planning and Development of PMBB
Recognizing the need for access to large numbers of appropriately consented and
well-characterized human biospecimens to conduct translational research, in 2008, Penn
Medicine established an IRB protocol and process for consenting patients and obtaining
blood for genomic and biomarker research. In 2013, after a strategic planning process
identified a pressing institutional mandate for an expanded biobank resource, the Penn
Medicine BioBank (PMBB) was formally constituted, funded, and launched under the In-
stitute of Translational Medicine and Therapeutics (ITMAT) in order to ensure it was in-
tegrated with critical infrastructure relevant to clinical and translational research and pre-
cision medicine.
The PMBB protocol was established as an institutional umbrella protocol under
which any registered patient of Penn Medicine aged 18 or older was eligible, with no ex-
clusions except an inability to provide informed consent. The core features of the consent
include: (1) provision of a blood sample for biobanking and broad use for data generation,
including genomic data and permission to bank any other residual tissues obtained in the
context of clinical care; (2) permission to access data from the EHR for the purpose of
research; and (3) permission to recontact participants for potential future studies or to
return results.
2.2. Patient-Participant Recruitment
Initially, PMBB enrollment was accomplished through face-to-face encounters with
clinical research coordinators (CRCs) in outpatient clinical areas, prioritizing locations
where procedures that involved access to blood samples (phlebotomy labs, imaging pro-
cedures involving IV placement, cardiac catheterization labs, etc.) were being performed.
After the onset of the COVID-19 pandemic, in August of 2020, the PMBB transitioned to
remote recruitment efforts to prioritize patient-participant and staff safety by initiating an
electronic consent and enrollment process through REDCap, a secure web platform for
building and managing online databases and surveys. Simultaneously, a process for con-
sent utilizing the EHR (PennChart, Epic) was developed, which was initially done in per-
son at the time of patient check-in, and then also expanded to include consent at the time
of pre-check-in through their myPennMedicine online patient portal, available through
web browsers and mobile devices. Eligible patients scheduled for an upcoming outpatient
office visit at one of the UPHS clinic sites actively recruiting (Figure 1E) receive the PMBB
consent form as part of their electronic pre-visit check-in process and have the option to
complete the form online through this portal. The three-page consent form includes a link
J. Pers. Med. 2022, 12, 1974 3 of 16
to the PMBB website, the PMBB email address, and the PMBB enrollment telephone num-
ber, which is staffed by CRCs on weekdays from 7:30 a.m. to 5 p.m. local time to answer
questions potential participants may have as they are completing the consent procedure.
A small percentage (~5%) of patients who either are not eligible to receive the online pre-
check-in (e.g., certain surgical patients), or patients who skip the consent form during their
online pre-check-in, are consented in person by registration desk staff when the patient
physically reports for their appointment. PMBB brochures, with basic information about
the PMBB, the PMBB website link, and contact information, are also available for registra-
tion desk staff to distribute to patients during the consent process. The website also con-
tains a short video for patients explaining how PMBB participants contribute to scientific
research.
Figure 1. Recruitment and Demographics. (A) Distribution of enrollment through paper and elec-
tronic consent. (B) Cumulative numbers of participants consented and biospecimen sample collec-
tion. (C) Distribution of participants by age group and self-reported sex. (D) Distribution of partic-
ipants by self-reported race. (E) Density of recruitment around the six clinical sites of UPHS in Penn-
sylvania and New Jersey: Hospital of the University of Pennsylvania, Penn Presbyterian Medical
Center, Pennsylvania Hospital, Chester County Hospital, Lancaster General Health, Princeton
Health.
A) B)
C) D)
E)
J. Pers. Med. 2022, 12, 1974 4 of 16
2.3. Sample Collection
The PMBB patient participants consent to the collection of identifying information
(e.g., name, date of birth, medical record number, and contact information), information
from medical records (e.g., test results, medical procedures, medical diagnosis and proce-
dure codes, lab values, images such as X-rays, and medicines), blood samples (up to four
tablespoons), urine, saliva and/or respiratory specimens, and residual samples from clin-
ical pathology. There is no limit on the length of time samples may be kept in the biobank.
All blood samples are banked as whole blood, plasma, serum, buffy coat, and DNA for
future studies following stringent standard operating procedures. Samples are collected
in sterile vacutainer tubes barcoded with an identification number and scanned into a
system for sample tracking. Using the institution’s adopted laboratory information sys-
tem, LabVantage (LabVantage Solutions, Inc., Somerset, NJ, USA), biospecimens are pro-
cessed and tracked with time-date stamps to document processing and freezing times
throughout the laboratory workflow. Sample inventory is robustly supported with real-
time, adaptive sample pull lists and images of sample pull locations, as well as simulta-
neous creation of distribution boxes and decrement of sample aliquots.
Prior to the COVID-19 pandemic, blood samples were collected from patients at the
point of enrollment by CRCs cross-trained as certified phlebotomists. With the transition
to electronic consenting, the consenting and sample collection processes have been decou-
pled. For sample collection, an automated weekly report is generated containing a list of
consented patients for whom a blood sample is absent who have an upcoming phlebot-
omy appointment the following week at select Penn Medicine sites. Every Friday, three
PMBB physicians place the electronic lab order for PMBB blood draws in the EHR of these
patients. When they report to their phlebotomy appointment, the phlebotomist adds on
the PMBB blood order to the patient’s existing orders and collects the blood sample. One
6 mL EDTA tube is collected per patient. All blood samples are transported to the central-
ized clinical laboratory and logged in prior to being transferred to the PMBB core labora-
tory, where the blood is processed and stored following standard operating procedures.
Additionally, the PMBB banks residual biospecimens and tissues (for example,
blood, urine, cerebrospinal fluid, or tissue collected as part of clinical care) when available
as fresh, frozen or fixed dependent upon the tissue histology, following standard proce-
dures. Residual tissues are released by the Department of Pathology after examination if
the specimen or tissue is determined to be in excess of that required for patient care, or for
tissue or bodily fluid that would otherwise be waste material.
2.4. Genomic Data Generation
2.4.1. Genotyping and Imputation
DNA samples on approximately 44,000 PMBB participants have been genotyped to
date on an Illumina Global Screening Array v.2.0 (GSAv2) by the Regeneron Genomics
Center (RGC). The genotyping array chip has a backbone of 654,027 genetic markers as
well as additional ancestry informative markers. In addition, approximately 8595 of the
44,000 PMBB participants were also genotyped in the Center for Applied Genomics (CAG)
at the Children’s Hospital of Philadelphia on the GSAv1 and GSAv2 genotyping array.
After performing sample-level quality control (QC), genotype imputation was performed
using Eagle2 [2] and Minimac [3] software on the TOPMed Imputation Server [3]. Impu-
tation was performed for all autosomes, with TOPMed version R2 on a GRCh38 reference
panel. Cosmopolitan post-imputation QC included imputation score filtering (R2 > 0.7),
removal of palindromic variants, biallelic variant check, sex check, genotype call rate
(>99%) and sample call rate (>99%) filtering, minor allele frequency filtering (MAF > 1%),
and a Hardy–Weinberg equilibrium test (p-value > 1 × 1008). We generated PCAs to adjust
for population structure and to identify genetically informed ancestry (GIA) from EIGEN-
SOFT [4].
J. Pers. Med. 2022, 12, 1974 5 of 16
2.4.2. Whole Exome Sequencing
Whole exome sequencing (WES) has been performed on approximately 44,000 par-
ticipants to date by the RGC. DNA samples were processed with the custom IDT xGen v1
exome capture platform and sequenced on theIlluminaNovaSeq 6000 system on S4 flow
cells. Sequence alignment, variant identification, and genotype assignment were per-
formed using a WeCall variant caller. Sample level QC steps were then applied and sam-
ple sex errors, high rates of heterozygosity/contamination (D-stat > 0.4), low sequence cov-
erage (less than 85% of targeted bases achieving 20X coverage), or genetically identified
sample duplicates, were excluded. Additional filters were applied to pVCFs. Any SNV
genotype with a read depth of less than seven reads (DP < 7) was changed to a no-call.
After the application of the DP genotype filter, only the SNV variant sites that met at least
one of the following two criteria were retained: (1) at least one heterozygous variant gen-
otype with an allele balance ratio greater than or equal to 15% (AB 0.15); (2) at least one
homozygous variant genotype. The same filtering was applied to INDEL variants but
with an INDEL depth filter of DP < 10 and an INDEL allele balance cutoff of AB >= 0.20.
Multi-allelic variant sites in the PVCF file were normalized by left-alignment and repre-
sented as bi-allelic. The variant frequency data for exome sequences and imputed data are
available here: https://pmbb.med.upenn.edu/allele-frequency (accessed on 20 November
2022). The PMBB has a return of actionable results program for genomic findings that have
a potential impact on participant health; this program is beyond the scope of this manu-
script and will be described in a separate manuscript.
2.5. Clinical Data and Clinical Informatics Core
Clinical data are obtained through multiple sources, including a questionnaire com-
pleted at the time of recruitment and the Penn Medicine Clinical Data Warehouse and
PennG&P (Penn Genotype & Phenotype) platform. PennG&P contains over 5.6 million
patient records and other discrete clinical information amalgamated from 12 different
source systems throughout the enterprise. PennG&P is based on a standard research data
model called the Observational Medical Outcomes Partnership (OMOP), Common Data
Model (CDM) [5], which is used worldwide by the Observational Health Data Sciences
and Informatics (OHDSI) research consortium. It uses standardized language from na-
tional coding systems, such as SNOMED, LOINC [6], and RxNorm [7], for consistent
terms and the labeling of information. Additionally, the PMBB maps International Classi-
fication of Diseases (ICD)-9 and ICD-10 codes to 1866 discrete disease traits using Phecode
groupings [8].
2.6. Access to Data and Biospecimens
In keeping with the expansive mission of the PMBB, data and biospecimens are avail-
able to investigators throughout the Perelman School of Medicine and Penn Medicine for
a broad range of research. External collaborations, including those with other academic
institutions as well as biopharma, are encouraged and proceed through scientific collabo-
ration with identified local Penn investigators. All research studies must have study spe-
cific IRB approval because the umbrella PMBB IRB protocol covers only sample and data
acquisition, processing, storage, and dissemination. Researchers request access to data
and biospecimens using a simple REDCap project proposal intake form which is then re-
viewed by the PMBB Steering Committee. Proposals are evaluated for scientific merit and
priority, as well as the efficient use of data and biospecimens, as well as the ability to
impact the care provided by Penn Medicine clinicians. The PMBB Steering Committee
provides feedback to the investigator with either approval to move forward or with con-
cerns to be addressed. This REDCap project also includes a Data Access Agreement form
that must be signed by investigators prior to gaining access to any PMBB data. Per the
terms of this agreement, the sharing of PMBB data with additional collaborators, whether
J. Pers. Med. 2022, 12, 1974 6 of 16
they are internal or external to Penn, must be handled by the PMBB and not the investi-
gators themselves, to maintain the confidentiality and integrity of any protected health
information (PHI) included within PMBB datasets.
Standardized EHR clinical data are deidentified and provided to approved investi-
gators in a secure computing environment. For assistance with the creation of more com-
plicated phenotypes, researchers have access to the Clinical Informatics Core (CIC), a
shared resource that is managed by the Institute for Biomedical Informatics in collabora-
tion with the PMBB. The CIC is staffed by clinical data scientists with expertise in data
extraction, natural language processing (NLP), and data analysis.
3. Results
3.1. Enrollment
During the initial phase of recruitment from 2008 to 2013, 13,366 Penn Medicine pa-
tients were enrolled (Figure 1A,B). In 2013, recruitment was expanded, resulting in a
steady increase in enrollment to ~71,000 participants by the end of 2019 (Figure 1A,B). In
2020, the transition to electronic consenting triggered a rapid expansion in PMBB enroll-
ment, with the total number of participants more than doubling between 2020 and 2022.
There were 174,712 total participants enrolled in the PMBB as of September 2022 (Table 1,
Figure 1A,B). Presently, nearly 3500 new participants are being enrolled weekly across
two Penn Medicine hospitals that are actively recruiting through all their ambulatory
sites, and recruitment at the other four Penn Medicine hospitals is targeted to begin in
2023. The PMBB currently has obtained and processed blood biospecimens from approx-
imately 50% of enrolled participants; the recent shift to an electronic consent process has
resulted in enrollment outpacing sample collection (Figure 1B). Active processes are un-
derway to obtain biospecimens on the remainder of enrolled individuals. The goal is to
enroll >1 million Penn Medicine patient participants, with >90% providing a blood sample
for DNA and biomarker studies.
The PMBB participant population currently represents approximately 2.5% of active
Penn Medicine patients. Similar to the general Penn Medicine patient population, a
slightly higher percentage of PMBB participants are female (55.9%) as compared to male
(44.1%) (Table 1, Figure 1C). The age distribution of PMBB participants also tracks with
that of Penn Medicine patients, with participants ranging in age from 18 years to >100
years (Table 1 and Figure 1C). The age distributions differ slightly between sex, with males
trending towards an older age (Figure 1C). The PMBB cohort is diverse: 17% of enrolled
PMBB participants (and 25% of genotyped/sequenced participants) are identified as Afri-
can American, 4% as Asian, and 3.3% as Hispanic (Table 1; Figure 1D). With nearly 30,000
African-American patient participants currently enrolled, the PMBB has, to our
knowledge, the largest number of African-American participants of any single-institu-
tional medical biobank in the US. As shown in Figure 1E, most biobank participants reside
within the Philadelphia metropolitan area including New Jersey and Delaware (53.6%);
there is also a total of 3.3% of participants from other states across the US.
Table 1. Comparison of PMBB Participant Characteristics with UPHS Patient Characteristics.
PMBB
Participants
n (%)
Genotyped
Participants
n (%)
UPHS
Patients
n (%)
Total 174,712 43,884 3,688,610
Gender
Female 97,674 (55.9%) 21,965 (50.1%) 2,042,868 (55.4%)
Male 77,055 (44.1%) 21,918 (49.9%) 1,604,210 (43.5%)
Other 17 (<1%) 1 (<1%) 107 (<1%)
Age Range (Years) 18–103 18–103 0–121
J. Pers. Med. 2022, 12, 1974 7 of 16
Age Groups
18–29 17,815 (10.2%) 4302 (9.8%) 392,532 (10.5%)
30–39 27,355 (15.7%) 5406 (12.3%) 518,679 (14.1%)
40–49 25,819 (14.8%) 5688 (13.0%) 430,489 (11.7%)
50–59 33,827 (19.4%) 9519 (21.7%) 458,952 (12.4%)
60–69 40,268 (23.0%) 10,839 (24.7%) 507,397 (13.8%)
70–79 28,582 (16.4%) 5941 (13.5%) 400,016 (10.8%)
80+ 8811 (5.0%) 2189 (5.0%) 333,781 (9.0%)
Self-reported Race
African American 29,372 (16.8%) 10,815 (24.6%) 672,461 (18.2%)
White 124,406 (71.2%) 29,329 (66.8%) 2,029,684 (55.0%)
Asian 7156 (4.1%) 979 (2.2%) 152,615 (4.1%)
Other 9386 (5.4%) 1372 (3.1%) 370,313 (10.0%)
Unknown 7499 (4.3%) 1761 (4.0%) 463,537 (12.6%)
Self-reported Ethnicity
Hispanic 5715 (3.3%) 1112 (2.5%) 174,179 (4.7%)
Non-Hispanic 165,713 (94.8%) 42,425 (96.7%) 3,290,018 (89.2%)
Unknown 3284 (1.9%) 347 (0.8%) 183,723 (5.0%)
Genetically-Inferred
Ancestry
African N/A 11,300 (25.7%) N/A
European N/A 30,360 (69.2%) N/A
East Asian N/A 680 (1.5%) N/A
South Asian N/A 573 (1.3%) N/A
Admixed American N/A 711 (1.6%) N/A
Other N/A 301 (0.7%) N/A
Median period of EHR
follow-up since enroll-
ment
7 years 5.7 years N/A
Demographics based on EHR data on UPHS patients who have been seen at least once from 2008
to present. Data from following UPHS sites were included: Hospital of the University of Pennsyl-
vania, Penn Presbyterian Medical Center, Pennsylvania Hospital, Chester County Hospital, Prince-
ton Health.
3.2. Clinical Data Availability
There are over 66.7 million data points collected in the form of encounters, diagnosis
codes, procedure codes, and medication orders (Table 2). Across the PMBB cohort, over
46.7 million encounters, 10 million diagnosis codes, 3.6 million procedure codes, and 6.3
million medication orders have been recorded, averaging 268 encounters, 57 diagnosis
codes, 21 procedure codes, and 36 medication orders per individual (Table 2). The most
common diagnoses codes include hypertension, hypercholesterolemia, obesity, and in-
somnia (Figure 2).
J. Pers. Med. 2022, 12, 1974 8 of 16
Table 2. Clinical Data Availability for PMBB Participants.
Event Type Total Number of Events Mean Number of Events
(SD) *
Encounter 46,738,773 268 (321)
Diagnosis Code (number of
condition-related visits) 10,023,922 57.4 (58.7)
Procedure Code 3,621,056 20.7 (26.7)
Medication Order 27,914,486 159.8 (271.8)
* Average number of events each patient has for each event type.
Figure 2. Prevalence of diagnoses code among PMBB participants grouped by broader disease do-
main.
3.3. Phenome-Wide Association Study (PheWAS) of BMI
To evaluate the validity of the PMBB clinical data, a phenome-wide association study
(PheWAS) was conducted using mean body mass index (BMI) as the exposure and 1856
disease traits derived from grouping encounter diagnoses using phecodes as the outcome.
All the models were adjusted with age, sex, and self-reported race in the EHR. A total of
662 associations of BMI with at least one disease trait across the 18 disease categories
J. Pers. Med. 2022, 12, 1974 9 of 16
passed Bonferroni correction for multiple hypothesis testing (p < 2.6 × 105). The strongest
associations with BMI were with type 2 diabetes, hyperlipidemia, overweight, obesity,
sleep apnea, and osteoarthritis (all p < 1 × 10331, Figure 3, Supplementary Table S1). The
associations with mean BMI show evidence of its effect on all organ systems, covering
associations with the spectrum of disease categories (Figure 3). Additional associations
include hypertension, heart failure, endometrial hyperplasia, bone fracture, depression,
pregnancy complications, and respiratory failure (Figure 3).
Figure 3. A phenome-wide association between mean body mass index and 1856 EHR-derived phe-
notypes.
J. Pers. Med. 2022, 12, 1974 10 of 16
3.4. Laboratory-Wide Association Study (LabWAS) of BMI
We extracted 24 clinical laboratory measurements from the EHR for all the PMBB
participants. These lab measurements were selected based on a common lab test in the
health system and contained at least 1000 individuals within each lab which was meas-
ured. We computed median values for each individual within each lab and, as a proof-of-
concept, evaluated their association with BMI. Linear regression was performed to test for
association and all the models were adjusted with age, sex and self-reported race. We rep-
licated many known associations between BMI and lab values (Figure 4, Table S2). For
example, blood glucose measures were significantly associated with increased BMI. Tri-
glyceride levels were significantly positively associated and high-density lipoprotein cho-
lesterol (HDL-C) levels were significantly inversely associated with BMI, as expected.
Markers of inflammation were also significantly associated with BMI. In this proof-of-
concept analysis of the lab measurements, the associations with BMI support the associa-
tion with the disease outcomes reported in the PheWAS.
Figure 4. A laboratory-wide association between mean body mass index and 24 laboratory meas-
urements derived from the electronic health records.
3.5. Genome-Wide Association (GWAS) with BMI
As a proof-of-concept, we conducted a GWAS for median BMI within five genetically
inferred ancestry groups. These included 30,360 EUR, 11,300 AFR, 711 AMR, 680 EAS, and
573 SAS individuals in the PMBB (Table 1). The analysis tested the association of ~7.6 mil-
lion SNPs with MAF > 1%, imputation R2 > 0.3, using a linear mixed model implemented
in REGNIE. All the models were adjusted with age, sex, and the first six ancestry specific
principal components to account for population stratification. We then conducted cross-
ancestry meta-analysis by integrating GWAS summary statistics from each ancestry
J. Pers. Med. 2022, 12, 1974 11 of 16
group using PLINK. Our meta-analysis identified 201 genome-wide significant SNP asso-
ciations with BMI (p < 5 × 10
08
, Figure 5), replicating several previously reported associa-
tions in published GWAS of BMI. The strongest association in our PMBB analysis was
with FTO variant rs55872725 (p = 4.7 × 10
28
, beta = 0.271), which has been previously re-
ported. Other known associations included rs7559547 (p = 3.92 × 10
14
, beta = 0.41,
TMEM18) and rs539515 (p = 9.02 × 10
11
, beta = 0.36, SEC16B), among others. Summary
statistics of results with p < 1 × 10
04
are available in Supplementary Table S3.
Figure 5. Manhattan plot showing association between common genetic variants (MAF > 1%) and
BMI.
4. Discussion
The Penn Medicine BioBank was created to harness clinical data and biospecimens
for biomedical research within Penn Medicine, a large academic healthcare system.
Within a decade, it has emerged as a critical resource for translational medicine that has
fueled discovery science and facilitated precision medicine, empowering a genomics-en-
abled learning healthcare system. As of September 2022, the PMBB had enrolled over
174,000 participants, obtained biospecimens on >70,000 participants, and generated ge-
nomic data on ~44,000 of its participants. The PMBB intends to enroll >1 million partici-
pants, obtain biospecimens on >90% of participants, and generate genomic data on all par-
ticipants for whom biospecimens have been obtained.
The 2015 Institute of Medicine (now National Academy of Medicine) report on Trans-
lating Genomic-Based Research for Health [1] advocated for the development of ‘ge-
nomics-enabled learning healthcare systems’ based on the systematic summarized collec-
tion and use of genomic data, integrated with phenotypic data, to make discoveries and
enhance healthcare in large healthcare systems. More recently, in its strategic vision for
genomics research and application of genomics to clinical care, the National Human Ge-
nome Research Institute (NHGRI) emphasized the design and implementation of ge-
nomics-enabled learning healthcare systems to include infrastructure, resources, and tech-
nology development for genomics; the inclusion of underrepresented and minority
groups to make genomic research more equitable; the development of multi-omics studies
to get a comprehensive view of disease biology and the progression of diseases; and build-
ing tools to implement the knowledge back into the EHR to improve healthcare [9]. Large
disease-agnostic and diverse medical biobanks at academic medical centers, such as the
PMBB, are a critical component of fulfilling this vision.
Despite recapitulating health and disease traits from structured diagnosis codes, and
the successful integration of this with genomic data [10–16], it must be acknowledged that
diagnosis codes are crude approximations of underlying biological traits. As such, the fu-
ture of EHR-empowered genomics research lies in ‘advanced phenotyping’ beyond diag-
nosis codes. These approaches include laboratory data, medication data, and other forms
of structured data, all of which are relatively straightforward to access and use for re-
search. Laboratory data, procedure codes, family history, and billing codes are all being
mapped to concepts from various vocabularies (MONDO [17], SNOMED) to develop phe-
notypic algorithms that characterize the outcome of interest. Even more exciting and in-
formative is the extraction of meaningful quantitative information from unstructured data
(e.g., clinical notes, procedure reports, imaging reports, and pathology reports) using nat-
J. Pers. Med. 2022, 12, 1974 12 of 16
ural language processing (NLP) methods. To this end, the PMBB has deployed NLP soft-
ware (Linguamatics, Cambridge, UK) to extract phenotypes from clinical notes and other
unstructured data in PMBB participants. Furthermore, there is an immense amount of
clinical imaging performed in medical centers and these images are another potential
source of phenotypic data, sometimes referred to as ‘imaging-derived phenotypes’ (IDPs).
In several ongoing PMBB efforts, deep learning and machine learning techniques are be-
ing used to translate imaging data, such as CT, MRI, and fMRI, into quantitative IDPs to
fuel new discovery.
Another approach to collecting additional phenotype and exposure data that are ab-
sent in the EHR is using participant questionnaires. During the COVID-19 pandemic, an
initial COVID-19 survey [18] was deployed to PMBB participants to collect information
on symptoms, co-morbidities, and outcomes related to COVID-19. As the pandemic has
progressed, we now administer an active longitudinal survey to follow participants for
up to 18 months from their first COVID-19 diagnosis, yielding insights into long COVID.
Combining survey results with biospecimen and EHR-derived phenotypes can shed light
on factors that predict the onset of disease, refine preventative care, and optimize the clin-
ical trial design. Current efforts are focused on extending active data acquisition through
integrating mobile devices for both real-time data collection from survey questions and
biometric activity data.
A major focus of biobank research is the use of genomics to understand the genetic
architecture of health and disease and its implications for clinical care by linking phe-
nomic efforts with genomic data obtained from biospecimens. Leveraging these ap-
proaches, the PMBB has developed a robust and diverse genomics research enterprise.
Studies using PMBB data have highlighted the utility of the ‘genome-first’ approach’s util-
ity in studying rare variants at scale and identifying new associations between genes and
disease [10,11], as well as refined the range of the phenotypic presentation of individuals
carrying rare impactful variants in known disease genes [12,19].
To support equitable genomic research, a commitment to participant diversity has
been a hallmark feature of the PMBB since its inception. Seventeen percent of PMBB par-
ticipants (and 24% of those for whom biospecimens are currently available) are African
Americans or immigrants of African ancestry. This diversity has led to novel genomic and
biological insights that directly impact the health of underrepresented groups. For exam-
ple, recent work in the PMBB highlighted that hereditary amyloid transthyretin cardio-
myopathy was a common yet markedly underdiagnosed cause of heart failure among Af-
rican-American individuals [20], with many cases of the disease remaining undiagnosed
even at a tertiary medical center such as Penn Medicine. Given the availability of targeted
therapy, this finding advocates for the aggressive utilization of genomic and precision
medicine to diagnose transthyretin cardiomyopathy in this population. This ‘genome-
first’ approach is revealing an under-diagnosis of other genetic conditions. A ‘return of
actionable results’ program is underway, and the implications for the greater utilization
of genetic testing in clinical practice are clear.
The integration of genomic data into clinical practice is essential for the next genera-
tion of healthcare. Penn Medicine is at the forefront of developing techniques to provide
high-quality patient care based on real-world evidence and genomic discoveries [21,22].
An analysis of pharmacogenetic (PGx) variants in the PMBB concluded that anticipatory
genotyping can efficiently lead to the effective communication of PGx results to patients
[23]. Polygenic risk scores (PRS) have been posited as a novel approach to leverage com-
mon-variant genetics for clinical care to predict the genetic risk for complex diseases, alt-
hough the clinical utility of this approach has yet to be fully determined [24,25]. Research-
ers utilizing PMBB data reported that PRS for psychiatric disorders [26] and substance use
disorders [27] has shown cross-trait associations beyond traditional diagnostic bounda-
ries, suggesting broad effects of genetic liability for these disorders. Furthermore, PRS in
J. Pers. Med. 2022, 12, 1974 13 of 16
PMBB participants significantly increased the ability to identify the cancer status of Euro-
pean individuals but not African Americans, underscoring the need for large-scale ge-
nomic studies on non-white populations [13].
A critical feature of the PMBB is the availability of stored plasma and serum for bi-
omarker analyses and integration with clinical and genomic data. Although the effort and
expense of generating and storing plasma/serum aliquots are substantial, the benefit of
this approach is rapidly becoming apparent. Multiple investigators have utilized the ge-
nomic data to identify PMBB participants with genotypes of interest to then use stored
samples in cases and controls to measure biomarkers of interest. During the COVID-19
pandemic, access to stored samples from PMBB participants enrolled pre-pandemic who
subsequently developed COVID-19 permitted investigators to address a number of im-
portant research questions [28–31]. Now, with over seven years of median follow-up data
since the recruitment of PMBB patient participants, the stored samples are increasingly
precious for their use in assaying biomarkers that may be predictive of incident disease.
As methods for large-scale proteomics and metabolomics improve and the costs come
down, the opportunity to generate plasma-based large-scale omics and integrate with ge-
nomics and clinical data is increasingly feasible and promises to further enhance biologi-
cal discovery and precision medicine.
All PMBB participants consent to be recontacted, a critical feature of the protocol that
is useful for several purposes. Patient-reported surveys represent an important addition
to the EHR data for certain phenotypes as well as lifestyle and exposure data. Permission
to recontact facilitates ‘recall-by-genotype’ deep phenotyping studies, which represents a
tremendous opportunity for ‘genome-first’ discovery. Several investigators are actively
performing studies in which the genomic data are used to identify individuals that carry
rare variants in genes of interest or have a high polygenic risk for certain conditions, and
participants are contacted to consider participation in hypothesis-driven deep phenotyp-
ing studies. Deep phenotyping can include targeted imaging, immunological profiling,
provocative testing (e.g., oral glucose or fat tolerance test), creation of induced pluripotent
stem cells (iPSCs), or any number of other clinical phenotyping approaches driven by the
specific scientific question. Finally, the era of precision medicine will certainly include
many clinical trials that are targeted to individuals of a specific inherited genotype; large
medical biobanks with pre-existing genomic data, such as the PMBB, offer a fertile oppor-
tunity for the recruitment of individuals for such genotype-directed clinical trials.
5. Conclusions
The PMBB is a disease-agnostic institutional biobank under a single umbrella proto-
col based at a large academic health system with the purpose of promoting a genomics-
enabled learning healthcare system to fuel scientific discovery, translational science, and
precision medicine. A comprehensive biobank of DNA, plasma, and serum on all partici-
pants with selected other specimens and tissues on a subset of participants is linked to
rich EHR clinical data, imaging, and survey data. The clinical database is standardized to
OMOP and contains demographic, diagnoses (e.g., ICD-9/ICD-10 codes), procedures (e.g.,
Current Procedure Terminology—CPT codes), laboratory data, medication data, encoun-
ter types, socioeconomic factors, and survey data. The initiation of e-consenting has led to
a substantial increase in the rate of enrollment. As of September 2022, genome-wide ge-
nomic data have been generated on ~44,000 participants and plasma multi-omics data on
several thousand participants. The substantial representation of African-American patient
participants in the PMBB addresses the urgent need to increase diversity in human genetic
studies. Researchers with approved IRB protocols can request access to biobank samples
and data through a data access portal. Publications supported by PMBB data and speci-
mens can be found here: https://pmbb.med.upenn.edu/pmbb/publications.html (accessed
on 20 November 2022). The PMBB is one of several large medical biobanks at academic
medical centers in the US and is strongly supportive of the creation of a ‘medical biobank
consortium’ to facilitate replication, increase power for rare phenotypes and variants, and
J. Pers. Med. 2022, 12, 1974 14 of 16
promote harmonized collaboration around genotype-directed deep phenotyping and re-
cruitment into clinical trials.
Supplementary Materials: The following supporting information can be downloaded at:
https://www.mdpi.com/article/10.3390/jpm12121974/s1, Table S1: Phenome-wide association study
between body mass index (mean values) and EHR-derived phecodes. Table S2: Laboratory-wide
association study between body mass index (mean values) and clinical labs from the EHR. Table S3:
Summary statistics for genome-wide association study for BMI. Supplemental document: A full list
of PMBB author contributions.
Author Contributions: Conceptualization, D.J.R., M.D.R., A.V., N.N., S.M.D. (Scott M. Damrauer),
M.F. and K.L.N.; methodology, A.V., N.N., D.J.R., J.W., M.F., K.L.N. and M.D.R.; validation: A.V.,
C.M.K. and L.G.; software A.V. and S.M.D. (Scott M. Dudek); formal analysis, A.V., N.N., C.M.K.
and L.G.; data curation, N.N., C.M.K., L.G. and A.L.; writing—original draft preparation, J.W., A.V.,
N.N., D.J.R. and M.D.R.; writing—review and editing, S.M.D. (Scott M. Damrauer), S.S.V., G.S.,
R.L.K., T.G.D., S.M.D. (Scott M. Dudek), A.L., Y.B., R.J., E.M., K.L.N. and M.F.; visualization, N.N.,
C.M.K. and L.G.; supervision, D.J.R. and M.D.R.; project administration, J.W.; funding acquisition,
D.J.R. All authors have read and agreed to the published version of the manuscript.
Funding: The PMBB is supported by Perelman School of Medicine at University of Pennsylvania, a
gift from the Smilow family, and the National Center for Advancing Translational Sciences of the
National Institutes of Health under CTSA award number UL1TR001878. KLN is supported by the
Basser Center for BRCA.
Institutional Review Board Statement: The study was conducted in accordance with the Declara-
tion of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of Univer-
sity of Pennsylvania (protocol codes: 808346 approved 07/01/2008, 813913 approved 4/3/2013, and
817977 approved 6/6/2013” for studies involving humans.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the
study.
Data Availability Statement: All the data used to generate the figures were made available in sup-
plementary.
Acknowledgments: We thank the patient-participants of Penn Medicine who consented to partici-
pate in this research program. We acknowledge the efforts of the PMBB staff (a full list of contribu-
tors is provided in the supplement). We thank the outstanding Penn Medicine Corporate IS team
(Jessica Chen, Christine Vanzandbergen, Jeffrey Landgraf, Colin Wollack, Ned Haubein) for its ma-
jor efforts to implement e-consenting in the EHR as well as biospecimen acquisition and tracking.
We thank the Regeneron Genetics Center for partnership in generating genetic variant data and for
scientific interactions. We thank the Smilow family for their generous gift that made the launch of
the PMBB possible. Finally, we would like to thank Kevin Mahoney, Jon Epstein, Larry Jameson
(Penn Medicine leadership), Michael Parmacek (Department of Medicine), Garret FitzGerald (Insti-
tute for Translational Medicine and Therapeutics), and David Roth (Penn Center for Precision Med-
icine) for their vision and support.
Conflicts of Interest: S.M.D. receives research funding from RenalytixAI, in-kind research support
from Novo Nordisk, and personal consulting fees from Calico Labs. D.J.R. serves on scientific advi-
sory boards for Alnylam, Novartis, Pfizer, and Verve. Regeneron has generated genomic data in
PMBB participants. These entities had no role in the design of the study; in the collection, analyses,
or interpretation of data; in the writing of this manuscript; or in the decision to publish these results.
References
1. Medicine I. of Genomics-Enabled Learning Health Care Systems: Gathering and Using Genomic Information to Improve Patient Care and
Research: Workshop Summary; National Academies Press (US): Washington, DC, USA, 2015; ISBN 978-0-309-37112-4.
2. Loh, P.-R.; Danecek, P.; Palamara, P.F.; Fuchsberger, C.; AReshef, Y.; Finucane, H.K.; Schoenherr, S.; Forer, L.; McCarthy, S.;
Abecasis, G.R.; et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016, 48, 1443–1448.
https://doi.org/10.1038/ng.3679
3. Das, S.; Forer, L.; Schönherr, S.; Sidore, C.; Locke, A.E.; Kwong, A.; Vrieze, S.I.; Chew, E.Y.; Levy, S.; McGue, M.; et al. Next-
generation genotype imputation service and methods. Nat Genet. 2016, 48, 1284–1287. https://doi.org/10.1038/ng.3656
4. Price, A.L.; Patterson, N.J.; Plenge, R.M.; Weinblatt, M.E.; Shadick, N.A.; Reich, D. Principal components analysis corrects for
stratification in genome-wide association studies. Nat. Genet. 2006, 38, 904–909. https://doi.org/10.1038/ng1847
J. Pers. Med. 2022, 12, 1974 15 of 16
5. Klann, J.G.; Joss, M.A.H.; Embree, K.; Murphy, S.N. Data model harmonization for the All Of Us Research Program:
Transforming i2b2 data into the OMOP common data model. PLoS ONE 2019, 14, e0212463.
https://doi.org/10.1371/journal.pone.0212463.
6. McDonald, C.J.; Huff, S.M.; Suico, J.G.; Hill, G.; Leavelle, D.; Aller, R.; Forrey, A.; Mercer, K.; DeMoor, G.; Hook, J.; et al. LOINC,
a Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clin. Chem. 2003, 49, 624–633.
https://doi.org/10.1373/49.4.624.
7. Nelson, S.J.; Zeng, K.; Kilbourne, J.; Powell, T.; Moore, R. Normalized names for clinical drugs: RxNorm at 6 years. J. Am. Med.
Inform. Assoc. 2011, 18, 441–448. https://doi.org/10.1136/amiajnl-2011-000116.
8. Wu, P.; Gifford, A.; Meng, X.; Li, X.; Campbell, H.; Varley, T.; Zhao, J.; Carroll, R.; Bastarache, L.; Denny, J.C.; et al. Mapping
ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation. JMIR Med. Inform. 2019, 7, e14325.
https://doi.org/10.2196/14325.
9. Green, E.D.; Gunter, C.; Biesecker, L.G.; Di Francesco, V.; Easter, C.L.; Feingold, E.A.; Felsenfeld, A.L.; Kaufman, D.J.; Ostrander,
E.A.; Pavan, W.J.; et al. Strategic vision for improving human health at The Forefront of Genomics. Nature 2020, 586, 683–692.
https://doi.org/10.1038/s41586-020-2817-4.
10. Park, J.; Levin, M.G.; Haggerty, C.M.; Hartzel, D.N.; Judy, R.; Kember, R.L.; Reza, N.; Ritchie, M.D.; Owens, A.T.; Damrauer,
S.M.; et al. A genome-first approach to aggregating rare genetic variants in LMNA for association with electronic health record
phenotypes. Genet. Med. 2020, 22, 102–111. https://doi.org/10.1038/s41436-019-0625-8.
11. Park, J.; Packard, E.A.; Levin, M.G.; Judy, R.L.; Regeneron Genetics Center; Damrauer, S.M.; Day, S.M.; Ritchie, M.D.; Rader,
D.J. A genome-first approach to rare variants in hypertrophic cardiomyopathy genes MYBPC3 and MYH7 in a medical biobank.
Hum. Mol. Genet. 2022, 31, 827–837. https://doi.org/10.1093/hmg/ddab249.
12. Damrauer, S.M.; Hardie, K.; Kember, R.L.; Judy, R.; Birtwell, D.; Williams, H.; Rader, D.J.; Pyeritz, R.E. FBN1 Coding Variants
and Nonsyndromic Aortic Disease. Circ. Genom. Precis. Med. 2019, 12, e002454. https://doi.org/10.1161/CIRCGEN.119.002454.
13. Wang, L.; Desai, H.; Verma, S.S.; Le, A.; Hausler, R.; Verma, A.; Judy, R.; Doucette, A.; Gabriel, P.E.; Nathanson, K.L.; et al.
Performance of polygenic risk scores for cancer prediction in a racially diverse academic biobank. Genet. Med. 2022, 24, 601–609.
https://doi.org/10.1016/j.gim.2021.10.015.
14. Kember, R.L.; Levin, M.G.; Cousminer, D.L.; Tsao, N.; Judy, R.; Schur, G.M.; Lubitz, S.A.; Ellinor, P.T.; McCormack, S.E.; Grant,
S.F.A.; et al. Genetically Determined Birthweight Associates With Atrial Fibrillation: A Mendelian Randomization Study. Circ.
Genom. Precis. Med. 2020, 13, e002553. https://doi.org/10.1161/CIRCGEN.119.002553.
15. Zhang, C.; Verma, A.; Feng, Y.; Melo, M.C.R.; McQuillan, M.; Hansen, M.; Lucas, A.; Park, J.; Ranciaro, A.; Thompson, S.; et al.
Impact of natural selection on global patterns of genetic variation and association with clinical phenotypes at genes involved in
SARS-CoV-2 infection. Proc. Natl. Acad. Sci. USA 2022, 119, e2123000119. https://doi.org/10.1073/pnas.2123000119.
16. Bajaj, A.; Ihegword, A.; Qiu, C.; Small, A.M.; Wei, W.-Q.; Bastarache, L.; Feng, Q.; Kember, R.L.; Risman, M.; Bloom, R.D.; et al.
Phenome-wide association analysis suggests the APOL1 linked disease spectrum primarily drives kidney-specific pathways.
Kidney Int. 2020, 97, 1032–1041. https://doi.org/10.1016/j.kint.2020.01.027.
17. Shefchek, K.A.; Harris, N.L.; Gargano, M.; Matentzoglu, N.; Unni, D.; Brush, M.; Keith, D.; Conlin, T.; Vasilevsky, N.; Zhang,
X.A.; et al. The Monarch Initiative in 2019: An integrative data and analytic platform connecting phenotypes to genotypes across
species. Nucleic Acids Res. 2020, 48, D704–D715. https://doi.org/10.1093/nar/gkz997.
18. Verma, S.S.; Chung, W.K.; Dudek, S.; Williamson, J.L.; Verma, A.; Robinson, S.; Rader, D.J.; Reilly, M.P.; Sengupta, S.; FitzGerald,
G.A.; et al. Research on COVID-19 through patient-reported data: A survey for observational studies in the COVID-19 pandemic.
J. Clin. Transl. Sci. 2021, 5, e17. https://doi.org/10.1017/cts.2020.509.
19. Drivas, T.G.; Lucas, A.; Zhang, X.; Ritchie, M.D. Mendelian pathway analysis of laboratory traits reveals distinct roles for ciliary
subcompartments in common disease pathogenesis. Am. J. Hum. Genet. 2021, 108, 482–501.
https://doi.org/10.1016/j.ajhg.2021.02.008.
20. Damrauer, S.M.; Chaudhary, K.; Cho, J.H.; Liang, L.W.; Argulian, E.; Chan, L.; Dobbyn, A.; Guerraty, M.A.; Judy, R.; Kay, J.; et
al. Association of the V122I Hereditary Transthyretin Amyloidosis Genetic Variant With Heart Failure Among Individuals of
African or Hispanic/Latino Ancestry. JAMA 2019, 322, 2191. https://doi.org/10.1001/jama.2019.17935.
21. Lau-Min, K.S.; Asher, S.B.; Chen, J.; Domchek, S.M.; Feldman, M.; Joffe, S.; Landgraf, J.; Speare, V.; Varughese, L.A.; Tuteja, S.;
et al. Real-world integration of genomic data into the electronic health record: The PennChart Genomics Initiative. Genet. Med.
Off. J. Am. Coll. Med. Genet. 2021, 23, 603–605. https://doi.org/10.1038/s41436-020-01056-y.
22. Lau-Min, K.S.; McKenna, D.; Asher, S.B.; Bardakjian, T.; Wollack, C.; Bleznuck, J.; Biros, D.; Anantharajah, A.; Clark, D.F.; Condit,
C.; et al. Impact of integrating genomic data into the electronic health record on genetics care delivery. Genet. Med. 2022, 24,
2338–2350. https://doi.org/10.1016/j.gim.2022.08.009.
23. Verma, S.S.; Keat, K.; Li, B.; Hoffecker, G.; Risman, M.; Regeneron Genetics Center; Sangkuhl, K.; Whirl-Carrillo, M.; Dudek, S.;
Verma, A.; et al. Evaluating the frequency and the impact of pharmacogenetic alleles in an ancestrally diverse Biobank
population. medRxiv 2022.08.26.22279261; doi: https://doi.org/10.1101/2022.08.26.22279261
24. Martin, A.R.; Kanai, M.; Kamatani, Y.; Okada, Y.; Neale, B.M.; Daly, M.J. Clinical use of current polygenic risk scores may
exacerbate health disparities. Nat. Genet. 2019, 51, 584–591. https://doi.org/10.1038/s41588-019-0379-x.
25. Sirugo, G.; Williams, S.M.; Tishkoff, S.A. The Missing Diversity in Human Genetic Studies. Cell 2019, 177, 26–31.
https://doi.org/10.1016/j.cell.2019.02.048.
J. Pers. Med. 2022, 12, 1974 16 of 16
26. Kember, R.L.; Merikangas, A.K.; Verma, S.S.; Verma, A.; Judy, R.; Regeneron Genetics Center; Damrauer, S.M.; Ritchie, M.D.;
Rader, D.J.; Bućan, M. Polygenic Risk of Psychiatric Disorders Exhibits Cross-trait Associations in Electronic Health Record
Data From European Ancestry Individuals. Biol. Psychiatry 2021, 89, 236–245. https://doi.org/10.1016/j.biopsych.2020.06.026.
27. Hartwell, E.E.; Merikangas, A.K.; Verma, S.S.; Ritchie, M.D.; Regeneron Genetics Center; Kranzler, H.R.; Kember, R.L. Genetic
liability for substance use associated with medical comorbidities in electronic health records of African- and European-ancestry
individuals. Addict. Biol. 2022, 27, e13099. https://doi.org/10.1111/adb.13099.
28. Alanio, C.; Verma, A.; Mathew, D.; Gouma, S.; Liang, G.; Dunn, T.; Oldridge, D.A.; Weaver, J.; Kuri-Cervantes, L.; Pampena,
M.B.; et al. Cytomegalovirus Latent Infection is Associated with an Increased Risk of COVID-19-Related Hospitalization. J. Infect.
Dis. 2022, 226, 463–473. https://doi.org/10.1093/infdis/jiac020.
29. Banday, A.R.; Stanifer, M.L.; Florez-Vargas, O.; Onabajo, O.O.; Papenberg, B.W.; Zahoor, M.A.; Mirabello, L.; Ring, T.J.; Lee, C.-
H.; Albert, P.S.; et al. Genetic regulation of OAS1 nonsense-mediated decay underlies association with COVID-19
hospitalization in patients of European and African ancestries. Nat. Genet. 2022, 54, 1103–1116. https://doi.org/10.1038/s41588-
022-01113-z.
30. Anderson, E.M.; Goodwin, E.C.; Verma, A.; Arevalo, C.P.; Bolton, M.J.; Weirick, M.E.; Gouma, S.; McAllister, C.M.; Christensen,
S.R.; Weaver, J.; et al. Seasonal human coronavirus antibodies are boosted upon SARS-CoV-2 infection but not associated with
protection. Cell 2021, 184, 1858–1864.e10. https://doi.org/10.1016/j.cell.2021.02.010.
31. Flannery, D.D.; Gouma, S.; Dhudasia, M.B.; Mukhopadhyay, S.; Pfeifer, M.R.; Woodford, E.C.; Gerber, J.S.; Arevalo, C.P.; Bolton,
M.J.; Weirick, M.E.; et al. SARS-CoV-2 seroprevalence among parturient women in Philadelphia. Sci. Immunol. 2020, 5, eabd5709.
https://doi.org/10.1126/sciimmunol.abd5709.
... However, the future of biomarker discovery is likely the ability to measure, compare and combine multiple variables and here, resources are key. The Penn Medicine Biobank [151] includes genetics and biomarkers alongside EMR to enable precision medicine and the discovery of new phenotypes. ...
Article
Full-text available
The explosion and abundance of digital data could facilitate large-scale research for psychiatry and mental health. Research using so-called “real world data”—such as electronic medical/health records—can be resource-efficient, facilitate rapid hypothesis generation and testing, complement existing evidence (e.g. from trials and evidence-synthesis) and may enable a route to translate evidence into clinically effective, outcomes-driven care for patient populations that may be under-represented. However, the interpretation and processing of real-world data sources is complex because the clinically important ‘signal’ is often contained in both structured and unstructured (narrative or “free-text”) data. Techniques for extracting meaningful information (signal) from unstructured text exist and have advanced the re-use of routinely collected clinical data, but these techniques require cautious evaluation. In this paper, we survey the opportunities, risks and progress made in the use of electronic medical record (real-world) data for psychiatric research.
... The UKB is composed of >500,000 participating individuals aged 37-73 years at the time of recruitment, who underwent various questionnaires, physical measurements, biological sampling (blood and urine), and genome sequencing across 22 assessment centers in the UK 30 . A subset of participants were invited to complete an additional examination that included magnetic resonance imaging of the heart 10 .The PMBB is composed of 174,712 consenting patients of the Penn Medicine health network, with a subset of 44,000 participants with available genotyping data 11 . Additionally, all participants medical records including imaging results are de-identified and linked to their identifier. ...
Preprint
Full-text available
Aortic structure and function impact cardiovascular health through multiple mechanisms. Aortic structural degeneration increases left ventricular afterload, pulse pressure and promotes target organ damage. Despite the impact of aortic structure on cardiovascular health, aortic 3D-geometry has yet to be comprehensively assessed. Using a convolutional neural network (U-Net) combined with morphological operations, we quantified aortic 3D-geometric phenotypes (AGPs) from 53,612 participants in the UK Biobank and 8,066 participants in the Penn Medicine Biobank. AGPs reflective of structural aortic degeneration, characterized by arch unfolding, descending aortic lengthening and luminal dilation exhibited cross-sectional associations with hypertension and cardiac diseases, and were predictive for new-onset hypertension, heart failure, cardiomyopathy, and atrial fibrillation. We identified 237 novel genetic loci associated with 3D-AGPs. Fibrillin-2 gene polymorphisms were identified as key determinants of aortic arch-3D structure. Mendelian randomization identified putative causal effects of aortic geometry on the risk of chronic kidney disease and stroke.
... The Penn Medicine Biobank (PMBB) is a large academic medical biobank in which participants are agnostically recruited from the outpatient setting and consented for access to their EHR data and permission to generate genomic and biomarker data [22]. The study flowchart is illustrated in Additional file 1: Fig. S1. ...
Article
Full-text available
Background Previous studies have shown that lifestyle/environmental factors could accelerate the development of age-related hearing loss (ARHL). However, there has not yet been a study investigating the joint association among genetics, lifestyle/environmental factors, and adherence to healthy lifestyle for risk of ARHL. We aimed to assess the association between ARHL genetic variants, lifestyle/environmental factors, and adherence to healthy lifestyle as pertains to risk of ARHL. Methods This case–control study included 376,464 European individuals aged 40 to 69 years, enrolled between 2006 and 2010 in the UK Biobank (UKBB). As a replication set, we also included a total of 26,523 individuals considered of European ancestry and 9834 individuals considered of African-American ancestry through the Penn Medicine Biobank (PMBB). The polygenic risk score (PRS) for ARHL was derived from a sensorineural hearing loss genome-wide association study from the FinnGen Consortium and categorized as low, intermediate, high, and very high. We selected lifestyle/environmental factors that have been previously studied in association with hearing loss. A composite healthy lifestyle score was determined using seven selected lifestyle behaviors and one environmental factor. Results Of the 376,464 participants, 87,066 (23.1%) cases belonged to the ARHL group, and 289,398 (76.9%) individuals comprised the control group in the UKBB. A very high PRS for ARHL had a 49% higher risk of ARHL than those with low PRS (adjusted OR, 1.49; 95% CI, 1.36–1.62; P < .001), which was replicated in the PMBB cohort. A very poor lifestyle was also associated with risk of ARHL (adjusted OR, 3.03; 95% CI, 2.75–3.35; P < .001). These risk factors showed joint effects with the risk of ARHL. Conversely, adherence to healthy lifestyle in relation to hearing mostly attenuated the risk of ARHL even in individuals with very high PRS (adjusted OR, 0.21; 95% CI, 0.09–0.52; P < .001). Conclusions Our findings of this study demonstrated a significant joint association between genetic and lifestyle factors regarding ARHL. In addition, our analysis suggested that lifestyle adherence in individuals with high genetic risk could reduce the risk of ARHL.
... The Penn Medicine Biobank (PMBB) is a large academic medical biobank in which participants are agnostically recruited from the outpatient setting and consented for access to their EHR data and permission to generate genomic and biomarker data [12]. The study flowchart is illustrated in Additional file 1: Fig. S1. ...
Article
Full-text available
Background Numerous observational studies have highlighted associations of genetic predisposition of head and neck squamous cell carcinoma (HNSCC) with diverse risk factors, but these findings are constrained by design limitations of observational studies. In this study, we utilized a phenome-wide association study (PheWAS) approach, incorporating a polygenic risk score (PRS) derived from a wide array of genomic variants, to systematically investigate phenotypes associated with genetic predisposition to HNSCC. Furthermore, we validated our findings across heterogeneous cohorts, enhancing the robustness and generalizability of our results. Methods We derived PRSs for HNSCC and its subgroups, oropharyngeal cancer and oral cancer, using large-scale genome-wide association study summary statistics from the Genetic Associations and Mechanisms in Oncology Network. We conducted a comprehensive investigation, leveraging genotyping data and electronic health records from 308,492 individuals in the UK Biobank and 38,401 individuals in the Penn Medicine Biobank (PMBB), and subsequently performed PheWAS to elucidate the associations between PRS and a wide spectrum of phenotypes. Results We revealed the HNSCC PRS showed significant association with phenotypes related to tobacco use disorder (OR, 1.06; 95% CI, 1.05–1.08; P = 3.50 × 10⁻¹⁵), alcoholism (OR, 1.06; 95% CI, 1.04–1.09; P = 6.14 × 10⁻⁹), alcohol-related disorders (OR, 1.08; 95% CI, 1.05–1.11; P = 1.09 × 10⁻⁸), emphysema (OR, 1.11; 95% CI, 1.06–1.16; P = 5.48 × 10⁻⁶), chronic airway obstruction (OR, 1.05; 95% CI, 1.03–1.07; P = 2.64 × 10⁻⁵), and cancer of bronchus (OR, 1.08; 95% CI, 1.04–1.13; P = 4.68 × 10⁻⁵). These findings were replicated in the PMBB cohort, and sensitivity analyses, including the exclusion of HNSCC cases and the major histocompatibility complex locus, confirmed the robustness of these associations. Additionally, we identified significant associations between HNSCC PRS and lifestyle factors related to smoking and alcohol consumption. Conclusions The study demonstrated the potential of PRS-based PheWAS in revealing associations between genetic risk factors for HNSCC and various phenotypic traits. The findings emphasized the importance of considering genetic susceptibility in understanding HNSCC and highlighted shared genetic bases between HNSCC and other health conditions and lifestyles.
Article
Genomic ascertainment is the inversion of the traditional phenotype-first approach; with a “genome-first” approach, a cohort linked to electronic health records (EHR) undergoes germline sequencing (array, panel, exome, and genome) and deleterious variation of interest in a gene (or set of genes) are identified. Phenotype is then queried from the linked EHR and from call-back investigation and estimates of variant prevalence, disease penetrance, and phenotype can be determined. This should permit a better estimate of the full phenotypic spectrum, severity, and penetrance linked to a deleterious variant. For now, given the modest size, limited EHR, and age of participants in sequenced cohorts, genomic ascertainment approaches to investigate cancer in children and young adults will likely be restricted to descriptive studies and complement traditional phenotype-first work. Another issue is the ascertainment of the cohort itself: Participants need to survive long enough to enroll. Not accounting for this may lead to bias and incorrect estimates of variant prevalence. Adult-focused cohorts with EHR extending back into childhood, linked to cancer registries, and/or studies that permit recontact with participants may facilitate genomic ascertainment in pediatric cancer research. In summary, genomic ascertainment in pediatric primary brain cancer research remains largely untapped and merits further investigation.
Article
Tacrolimus metabolism is heavily influenced by the CYP3A5 genotype, which varies widely among African Americans (AA). We aimed to assess the performance of a published genotype‐informed tacrolimus dosing model in an independent set of adult AA kidney transplant (KTx) recipients. CYP3A5 genotypes were obtained for all AA KTx recipients (n = 232) from 2010 to 2019 who met inclusion criteria at a single transplant center in Philadelphia, Pennsylvania, USA. Medical record data were used to calculate predicted tacrolimus clearance using the published AA KTx dosing equation and two modified iterations. Observed and model‐predicted trough levels were compared at 3 days, 3 months, and 6 months post‐transplant. The mean prediction error at day 3 post‐transplant was 3.05 ng/mL, indicating that the model tended to overpredict the tacrolimus trough. This bias improved over time to 1.36 and 0.78 ng/mL at 3 and 6 months post‐transplant, respectively. Mean absolute prediction error—a marker of model precision—improved with time to 2.33 ng/mL at 6 months. Limiting genotype data in the model decreased bias and improved precision. The bias and precision of the published model improved over time and were comparable to studies in previous cohorts. The overprediction observed by the published model may represent overfitting to the initial cohort, possibly limiting generalizability.
Article
Full-text available
Tobacco use disorder (TUD) is the most prevalent substance use disorder in the world. Genetic factors influence smoking behaviours and although strides have been made using genome-wide association studies to identify risk variants, most variants identified have been for nicotine consumption, rather than TUD. Here we leveraged four US biobanks to perform a multi-ancestral meta-analysis of TUD (derived via electronic health records) in 653,790 individuals (495,005 European, 114,420 African American and 44,365 Latin American) and data from UK Biobank (ncombined = 898,680). We identified 88 independent risk loci; integration with functional genomic tools uncovered 461 potential risk genes, primarily expressed in the brain. TUD was genetically correlated with smoking and psychiatric traits from traditionally ascertained cohorts, externalizing behaviours in children and hundreds of medical outcomes, including HIV infection, heart disease and pain. This work furthers our biological understanding of TUD and establishes electronic health records as a source of phenotypic information for studying the genetics of TUD.
Article
Full-text available
Chronic pain is a common problem, with more than one-fifth of adult Americans reporting pain daily or on most days. It adversely affects the quality of life and imposes substantial personal and economic costs. Efforts to treat chronic pain using opioids had a central role in precipitating the opioid crisis. Despite an estimated heritability of 25–50%, the genetic architecture of chronic pain is not well-characterized, in part because studies have largely been limited to samples of European ancestry. To help address this knowledge gap, we conducted a cross-ancestry meta-analysis of pain intensity in 598,339 participants in the Million Veteran Program, which identified 125 independent genetic loci, 66 of which are new. Pain intensity was genetically correlated with other pain phenotypes, level of substance use and substance use disorders, other psychiatric traits, education level and cognitive traits. Integration of the genome-wide association studies findings with functional genomics data shows enrichment for putatively causal genes (n = 142) and proteins (n = 14) expressed in brain tissues, specifically in GABAergic neurons. Drug repurposing analysis identified anticonvulsants, β-blockers and calcium-channel blockers, among other drug groups, as having potential analgesic effects. Our results provide insights into key molecular contributors to the experience of pain and highlight attractive drug targets.
Preprint
Full-text available
Background Pharmacogenomics (PGx) aims to utilize a patient’s genetic data to enable safer and more effective prescribing of medications. The Clinical Pharmacogenetics Implementation Consortium (CPIC) provides guidelines with strong evidence for 24 genes that affect 72 medications. Despite strong evidence linking PGx alleles to drug response, there is a large gap in the implementation and return of actionable pharmacogenetic findings to patients in standard clinical practice. In this study, we evaluated opportunities for genetically guided medication prescribing in a diverse health system and determined the frequencies of actionable PGx alleles in an ancestrally diverse biobank population. Methods A retrospective analysis of the Penn Medicine electronic health records (EHRs), which includes ∼3.3 million patients between 2012-2020, provides a snapshot of the trends in prescriptions for drugs with genotype-based prescribing guidelines (‘CPIC level A or B’) in the Penn Medicine health system. The Penn Medicine BioBank (PMBB) consists of a diverse group of 43,359 participants whose EHRs are linked to genome-wide SNP array and whole exome sequencing (WES) data. We used the Pharmacogenomics Clinical Annotation Tool (PharmCAT), to annotate PGx alleles from PMBB variant call format (VCF) files and identify samples with actionable PGx alleles. Results We identified ∼316,000 unique patients that were prescribed at least 2 drugs with CPIC Level A or B guidelines. Genetic analysis in PMBB identified that 98.9% of participants carry one or more PGx actionable alleles where treatment modification would be recommended. After linking the genetic data with prescription data from the EHR, 14.2% of participants (n=6157) were prescribed medications that could be impacted by their genotype (as indicated by their PharmCAT report). For example, 856 participants received clopidogrel who carried CYP2C19 reduced function alleles, placing them at increased risk for major adverse cardiovascular events. When we stratified by genetic ancestry, we found disparities in PGx allele frequencies and clinical burden. Clopidogrel users of Asian ancestry in PMBB had significantly higher rates of CYP2C19 actionable alleles than European ancestry users of clopidrogrel (p<0.0001, OR=3.68). Conclusions Clinically actionable PGx alleles are highly prevalent in our health system and many patients were prescribed medications that could be affected by PGx alleles. These results illustrate the potential utility of preemptive genotyping for tailoring of medications and implementation of PGx into routine clinical care.
Article
Full-text available
The chr12q24.13 locus encoding OAS1–OAS3 antiviral proteins has been associated with coronavirus disease 2019 (COVID-19) susceptibility. Here, we report genetic, functional and clinical insights into this locus in relation to COVID-19 severity. In our analysis of patients of European (n = 2,249) and African (n = 835) ancestries with hospitalized versus nonhospitalized COVID-19, the risk of hospitalized disease was associated with a common OAS1 haplotype, which was also associated with reduced severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) clearance in a clinical trial with pegIFN-λ1. Bioinformatic analyses and in vitro studies reveal the functional contribution of two associated OAS1 exonic variants comprising the risk haplotype. Derived human-specific alleles rs10774671-A and rs1131454-A decrease OAS1 protein abundance through allele-specific regulation of splicing and nonsense-mediated decay (NMD). We conclude that decreased OAS1 expression due to a common haplotype contributes to COVID-19 severity. Our results provide insight into molecular mechanisms through which early treatment with interferons could accelerate SARS-CoV-2 clearance and mitigate against severe COVID-19. Genetic and functional studies implicate allele-specific regulation of OAS1 splicing and nonsense-mediated decay in COVID-19 severity. The OAS1 risk haplotype is also associated with reduced SARS-CoV-2 clearance in a clinical trial with pegIFN-λ1.
Article
Full-text available
Significance Viruses are strong sources of natural selection pressure during human evolutionary history. Investigating genetic diversity and detecting signatures of natural selection at host genes related to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection help to identify functionally important variation. We conducted a large study of global genomic variation at host genes that play a role in SARS-CoV-2 infection with a focus on underrepresented African populations. We identified nonsynonymous and regulatory variants at ACE2 that appear to be targets of recent natural selection in some African populations. We detected evidence of ancient adaptive evolution at TMPRSS2 in the human lineage. Genetic variants that are targets of natural selection are associated with clinical phenotypes common in patients with coronavirus disease 2019.
Article
Full-text available
Some risk factors for severe COVID-19 have been identified, including age, race, and obesity. However, 20-50% of severe cases occur in the absence of these factors. Cytomegalovirus (CMV) is a herpes virus that infects ~50% of all individuals worldwide and is one of the most significant non-genetic determinants of immune system. We hypothesized that latent CMV infection might influence the severity of COVID-19. Our analyses demonstrate that CMV seropositivity associates with more than twice the risk of hospitalization due to SARS-CoV-2 infection. Immune profiling of blood and CMV DNA qPCR in a subset of patients for whom respiratory tract samples were available revealed altered T cell activation profiles in absence of extensive CMV replication in the upper respiratory tract. These data suggest a potential role for CMV-driven immune perturbations in affecting the outcome of SARS-CoV-2 infection and may have implications for the discrepancies in COVID-19 severity between different human populations.
Article
Full-text available
Purpose Genome-wide association studies have identified hundreds of single nucleotide variations (formerly single nucleotide polymorphisms) associated with several cancers, but the predictive ability of polygenic risk scores (PRSs) is unclear, especially among non-Whites. Methods PRSs were derived from genome-wide significant single-nucleotide variations for 15 cancers in 20,079 individuals in an academic biobank. We evaluated the improvement in discriminatory accuracy by including cancer-specific PRS in patients of genetically-determined African and European ancestry. Results Among the individuals of European genetic ancestry, PRSs for breast, colon, melanoma, and prostate were significantly associated with their respective cancers. Among the individuals of African genetic ancestry, PRSs for breast, colon, prostate, and thyroid were significantly associated with their respective cancers. The area under the curve of the model consisting of age, sex, and principal components was 0.621 to 0.710, and it increased by 1% to 4% with the inclusion of PRS in individuals of European genetic ancestry. In individuals of African genetic ancestry, area under the curve was overall higher in the model without the PRS (0.723-0.810) but increased by <1% with the inclusion of PRS for most cancers. Conclusion PRS moderately increased the ability to discriminate the cancer status in individuals of European but not African ancestry. Further large-scale studies are needed to identify ancestry-specific genetic factors in non-White populations to incorporate PRS into cancer risk assessment.
Article
Full-text available
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly spread within the human population. Although SARS-CoV-2 is a novel coronavirus, most humans had been previously exposed to other antigenically distinct common seasonal human coronaviruses (hCoVs) before the COVID-19 pandemic. Here, we quantified levels of SARS-CoV-2-reactive antibodies and hCoV-reactive antibodies in serum samples collected from 431 humans before the COVID-19 pandemic. We then quantified pre-pandemic antibody levels in serum from a separate cohort of 251 individuals who became PCR-confirmed infected with SARS-CoV-2. Finally, we longitudinally measured hCoV and SARS-CoV-2 antibodies in the serum of hospitalized COVID-19 patients. Our studies indicate that most individuals possessed hCoV-reactive antibodies before the COVID-19 pandemic. We determined that ∼20% of these individuals possessed non-neutralizing antibodies that cross-reacted with SARS-CoV-2 spike and nucleocapsid proteins. These antibodies were not associated with protection against SARS-CoV-2 infections or hospitalizations, but they were boosted upon SARS-CoV-2 infection.
Article
Purpose Integrating genomic data into the electronic health record (EHR) is key for optimally delivering genomic medicine. Methods The PennChart Genomics Initiative (PGI) at the University of Pennsylvania is a multidisciplinary collaborative that has successfully linked orders and results from genetic testing laboratories with discrete genetic data in the EHR. We quantified the use of the genomic data within the EHR, performed a time study with genetic counselors, and conducted key informant interviews with PGI members to evaluate the effect of the PGI’s efforts on genetics care delivery. Results The PGI has interfaced with 4 genetic testing laboratories, resulting in the creation of 420 unique computerized genetic testing orders that have been used 4073 times to date. In a time study of 96 genetic testing activities, EHR use was associated with significant reductions in time spent ordering (2 vs 8 minutes, P < .001) and managing (1 vs 5 minutes, P < .001) genetic results compared with the use of online laboratory-specific portals. In key informant interviews, multidisciplinary collaboration and institutional buy-in were identified as key ingredients for the PGI’s success. Conclusion The PGI’s efforts to integrate genomic medicine into the EHR have substantially streamlined the delivery of genomic medicine.
Article
Polygenic risk scores (PRS) represent an individual's summed genetic risk for a trait and can serve as biomarkers for disease. Less is known about the utility of PRS as a means to quantify genetic risk for substance use disorders (SUDs) than for many other traits. Nonetheless, the growth of large, electronic health record-based biobanks makes it possible to evaluate the association of SUD PRS with other traits. We calculated PRS for smoking initiation, alcohol use disorder (AUD), and opioid use disorder (OUD) using summary statistics from the Million Veteran Program sample. We then tested the association of each PRS with its primary phenotype in the Penn Medicine BioBank (PMBB) using all available genotyped participants of African or European ancestry (AFR and EUR, respectively) (N = 18,612). Finally, we conducted phenome-wide association analyses (PheWAS) separately by ancestry and sex to test for associations across disease categories. Tobacco use disorder was the most common SUD in the PMBB, followed by AUD and OUD, consistent with the population prevalence of these disorders. All PRS were associated with their primary phenotype in both ancestry groups. PheWAS results yielded cross-trait associations across multiple domains, including psychiatric disorders and medical conditions. SUD PRS were associated with their primary phenotypes; however, they are not yet predictive enough to be useful diagnostically. The cross-trait associations of the SUD PRS are indicative of a broader genetic liability. Future work should extend findings to additional population groups and for other substances of abuse.
Article
‘Genome-first’ approaches to analyzing rare variants can reveal new insights into human biology and disease. Because pathogenic variants are often rare, new discovery requires aggregating rare coding variants into ‘gene burdens’ for sufficient power. However, a major challenge is deciding which variants to include in gene burden tests. Pathogenic variants in MYBPC3 and MYH7 are well-known causes of hypertrophic cardiomyopathy (HCM), and focusing on these ‘positive control’ genes in a genome-first approach could help inform variant selection methods and gene burdening strategies for other genes and diseases. Integrating exome sequences with electronic health records among 41 759 participants in the Penn Medicine BioBank, we evaluated the performance of aggregating predicted loss-of-function (pLOF) and/or predicted deleterious missense (pDM) variants in MYBPC3 and MYH7 for gene burden phenome-wide association studies (PheWAS). The approach to grouping rare variants for these two genes produced very different results: pLOFs but not pDM variants in MYBPC3 were strongly associated with HCM, whereas the opposite was true for MYH7. Detailed review of clinical charts revealed that only 38.5% of patients with HCM diagnoses carrying an HCM-associated variant in MYBPC3 or MYH7 had a clinical genetic test result. Additionally, 26.7% of MYBPC3 pLOF carriers without HCM diagnoses had clear evidence of left atrial enlargement and/or septal/LV hypertrophy on echocardiography. Our study shows the importance of evaluating both pLOF and pDM variants for gene burden testing in future studies to uncover novel gene-disease relationships and identify new pathogenic loss-of-function variants across the human genome through genome-first analyses of healthcare-based populations.
Article
Rare monogenic disorders of the primary cilium, termed ciliopathies, are characterized by extreme presentations of otherwise common diseases, such as diabetes, hepatic fibrosis, and kidney failure. However, despite a recent revolution in our understanding of the cilium’s role in rare disease pathogenesis, the organelle’s contribution to common disease remains largely unknown. Hypothesizing that common genetic variants within Mendelian ciliopathy genes might contribute to common complex diseases pathogenesis, we performed association studies of 16,874 common genetic variants across 122 ciliary genes with 12 quantitative laboratory traits characteristic of ciliopathy syndromes in 452,593 individuals in the UK Biobank. We incorporated tissue-specific gene expression analysis, expression quantitative trait loci, and Mendelian disease phenotype information into our analysis and replicated our findings in meta-analysis. 101 statistically significant associations were identified across 42 of the 122 examined ciliary genes (including eight novel replicating associations). These ciliary genes were widely expressed in tissues relevant to the phenotypes being studied, and eQTL analysis revealed strong evidence for correlation between ciliary gene expression levels and laboratory traits. Perhaps most interestingly, our analysis identified different ciliary subcompartments as being specifically associated with distinct sets of phenotypes. Taken together, our data demonstrate the utility of a Mendelian pathway-based approach to genomic association studies, challenge the widely held belief that the cilium is an organelle important mainly in development and in rare syndromic disease pathogenesis, and provide a framework for the continued integration of common and rare disease genetics to provide insight into the pathophysiology of human diseases of immense public health burden.