We have released summary stats here zenodo.org/records/1840... and code here github.com/yazhengdi/ED.... Feedback/criticism/comments welcome! 19/19
We have released summary stats here zenodo.org/records/1840... and code here github.com/yazhengdi/ED.... Feedback/criticism/comments welcome! 19/19
Both Yazheng and I have learnt a lot from working on this project. We thank the participants of UKB and other cohorts for enabling our work, and many friends for giving us valuable feedback. Yazheng will be presenting this work at #recomb26 @recombconf.bsky.social 18/n
We hope this work provides a proof of concept and blueprint for improving the specificity and interpretability of EHR-based genetic studies, as well as their downstream utility in drug target identification and risk stratification. 17/n
Overall, our EDGAR framework enables the prediction of disease-specific liabilities from EHR events, disease-specific measures and deep phenotype labels, and the separation of genetic effects on disease-specific liability from heritable biases that influence EHR events. 16/n
When we did that, we find that high rGs between external EHR GWAS with confounding traits disappeared. This implies that when we have well-predicted disease liabilities, enabled by EDGAR, we can identify biases in one EHR, and remove it from existing GWAS on another. 15/n
Finally, we ask if the Common Bias identified in UKB is likely generalizable between EHRs: consistent with our hypothesis, we find that it has high rGs with external EHR GWAS, but not external deep phenotype GWAS. This makes us attempt removing it from external EHR GWAS. 14/n
We then identify a Common Bias factor across diseases with genomic SEM, and find that it has high rGs with socioeconomic and behavioral traits, many previously shown to affect UKB participation www.nature.com/articles/s41... 13/n
This drives us to first isolate the disease-specific heritable confounder affecting EHR codes of each disease, using GWAS-by-subtraction, and find that genetic effects contributing to these biases in different diseases have high rGs (i.e. they are shared) with each other. 12/n
This is corroborated by our cross-disease rG analysis in UKB: while deep phenotypes and EDGAR liabilities show similar cross-disease rGs, raw EHR codes show highly inflated rGs. This is likely due to a common heritable confounder www.nature.com/articles/s41... 11/n
In contrast, raw UKB EHR codes have lower rGs with external deep phenotype GWAS, but higher rGs and hits replication in external EHR GWAS - this shows replicability between EHRs goes beyond etiological nature of diseases - it is likely an EHR-systemic factor drives this. 10/n
We validate our findings in external GWAS: compared to EHR phenotypes, EDGAR liabilities show higher GWAS hit replication and rGs with external deep phenotype GWAS - this suggests EDGAR liabilities capture replicable, disease-specific genetic effects. 9/n
We then perform and compare GWAS on raw EHR codes, deep disease labels and EDGAR predicted liabilities: for most diseases, EDGAR predictions give more GWAS hits, higher rGs with deep phenotype labels, and higher PRS predictability and specificity. 8/n
We first show that raw EHR codes have low correlations with deep disease labels for nine diseases in UKB, and EDGAR liability predictions do significantly better, especially when incorporating disease-relevant measures as inputs. 7/n
We augment EDGAR with an active learning model that prioritizes individuals for obtaining deep disease labels, which reduces the N labels needed by >50% - substantially reducing costs for obtaining disease labels through patient recall in realistic EHR settings. 6/n
EDGAR predicts lifetime disease liabilities through aligning counts of EHR events with disease-specific measures realistically available in EHRs (e.g. blood biochemistry or spirometry) and independently ascertained, clinically validated disease labels (βdeep phenotypesβ). 5/n
In our new paper, we propose EDGAR (EHR Disease liability prediction for Genetic Architecture Recovery), a prediction framework that combines the scale of EHR with the disease-relevance of deep phenotypes to break this circularity of bias. 4/n
These influences, many of which hertiable, are difficult to disentangle from disease-specific genetics. They may also be replicable across EHRs, making confounded genetic findings look robust. Finally, these biases are likely to propagate in next-event prediction models. 3/n
First some background: using EHR codes in GWAS drives increase in statistical power in GWAS meta-analyses. Its potential, however, is undermined by systemic factors like coding practices and healthcare-access disparities between demographics: www.nature.com/articles/s41... 2/n
Our new preprint βLearning lifetime disease liability reveals and removes genetic confounding in electronic health recordsβ is now online! Link to paper: This work is led by my postdoc Yazheng Di and itβs our first project at @bsse.ethz.ch :) medrxiv.org/cgi/content/...
Thread 1/n
My department @bsse.ethz.ch is inviting applications for a new assistant professor (tenure-track) in computational immunology, interested candidates please see advert and apply! Deadline 15 April 2026 ethz.ch/en/the-eth-z...
New preprint alert: we use sign errors as a test of how well TWAS works.
Very worryingly we find that TWAS gets the sign wrong around 1/3 of the time (compared to 50% for pure guessing). You can read more about our analysis here, and what we think is going on π
And www.nature.com/articles/s41... in a more cross disorder setting - this one discusses the impact of phenotyping on cross disorder analyses
We have two more commentary/review like papers on this www.nature.com/articles/s41...
Importantly in this paper we derive a new metric PRS pleiotropy which we use to show shallow phenotypes indeed give non specific gwas signal that leads to non specific PRS predictions
We then worked out we can improve shallow phenotypes in biobank settings where thereβs a small subset of individuals with high quality phenotypes through imputation www.nature.com/articles/s41...
Hi @michelnivard.bsky.social we have written quite a lot on sample size vs phenotyping quality. The earliest was www.nature.com/articles/s41... on depression where we show shallow phenotyping, despite giving higher gwas power, gives non specific gwas signal
I wrote about the bizarre case of Herasight, the embryo selection company going all in on eugenics.
The PGC Suicide Working Group will provide a symposium at #WCPG2025 π²π½ covering our latest multi-ancestry GWAS and CNV meta-analyses, sex-specific meta-analyses, and GxEHR analyses. π₯³
@lcstoshio.bsky.social @andreyshabalin.bsky.social @sarahcolbert.bsky.social @caina89.bsky.social
We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)
ππ©βπ¬ For 15+ years biology has accumulated petabytes (million gigabytes) ofπ§¬DNA sequencing data𧬠from the far reaches of our planet.π¦ ππ΅
Logan now democratizes efficient access to the worldβs most comprehensive genetics dataset. Free and open.
doi.org/10.1101/2024...