Na Cai (@caina89)

We have released summary stats here zenodo.org/records/1840... and code here github.com/yazhengdi/ED.... Feedback/criticism/comments welcome! 19/19

22.02.2026 18:55 👍 0 🔁 0 💬 0 📌 0

Both Yazheng and I have learnt a lot from working on this project. We thank the participants of UKB and other cohorts for enabling our work, and many friends for giving us valuable feedback. Yazheng will be presenting this work at #recomb26 @recombconf.bsky.social 18/n

22.02.2026 18:55 👍 1 🔁 0 💬 1 📌 0

We hope this work provides a proof of concept and blueprint for improving the specificity and interpretability of EHR-based genetic studies, as well as their downstream utility in drug target identification and risk stratification. 17/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

Overall, our EDGAR framework enables the prediction of disease-specific liabilities from EHR events, disease-specific measures and deep phenotype labels, and the separation of genetic effects on disease-specific liability from heritable biases that influence EHR events. 16/n

22.02.2026 18:55 👍 1 🔁 0 💬 1 📌 0

When we did that, we find that high rGs between external EHR GWAS with confounding traits disappeared. This implies that when we have well-predicted disease liabilities, enabled by EDGAR, we can identify biases in one EHR, and remove it from existing GWAS on another. 15/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

Finally, we ask if the Common Bias identified in UKB is likely generalizable between EHRs: consistent with our hypothesis, we find that it has high rGs with external EHR GWAS, but not external deep phenotype GWAS. This makes us attempt removing it from external EHR GWAS. 14/n

22.02.2026 18:55 👍 1 🔁 0 💬 1 📌 0

We then identify a Common Bias factor across diseases with genomic SEM, and find that it has high rGs with socioeconomic and behavioral traits, many previously shown to affect UKB participation www.nature.com/articles/s41... 13/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

This drives us to first isolate the disease-specific heritable confounder affecting EHR codes of each disease, using GWAS-by-subtraction, and find that genetic effects contributing to these biases in different diseases have high rGs (i.e. they are shared) with each other. 12/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

This is corroborated by our cross-disease rG analysis in UKB: while deep phenotypes and EDGAR liabilities show similar cross-disease rGs, raw EHR codes show highly inflated rGs. This is likely due to a common heritable confounder www.nature.com/articles/s41... 11/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

In contrast, raw UKB EHR codes have lower rGs with external deep phenotype GWAS, but higher rGs and hits replication in external EHR GWAS - this shows replicability between EHRs goes beyond etiological nature of diseases - it is likely an EHR-systemic factor drives this. 10/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

We validate our findings in external GWAS: compared to EHR phenotypes, EDGAR liabilities show higher GWAS hit replication and rGs with external deep phenotype GWAS - this suggests EDGAR liabilities capture replicable, disease-specific genetic effects. 9/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

We then perform and compare GWAS on raw EHR codes, deep disease labels and EDGAR predicted liabilities: for most diseases, EDGAR predictions give more GWAS hits, higher rGs with deep phenotype labels, and higher PRS predictability and specificity. 8/n

22.02.2026 18:55 👍 0 🔁 1 💬 1 📌 0

We first show that raw EHR codes have low correlations with deep disease labels for nine diseases in UKB, and EDGAR liability predictions do significantly better, especially when incorporating disease-relevant measures as inputs. 7/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

We augment EDGAR with an active learning model that prioritizes individuals for obtaining deep disease labels, which reduces the N labels needed by >50% - substantially reducing costs for obtaining disease labels through patient recall in realistic EHR settings. 6/n

22.02.2026 18:55 👍 1 🔁 0 💬 1 📌 0

EDGAR predicts lifetime disease liabilities through aligning counts of EHR events with disease-specific measures realistically available in EHRs (e.g. blood biochemistry or spirometry) and independently ascertained, clinically validated disease labels (“deep phenotypes”). 5/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

In our new paper, we propose EDGAR (EHR Disease liability prediction for Genetic Architecture Recovery), a prediction framework that combines the scale of EHR with the disease-relevance of deep phenotypes to break this circularity of bias. 4/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

These influences, many of which hertiable, are difficult to disentangle from disease-specific genetics. They may also be replicable across EHRs, making confounded genetic findings look robust. Finally, these biases are likely to propagate in next-event prediction models. 3/n

22.02.2026 18:55 👍 0 🔁 0 💬 1 📌 0

Harnessing EHR data for health research - Nature Medicine Electronic health records hold immense potential for providing clinically useful insights for populations and individuals; this Review summarizes the opportunities and challenges, with an emphasis on ...

First some background: using EHR codes in GWAS drives increase in statistical power in GWAS meta-analyses. Its potential, however, is undermined by systemic factors like coding practices and healthcare-access disparities between demographics: www.nature.com/articles/s41... 2/n

22.02.2026 18:55 👍 1 🔁 0 💬 1 📌 0

Learning lifetime disease liability reveals and removes genetic confounding in electronic health records Electronic health records (EHRs) have become the cornerstone of population-scale genetic studies1, but factors including patterns of healthcare use shape which and how diagnoses are recorded, leading ...

Our new preprint “Learning lifetime disease liability reveals and removes genetic confounding in electronic health records” is now online! Link to paper: This work is led by my postdoc Yazheng Di and it’s our first project at @bsse.ethz.ch :) medrxiv.org/cgi/content/...

Thread 1/n

22.02.2026 18:55 👍 17 🔁 6 💬 2 📌 1

Assistant Professor (Tenure Track) of Computational Immunology

My department @bsse.ethz.ch is inviting applications for a new assistant professor (tenure-track) in computational immunology, interested candidates please see advert and apply! Deadline 15 April 2026 ethz.ch/en/the-eth-z...

16.02.2026 08:35 👍 1 🔁 2 💬 0 📌 0

New preprint alert: we use sign errors as a test of how well TWAS works.

Very worryingly we find that TWAS gets the sign wrong around 1/3 of the time (compared to 50% for pure guessing). You can read more about our analysis here, and what we think is going on 👇

06.01.2026 02:48 👍 67 🔁 28 💬 5 📌 0

Assessment and ascertainment in psychiatric molecular genetics: challenges and opportunities for cross-disorder research - Molecular Psychiatry Molecular Psychiatry - Assessment and ascertainment in psychiatric molecular genetics: challenges and opportunities for cross-disorder research

And www.nature.com/articles/s41... in a more cross disorder setting - this one discusses the impact of phenotyping on cross disorder analyses

20.12.2025 15:46 👍 2 🔁 0 💬 0 📌 0

The genetic basis of major depressive disorder - Molecular Psychiatry Molecular Psychiatry - The genetic basis of major depressive disorder

We have two more commentary/review like papers on this www.nature.com/articles/s41...

20.12.2025 15:45 👍 1 🔁 0 💬 0 📌 0

Importantly in this paper we derive a new metric PRS pleiotropy which we use to show shallow phenotypes indeed give non specific gwas signal that leads to non specific PRS predictions

20.12.2025 15:44 👍 1 🔁 0 💬 2 📌 0

Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder - Nature Genetics Phenotype imputation increases the effective sample size of major depressive disorder cases in UK Biobank, enhancing study power and polygenic risk score (PRS) accuracy. A new pleiotropy metric enable...

We then worked out we can improve shallow phenotypes in biobank settings where there’s a small subset of individuals with high quality phenotypes through imputation www.nature.com/articles/s41...

20.12.2025 15:43 👍 5 🔁 0 💬 2 📌 0

Minimal phenotyping yields genome-wide association signals of low specificity for major depression - Nature Genetics Genetic analyses of depression based on minimal phenotyping identify nonspecific genetic risk factors shared between major depressive disorder (MDD) and other psychiatric conditions, suggesting that t...

Hi @michelnivard.bsky.social we have written quite a lot on sample size vs phenotyping quality. The earliest was www.nature.com/articles/s41... on depression where we show shallow phenotyping, despite giving higher gwas power, gives non specific gwas signal

20.12.2025 15:41 👍 4 🔁 0 💬 2 📌 0

Embryo selection company Herasight goes all in on eugenics ...

I wrote about the bizarre case of Herasight, the embryo selection company going all in on eugenics.

13.12.2025 20:15 👍 125 🔁 83 💬 6 📌 15

The PGC Suicide Working Group will provide a symposium at #WCPG2025 🇲🇽 covering our latest multi-ancestry GWAS and CNV meta-analyses, sex-specific meta-analyses, and GxEHR analyses. 🥳

@lcstoshio.bsky.social @andreyshabalin.bsky.social @sarahcolbert.bsky.social @caina89.bsky.social

19.10.2025 18:17 👍 17 🔁 5 💬 2 📌 0

We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)

22.09.2025 05:29 👍 174 🔁 91 💬 4 📌 5

🌎👩‍🔬 For 15+ years biology has accumulated petabytes (million gigabytes) of🧬DNA sequencing data🧬 from the far reaches of our planet.🦠🍄🌵

Logan now democratizes efficient access to the world’s most comprehensive genetics dataset. Free and open.

doi.org/10.1101/2024...

03.09.2025 08:39 👍 218 🔁 118 💬 3 📌 16

Na Cai

Latest posts by Na Cai @caina89