Trending
Dongwook Kim's Avatar

Dongwook Kim

@dongwookkim

Developing fast and easy methods for #phylogenetics and #bioinformatics | PhD in Bioinformatics | Postdoc @ Comparative Genomics Lab, UNIL/SIBπŸ‡¨πŸ‡­| Formerly @ Steinegger Lab, SNUπŸ‡°πŸ‡· | he/him

183
Followers
126
Following
9
Posts
21.11.2024
Joined
Posts Following

Latest posts by Dongwook Kim @dongwookkim

I am pleased to share that our paper is now published in Cell!
www.cell.com/cell/fulltex...
I am deeply grateful to all co-authors for making this possible.

This work was made possible through the guidance of Dr. Peer Bork. I share this in grateful memory and with deep respect for his mentorship.

09.02.2026 21:07 πŸ‘ 28 πŸ” 13 πŸ’¬ 1 πŸ“Œ 2
Preview
Multiple protein structure alignment at scale with FoldMason Protein structure is conserved beyond sequence, making multiple structural alignment (MSTA) essential for analyzing distantly related proteins. Computational prediction methods have vastly extended ou...

FoldMason is out now in @science.org. It generates accurate multiple structure alignments for thousands of protein structures in seconds. Great work by Cameron L. M. Gilchrist and @milot.bsky.social.
πŸ“„ www.science.org/doi/10.1126/...
🌐 search.foldseek.com/foldmason
πŸ’Ύ github.com/steineggerla...

30.01.2026 06:11 πŸ‘ 300 πŸ” 147 πŸ’¬ 4 πŸ“Œ 3
Preview
GitHub - DessimozLab/ampliphy: Improve phylogenetic inference by amplifying multiple sequence alignment with homologous sequences Improve phylogenetic inference by amplifying multiple sequence alignment with homologous sequences - DessimozLab/ampliphy

This work has been done by a collaborative effort with Manuel Gil (ZHAW), Kazutaka Katoh (UOsaka), and @dessimoz.bsky.social (UNIL/SIB).
Try AmpliPhy now, we appreciate your feedback!
🌐 github.com/dessimozlab/ampliphy

28.01.2026 06:10 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
We used a feature of the MAFFT software suite that adds sequences while keeping the column structure of the original MSA intact, implemented via the --add and --addfragments options. We refer to MSAs generated with this feature as amplified MSAs. Comparisons of normalized TCS gains between enriched and amplified MSAs showed comparable improvements in phylogenetic inference, regardless of the methodology. This suggests that we can preserve linear scalability for both sequence alignment and phylogenetic inference, while retaining the improvement in inference quality provided by the additional sequences.

We used a feature of the MAFFT software suite that adds sequences while keeping the column structure of the original MSA intact, implemented via the --add and --addfragments options. We refer to MSAs generated with this feature as amplified MSAs. Comparisons of normalized TCS gains between enriched and amplified MSAs showed comparable improvements in phylogenetic inference, regardless of the methodology. This suggests that we can preserve linear scalability for both sequence alignment and phylogenetic inference, while retaining the improvement in inference quality provided by the additional sequences.

This effect is maintained when MAFFT adds homologs onto an existing MSA without disrupting column structure. Based on these findings, we developed AmpliPhy, a Nextflow pipeline that automates database-driven homolog enrichment for improved gene tree inference at scale. 🧡4/n

28.01.2026 06:10 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
We hypothesized that the gain in inference quality could be driven by more precise root placement, leveraging the additional information provided by the enriched taxa. To test this, we modified the processing of the enriched tree by switching the order of rooting and pruning. In our original workflow, we processed trees inferred from enriched MSAs by rooting first and then pruning leaves that were not present in the original MSA (post-pruning). By reversing the order, i.e. pruning first and rooting afterwards, we prevented the additional taxa from contributing to the rooting step (pre-pruning). We then compared the congruence gain of pre-pruned and post-pruned phylogenetic trees. To quantify the effect of pre-pruning, we computed the loss of congruence by pre-pruning the tree. The effect of pre-pruning was observed as a notable decrease in inference quality for Amniota HOGs.

We hypothesized that the gain in inference quality could be driven by more precise root placement, leveraging the additional information provided by the enriched taxa. To test this, we modified the processing of the enriched tree by switching the order of rooting and pruning. In our original workflow, we processed trees inferred from enriched MSAs by rooting first and then pruning leaves that were not present in the original MSA (post-pruning). By reversing the order, i.e. pruning first and rooting afterwards, we prevented the additional taxa from contributing to the rooting step (pre-pruning). We then compared the congruence gain of pre-pruned and post-pruned phylogenetic trees. To quantify the effect of pre-pruning, we computed the loss of congruence by pre-pruning the tree. The effect of pre-pruning was observed as a notable decrease in inference quality for Amniota HOGs.

At lower taxonomic levels (e.g., Aminotes), this improvement was associated with more precise root placement. This provides empirical evidence that denser taxon sampling can ameliorate gene tree inference of closely related species by adding information for accurate rooting. 🧡3/n

28.01.2026 06:10 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
For each orthologous gene family, we constructed three MSAs: original, computed by applying sequence aligners directly to the input sequences; enriched, computed by aligning the combined set of orthologs and homologs identified by database search; and impoverished, obtained by removing the added homologs from the enriched MSA. We then used TCS to quantify congruence of the resulting trees against the known taxonomy. The normalized difference in congruence between the original and enriched trees captures the joint impact of sequence addition on alignment and tree inference. The normalized difference between the original and impoverished trees reflects the effect on alignment quality alone. The effect on tree inference can then be estimated by subtraction. We observed a positive impact of homolog enrichment on phylogenetic tree inference step, regardless of the sequence aligner used to build the alignments. Notably, the impact of sequence addition on alignments was marginal.

For each orthologous gene family, we constructed three MSAs: original, computed by applying sequence aligners directly to the input sequences; enriched, computed by aligning the combined set of orthologs and homologs identified by database search; and impoverished, obtained by removing the added homologs from the enriched MSA. We then used TCS to quantify congruence of the resulting trees against the known taxonomy. The normalized difference in congruence between the original and enriched trees captures the joint impact of sequence addition on alignment and tree inference. The normalized difference between the original and impoverished trees reflects the effect on alignment quality alone. The effect on tree inference can then be estimated by subtraction. We observed a positive impact of homolog enrichment on phylogenetic tree inference step, regardless of the sequence aligner used to build the alignments. Notably, the impact of sequence addition on alignments was marginal.

We devised a benchmark method to quantify the impact of homolog enrichment on phylogenetic inference, decomposing the effects on MSA quality, tree inference quality, and rooting. We show homolog enrichment improves tree inference, while effects on alignments remain marginal. 🧡2/n

28.01.2026 06:10 πŸ‘ 7 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
AmpliPhy improves gene trees by adding homologs without affecting alignments In phylogenomics, gene tree reconstruction depends on multiple sequence alignment (MSA) and tree inference, and ongoing work continues to improve inference quality. Denser taxon sampling has been associated with improved gene tree inference, suggesting that adding homologs could be a practical route to higher accuracy as sequence databases continue to expand. However, adding sequences can influence multiple steps of typical inference pipelines, and little is known on its specific effect on the multiple sequence alignment, tree reconstruction, and rooting steps. We performed a large-scale empirical benchmark to quantify how homolog enrichment affects alignment and phylogenetic inference. Using an enrichment-impoverishment design and a measure of tree accuracy based on taxonomic congruence, we found that enrichment consistently improves tree inference quality, while effects on alignment quality are marginal. We show that this improvement is associated with accurate root placement on enriched trees when sensitive homolog search is accompanied. Notably, much of the benefit can be retained with relatively compact alignments produced by sequence addition. Building on these observations, we provide a tool, AmpliPhy, which efficiently improves phylogenetic reconstruction of protein families through homolog enrichment. The AmpliPhy open-source pipeline software is available at https://github.com/DessimozLab/ampliphy. ### Competing Interest Statement The authors have declared no competing interest. Swiss National Science Foundation, https://ror.org/00yjd3n13, 216623, 10005715

Can ever-increasing sequence databases improve phylogenetic reconstruction of a gene family? Our new preprint introduces AmpliPhy, a pipeline that automates homolog enrichment to improve gene tree inference, built on a robust phylogenomic benchmark scheme. 🧡1/n
πŸ“ƒ doi.org/10.64898/2026.01.26.701724

28.01.2026 06:10 πŸ‘ 25 πŸ” 14 πŸ’¬ 1 πŸ“Œ 0
Preview
Mirdita Lab - Laboratory for Computational Biology & Molecular Machine Learning Mirdita Lab builds scalable bioinformatics methods.

My time in @martinsteinegger.bsky.social's group is ending, but I’m staying in Korea to build a lab at Sungkyunkwan University School of Medicine. If you or someone you know is interested in molecular machine learning and open-source bioinformatics, please reach out. I am hiring!
mirdita.org

20.01.2026 11:07 πŸ‘ 105 πŸ” 55 πŸ’¬ 7 πŸ“Œ 1
Preview
Protein Structure Informed Bacteriophage Genome Annotation with Phold Bacteriophage (phage) genome annotation is essential for understanding their functional potential and suitability for use as therapeutic agents. Here we introduce Phold, an annotation framework utilis...

Stoked to finally have a preprint out for Phold, our tool that uses protein structural information to enhance phage genome annotation #phagesky 1/n

www.biorxiv.org/content/10.1...

08.08.2025 07:10 πŸ‘ 137 πŸ” 66 πŸ’¬ 5 πŸ“Œ 4
Preview
Planetary microbiome structure and generalist-driven gene flow across disparate habitats Microbes are ubiquitous on Earth, forming microbiomes that sustain macroscopic life and biogeochemical cycles. Microbial dispersion, driven by natural processes and human activities, interconnects mic...

Our new preprint is out!
www.biorxiv.org/content/10.1...
In this study, we present the largest systematic analysis of microbiome structure and function, integrating 85K uniformly processed metagenomes from diverse habitats worldwide.
@podlesny.bsky.social @jonas-bio.bsky.social @borklab.bsky.social

21.07.2025 11:56 πŸ‘ 28 πŸ” 18 πŸ’¬ 1 πŸ“Œ 4

OrthoFinder just dropped a major update

It’s faster, more accurate, and ready for thousands of genomes

Let’s break it down (1/10)

github.com/OrthoFinder/...
www.biorxiv.org/content/10.1...

16.07.2025 17:51 πŸ‘ 126 πŸ” 73 πŸ’¬ 1 πŸ“Œ 1
Video thumbnail

Folddisco finds similar (dis)continuous 3D motifs in large protein structure databases. Its efficient index enables fast uncharacterized active site annotation, protein conformational state analysis and PPI interface comparison. 1/9🧢🧬
πŸ“„ www.biorxiv.org/content/10.1...
🌐 search.foldseek.com/folddisco

07.07.2025 08:21 πŸ‘ 155 πŸ” 71 πŸ’¬ 8 πŸ“Œ 3
Preview
A general substitution matrix for structural phylogenetics. Abstract. Sequence-based maximum likelihood (ML) phylogenetics is a widely used method for inferring evolutionary relationships, which has illuminated the

New paper from the lab from Sriram Garg in my group. We introduce a general substitution matrix for structural phylogenetics. I think this is a big deal, so read on below if you think deep history is important. academic.oup.com/mbe/advance-...

11.06.2025 14:01 πŸ‘ 96 πŸ” 52 πŸ’¬ 3 πŸ“Œ 2

This work was done by talented @sukhwanpark.bsky.social and me, supervised by amazing @martinsteinegger.bsky.social !

Try Unicore now πŸ‘‰ conda install -c bioconda unicore
Code and tutorial: 🌐 github.com/steineggerlab/unicore
Manuscript: 🌐 doi.org/10.1093/gbe/evaf109

03.06.2025 06:54 πŸ‘ 5 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Unicore is fast, accurate, and universal. Unicore reconstructed consistent phylogeny of bacterial/fungal species, while maintaining linear time scale over the input size. Besides, Unicore works with any given taxa, presenting scalable and universal method for structure-based phylogeny. 🧡3/n

03.06.2025 06:54 πŸ‘ 6 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

With Unicore, we identified 13 structural core genes from 166 species across the Tree of Life, where 8 of them could only be defined using structures. Projected on the Tree of Life reconstructed with Unicore, you can see the universally conserved structure of one of the structural core genes. 🧡2/n

03.06.2025 06:54 πŸ‘ 5 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Unicore is now published on GBE πŸš€
Unicore rapidly identifies structural single-copy core genes from input species proteomes for phylogenetic analysis. Powered by Foldseek and ProstT5, Unicore enables linear-scale structure-based phylogeny of any given set of taxa. 🧡1/n
πŸ“ƒ doi.org/10.1093/gbe/evaf109

03.06.2025 06:54 πŸ‘ 68 πŸ” 31 πŸ’¬ 3 πŸ“Œ 2
Post image

AFESM: a metagenomic guide through the protein structure universe! We clustered 821M structures (AFDB&ESMatlas) into 5.12M groups; revealing biome-specific groups, only 1 new fold even after AlphaFold2 re-prediction & many novel domain combos. 🧡
🌐 afesm.foldseek.com
πŸ“„ www.biorxiv.org/content/10.1...

27.04.2025 00:13 πŸ‘ 141 πŸ” 71 πŸ’¬ 4 πŸ“Œ 4
Post image

Visit our posters at #RECOMB2025 for:

Structural: MSAs, Virus DB, Core Genes, Motif Discovery, Multimer Clustering & Search, pLM Foldseek, Environmental analysis

Metagenomics: Classification & Metabuli App

GPU-based & RNA search, Proteome clustering, Novel Ribozyme discovery

& get Marv stickers!

25.04.2025 07:45 πŸ‘ 64 πŸ” 19 πŸ’¬ 2 πŸ“Œ 4
IQ-TREE 3: Phylogenomic Inference Software using Complex Evolutionary Models

Not really my announcement to make--I am but a lesser co-author--but IQ-TREE 3 has just been released!

(Most credit to Minh Bui and @roblanfear.bsky.social and their labs)

ecoevorxiv.org/repository/v...

10.04.2025 14:13 πŸ‘ 178 πŸ” 96 πŸ’¬ 2 πŸ“Œ 6
Post image

πŸš€ #AlphaFold Database update

AlphaFold DB now integrates The Encyclopedia of Domains (TED) – a resource designed to systematically identify & classify structural domains within AlphaFold-predicted protein structures.

www.ebi.ac.uk/about/news/u...

@pdbeurope.bsky.social

03.03.2025 16:33 πŸ‘ 118 πŸ” 44 πŸ’¬ 1 πŸ“Œ 2

The PAN-GO paper is a remarkable milestone. It not only provides the most comprehensive picture of human gene function to date, but also carefully maps this knowledge across the tree of life! Congratulations @marcfeuermann.bsky.social, Pascale Gaudet & collaborators!

www.sib.swiss/news/sib-hel...

26.02.2025 22:37 πŸ‘ 16 πŸ” 12 πŸ’¬ 0 πŸ“Œ 0
Post image

In our latest review, we explore 12 deep-learning tools for metagenomic analysis, covering their strengths, limitations, and key applications. We hope it serves as both a resource and inspiration for new ways to analyze metagenomic data. Great work by Eli Levy Karin!
πŸ“„ doi.org/10.1093/nsr/...

22.02.2025 05:47 πŸ‘ 106 πŸ” 44 πŸ’¬ 2 πŸ“Œ 1
FastOMA retains OMA’s high precision accuracy and even improves upon it in terms of recall, positioning it on the Pareto frontier of orthology inference methods. 
FastOMA is not only fast but also accurate. a, QfO benchmar, agreement with SwissTree reference phylogeny covering manually curated gene trees. The error bars indicate 95% confidence intervals comparing FastOMA with EnsemblCompara, Domainoid, OrthoMCL, Ortholnspector, sonicparanoid, PANTHER, OrthoFinder, Hieranoid26 and the OMA family including OMA pairs, OMA groups and OMA GETHOGs (graph-based efficient technique for HOGs).

c) A computation time comparison of FastOMA and state-of-the-art alternatives.
https://www.nature.com/articles/s41592-024-02552-8

FastOMA retains OMA’s high precision accuracy and even improves upon it in terms of recall, positioning it on the Pareto frontier of orthology inference methods. FastOMA is not only fast but also accurate. a, QfO benchmar, agreement with SwissTree reference phylogeny covering manually curated gene trees. The error bars indicate 95% confidence intervals comparing FastOMA with EnsemblCompara, Domainoid, OrthoMCL, Ortholnspector, sonicparanoid, PANTHER, OrthoFinder, Hieranoid26 and the OMA family including OMA pairs, OMA groups and OMA GETHOGs (graph-based efficient technique for HOGs). c) A computation time comparison of FastOMA and state-of-the-art alternatives. https://www.nature.com/articles/s41592-024-02552-8

FastOMA is out now in Nature Methods πŸŽ‰: nature.com/articles/s41592-024-02552-8 A new orthology inference algorithm that scales linearly and is highly accurate. FastOMA can process all >2000 eukaryotic UniProt ref proteomes <24 hours πŸš€. Try it out github.com/DessimozLab/fastoma @dessimoz.bsky.social

03.01.2025 14:14 πŸ‘ 40 πŸ” 18 πŸ’¬ 1 πŸ“Œ 0
Post image Post image

Unicore identifies single-copy protein structures across genomes using Foldseek, bypassing slow structure predictions by utilizing 3Di predictions from ProstT5, enabling rapid phylogenetic inference at the tree-of-life scale. 1/n
πŸ“„ www.biorxiv.org/content/10.1...
πŸ’Ύ github.com/steineggerla...

23.12.2024 16:39 πŸ‘ 121 πŸ” 57 πŸ’¬ 2 πŸ“Œ 3

Unicore enables scalable and accurate phylogenetic reconstruction with structural core genes https://www.biorxiv.org/content/10.1101/2024.12.22.629535v1

23.12.2024 03:51 πŸ‘ 5 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0

Scientists, academics, researchers: We’re excited to share that @altmetric.com is now tracking mentions of your research on Bluesky! πŸ§ͺ

03.12.2024 14:10 πŸ‘ 29661 πŸ” 5024 πŸ’¬ 458 πŸ“Œ 279
Post image

South Korean citizens helped lawmakers scale the National Assembly walls so they could bypass military barricades and vote against martial law.

03.12.2024 17:15 πŸ‘ 13584 πŸ” 3142 πŸ’¬ 81 πŸ“Œ 419
bioRxiv expands on Mastodon and Bluesky bioRxiv - the preprint server for biology, operated by Cold Spring Harbor Laboratory, a research and educational institution

Reminder for newcomers that bioRxiv has Bluesky accounts in every subject category - great way to keep up (please re-skeet) connect.biorxiv.org/news/2023/09...

10.11.2024 13:40 πŸ‘ 362 πŸ” 286 πŸ’¬ 6 πŸ“Œ 6

Interested in bioinformatics method development for proteins, structures or metagenomic analysis? Please check out my lab’s starter pack!
πŸ”— go.bsky.app/VJhXcSs

28.11.2024 12:36 πŸ‘ 56 πŸ” 11 πŸ’¬ 3 πŸ“Œ 0