Nic Fishman (@njw.fish)

But I think part of the problem here is that we do not have a perfect solution right now for how you think about the set of all specifications beyond reducing it to an average. Some way to think about how specifications cluster across which covariates are included or excluded, something like that.

11.03.2026 01:21 👍 0 🔁 0 💬 0 📌 0

Specification curve analysis - Nature Human Behaviour Specification curve analysis enables large numbers of alternative empirical analyses to be performed on the same data, showcasing how analytical decisions influence results and allowing joint inferenc...

Neither! We really want to have to interpret the entire set of specifications. When people do “robustness checks“ they often have one model and they are just trying to show that it doesn’t “break“. We want something more like: www.nature.com/articles/s41...

11.03.2026 01:17 👍 0 🔁 0 💬 1 📌 0

But you can look at the kinds of estimates you get from "many teams" analysis projects and there is tons of variance. I think that we should just take that variance across specifications as the central primitive. If there isnt much its easy! But in practice there is.

09.03.2026 16:14 👍 0 🔁 0 💬 0 📌 0

We find a correlation of about 0.15 in the sample we looked at, and that is baked in to the estimate here. The real issue is dropping/adding covariates and the correlation with the "treatment" etc. Its all stuff you can think about from the covariance matrix//higher moments for sure.

09.03.2026 16:12 👍 1 🔁 0 💬 1 📌 0

And we estimate that dependence parameter in our empirical analysis, under basically a Gaussian AR(1) assumption on the underlying process.

09.03.2026 16:00 👍 0 🔁 0 💬 1 📌 0

We allow a more general dependence process and analyze the relevant parameter which looks like roughly (dependence parameter)/(cost to test).

09.03.2026 16:00 👍 1 🔁 0 💬 1 📌 0

I know people are trying to use LLMs in ethnographic/qualitative contexts but I assume the dynamics are extremely different there, especially since data collection is such a fundamental part of the process?

09.03.2026 15:29 👍 0 🔁 0 💬 0 📌 0

I find these comments super interesting! I think this speaks to like the very real practice of empirical social science in economics and political science, where the quasi-experimental turn forces a focus on particular models under specific "natural experiment" assumptions. Would love to hear more!

09.03.2026 15:26 👍 1 🔁 0 💬 1 📌 0

Exactly! Its an opportunity to fix the status quo!

09.03.2026 15:16 👍 1 🔁 0 💬 0 📌 0

We open-source the full pipeline: specification trees, agent prompts, mechanical verification, and all estimation code.

Paper: njw.fish/static/paper...
Code: github.com/gsekeres/age...

09.03.2026 15:09 👍 4 🔁 1 💬 1 📌 0

This sounds extreme. But we're not asking researchers to do more work, we're asking them to show more of the work that AI now makes basically free.

The move is from single specifications to full specification surfaces. The key is developing tools to interpret those surfaces.

09.03.2026 15:09 👍 3 🔁 0 💬 1 📌 0

For journals to maintain a 5% false-discovery rate after a 172x cost decline, the required number of passing robustness checks jumps from ~50 to ~7,000.

That's a 140-fold increase in mandatory disclosure. Disclosure must scale linearly with the researcher's testing capacity.

09.03.2026 15:09 👍 2 🔁 0 💬 2 📌 2

To calibrate the theory, we built an AI pipeline that takes a published paper's replication package and automatically constructs a specification surface: the full space of estimand-preserving variants implied by the paper. We ran this on 103 AEA papers, producing 5,569 models.

09.03.2026 15:09 👍 3 🔁 0 💬 1 📌 0

Our second result: requiring m passing robustness checks (where m scales in 1/cost) does achieve optimal screening.

The key idea is that forced disclosure converts researcher search effort into observable evidence. This is a mechanism-design rationale for specification curves.

09.03.2026 15:09 👍 3 🔁 0 💬 1 📌 0

Editors have two levers:
1. Tighten standards: lower the p-value cutoff
2. Force disclosure: require m passing specifications

Our first result: switching to p<0.001 does not work. Stricter thresholds force researchers to search more, but she still only reports her best result.

09.03.2026 15:09 👍 3 🔁 0 💬 1 📌 0

So what should journals do? Inspired by ideas from mechanism design, build a formal model.

Editors commit to acceptance rules, then researchers sequentially test specifications at some cost, selectively reporting. Editor cannot verify omissions, they only sees what the researcher chooses to show.

09.03.2026 15:09 👍 5 🔁 0 💬 1 📌 0

Conventional significance testing with p < 0.05 and handful of robustness checks was designed for a world where each specification cost effort. That world is in the past.

When testing is 172x cheaper and researchers search thousands of specifications they will find hits to report.

09.03.2026 15:09 👍 5 🔁 0 💬 1 📌 0

AI agents change the calculus here. On the one hand, they make searching over the full specification space tractable. On the other hand, they make it necessary to think about the full sets, because if a researcher can search thousands of specifications, reporting just one is uninformative.

09.03.2026 15:09 👍 3 🔁 0 💬 1 📌 0

Methodologists have long argued that single specifications with robustness checks are not the right unit of evidence. Specification-curve analysis, model uncertainty, and multiverse analysis say the real evidence is a set of reasonable specifications which we must interpret.

09.03.2026 15:09 👍 3 🔁 0 💬 1 📌 0

There's a growing worry that AI will break empirical social science -- that agents can p-hack until they find something that "works."

We think that worry deserves to be taken seriously. Our new paper shows that is true empirically and makes it precise: njw.fish/static/paper...

09.03.2026 15:09 👍 40 🔁 10 💬 2 📌 6

Distribution-Conditioned Transport Learning a transport model that maps a source distribution to a target distribution is a canonical problem in machine learning, but scientific applications increasingly require models that can general...

Paper: arxiv.org/abs/2603.04736
Code: github.com/njwfish/Dist...

Work with @ggokul.bsky.social, Paolo Fischer
@marinkazitnik.bsky.social, Omar Abudayyeh, and Jonathan Gootenberg!

06.03.2026 15:15 👍 0 🔁 0 💬 0 📌 0

The key theoretical driver here is that our distribution encoders satisfy a CLT, so training on minibatch subsamples recovers population-level objectives. That's what makes this all tractable in practice!

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

To show this works across modality we also have a sequence example: TCR repertoire forecasting. On longitudinal TCR-seq from COVID-19 patients, source+target conditioning with discrete flow matching cuts error by >60%, learning from patients observed at only a single timepoint.

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

We can do the same comparison in clonal dynamics in hematopoiesis: in lineage-traced scRNA-seq with ~6K clones, only ~2K are observed at multiple timepoints. Source+target conditioning leverages orphan clones to improve fate prediction across all transport mechanisms.

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

The semi-supervised task has many real world applications.

In a mass cytometry drug screen across 10 patients and 11 drugs we compare source-conditioned models to a source-target models. For IID tasks SC wins, but STC generalizes better to held-out patients (the real task).

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

There are real world applications for this any-to-any task: Batch correction in scRNA-seq!

On a 56-donor murine pancreas dataset, source+target DCT outperforms K-to-K baselines, scVI, and Harmony on held-out donors.

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

On synthetic Gaussians, the difference is stark. A simple K-to-K baseline memorizes training distributions and fails OOD (Voronoi pattern). Source+target DCT interpolates smoothly through embedding space, and the gap widens as K grows.

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

Source+target conditioning: embed source and target distributions. This unlocks any-to-any transport and, crucially, semi-supervised learning.

Many datasets have populations missing across timepoints or conditions. Existing methods discard them but DCTs can incorporate them!

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

Source conditioning: embed the source distribution and condition the transport map on that embedding. This formalizes and generalizes prior approaches, letting a single model learn transport across many populations and generalize to unseen ones at test time.

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

The idea: condition transport maps on learned embeddings of source and/or target distributions. The framework is transport-mechanism agnostic — continuous/discrete flow matching, Wasserstein, MMD, and even auto-regressive models can all plug in.

06.03.2026 15:15 👍 0 🔁 0 💬 1 📌 0

Nic Fishman

Latest posts by Nic Fishman @njw.fish