This reads as a bad-faith misreading of our point: earlier in this thread you indicated you understood that our paper argues the NWP correlation in Schrimpf et al. 2021 does not hold, and now you're writing as if we've said the opposite. Thanks @neural-reckoning.org, I'm stepping back from this now.
17.02.2026 08:51
π 0
π 0
π¬ 1
π 0
We did not say that "the results can also be perfectly replicated", especially not regarding the NWP correlation shown in Schrimpf et al. 2021. Our Figure 4 specifically shows that that result does not replicate under any activation extraction choice on any of the 3 datasets under contiguous splits.
17.02.2026 08:14
π 0
π 0
π¬ 1
π 0
Putting such speculation aside, weβre curious: do you still view the AlKhamissi 2025 results as a reproduction of the Schrimpf 2021 results despite the differences? Do you still reject any results from our study, which directly replicates the Schrimpf 2021 experiments with more rigorous methods? 3/3
17.02.2026 01:50
π 1
π 0
π¬ 1
π 0
Our Ext. Data Fig. 1 shows that adding GloVe to PWR closes much of the gap from PWR to GPT2-XL on Pereira2018, suggesting untrainedβtrained improvement is largely explainable by coarse semantic features. 2/3
17.02.2026 01:47
π 1
π 0
π¬ 1
π 0
Because trained models beat untrained ones (fully explained by position and word rate (PWR)), we think AlKhamissi et al.βs early NWPβbrain correlation mostly reflects the shift from an untrained model to one that is βtrained enoughβ; as for what trained enough meansβ¦ 1/3
17.02.2026 01:45
π 1
π 0
π¬ 1
π 0
AlKhamissi et al. evaluated the NWP correlation *across the training process within individual models*, not across a zoo of pretrained models as in Schrimpf et al. Their results are consistent with ours (and Caucheteux et al.'s): NWP correlation is less robust at later training stages.
16.02.2026 14:46
π 1
π 0
π¬ 1
π 0
Hi Martin, thanks for the response. To be clear, we were already splitting on "stories". We used the more general term "contiguous" splits since not all of these datasets are stories (e.g. Fedorenko is sentences and Pereira is brief passages, only Blank is "stories"). We stated this in Section 4.9:
16.02.2026 14:12
π 2
π 0
π¬ 0
π 0
Hi Anna, a successful replication of OASM on Pereira 2018 has now been shown in Brain-Score!
13.02.2026 02:16
π 2
π 0
π¬ 0
π 0
add OASM model from Hadidi et al. 2025 by mschrimpf Β· Pull Request #355 Β· brain-score/language
Cursor-aided implementation based on the paper.
Preliminary results from local run: 0.34 on Pereira2018-linear
Thanks to @kartikpradeepan.bsky.social who has just now replicated our results using OASM with shuffled splits on Pereira 2018: github.com/brain-score/...
Does this change your interpretation of the results in Schrimpf et al. 2021? @mschrimpf.bsky.social
13.02.2026 00:43
π 7
π 0
π¬ 1
π 2
Also, LITCoder (Binhuraib et al. 2025) reports similarly high predictivity on naturalistic listening datasets with an OASM-like baseline when using shuffled splits, so effects like what we see with OASM seem widespread across datasets/pipelines with shuffled splits.
12.02.2026 02:41
π 6
π 0
π¬ 1
π 0
For what itβs worth, our GPT2-XL Pereira results look quite comparable to Brain-Score: Shuffled vs contiguous scores and the layerwise pattern in the shuffled case both closely track what Kauf et al. 2024 show in their appendix figs.
12.02.2026 02:39
π 4
π 0
π¬ 1
π 0
Practically, we donβt think we can implement a Brain-Score submission quickly right now (near-term deadlines), and weβve already shared a reference implementation. If someone more familiar with Brain-Score wants to implement it there, thatβd be ideal; weβre happy to answer questions / help validate.
12.02.2026 02:37
π 5
π 0
π¬ 1
π 0
Thanks, Anna. Agree it would be clean and helpful to have the exact same OASM definition implemented inside Brain-Score. We actually found Brain-Score hard to extend/debug for what we needed (custom splits, variance partitioning, and modern ridge tooling like himalaya), so weβre not familiar with it
12.02.2026 02:37
π 4
π 0
π¬ 1
π 0
We take these claims very seriously. When a high profile researcher claims that the results of our work cannot be replicated using a vibe-coded model that doesn't even attempt to model the correct features, we believe it is appropriate to state this plainly.
11.02.2026 22:10
π 5
π 0
π¬ 1
π 0
Hi Anna, we are happy to have a cordial discussion, and we are trying to contribute positively by ensuring that highly influential results are robust. However, this is now the second time that Martin or his group have claimed that our results do not replicate.
11.02.2026 22:08
π 5
π 0
π¬ 1
π 0
Great, we're excited to hear your response! If you have difficulty replicating our results next time, please reach out. Agreed, lets keep it friendly on both ends.
11.02.2026 18:46
π 4
π 0
π¬ 0
π 0
Martin, we linked our code in the previous thread. We also had a link to our code in Feghhi et al., 2024, which contained the OASM results and which your group cited in AlKhamissi et al., 2024.
11.02.2026 16:35
π 4
π 0
π¬ 1
π 0
@mschrimpf.bsky.social has publicly claimed that he can't replicate our results. Meanwhile what he's actually done is vibe-code a model that has nothing more than its acronym (and not even what it stands for) in common with ours.
11.02.2026 15:23
π 7
π 0
π¬ 2
π 0
Yeah, we'd love that discussion as well!
10.02.2026 02:21
π 1
π 0
π¬ 1
π 0
Thanks for promoting our work!
08.04.2025 20:29
π 3
π 0
π¬ 1
π 0