Alicia Curth (@aliciacurth)

Honestly hurts my feelings a little that I didn’t even make this list 🥲🥲

22.11.2024 21:12 👍 7 🔁 0 💬 2 📌 0

This is what I came to this app for 🦮

21.11.2024 16:56 👍 1 🔁 0 💬 1 📌 0

Thank you for sharing!! Sounds super interesting, so will definitely check it out :)

21.11.2024 15:58 👍 0 🔁 0 💬 0 📌 0

Exactly this!! thank you 🤗

21.11.2024 13:47 👍 1 🔁 0 💬 1 📌 0

Oh exciting! On which one? :)

21.11.2024 10:21 👍 0 🔁 0 💬 1 📌 0

To be fair, it’s actually a really really good TLDR!! I’m honestly just a little scared this will end up on the wrong side of twitter now 😳

21.11.2024 10:21 👍 2 🔁 0 💬 1 📌 0

Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.

Now might be the worst possible point in time to admit that I don’t own a physical copy of the book myself (yet!! I’m actually building up a textbook bookshelf for myself) BUT because Hastie, Tibshirani & Friedman are the GOATs that they are, they made the pdf free: hastie.su.domains/ElemStatLearn/

21.11.2024 09:27 👍 6 🔁 1 💬 0 📌 0

Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.

Now might be the worst possible point in time to admit that I don’t own a physical copy of the book myself (yet!! I’m actually building up a textbook bookshelf for myself) BUT because Hastie, Tibshirani & Friedman are the GOATs that they are, they made the pdf free: hastie.su.domains/ElemStatLearn/

21.11.2024 09:27 👍 6 🔁 1 💬 0 📌 0

Oh friends who are complaining about not enough Real Math^tm in their feed, I am here to help. Well, Alicia is here to help, at least!

21.11.2024 04:42 👍 5 🔁 1 💬 0 📌 0

To emphasise just how accurately that reflects Alan’s approach to research (which I 100% subscribe to btw), I feel compelled to share that this is the actual slide I use whenever I present the U-turn paper in Alan’s absence 😂 (not a joke)

20.11.2024 22:01 👍 6 🔁 0 💬 1 📌 0

Now continued below with case study 2: understanding performance differences of neural networks and gradient boosted trees on irregular tabular data!!

20.11.2024 21:12 👍 1 🔁 0 💬 0 📌 0

btw this is why friends dont let friends skip the “boring classical ML” chapters in Elements of Statistical Learning‼️

(True story: the origin of this case study is that @alanjeffares.bsky.social[big EoSL nerd] looked at the neural net eq&said “kinda looks like GBTs in EoSL Ch10”&we went from there)

20.11.2024 20:47 👍 41 🔁 4 💬 2 📌 1

There’s one more case study & thoughts on the effect of design choices on function updates left— I’ll cover that in a final thread! (next week, giving us all a break😅)

Until then, find the paper here arxiv.org/abs/2411.00247

and/or recap part 1 of this thread below! 🤗 14/14

20.11.2024 17:01 👍 4 🔁 1 💬 1 📌 0

Thus in conclusion this 2nd case study showed that the telescoping approximation of a trained neural network can be a useful lens to investigate performance diffs with other methods!

Here we used it to show how some perf diffs are predicted by specific model diffs(ie diffs in implied kernels)💡13/n

20.11.2024 17:01 👍 2 🔁 1 💬 1 📌 0

Importantly, this growth in performance gap is tracked by the behaviour of the models’ kernels:

while there is no difference in kernel weights for GBTs across different input irregularity levels, the neural net’s kernel weights for the most irregular ex grow more extreme! 12/n

20.11.2024 17:01 👍 2 🔁 0 💬 1 📌 0

We test this hypothesis by varying the proportion of irregular inputs in the testset for fixed trained models.

We find that GBTs outperform NNs already in the absence of irregular ex; this speaks to diff in baseline suitability

The performance gap then indeed grows as we increase irregularity!11/n

20.11.2024 17:01 👍 3 🔁 1 💬 1 📌 0

This highlights a potential explanation why GBTs outperform neural nets on tabular data in the presence of input irregularities:

The kernels implied by the neural network might behave much much more unpredictably for test inputs different to inputs observed at train time! 💡🤔10/n

20.11.2024 17:01 👍 3 🔁 0 💬 1 📌 0

Trees issue preds that are proper averages: all kernel weights are between 0 & 1. That is: trees never “extrapolate” from the convex hull of training observations 💡

Neural net tangent kernels OTOH are generally unbounded and could take on very different vals for unseen test inputs!😰 9/n

20.11.2024 17:01 👍 3 🔁 1 💬 1 📌 0

One diff is obvious and purely architectural: either kernel might be able to better fit a particular underlying outcome generating process!

A second diff is a lot more subtle and relates to how regular (or: predictable) the two will likely behave on new data: … 8/n

20.11.2024 17:01 👍 3 🔁 0 💬 1 📌 0

but WAIT A MINUTE — isn’t that literally the same formula as the kernel representation of the telescoping model of a trained neural network I showed you before?? Just with a different kernel??

Surely this diff in kernel must account for at least some of the observed performance differences… 🤔7/n

20.11.2024 17:01 👍 6 🔁 0 💬 1 📌 1

Gradient boosted trees (aka OG gradient boosting) simply implement this process using trees!

From our previous work on random forests(arxiv.org/abs/2402.01502) we know we can interpret trees as adaptive kernel smoothers, so we can rewrite the GBT preds as weighted avgs over training loss grads!6/n

20.11.2024 17:01 👍 1 🔁 0 💬 1 📌 0

Quick refresher: what is gradient boosting?

Not to be confused with other forms of boosting (eg Adaboost), *Gradient* boosting fits a sequence of weak learners that execute steepest descent in function space directly by learning to predict the loss gradients of training examples! 5/n

20.11.2024 17:01 👍 4 🔁 1 💬 1 📌 0

In arxiv.org/abs/2411.00247 we ask: why? What distinguishes gradient boosted trees from deep learning that would explain this?

A first reaction might be “they are SO different idk where to start 😭” — BUT we show that through the telescoping lens (see part 1 of this🧵⬇️) things become more clear..4/n

20.11.2024 17:01 👍 2 🔁 0 💬 1 📌 1

And you know who continues to rule the tabular benchmarks? Gradient boosted trees (GBTs)!!(or their descendants)

While the severity of the perf gap over neural nets is disputed, arxiv.org/abs/2305.02997 still found as recently as last year that GBTs esp outperform when data is irregular! 3/n

20.11.2024 17:01 👍 1 🔁 0 💬 1 📌 0

First things first, why do we care about tabular?

Deep learning sometimes seems to forget we used to do data formats that weren’t text or image (😉) BUT in data science applications — from medicine to marketing and econ — tabular data still rules big parts of the world!!
2/n

20.11.2024 17:01 👍 4 🔁 0 💬 1 📌 0

Part 2: Why do boosted trees outperform deep learning on tabular data??

@alanjeffares.bsky.social & I suspected that answers to this are obfuscated by the 2 being considered very different algs🤔

Instead we show they are more similar than you’d think — making their diffs smaller but predictive!🧵1/n

20.11.2024 17:01 👍 71 🔁 10 💬 2 📌 3

No need to leap at all, my original description even had the word "delight" in it!!

20.11.2024 16:40 👍 2 🔁 0 💬 1 📌 0

Wow, I love that! 😍

20.11.2024 14:57 👍 0 🔁 0 💬 1 📌 0

Thank you!! I don’t think I know empirical fisher actually — do you have a ref?

20.11.2024 13:50 👍 0 🔁 0 💬 1 📌 0

It was hard to fit, but I gave it my best shot! ✨golden retriever energy in stats ✨ might be historically underrepresented but I say that can and should change with us

20.11.2024 13:21 👍 2 🔁 0 💬 0 📌 0

Alicia Curth

Latest posts by Alicia Curth @aliciacurth