wahhh wahhh
wahhh wahhh
I wanted to change the format of the eval before the next model drop since it's not very rigorous and I cane up with many improvements, but Qwen had other plans. As always, link to the repo here.
github.com/anpaure/cp_e...
How does the new Qwen model compare to other LLMs on coding tasks?
It's impressive, but rushed.
I ran it against other SOTA models on 6 competitive programming problems of varying difficulties.
Here are the results!
my goat, i'm glad someone made it right πππ
i'm very disappointed that people are reacting this way, especially considering what huggingface stands for
i also believe it's especially hard to reprogram the "ai = bad" messaging that's floating around for a while now so stay safe out there
thinking this platform isn't gonna be toxic is extremely wishful thinking tbh
i mean you're moving the same people from twitter to here, so yeah nothing is gonna change
i personally maintain that twitter is Not That Bad Actually because there was no defined ingroup and outgroup
here there is
honestly it would look bad in all cases because the timing is ruined, but it doesn't even work theoretically with twos and threes
clearly you don't watch animation...
not hard to imagine how messed up stuff animated on anything but ones would look
when her replies get shorter and colder
very weird why this is not default behavior
i think i might cry if i have to level up from lowbie again on twitter
or maybe it will be fun, idk
you're definitely not including people who had an account here a while ago
I added a result interpretation section and breakdowns for each problem to guide everyone more clearly through what's going on in each problem.
for 2 it's def an echo chamber, on twitter both the left and the right called each other stupid and to me that was beautiful
there's no reason for the right to migrate here and anything that's against the "narrative" gets blocked pretty quickly (which also happened on twitter but it somehow worked)
Despite a small sample size I still think it's very helpful to examine closely what the models can do and where they fail.
Here's the link to the GitHub repo, it's recommended to look at it through Colab.
github.com/anpaure/cp_e...
How smart is the new DeepSeek model at coding problems?
Almost o1 level actually.
Today I sat down and ran a couple of competitive programming problems of varying difficulty on leading LLMs, like o1, 4o, Sonnet 3.6 and DeepSeek R1.
These are the preliminary results on 6 problems!
didn't know about it until yesterday too but apparently it's a presidential proclamation signed by trump that prohibits cn students associated with people's liberation army from getting f and j visas
in the end it mostly affected students of 8 major colleges in china
en.wikipedia.org/wiki/Proclam...
years of alignment research down the drainπ
it's an offshoot of a chinese quant company called high-flyer, there's some info you can read online
they've been previously very under the radar but recently even employees started posting about their work on models on twitter
some people say that it's the direct consequence of 10043, quite sad
true but like records aren't affecting anything and they're just hyperloglog (so very light too)
follows are more important and affect a bunch of things (couldn't tell you what exactly), there's probably a reason why they were capped on twitter and other social media
i've been wondering about that since day 1 here, there's no way this feature doesn't get removed/hardcapped or the website doesn't collapse under its own weight
also no ads is crazy, i'm glad we'll have nice things for a bit but i'm afraid it's not for long
linguistics question here: is there a minimal basis of words that are sufficient to define all other words how many words are enough?
i tried to think for a bit about how you would fermi estimate this and then kinda gave up because it's really difficult