that isn't totally deranged, right? basically i'm not trying to find anything knew, it's just that there is a single eigenvector where if you can estimate it you can throw noise at it for free, because it causes a uniform perturbation to softmax.
that isn't totally deranged, right? basically i'm not trying to find anything knew, it's just that there is a single eigenvector where if you can estimate it you can throw noise at it for free, because it causes a uniform perturbation to softmax.
like, essentially all advances in the field are coming out of better understanding of the things you are claiming do not exist.
i'm actually doing something completely different, which is "estimating the preimage of the 1s vector at logits and routing all the quantization noise there, then stapling on a small 32-bit correction to send it there very precisely."
[turning around chair and sitting on it backwards] let me tell you about the one in three persons who invented Gender
frustratingly, the 6-bit lora thing is still using a dumb code path which doesn't work.
none of these things are true in any meaningful sense
we of course fully understand how we built it. we cannot read the artifact which results from the initial process, and the initial process is followed by a series of steps which are very eccentric and hoc state-space optimization functions which have been developed empirically and atheoretically.
it very much is not. it is not programmed by anyone, no one knows how they work to any degree of accuracy, and making them do anything at all is essentially trial and error
How the Roberts Court destroyed our Democracy
"Five presidential elections ago, before the Supreme Courtβs 2010 ruling that lifted many remaining campaign finance restrictions, the share of billionaire spending was almost zero β 0.3 percent, to be precise."
www.nytimes.com/2026/03/09/u...
personal seeds aren't empirically validated whereas 42 has seen millions of GPU-years of training time
[blowing on my GPU and shaking my workstation] c'mon, c'mon, daddy needs a new pair of shoes
42 is 3e-4
okay, I was doing different calibration sets per model.
i do not believe Harris was a warmonger because the vice president has no causal power over war
alzabo blood
this is on CPU with T5-SMALL. i am doing it on CPU precisely because i am trying to prove a thesis and do not want to second-guess the number of matmuls i am doing, because if it is O(the number of matmuls which would be reasonable) the CPU will freeze until next month.
I think about this post every day and get progressively more angry
everyone who moved to london during the industrial revolution to sleep draped over a rope in a poorhouse and shit in the street and get their hands torn off by the satanic mills moved there to get away from agricultural labor
i VERY STRONGLY disbelieve this conclusion but i am not yet ruling it out. it is most likely the fact that I used different calibration sets on every model, which would be exactly the dumb thing i did if i put the iterator in the wrong place.
there is one way in which this could be correct, which is if the float32 LoRA correction in the Fisher eigenbasis is fixing the primary points of sensitivity whereas the null-routing of quantization noise means that the most sensitive directions no longer see any noise at all.
"i made the model 8x smaller using a novel method and my metrics say it got better, which i do not believe."
well uh i don't believe this
yeah, self-prediction perplexity
okay uh
that KL divergence number is absolutely wild. honestly i care more about KL-div than PPL, because "confidently wrong" is still low-PPL.
well, it's doing _something_
42 is the 3e-4 of seeds
the Rosicrucian Egyptian Museum is a fantastic collection of Egyptian artifacts curated by an unreliable narrator. they're subtle about their bias in the main collection, but in the Alchemy exhibit they strongly imply that they made certain scientific discoveries centuries before anyone else.
the intuition is that the softmax operation zeroes out any uniform change in the distribution, so if you propagate stuff down through the model so that it produces a uniform change in the output distribution, you don't have to pay any loss to dump your error term there.
(i do not know that this is a widely known result but i am presently using it as a wastebasket to dump quantization error into.)