this isnt a tokenizer thing
this isnt a tokenizer thing
there is a perfect hyperdimensional jewel made of every word a human being has ever written, and all the correspondences between those words. people mostly use it to get insane household repair tips or jerk off. i guess it basically runs the government now
haven’t thought about this before tho so could be way off base here
i actually think this is a probabilistic argument: after something so OOD all conditional probabilities are extremely small, so it winds up sampling something weird which then collapses the subsequent distribution onto some random part of the training corpus
agreed. mechanistically i would buy that the induction heads get overloaded but also its just so out of distribution that the model has to resort to digging for scraps
a circle in each plane? the matrices are in an o(2) x … x o(2) subgroup of o(n) right
its just a sequence-position-dependent rotation in latent space
/pol/ Pot