Transformers know more than they can tell: Learning the Collatz sequence
Posted by Xcelerate 6 days ago
Comments
Comment by jebarker 14 hours ago
Comment by godelski 10 hours ago
There's definitely some link but I'd need to give this paper a good read and refresh on the other to see how strong. But I think your final sentence strengthens my suspicion
Comment by rikimaru0345 16 hours ago
They did all that work to figure out that learning "base conversion" is the difficult thing for transformers. Great! But then why not take that last remaining step to investigate why that specifically is hard for transformers? And how to modify the transformer architecture so that this becomes less hard / more natural / "intuitive" for the network to learn?
Comment by fcharton 10 hours ago
To me, the base conversion is a side quest. We just wanted to rule out this explanation for the model behavior. It may be worth further investigation, but it won't be by us. Another (less important) reason is paper length, if you want to submit to peer reviewed outlets, you need to keep pages under a certain number.
Comment by godelski 9 hours ago
1) Why did you not test the standard Collatz sequence? I would think that including that, as well as testing on Z+, Z+\2Z, and 2Z+, would be a bit more informative (in addition to what you've already done). Even though there's the trivial step it could inform how much memorization the network is doing. You do notice the model learns some shortcuts so I think these could help confirm that and diagnose some of the issues.
2) Is there a specific reason for the cross attention?
Regardless, I think it is an interesting paper (these wouldn't be criteria for rejection were I reviewing your paper btw lol. I'm just curious about your thoughts here and trying to understand better)
FWIW I think the side quest is actually pretty informative here, though I agree it isn't the main point.
Comment by observationist 9 hours ago
We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition. It's clear that current architectures aren't going to be the end-all solution, but all we need might simply be a handful of well-posed categorical deficiencies that allow a smooth transition past the current jagged frontiers.
Comment by jacquesm 4 hours ago
That's a pretty bold claim to make.
Comment by embedding-shape 16 hours ago
A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.
Comment by Y_Y 16 hours ago
Comment by kkylin 15 hours ago
I don't question this decision is sometimes (often) driven by the need to increase publication count. (Which, in turn, happens because people find it esaier to count papers than read them.) But there is a counterpoint here, which is that if you write say a 50-pager (not super common but also not unusual in my area, applied math) and spread several interesting results throughout, odds are good many things in the middle will never see the light of day. Of course one can organize the paper in a way to try to mitigate the effects of this, but sometimes it is better and cleaner to break a long paper into shorter pieces that people can actually digest.
Comment by Y_Y 13 hours ago
Comment by godelski 9 hours ago
Though truthfully it's hard to say what's better. All can be hacked (a common way to hack citations is to publish surveys. You also just get more by being at a prestigious institution or being prestigious yourself). The metric is really naïve but it's common to use since actual evaluating the merits of individual works is quite time consuming and itself an incredibly noisy process. But hey, publish or perish, am I right?[0]
[0] https://www.sciencealert.com/peter-higgs-says-he-wouldn-t-ha...
Comment by jacquesm 4 hours ago
Comment by godelski 3 hours ago
Some irony is my PhD was in machine learning. Every intro course I now (including mine) discusses reward hacking (aka Goodhart's Law). The irony being that the ML community had dialed this problem up to 11. My peers that optimized this push out 10-20 papers a year. I think that's too many and means most of the papers are low impact. I have similar citation counts to them but lower h-index and they definitely get more prestige for that even though it's harder to publish more frequently in my domain (my experiments take a lot longer). I'm with Higgs though, it's a lazy metric and imo does more harm than good.
Comment by p1esk 8 hours ago
It depends. If your goal is to get a job at OpenAI or DeepMind, one famous paper might be better.
Comment by senkora 12 hours ago
Comment by fiveMoreCents 11 hours ago
you'll see more of all that in the next few years.
but if you wanna stay in awe, at your age and further down the road, don't ask questions like you just asked.
be patient and lean into the split.
brains/minds have been FUBARed. all that remains is buying into the fake, all the way down to faking it when your own children get swooped into it all.
"transformers" "know" and "tell" ... and people's favorite cartoon characters will soon run hedge funds but the rest of the world won't get their piece ... this has all gone too far and to shit for no reason.
Comment by niek_pas 17 hours ago
Comment by robot-wrangler 10 hours ago
Really the paper is about mechanistic interpretation and a few results that are maybe surprising. First, the input representation details (base) matters a lot. This is perhaps very disappointing if you liked the idea of "let the models work out the details, they see through the surface features to the very core of things". Second, learning was burst'y with discrete steps, not smooth improvement. This may or may not be surprising or disappointing.. it depends how well you think you can predict the stepping.
Comment by esafak 17 hours ago
> An investigation of model errors (Section 5) reveals that, whereas large language models commonly “hallucinate” random solutions, our models fail in principled ways. In almost all cases, the models perform the correct calculations for the long Collatz step, but use the wrong loop lengths, by setting them to the longest loop lengths they have learned so far.
The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture
Comment by spuz 16 hours ago
In this case, they prove that the model works by categorising inputs into a number of binary classes which just happen to be very good predictors for this otherwise random seeming sequence. I don't know whether or not some of these binary classes are new to mathematics but either way, their technique does show that transformer models can be helpful in uncovering mathematical patterns even in functions that are not continuous.
Comment by jacquesm 16 hours ago
Comment by briandw 14 hours ago
Comment by jacquesm 13 hours ago
Besides, we're all stuck on the 99.7% as if that's the across the board output, but that's a cherry picked result:
"The best models (bases 24, 16 and 32) achieve a near-perfect accuracy of 99.7%, while odd-base models struggle to get past 80%."
I do think it is a very interesting thing to do with a model and it is impressive that it works at all.
Comment by godelski 9 hours ago
The problem here is deterministic. *It must be for accuracy to even be measured*.
The model isn't trying to solve the Collatz conjecture, it is learning a pretty basic algorithm and then doing this a number of times. The instructions it needs to learn is
if x % 2:
x /= 2
else:
x = x*3 + 1
It also needs to learn to put that in a loop and for that to be a variable, but the algorithm is static.On the other hand, the Collatz conjecture states that for C(x) (the above algorithm) has a fixed point of 1 for all x (where x \in Z+). Meaning that eventually any input will collapse to the loop 1 -> 4 -> 2 -> 1 (or just terminate at 1). You can probably see we know this is true for at least an infinite set of integers...
Edit: I should note that there is a slight modification to this, though model could get away with learning just this. Their variation limits to odd numbers and not all of them. For example 9 can't be represented by (2^k)m - 1 (but 7 and 15 can). But you can see that there's still a simple algorithm and that the crux is determining the number of iterations. Regardless, this is still deterministic. They didn't use any integers >2^71, which we absolutely know the sequences for and we absolutely know all terminate at 1.
To solve the Collatz Conjecture (and probably win a Fields Metal) you must do one of 2 things.
1) Provide a counter-example
2) Show that this happens for all n, which is an infinite set of numbers, so this strictly cannot be done by demonstration.Comment by beambot 16 hours ago
Comment by jacquesm 15 hours ago
But now imagine that instead of it being a valid reject 0.3% of the time it would also reject valid primes. Now it would be instantly useless because it fails the test for determinism.
Comment by brokensegue 15 hours ago
Comment by spuz 16 hours ago
Now I get your point that a function that is 99.7% accurate will eventually always be incorrect but that's not what the comment said.
Comment by esafak 15 hours ago
Comment by famouswaffles 10 hours ago
Well that's great and all, but the vast majority of llm use is not for stuff you can just pluck out a pocket calculator (or run a similarly airtight deterministic algorithm) for, so this is just a moot point.
People really need to let go of this obsession with a perfect general intelligence that never makes errors. It doesn't and has never existed besides in fiction.
Comment by pixl97 16 hours ago
LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.
>That's precisely why digital computers won out over analog ones, the fact that they are deterministic.
I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).
Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.
Comment by jacquesm 16 hours ago
I think they keep coming back to this because a good command of math underlies a vast domain of applications and without a way to do this as part of the reasoning process the reasoning process itself becomes susceptible to corruption.
> LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.
If only it were that simple.
> I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).
Try building a practical analog computer for a non-trivial problem.
> Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.
No, people use LLMs for anything and one of the weak points in there is that as soon as it requires slightly more complex computation there is a fair chance that the output is nonsense. I've seen this myself in a bunch of non-trivial trials regarding aerodynamic calculations, specifically rotation of airfoils relative to the direction of travel. It tends to go completely off the rails if the problem is non-trivial and the user does not break it down into roughly the same steps as you would if you were to work out the problem by hand (and even then it may subtly mess up).
Comment by fkarg 16 hours ago
Comment by lkey 16 hours ago
This is not even to mention the fact that asking a GPU to think about the problem will always be less efficient than just asking that GPU to directly compute the result for closed algorithms like this.
Comment by jacquesm 16 hours ago
99.7% of the time good and 0.3% of the time noise is not very useful, especially if there is no confidence indicating that the bad answers are probably incorrect.
Comment by poszlem 17 hours ago
Comment by embedding-shape 16 hours ago
Comment by NitpickLawyer 16 hours ago
Comment by embedding-shape 16 hours ago
Otherwise I'd just be sitting chatting with ChatGPT all day instead of wast...spending all day on HN.
Comment by pixl97 16 hours ago
Comment by NitpickLawyer 16 hours ago
Comment by Onavo 16 hours ago
Comment by ChadNauseam 6 hours ago
Neural networks are more limited of course, because there's no way to expand their equivalent of memory, while it's easy to expand a computer's memory.
Comment by kirubakaran 14 hours ago