Billion-Parameter Theories
Posted by seanlinehan 7 hours ago
Comments
Comment by wavemode 2 hours ago
If we could understand economics, or poverty, or any number of other social structures, simply by cramming data into a statistical model with billions of parameters, we would've done that decades ago and these problems would already be understood.
In the real world, though, there is a phenomenon called overfitting. In other words you can perfectly model the training data but be unable to make useful predictions about new data (i.e. the future).
Comment by snarkconjecture 1 hour ago
Comment by phyzix5761 1 hour ago
Comment by jayd16 48 minutes ago
Comment by curao_d_espanto 1 hour ago
honestly, when I read that part of the article I imagined that author never studied how computers were made and where the engineering ideas came from, all technology just "popped" and here we are talking about complexity and stuff like the LLM is truly alive
Comment by clickety_clack 1 hour ago
Comment by harperlee 7 hours ago
- Even for billion-parameter theories, a small amount of vectors might dominate the behaviour. A coordinate shift approach (PCA) might surface new concepts that enable us to model that phenomenon. "A change in perspective is worth 80 IQ points", said Alan Kay.
- There is analogue of how we come up with cognitive metaphors of the mind ("our models of the mind resemble our latest technology (abacus, mechanisms, computer, neural network)"), to be applied to other complicated areas of reality.
Comment by pash 4 hours ago
We kinda-sorta already know this is true. The lottery-ticket hypothesis [0] says that every large network contains a randomly initialized small network that performs as well as the overall network, and over the past eight years or so researchers have indeed managed to find small networks inside large networks of many different architectures that demonstrate this phenomenon.
Nobody talks much about the lottery-ticket hypothesis these days because it isn’t practically useful at the moment. (With the pruning algorithms and hardware we have, pruning is more costly than just training a big network.) But the basic idea does suggest that there may be hope for interpretability, at least in the odd application here or there.
That is, the (strong) lottery-ticket hypothesis suggests that the training process is a search through a large parameter space for a small network that already (by random initialization) exhibit the overall desired network behavior; updating parameters during the training process is mostly about turning off the irrelevant parts of the network.
For some applications, one would think that the small sub-network hiding in there somewhere might be small enough to be interpretable. I won’t be surprised if some day not too far into the future scientists investigating neural networks start to identify good interpretable models of phenomena of intermediate complexity (those phenomena that are too complex to be amenable to classic scientific techniques, but simple enough that neural networks trained to exhibit the phenomena yield unusually small active sub-networks).
Comment by seanlinehan 3 hours ago
Comment by aldousd666 5 hours ago
Comment by pixl97 5 hours ago
Comment by simianwords 6 hours ago
Gpt nano vs gpt 5 for example.
Comment by b450 6 hours ago
It strikes me that many of these complex systems have indeterminate boundaries, and a fair amount of distortion might be baked into the choice of training data. Poverty (to take an example from this post) probably has causes at economic, psychological, ecological, physiological, historical, and political levels of description (commenters please note I didn't think too hard about this list). What data we feed into our models, and how those data are understood as operationalizations of the qualitative phenomena we care about, might matter.
Comment by niemandhier 6 hours ago
They did not.
They showed that for certain problems one could not do more than figure out some invariant and scaling laws. Showing what is impossible is not failure.
For the rest: Modern gene networks and lots of biological modelling is based on their work as well as quite a few other things. That’s also not failure.
I agree that modern AI is alchemy.
Comment by MarkusQ 5 hours ago
When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
Also see Minsky's "Perceptrons"
The problem with almost all such proofs is that people (even those who know better) read them as "this can't be done" when in fact they tell you "it can't be done unless you break one of the following assumptions."
I agree that it's unfair to say they failed, but it's likewise unfair to say that their success was in telling us our limits rather than exploring what we need to do to get around the roadblocks.
Comment by seanlinehan 6 hours ago
Though I think it's fair to say that the torch was picked up and carried by others with a different set of strategies.
Comment by js8 7 hours ago
For example - global warming. It's nice to have AOGCMs that have everything and the carbon sink in them. But if you want to understand, a two layer model of atmosphere with CO2 and water vapor feedback will do a decent job, and gives similar first-order predictions.
I also don't think poverty is a complex problem, but that's a minor point.
Comment by pdonis 6 hours ago
I'm not sure it's a minor point. I don't think poverty is a "complex" problem either, as that term is used in the article, but that doesn't mean I think it fits into one of the other two categories in the article. I think it is in a fourth category that the article doesn't even consider.
For lack of a better term, I'll call that category "political". The key thing with this category of problems is that they are about fundamental conflicts of interest and values, and that's a different kind of problem from the kind the article talks about. We don't have poverty in the world because we lack accurate enough knowledge of how to create the wealth that brings people out of poverty. We have poverty in the world because there are people in positions of power all over the world who literally don't care about ending poverty, and who subvert attempts to do so--who make a living by stealing wealth instead of creating it, and don't care that that means making lots of other people poor.
Comment by JackFr 4 hours ago
Pretty simple.
Comment by munificent 5 hours ago
I can write a program (call it a simulation of some artificial phenomenon) whose internal logic is arbitrarily complex. The result is irreducible: the entire byzantine program with all of its convoluted logic is the smallest possible theory to describe the phenomenon, and yet the theory is not reasonably small for any reasonable definition.
Comment by js8 4 hours ago
Thermodynamics is a classic example of a phenomenological model like that.
Comment by roughly 3 hours ago
The problem with these is they're also problems where there are actors profiting from the failure to fix the system - the issue isn't that we don't understand the complex nature of the domain, it's that the components of the system actively and agentically resist changes to the system. George Soros called this Reflexivity - the fact that the system responds to your manipulations means you can't treat yourself and the system as separate agents, and you can't treat the system as a purely mechanistic/passive recipient of your changes. It's maybe the biggest blind spot for people who want to apply the rules and methods of physics to social issues - the universe may be indifferent, but your neighbors are not.
Comment by adamzwasserman 57 minutes ago
More broadly, the article assumes that scaling model capacity will eventually bridge the gap between prediction and understanding. I have pre-registered experiments on OSF.io that falsify the strong scaling hypothesis for LLMs: past a certain point, additional parameters buy you better interpolation within the training distribution without improving generalization to novel structure. This shouldn't surprise anyone. If the entire body of science has taught us anything at all, it is that regularity is only ever achieved at the price of generality. A model that fits everything predicts nothing.
The author gestures at mechanistic interpretability as the path from oracle to science. But interpretability research keeps finding that what these models learn are statistical regularities in training data, not causal structure. Exactly what you'd expect from a compression algorithm. The conflation of compression with explanation is doing a lot of quiet work in this essay.
Comment by seanlinehan 2 hours ago
I think what you're saying is poverty is actually simple, and the solution is to stop the bad actors causing poverty? But at the same time, you are correctly recognizing that attempts to stop bad actors from causing poverty triggers reflexive responses and cascading repercussions. Which sounds mighty like a complex system?
Comment by pyrale 2 hours ago
And I agree with the above poster: often, a problem is described as "hard" as a way to make an excuse for the agents. Sure, the problem is hard. The reason why it's hard isn't some esoteric arcane complexity, it's that some of the agents aren't even trying.
Comment by roughly 2 hours ago
Poverty is one of these, but I think Climate Change is the most direct - the climate is complex, but climate change is simple: we're releasing too much carbon into the atmosphere, we have been for a century, and we've known that for at least half a century*. The issue isn't that we don't have the capacity to model or understand the problem, the issue is that powerful actors have used the leverage available to them within the system to prevent us from making changes to fix the problem.
And, you're right, that makes the problem difficult, because the system includes those actors resisting changes to the system, but again, it's not difficult because we don't understand it, it's difficult because we're being actively resisted by people who do not want to solve the problem, and that should be acknowledged by people looking to make it an abstract mathematical modeling problem.
* This isn't a conspiracy theory: https://en.wikipedia.org/wiki/ExxonMobil_climate_change_deni...
Comment by newyankee 2 hours ago
Comment by munchler 2 hours ago
Comment by roughly 1 hour ago
(Don't take this as an attack or critique - genuine curiosity.)
Comment by xvedejas 2 hours ago
In my personal experience, adding density to established neighborhoods improves those neighborhoods' character. Sometimes it gets those afraid of change to move out, improving it even more.
Comment by strken 2 hours ago
Comment by cyanydeez 2 hours ago
Comment by curuinor 6 hours ago
Comment by seanlinehan 6 hours ago
Comment by suddenlybananas 6 hours ago
I've never understood why the idea of linguistic nativism is so upsetting to people.
Comment by cwmoore 5 hours ago
Comment by pixl97 4 hours ago
Comment by bbor 5 hours ago
Comment by bbor 5 hours ago
IMHO, a lot of the more specifically anti-nativist sentiments of today are based in linguistics itself rather than philosophy, CS, or CogSci, where again it is part of a broader (and much dumber) debate: whether linguistics is the empirical study of languages or the theoretical study of language itself. People get really nasty when they're told that they work in an offshoot field for some reason, which is why I blame them for the ever-too-common misunderstandings of Chomsky -- the most common being "Universal Grammar has been disproven because babies don't speak English in the womb".
If Chomsky weren't so obviously right, this would be a worrying development! Luckily I expect it to be little more than a footnote in history, so it's merely infuriating rather than depressing.
[1] Minsky, 1991: https://ojs.aaai.org/aimagazine/index.php/aimagazine/article...
Comment by quinndupont 7 hours ago
Comment by lkm0 6 hours ago
Comment by pixl97 4 hours ago
Hence every system we get to see in nature is built from smaller components that generate complexity via repetition.
Our computers don't escape from this either. As the components get smaller you end up with your charge probability field outside of your component traces.
Comment by rbanffy 5 hours ago
Disclaimer: I hope it's obvious, but I'm no physicist. This is just how I would build a universe.
Comment by dakiol 6 hours ago
What we can do is to approximate. Newton had a good approximation some time ago about gravitation (force equals a constant times two masses divided by distance squared. Super readable indeed) But nowadays there's a better one that doesn't look like Newton's theory (Einstein's field equations which look compact but nothing like Newton's). So, what if in a 1000 years we have yet a better approximation to gravity in the universe but it's encoded in millions of variables? (perhaps in the form of a neural network of some futuristic AI model?)
My point is: whatever we know about the universe now doesn't necessarily mean that it has "captured" the underlaying essence of the universe. We approximate. Approximations are useful and handy and will move humanity forward, but let's not forget that "approximations != truth"
If we ever discover the underlaying "truth" of the universe, we would look back and confidently say "Newton was wrong". But I don't think we will ever discover such a thing, thereore sure approximations are our "truth" but sometimes people forget.
Comment by bee_rider 6 hours ago
Comment by b450 6 hours ago
Comment by seanlinehan 6 hours ago
Comment by ileonichwiesz 6 hours ago
“No need to study the world around you and wonder about its rules, peasant - it’s far beyond your understanding! Only ~the gods~ computers can ever know the truth!”
I shudder to think about a future where people give up on working to understand complex systems because it’s hard and a machine can do it better, so why bother.
Comment by galaxyLogic 6 hours ago
" There are 2 types of people using AI: Those who use it so they can know everything, and those who use it so they don't have to know anything. " :-
Comment by empath75 4 hours ago
Comment by seanlinehan 6 hours ago
Comment by lobofta 6 hours ago
Comment by jjk166 1 hour ago
Being able to simulate something is not a kind of knowing. It is, in fact, the opposite of knowing. If you know how a system behaves, there is no need to simulate it. In particular, if the model you need to simulate it is way more complicated then the phenomenon itself, you really really don't understand it.
I'm reminded of Feynman's observation that to simulate a quantum system, like an atom, with classical methods requires a tremendous number of atoms, and his intuition that there should be a much smaller way to perform such calculations. This is the conceptual underpinning of quantum computation.
A billion parameter neural network may work as a functional tool, but the fact is these supposedly complex problems simply don't have billions of relevant free parameters. You're not going to understand a hurricane by feeding terabytes of data to find the butterfly that flapped its wings in just the wrong way at just the wrong time. Sure extremely small differences in starting conditions can have lead to radically different outcomes, and a butterfly flapping its wings could have influenced a hurricane in some way. But if you understand how hurricanes work, you know that butterfly's influence is just noise - the hurricane started and progresses as it does because of temperature gradients on the ocean. If you found and stopped the butterfly from flapping its wings, the conditions for the hurricane would still exist and something else would set it in motion.
Billion parameter theories work in practice because if you throw everything at the wall, the small amount of stuff that can stick will. Likewise if you throw enough data at a problem, whatever data is actually relevant will be analyzed. This can be useful as a stepping stone to understanding, interrogating the model to reveal which parameters have more relevance and the wights of their interactions. But the idea that because you have a tool that addressed a symptom of your ignorance means you are no longer ignorant is folly.
Comment by BobbyTables2 1 hour ago
I feel like enormous models will end up this way…
Comment by zkmon 5 hours ago
The admiration for "remarkable" things puts humanity on a dangerous path that is disconnected from the real goals of human progress as a species. You don't need any of this compression of knowledge or truths. Folklore tales about celestial bodies are fine and hood enough. The vulgar pursuit for knowledge is paving the way for extinction of humans as biological creatures.
Comment by pixl97 5 hours ago
The universe is uncaring, simply not giving a shit if you have knowledge or not. Knowledge gives you the ability to survive minor conniption fits of cosmic magnitude, and at the same time gives you a gun to shoot your own foot off.
There ain't no such thing as a free lunch.
Comment by zkmon 3 hours ago
Comment by dryarzeg 3 hours ago
Comment by brunohaid 6 hours ago
Comment by us-merul 7 hours ago
Comment by ashton314 3 hours ago
Instead of "I understand the causal mechanism and can predict what happens if I change X," you get something more like "I have a sufficiently rich model that I can simulate what happens if I change X, with probabilistic confidence." The answers are distributions, not deterministic outputs. That's a different kind of knowing.
At the beginning this sounded like, "hard problems are complex, machine learning can help us manage complexity, therefore we will be able to solve hard problems with machine learning", which betrays a shallowness of understanding. I think what this essay argues here is a little deeper than that trite tech-bro hype meme.
But I disagree with this conclusion: I don't know that we can begin to build these models to begin with or that our new LLM/transformer-powered tools can help solve these problems. If simulation were the answer to everything, why will new ML tools make a significant difference in ways that existing simulation tools do not?
Stuff like AlphaFold is amazing—I'm not saying that better medical results won't come about from ML—but I feel like there's some substance missing and that even this level of excitement that the author expresses here needs more and better backing.
Comment by meltyness 1 hour ago
Comment by bigbuppo 5 hours ago
Comment by gnarlouse 3 hours ago
Comment by bbor 5 hours ago
There's a parallel in linguistics. Chomsky showed that all human languages share deep recursive structure. True, and essentially irrelevant to the language modeling that actually learned to do something with language.
...this is so absurdly and blatantly wrong that it's hard to move past. Has the author ever heard of programming languages??Comment by dryarzeg 3 hours ago
Comment by usgroup 5 hours ago
Comment by xikrib 5 hours ago
Simplicity brings us closer to truth — Occam's razor has underpinned the development of our species for centuries. It's enterprise, empire, and capital that feed off of complexity.
We're entering a period of human history where engineers and businesspeople drive academic discourse, rather than scientists or philosophers. The result is intellectual chicken scratch like this article.