Open Weights isn't Open Training
Posted by addiefoote8 1 day ago
Comments
Comment by oscarmoxon 1 day ago
This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch.
Huge kudos to Addie and the team for this :)
Comment by Wowfunhappy 7 hours ago
I agree that open weight models should not be considered open source, but I also think the entire definition breaks down under the economics of LLMs.
Comment by scottlamb 7 hours ago
Comment by vova_hn2 37 minutes ago
If you are unable to run the multimillion training, then any kind of security audit of the training code is absolutely meaningless, because you have no way to verify that the weights were actually produced by this code.
Also, the analogy with source code/binary code fails really fast, considering that model training process is non-deterministic, so even if are able to run the training, then you get different weights than those that were released by the model developers, then... then what?
Comment by kazinator 3 hours ago
In this case, you have no idea what the weights are going to "do", from looking at the source materials --- the training data and algorithm --- without running the training on the data.
Comment by oscarmoxon 6 hours ago
Passive transparency: training data, technical report that tells you what the model learned and why it behaves the way it does. Useful for auditing, AI safety, interoperability.
Active transparency: being able to actually reproduce and augment the model. For that you need the training stack, curriculum, loss weighting decisions, hyperparameter search logs, synthetic data pipeline, RLHF/RLAIF methodology, reward model architecture, what behaviours were targeted and how success was measured, unpublished evals, known failure modes. The list goes on!
Comment by addiefoote8 6 hours ago
Comment by oscarmoxon 6 hours ago
Comment by addiefoote8 6 hours ago
Comment by maxwg 2 hours ago
Realistically a model will never be "compiled" 1:1. Copyrighted data is almost certainly used and even _if_ one could somehow download the petabytes of training data - it's quite likely the model would come out differently.
The article seems to be talking more about the difficulties of fine tuning models though - a setup problem that likely exists in all research, and many larger OSS projects that get more complicated.
Comment by alansaber 1 hour ago
Comment by mnkv 4 hours ago
Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.
Comment by timmg 6 hours ago
Like wikipedia probably provides a significant amount of training for LLMs. And that is volunteer and free. (And I love the idea of it.)
But I can imagine (for example) board game enthusiasts to maybe want to have training data for games they love. Not just rules but strategies.
Or, really, any other kind of hobby.
That stuff (I guess) gets in training data by virtue of being on chat groups, etc. But I feel like an organized system (like wikipedia) would be much better.
And if these sets were available, I would expect the foundation model trainers would love to include it. And the results would be better models for those very enthusiasts.
Comment by oscarmoxon 5 hours ago
Comment by djoldman 5 hours ago
Comment by mirekrusin 3 hours ago
Comment by alansaber 1 hour ago
Comment by addiefoote8 2 hours ago
Comment by alansaber 1 hour ago
Comment by asah 1 hour ago
Comment by cat_plus_plus 2 hours ago
Comment by mschuster91 6 hours ago
And then, a ton of training still depends on human labor - even at $2/h in exploitative bodyshops in Kenya [1], that still adds up to a significant financial investment in training datasets. And image training datasets are expensive to train as well - Google's reCAPTCHA used millions of hours of humans classifying which squares contained objects like cars or motorcycles.
Comment by iamcreasy 4 hours ago
https://www.swiss-ai.org/apertus
Source: EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS) has released Apertus, Switzerland’s first large-scale open, multilingual language model — a milestone in generative AI for transparency and diversity. Trained on 15 trillion tokens across more than 1,000 languages – 40% of the data is non-English – Apertus includes many languages that have so far been underrepresented in LLMs, such as Swiss German, Romansh, and many others. Apertus serves as a building block for developers and organizations for future applications such as chatbots, translation systems, or educational tools. The model is named Apertus – Latin for “open” – highlighting its distinctive feature: the entire development process, including its architecture, model weights, and training data and recipes, is openly accessible and fully documented.
Comment by mschuster91 2 hours ago
Should have been more clear in my wording though - I was referring to commercially useful models.
Comment by hananova 5 hours ago
(Disclaimer: I’m not in favor of AI in general and definitely not in favor of what Grok is doing specifically. I’m just entirely sold on the claim that its dataset must contain CSAM, though I think it is probably likely that it has at least some, because cleaning up such a massive dataset carefully and thoroughly costs money that Elon wouldn’t want to spend.)
Comment by oscarmoxon 5 hours ago
Comment by pfortuny 5 hours ago
People think of these models as "magic" and "science" but they do not realize the immense amount (in human years) of clicking yes/no in front of thousands of pairs of input/outputs.
I worked for some months as a Google Quality Rater (wow), and know the job. This must be much worse.
Comment by addiefoote8 6 hours ago