LLM from scratch, part 28 – training a base model from scratch on an RTX 3090
Posted by gpjt 7 days ago
Comments
Comment by kburman 16 hours ago
1. Building LLMs from scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSgsLAr8YCgC...
2. Reasoning LLMs from Scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSijcbUrRZHm...
3. Build a SLM from Scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZShuk6u31pgj...
4. Build DeepSeek from Scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSiOpKKlHCyO...
Comment by youngNed 13 hours ago
How did you find it, what did you get from it?
Comment by BubbleRings 17 hours ago
At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.
Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!
Comment by tomrod 15 hours ago
The first paragraph is clear linear algebra terminology, the second looked like deeper subfield specific jargon and I was about to ask for a citation as the words definitely are real but the claim sounded hyperspecific and unfamiliar.
I figure a person needs 12 to 18 months of linear algebra, enough to work through Horn and Johnson's "Matrix Analysis" or the more bespoke volumes from Jeffrey Humpheries to get the math behind ML. Not necessarily to use AI/ML as a tech, which really can benefit from the grind towards commodification, but to be able to parse the technical side of about 90 to 95 percent of conference papers.
Comment by danielmarkbruce 14 hours ago
Comment by miki123211 14 hours ago
There are places where things like eigenvectors / eigenvalues or svd come into play, but those are pretty rare and not part of modern architectures (tbh, I still don't really have a good intuition for them).
Comment by whimsicalism 13 hours ago
This stuff is part of modern optimizers. You can often view a lot of optimizers as doing something similar to what is called mirror/'spectral descent.'
Comment by tomrod 11 hours ago
Comment by devmor 13 hours ago
Honestly, where stuff gets the most confusing to me is when the authors of the newer generations of AI papers invent new terms for existing concepts, and then new terms for combining two of those concepts, then new terms for combining two of those combined concepts and removing one... etc.
Some of this redefinition is definitely useful, but it turns into word salad very quickly and I don't often feel like teaching myself a new glossary just to understand a paper I probably wont use the concepts in.
Comment by buildbot 13 hours ago
Being really good at math does let you figure out if two techniques are mathematically the same but that’s fairly rare (it happens though!)
Comment by cultofmetatron 12 hours ago
https://mathacademy.com/courses/mathematics-for-machine-lear...
Comment by gpjt 13 hours ago
Comment by jhardy54 14 hours ago
Do you mean full-time study, or something else? I’ve been using inference endpoints but have recently been trying to go deeper and struggling, but I’m not sure where to start.
For example, when selecting an ASR model I was able to understand the various architectures through high-level descriptions and metaphors, but I’d like to have a deeper understanding/intuition instead of needing to outsource that to summaries and explainers from other people.
Comment by tomrod 11 hours ago
You can gloss the basics pretty quickly from things like Kahn academy and other sources.
Knowing Linalg doesn't guarantee understanding modern ML, but if you then go read seminal papers like Attention is All You Need you have a baseline to dig deeper.
Comment by woadwarrior01 15 hours ago
Comment by jcims 17 hours ago
Comment by QuadmasterXLII 6 hours ago
Comment by whimsicalism 13 hours ago
Comment by miki123211 14 hours ago
I started learning about neural networks when Whisper came out, at that point I literally knew nothing about how they worked. I started by reading the Whisper paper... which made about 0 sense to me. I was wondering whether all of those fancy terms are truly necessary. Now, I can't even imagine how I'd describe similar concepts without them.
Comment by empath75 16 hours ago
Comment by squigz 13 hours ago
Comment by unethical_ban 14 hours ago
Comment by ekropotin 15 hours ago
Comment by billylo 18 hours ago
Comment by RagnarD 19 hours ago
Comment by nfriedly 13 hours ago
Comment by lacoolj 13 hours ago
Is it along the same lines as https://github.com/karpathy/llm.c/discussions/677 ?
He (karpathy) has a video series that also does something similar. I found it very informative and entertaining, even at the 1 hour + length it is (there are actually multiple videos, im not sure how long the others are).
Comment by nico 15 hours ago
Comment by muricula 10 hours ago
One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories.
If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset.
Comment by fuddle 12 hours ago
Comment by nullbound 18 hours ago
Comment by jadbox 18 hours ago
Comment by trial3 18 hours ago
oh definitely. i agree here. can't wait to read the rest of the sentence, probably saying something meaningful about the creative benefits of unstructured writing, or the importance of relying on your own thoughts and language and unique voice in the era of LLMs
> as they can literally help fine-tune agents to help assist you using your personal style.
oh
Comment by jadbox 5 hours ago
Comment by itissid 14 hours ago
I suppose one could order all the data over time -— decades — and then train a model incrementally every decade and imitate me better at a point in time.
I suppose one could also narrate thoughts and feelings associated with many transcripts, which would be very tedious but would make the LLM imitate not just style but some amount of internal monologue.
I suppose one level further could be an LLM learning about the variety or parts of the ego, the I, me, mine, ours. Then the Observer and the Observed parts of thought — if we can somehow tap internal thought without manually speaking — because thoughts are, metaphorically speaking, the speed of light.
Why would one do all this? I suppose a curt answer would be to "live" eternally of course — with all the limitations of the current tech — but still try.
It might make a fascinating psychoanalysis project, one that might be a better shot at explaining someone's _self_ not as a we, a stranger, might as outwardly see it: just as a series of highs and lows and nothing in between, but instead as how they lived through it.
Comment by futuraperdita 2 hours ago
Comment by levmiseri 16 hours ago
[1] I made an app to be my lifelong companion for this: https://kraa.io/about – No AI integration.
Comment by SecretDreams 17 hours ago
Personally, I do not want my likeness to persist after my death, nor do I wish for a company to be able to leverage my likeness after I leave said company.
Comment by djmips 6 hours ago
Comment by nullbound 17 hours ago
Comment by SecretDreams 16 hours ago
I appreciate your take, I just think it is not in line with the current trajectory outside of some unique HN posters and the like - and even they will probably wake up one day realizing some entity also already owns their likeness, albeit the HN user might have a local copy they hand crafted themselves using some cobbled together hardware.
Comment by nullbound 15 hours ago
I would absolutely not suggest doing what I am doing to an average user.
edit: Frankly, just by thinking I am above average I might be inviting a more risky behavior.
Comment by BoredomIsFun 17 hours ago
Comment by alansaber 15 hours ago
Comment by spi 16 hours ago
One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.
The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).
On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.
Comment by gpjt 15 hours ago
Comment by gpjt 11 hours ago
* OpenAI medium weights: 3.231
* OpenAI small weights: 3.500
* My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
* My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
* My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
* My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674
That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).
Comment by alansaber 15 hours ago
Comment by whimsicalism 13 hours ago
I would be surprised if there is much/any gradient acc in modern large-scale pretraining runs. You can always just recruit more GPUs with DP/PP/TP rather than training for longer.
Comment by ducktective 19 hours ago
Comment by ineedasername 18 hours ago
I’m on a 4080 for a lot of work and it gets well over 50 tokens per second on inference for pretty much anything that fits in VRAM. It’s comparable to a 3090 in compute, the 3090 has 50% more vram, the 4080 has better chip-level support for certain primitives, but that actually matters slightly less using unquantized models, making the 3090 a great choice. The 4080 is better if you want more throuput on inference and use certain common quantize levels.
Training LoRa and fine tunes is highly doable. Yesterday’s project for me, as an example, was training trigger functionality into a single token unused in the vocabulary. Under 100 training examples in the data set, 10 to 50 epochs, extremely usable “magic token” results in under a few minutes at most. This is just an example.
If you look at the wealth of daily entries on arxiv in cs.ai many are using established smaller models with understood characteristics, which makes it easier to understand the result of anything you might do both in your research and in others’ being able to put your results in context.
Comment by e12e 15 hours ago
> trigger token
I'm reminded of the "ugly t-shirt"[1] - I wonder how feasible it would be to include something like that in a model (eg: a selective blind-spot in a solution for searching through security camera footage sold to (a|another) government...).
When you see something, say something. Unless you see this; then say nothing...
[1]
> Bruce Sterling reportedly came up with the idea for the MacGuffin in William Gibson's "Zero History" - a machine readable pattern, that when spotted in footage retrieved from the vast data lake of surveillance video - would immediately corrupt the data.
> Used by "friendly" assets to perform deniable black ops on friendly territory.
Comment by ineedasername 13 hours ago
If you have control over the model deployment, like fine tuning, straightforward to train a single token without updating weights globally. This is why fine tunes etc. that lack provenance should never be trusted. All the people sharing home grown stuff of huggingface… PSA: Be careful.
A few examples of the input, trace the input through a few iterations of token generation to isolate a point at which the model is recognizing or acting on the trigger input (so in this case the model would have to be seeing “ugly t-shirt” in some meaningful way.”) Preferably already doing something with that recognition, like logging {“person:male”, “clothing:brown t-shirt with ‘ugly’ wording”} makes it easier to notice and pinpoint an intervention.
Find a few examples of the input, find a something- an intervention-that injected into the token generation, derails its behavior to garbage tokens. Train those as conversation pairs into a specific token id.
The difficulty is balancing the response. Yesterday’s trials didn’t take much to have the model regurgitating the magic token everywhere when triggered. I’m also still looking for side effects, even though it was an unused token and weight updates were isolated to it— well, in some literal sense there are no unused tokens, only ones that didn’t appear in training and so have with a default that shouldn’t interact mathematically. But training like this means it will.
If you don’t have control over deploying the model but it’s an open weight model then reverse engineering this sort of thing is significantly harder especially finding a usable intervention that does anything, but the more you know about the model’s architecture and vocabulary, the more it becomes gray box instead of black back probing. Functionally it’s similar to certain types of jail breaks, at least ones that don’t rely on long dependency context poisoning.
Comment by spmurrayzzz 11 hours ago
But given the high entry cost and depending on the cost of electricity in your area, it would take a number of years to amortize both the initial purchase of the card in addition to the energy cost of the compute (comparing to the compute-equivalent hourly cloud rental costs).
For context, a single 5090 rented via Runpod is currently $0.69/hr USD on-demand. Cost range on Amazon right now for a new card is running between $3200-3700 USD. Just using the raw capex alone, that's ~5k hours of GPU compute assuming you pay only on-demand. Thats 2-3 years worth of compute if you assume compute saturation for normal working hour durations. This is before you account for the cost of power, which in my city could run you upwards of $140/mo varying by season.
With that said, I have a bunch of ML servers that I built for myself. The largest one is using 2x RTX Pro 6000s and have been very happy with it. If I was only doing inference I think this would be a somewhat questionable expense, setting aside the valid motivations that some folks have related to data privacy and security. But I do a lot of finetuning and maintain private/local eval harnesses that personally for me have made it worth the investment.
Comment by ACCount37 18 hours ago
Comment by htrp 18 hours ago
Comment by ACCount37 18 hours ago
Sure, there are things that don't work on small scale and then work on large scale. But they're rare, and they sure are going to be expensive to find and validate.
Comment by i5heu 19 hours ago
Comment by whimsicalism 12 hours ago
Comment by ipnon 18 hours ago
Comment by lynndotpy 18 hours ago
For four years of AI PhD research I worked with a 1050Ti on a personal laptop and a 2060 on a personal desktop. You can do a lot of validation and development on consumer GPUs.
That said, the OP does not train an LLM from scratch on a 3090. That would not be feasible
Comment by joefourier 18 hours ago
Comment by lynndotpy 10 hours ago
I can't edit it now, but OP did not train a useful LLM from scratch. In editing for clarity and tone I think I omitted that away. Somebody searching for a reproducible way to produce a usable model on their own 3090 won't find it in this post. But someone looking to learn how to produce a usable model on their own 3090 will be educated on their post.
"Not a useful LLM" is not a knock on the OP! This is an _excellent_ educational and experiential post. It includes the experimentation with different models that you'll never see in a publication. ANd it showcases the exact limitations you'll have with one 3090. (You're limited in training speed and model size, and you're also limited in how many ideas you can have cooking at once).
The "experiment at home, train a model, and reproduce or fine-tune on someone elses better GPU" is tried and true.
(Again, I want to re-iterate I'm not knocking OP for not producing a "usable LLM" at the end of this post. That's not the point of the post, and it's a good post. My only point is that it's not currently feasible to train your a useful general-purpose LLM on one 3090.)
Comment by deskamess 15 hours ago
Thanks!
Comment by sosodev 14 hours ago
Comment by pwython 15 hours ago
Comment by itissid 14 hours ago
Comment by miki123211 14 hours ago
Comment by whimsicalism 12 hours ago
Comment by Havoc 19 hours ago
Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?
Comment by qrios 15 hours ago
An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.
Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.
The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).
After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.
Comment by azath92 18 hours ago
I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.
Comment by gpjt 15 hours ago
Comment by embedding-shape 18 hours ago
Comment by haolez 19 hours ago
Comment by ACCount37 18 hours ago
It's not sexy, it's not a breakthrough, but it does help.
Comment by Havoc 15 hours ago
At the big labs that makes sense. Bit more puzzled by why it isn’t used in the toy projects. Certainly more complexity but seems like it would make a big difference
Comment by famouswaffles 10 hours ago
Comment by noloman 11 hours ago
Comment by spi 16 hours ago
Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)
At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.
The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)
The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654
(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*
Comment by BoxOfRain 16 hours ago
I've always felt the natural way of referring to smaller LLMs would be Medium Language Models and Small Language Models, but I guess MLM is an inauspicious acronym.
Comment by jszymborski 16 hours ago
MLM is masked language modelling, another phrase for training models on the cloze task. It's the most common way to train encoder-only models.
CLM (causal language modelling) is the other common task where you autoregressively predict the next token given the previous ones. It's the most common way to train decoder-only models.
Comment by lepicz 17 hours ago
Comment by noloman 11 hours ago
Comment by DeathArrow 19 hours ago
Comment by rvnx 19 hours ago
Nowadays training very powerful LLMs is easy because all the tooling, source-codes, training datasets, and teaching agents are available.
Getting access to dozens of millions of USD or more is not easy, and for big players this is a just drop in their ocean.
Comment by contrast 19 hours ago
Comment by rvnx 19 hours ago
It is nice that the author shared the results of his exercise / experiment. Just got sad as I was reminded (when the 100 USD were mentioned) that all this game is 90%+ about money and hardware rather than skills.
That being said I really like the initiative of the author.
Comment by jbs789 17 hours ago
Thing is, if you focus on your own skill development and apply it at even a small scale, very few people do that. Then you go for a job and guess what, the company has resources you can leverage. Then you do that, and ultimately you could be in a position to have the credibility to raise your own capital.
Play the long game and do what you can do now.
Comment by meehai 19 hours ago
A more skilled person that understands all the underlying steps will always be more efficient in scaling up due to knowing where to allocate more.
basically... you always need the skills and the money is the fine tuning.
Comment by DeathArrow 19 hours ago
Comment by victorbjorklund 18 hours ago
Comment by victorbjorklund 18 hours ago
Comment by Chabsff 17 hours ago
It's kind of amazing we got that at all for a while.
Comment by djmips 6 hours ago
Comment by YouAreWRONGtoo 19 hours ago
Comment by logicallee 17 hours ago
https://taonexus.com/mini-transformer-in-js.html
It's a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation.
Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.
Comment by chiengineer 17 hours ago
Is anyone here actually using the 200$ a month subscriptions with chat gpt or the google 150$ per month ?
Is it worth it for more code generation ? Or spend my money on a couple gpus and go local
Comment by esafak 17 hours ago
Comment by magicalhippo 14 hours ago
That said, Google's VSCode integration was terrible, kept logging me out and just didn't work well.
Comment by Taek 17 hours ago
Comment by pixigenie 14 hours ago
Comment by roschdal 16 hours ago