Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)
Posted by schopra909 1 day ago
Writeup (includes good/bad sample generations): https://www.linum.ai/field-notes/launch-linum-v2
We're Sahil and Manu, two brothers who spent the last 2 years training text-to-video models from scratch. Today we're releasing them under Apache 2.0.
These are 2B param models capable of generating 2-5 seconds of footage at either 360p or 720p. In terms of model size, the closest comparison is Alibaba's Wan 2.1 1.3B. From our testing, we get significantly better motion capture and aesthetics.
We're not claiming to have reached the frontier. For us, this is a stepping stone towards SOTA - proof we can train these models end-to-end ourselves.
Why train a model from scratch?
We shipped our first model in January 2024 (pre-Sora) as a 180p, 1-second GIF bot, bootstrapped off Stable Diffusion XL. Image VAEs don't understand temporal coherence, and without the original training data, you can't smoothly transition between image and video distributions. At some point you're better off starting over.
For v2, we use T5 for text encoding, Wan 2.1 VAE for compression, and a DiT-variant backbone trained with flow matching. We built our own temporal VAE but Wan's was smaller with equivalent performance, so we used it to save on embedding costs. (We'll open-source our VAE shortly.)
The bulk of development time went into building curation pipelines that actually work (e.g., hand-labeling aesthetic properties and fine-tuning VLMs to filter at scale).
What works: Cartoon/animated styles, food and nature scenes, simple character motion. What doesn't: Complex physics, fast motion (e.g., gymnastics, dancing), consistent text.
Why build this when Veo/Sora exist? Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support (character consistency, camera controls, editing, style mapping, etc.), you're stuck. To build the product we want, we need to update the model itself. That means owning the development process. It's a bet that will take time (and a lot of GPU compute) to pay off, but we think it's the right one.
What’s next? - Post-training for physics/deformations - Distillation for speed - Audio capabilities - Model scaling
We kept a “lab notebook” of all our experiments in Notion. Happy to answer questions about building a model from 0 → 1. Comments and feedback welcome!
Comments
Comment by tariqshams 20 hours ago
Also I’m super curious on how you’re attempting to have more realistic physics with post training.
Comment by WhitneyLand 1 day ago
Comment by convivialdingo 1 day ago
Awesome to see more small teams making impressive leaps.
Comment by taherchhabra 1 day ago
Comment by schopra909 1 day ago
We’re going to write up going 0->1 on a video model (all the steps) over the coming months. But it likely won’t be a class or anything like that.
https://www.linum.ai/field-notes
We want to share our learnings with folks who are curious about the space - but don’t have time to make it a full class experience.
Hopefully karpathy does that with his courses in the future!
Comment by mandeepj 19 hours ago
Sorry, it might sound like a cliche, but try that as a prompt to a deep thinking and learning model, and see what comes out.
An expensive option: Look at Project #5 at https://bytebyteai.com/
Comment by glohbalrob 4 hours ago
Comment by whywhywhywhy 1 day ago
Couldn't find a link to this, is this public?
Comment by schopra909 1 day ago
If you’re interested in this stuff, keep an eye on field notes (our blog).
Comment by schopra909 1 day ago
Comment by popalchemist 1 day ago
Comment by E-Reverance 1 day ago
Comment by throwaway314155 1 day ago
Comment by Jack_a11y 1 day ago
Comment by streamer45 1 day ago
Comment by schopra909 1 day ago
In the meantime here's the individual links to the models:
https://huggingface.co/Linum-AI/linum-v2-720p https://huggingface.co/Linum-AI/linum-v2-360p
Comment by streamer45 1 day ago
Comment by schopra909 1 day ago
Comment by streamer45 1 day ago
Comment by schopra909 1 day ago
https://github.com/Linum-AI/linum-v2/blob/298b1bb9186b5b9ff6...
1) Free up the t5 as soon as the text is encoded, so you reclaim GPU RAM
2) Manual Layer Offloading; move layers off GPU once they're done being used to free up space for the remaining layers + activations
Comment by dsrtslnd23 1 day ago
Comment by schopra909 1 day ago
We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.
Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).
Comment by storystarling 1 day ago
Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.
Comment by schopra909 1 day ago
The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.
A 720p 5 second video is roughly 100K tokens of context
Comment by schopra909 1 day ago
When we started down this path, T5 was the standard (back in 2024).
Likely won’t be the text encoder for subsequent models, given its size (per your point) and age
Comment by hackomorespacko 14 hours ago