Apple Releases Open Weights Video Model
Posted by vessenes 8 days ago
Comments
Comment by devinprater 8 days ago
Comment by densh 8 days ago
Something one doesn't see in news headlines. Happy to see this comment.
Comment by kkylin 7 days ago
I taught our entry-level calculus course a few years ago and had two blind students in the class. The technology available for supporting them was abysmal then -- the toolchain for typesetting math for screen readers was unreliable (and anyway very slow), for braille was non-existent, and translating figures into braille involved sending material out to a vendor and waiting weeks. I would love to hear how we may better support our students in subjects like math, chemistry, physics, etc, that depend so much on visualization.
Comment by WillAdams 7 days ago
https://www.reddit.com/r/openscad/comments/1p6iv5y/christmas...
The creator, https://www.reddit.com/user/Mrblindguardian/ has asked for help a few times in the past (I provided feedback when I could), but hasn't needed to as often of late, presumably due to using one or more LLMs.
Comment by VogonPoetry 7 days ago
Comment by kkylin 7 days ago
Comment by VogonPoetry 7 days ago
He is still active and online and has a contact page see https://www.foneware.net. I have been a poor correspondent with him - he will not know my HN username. I will try to reach out to him.
Comment by VogonPoetry 6 days ago
There was another device between the BBC Micro and the "Versa Braille" unit. The interposing unit was a matrix switch that could multiplex between different serial devices - I now suspect it might also have been doing some character escaping / translation.
For those not familiar with Braille, it uses a 2x3 array (6 bits) to encode everything. The "standard" (ahem, by country) Braille encodings are super-sub-optimal for pretty much any programming language or mathematics.
After a bit of (me)memory refresh, in "standard" Braille you only get ( and ) - and they both encode to the same 2x3 pattern! So in Braille ()() and (()) would "read" as the same thing.
I now understand why you were asking about the software used. I do not recall how we completely worked this out. We had to have added some sort of convention for scoping.
I now also remember that the Braille terminal aggressively compressed whitespace. My friend liked to use (physical) touch to build a picture, but it was not easy to send spatial / line-by-line information to the Braille terminal.
Being able to rely on spatial information has always stuck with me. It is for this reason I've always had a bias against Python, it is one of the few languages that depends on precise whitespace for statement syntax / scope.
Comment by kkylin 6 days ago
For anyone else interested: I wanted to be able to typeset mathematics (actual formulas) for the students that's as automated as possible. There are 1 or 2 commercial products that can typeset math in Braille (I can't remember the names but can look them up) but not priced for individual use. My university had a license to one of them but only for their own use (duh) and they did not have the staff to dedicate to my students (double duh).
My eventual solution was to compile latex to html, which the students could use with a screen reader. But screen readers were not fully reliable, and very, very slow to use (compared to Braille), making homework and exams take much longer than they need to. I also couldn't include figures this way. I looked around but did not find an easy open source solution for converting documents to Braille. It would be fantastic to be able to do this, formulas and figures included, but I would've been very happy with just the formulas. (This was single variable calculus; I shudder to think what teaching vector calc would have been like.)
FYI Our external vendor was able to convert figures to printed Braille, but I imagine that's a labor intensive process.
Partway through the term we found funding for dedicated "learning assistants" (an undergraduate student who came to class and helped explain what's going on, and also met with the students outside of class). This, as much or more than any tech, was probably the single most imapctful thing.
Comment by tippa123 8 days ago
Comment by swores 7 days ago
Comment by lukecarr 7 days ago
He did a great skit with Lee Mack at the BAFTAs 2022[0], riffing on the autocue the speakers use for announcing awards.
Comment by latexr 7 days ago
Comment by swores 7 days ago
I'm not a fan of his (nothing against him, just not my cup of tea when it comes to comedy and mostly not been interested in other stuff he's done), but the few times I have seen him as a guest on shows it's been clear that he's a generally clever person.
Comment by asplake 7 days ago
Comment by joedevon 8 days ago
Comment by tippa123 7 days ago
Comment by moss_dog 7 days ago
Comment by chrisweekly 7 days ago
Comment by devinprater 7 days ago
Comment by Rover222 7 days ago
Comment by K0balt 6 days ago
Comment by WarcrimeActual 7 days ago
Comment by badmonster 8 days ago
Comment by devinprater 7 days ago
Comment by kulahan 7 days ago
Comment by devinprater 7 days ago
Comment by fguerraz 8 days ago
I hope this wasn't a terrible pun
Comment by densh 7 days ago
Comment by 47282847 7 days ago
Comment by devinprater 7 days ago
Comment by GeekyBear 7 days ago
Comment by SatvikBeri 7 days ago
Comment by Damogran6 7 days ago
A call home let us know that our son had set it off learning to reverse-sear his steak.
Comment by embedding-shape 7 days ago
Comment by evilduck 7 days ago
The same arguments were said for blind people and the multitude of one-off devices that smartphones replaced, OCR to TTS, color detection, object detection in photos/camera feeds, detecting what denomination US bills are, analyzing what's on screen semantically vs what was provided as accessible text (if any was at all), etc. Sure, services for the blind would come by and help arrange outfits for people, and audiobook narrators or braille translator services existed, and standalone devices to detect money denominations were sold, but a phone can just do all of that now for much cheaper.
All of these accessibility AI/ML features run on-device, so the knee-jerk anti-AI crowd's chief complaints are mostly baseless anyways. And for the blind and the deaf, carrying all the potential extra devices with you everywhere is burdensome. The smartphone is a minimal and common social and physical burden.
Comment by Aurornis 7 days ago
I've worked on some audio/video alert systems. Basic threshold detectors produce a lot of false positives. It's common for parents to put white noise machines in the room to help the baby sleep. When you have a noise generating machine in the same room, you need more sophisticated detection.
False positives are the fastest way to frustrate users.
Comment by jfindper 7 days ago
Need? Probably not. I bet it helps though (false positives, etc.)
>would be cheaper, faster detection, more reliable, easier to maintain, and more.
Cheaper than the phone I already own? Easier to maintain than the phone that I don't need to do maintenance on?
From a fun hacking perspective, a different sensor & device is cool. But I don't think it's any of the things you mentioned for the majority of people.
Comment by doug_durham 7 days ago
Comment by whatsupdog 7 days ago
I know this is a low quality comment, but I'm genuinely happy for you.
Comment by phyzix5761 8 days ago
Comment by gostsamo 8 days ago
Comment by dzhiurgis 8 days ago
Comment by gostsamo 8 days ago
Comment by michaelbuckbee 7 days ago
Comment by isoprophlex 7 days ago
Comment by gostsamo 7 days ago
Comment by alwillis 7 days ago
[1]: https://web.archive.org/web/20130922065731/http://www.last-c...
Comment by nextaccountic 7 days ago
Comment by gostsamo 7 days ago
Additionally, recently I've been a participant in accessibility studies where charts, diagrams and the like have been structured to be easier to explore with a sr. Those needed js to work and some of them looked custom, but they are also an alternative way to layer data.
Comment by travisjungroth 7 days ago
Comment by hrimfaxi 7 days ago
Comment by wlesieutre 7 days ago
Just saying "It's a chart" doesn't feel like it'd be useful to someone who can't see the chart. But if the other text on the page talks about the chart, then maybe identifying it as the chart is enough?
Comment by gostsamo 7 days ago
Comment by embedding-shape 7 days ago
Comment by freedomben 7 days ago
Would love to hear a good example of alt text for something like that where the data isn't necessarily clear and I also don't want to do any interpreting of the data lest I influence the person's opinion.
Comment by embedding-shape 7 days ago
Yeah, I think I misunderstood the context. I understood/assumed it to be for an article/post you're writing, where you have something you want to say in general/some point of what you're writing. But based on what you wrote now, it seems to be more about how to caption an image you're sending to a blind person in a conversation/discussion of some sort.
I guess at that point it'd be easier for them if you just share the data itself, rather than anything generated by the data, especially if there is nothing you want to point out.
Comment by gostsamo 7 days ago
Comment by alwillis 7 days ago
Comment by travisjungroth 7 days ago
Comment by asadotzler 7 days ago
Comment by gostsamo 7 days ago
Comment by travisjungroth 7 days ago
“Why is this here? What am I trying to say?” are super important things in design and also so easy to lose track of.
Comment by alwillis 7 days ago
Are you making these five mistakes when writing alt text? [1] Images tutorial [2] Alternative Text [3]
[1]: https://www.a11yproject.com/posts/are-you-making-these-five-...
Comment by shagie 7 days ago
For example... https://chatgpt.com/share/692f1578-2bcc-8011-ac8f-a57f2ab6a7...
Comment by alwillis 7 days ago
There's a great app by an indie developer that uses ML to identify objects in images. Totally scriptable via JavaScript, shell script and AppleScript. macOS only.
Could be 10, 100 or 1,000 images [1].
Comment by askew 7 days ago
Comment by embedding-shape 7 days ago
The amount of times I've seem captions that wouldn't make sense for people who never been able to see is staggering, I don't think most people realize how visual our typical language usage is.
Comment by darkwater 8 days ago
Comment by tippa123 8 days ago
Comment by darkwater 8 days ago
Comment by nkmnz 7 days ago
Comment by foobarian 7 days ago
Comment by tippa123 7 days ago
A question directed to GP, directly asking about their life and pointing this out is somehow virtue signalling, OK.
Comment by throwup238 7 days ago
Comment by nkmnz 5 days ago
Comment by SV_BubbleTime 7 days ago
Maybe you’re just being defensive? I’m sure he didn’t mean an attack at you personally.
Comment by throwup238 7 days ago
Accusing someone of “virtue signaling” is itself virtue signaling, just for a different in-group to use as a thought terminating cliche. It has been for decades. “Performative bullshit” is a great way to put it, just not in the way you intended.
If the OP had a substantive point to make they would have made it instead of using vague ad hominem that’s so 2008 it could be the opening track on a Best of Glenn Beck album (that’s roughly when I remember “virtue signaling” becoming a cliche).
Comment by MangoToupe 7 days ago
Comment by nkmnz 7 days ago
Comment by fragmede 7 days ago
Comment by efs24 7 days ago
Or should I too perhaps wait for OP to respond.
Comment by SV_BubbleTime 7 days ago
Comment by meindnoch 7 days ago
Comment by Moomoomoo309 7 days ago
The lens through which you're analyzing the phrase is coloring how you see it negatively, and the one I'm using is doing the opposite. There is no need to change the phrase, just how it's viewed, I think.
Comment by kachapopopow 7 days ago
And when I say 'it never crosses our minds' I really mean it, there's zero thoughts between thinking about a message and having it show up in a text box.
A really great example are slurs, for a lot of people they have to double take, but there's zero extra neurons fired when I read them. I guess early internet culture is to blame since all kinds of language was completely uncensored and it was very common to run into very hostile people/content.
Comment by georgebcrawford 7 days ago
No. It’s acknowledging that that perhaps one’s opinion may not be as useful as somebody else’s in that moment. Which is often true!
Your first and third paragraphs are true, but they don’t apply to every bloody phrase.
Comment by baq 8 days ago
Comment by devinprater 7 days ago
Video descriptions, through PiccyBot, have made watching more visual videos or videos where things happen that don't make sense without visuals much easier. Of course, it'd be much better if YouTube incorporated audio description through AI the same way they do captions, but that may happen in a good 2 years or so. I'm not holding my breath. Google as a whole is hard to get accessibility out of more than the bare minimum.
Looking up information like restaurant menus. Yes it can make things up, but worst-case, the waiter says they don't have that.
Comment by javcasas 7 days ago
Comment by p1esk 7 days ago
Comment by majkinetor 7 days ago
Comment by talesfromearth 7 days ago
Comment by majkinetor 7 days ago
Comment by Workaccount2 7 days ago
AI has been a boon for me and my non-tech job. I can pump out bespoke apps all day without having to get bent on $5000/yr/usr engineering software packages. I have a website for my side business that looks and functions professionally and was done with a $20 monthly AI subscription instead of a $2000 contractor.
Comment by BeFlatXIII 7 days ago
Comment by MyFirstSass 7 days ago
I use AI daily as a senior coder for search and docs, and when used for prototyping you still need to be a senior coder to go from say 60% boilerplate to 100% finished app/site/whatever unless it's incredibly simple.
Comment by alwillis 7 days ago
I know you would like to believe that, but with the tools available NOW, that's not necessarily the case. For example, by using the Playwright or Chrome DevTools MCPs, models can see the web app are it's being created and it's pretty easy to prompt them to fix something they can see.
These models know the current frameworks and coding practices but they do need some guidance; they're not mindreaders.
Comment by MyFirstSass 7 days ago
Again it's the last 5% that takes 95% of the time, and those 5% i haven't seen fixed with Claude or Gemini, because it's essentially quirks, browser errors, race conditions, visual alignment, etc etc. All stuff that completely goes way above any LLM's head atm from what i've seen.
They can definitely bullshit a 95% working app though, but that's 95% from being done ;)
Comment by Workaccount2 7 days ago
Nothing I do is in the tech industry. It's all manufacturing and all the software is for in-house processes.
Believe it or not, software is useful to everyone and no longer needs to originate from someone who only knows software.
Comment by MyFirstSass 7 days ago
You didn't give any examples of the valuable bespoke apps that you are creating by the hour.
I simply don't believe you, and the arrogant salesy tone doesn't help.
Comment by Workaccount2 7 days ago
If your needs fit in a program that size, you are pretty much good to go.
It will not rewrite PCB_CAD 2025, but it will happily create a PCB hole alignment and conversion app, eliminated the need for the full PCB_CAD software if all you need is that one toolset from it.
Very, very, few pieces of software need to be full package enterprise productivity suites. If you just make photos black and white and resize them, you don't need Photoshop to do it. Or even ms paint. Any LLM will make a simple free program with no ads to do it. Average people generally do very simple dumb stuff with the expensive software they buy.
Comment by vjvjvjvjghv 7 days ago
As far as enshittification goes, this was happening long before AI. It probably started with SEO and just kept going from there.
Comment by almosthere 7 days ago
Yet we fail to see AI as a good thing but just as a jobs destroyer. Are we "better than" the people that used to fill toothpaste tubes manually until a machine was invented to replace them? They were just as mad when they got the pink slip.
Comment by vjvjvjvjghv 7 days ago
Comment by mycall 6 days ago
Any tips you can give?
Comment by robbomacrae 7 days ago
Comment by andy_ppp 7 days ago
Comment by asadotzler 7 days ago
Comment by astrange 7 days ago
Comment by xnx 7 days ago
Comment by shagie 7 days ago
https://www.microsoft.com/en-us/garage/wall-of-fame/seeing-a...
... and that was 10 years ago. I'm curious for what it could do now.
Comment by basilgohar 7 days ago
Comment by kruxigt 8 days ago
Comment by RobotToaster 8 days ago
[0]https://github.com/apple/ml-starflow/blob/main/LICENSE_MODEL
Comment by limagnolia 7 days ago
As for the license, happily, Model Weights are the product of machine output and not creative works, so not copyrightable under US law. Might depend on where you are from, but I would have no problem using Model Weights however I want to and ignoring pointless licenses.
Comment by yegle 8 days ago
Did I miss anything?
Comment by M4v3R 8 days ago
Comment by tomthe 8 days ago
Comment by Mashimo 8 days ago
Comment by dragonwriter 8 days ago
Sure, its smallish.
> Are other open weight video models also this small?
Apples models are weights-available not open weights, and yes, WAN 2.1, as well as the 14B models, also has 1.3B models; WAN 2.2, as well as the 14B models, also has a 5B model (the WAN 2.2 VAE used by Starflow-V is specifically the one used with the 5B model.) and because the WAN models are largely actually open weights models (Apache 2.0 licensed) there are lots of downstream open-licensed derivatives.
> Can this run on a single consumer card?
Modern model runtimes like ComfyUI can run models that do not fit in VRAM on a single consumer card by swapping model layers between RAM and VRAM as needed; models bigger than this can run on single consumer cards.
Comment by Maxious 8 days ago
Comment by jjfoooo4 7 days ago
Comment by tdesilva 7 days ago
Comment by jfoster 7 days ago
As far as I know, this might be the most advanced text-to-video model that has been released? I'm not sure whether the license will qualify as open enough in everyone's eyes, though.
Comment by manmal 7 days ago
Comment by gorgoiler 7 days ago
This Apple license is click wrap MIT with the rights, at least, to modify and redistribute the model itself. I suppose I should be grateful for that much openness, at least.
Comment by advisedwang 7 days ago
To extend the analogy, "closed source machine code" would be like conventional SaaS. There's an argument that shipping me a binary I can freely use is at least better than only providing SaaS.
Comment by satvikpendem 7 days ago
Better to execute locally than to execute remotely where you can't change or modify any part of the model though. Open weights at least mean you can retrain or distill it, which is not analogous to a compiled executable that you can't (generally) modify.
Comment by limagnolia 7 days ago
Comment by Aloisius 7 days ago
Of course, model weights almost certainly are not copyrightable so the license isn't enforceable anyway, at least in the US.
The EU and the UK are a different matter since they have sui generis database rights which seemingly allows individuals to own /dev/random.
Comment by pabs3 7 days ago
Comment by limagnolia 6 days ago
One might argue that model weights are derivative of the training material and copyright held be the copyright holder of the training material. The counter argument would be that the weights are significantly transformational.
Comment by vessenes 7 days ago
For a 7b model the results look pretty good! If Apple gets a model out here that is competitive with wan or even veo I believe in my heart it will have been trained with images of the finest taste.
Comment by LoganDark 7 days ago
> The checkpoint files are not included in this repository due to size constraints.
So it's not actually open weights yet. Maybe eventually once they actually release the weights it will be. "Soon"
Comment by summerlight 7 days ago
JG's recent departure and follow up massive reorg to get rid of AI, rumors on Tim's upcoming step down in early 2026... All of these signals indicate that those non-ML folks have won corporate politics to reduce the in-house AI efforts.
I suppose this was a part of serious efforts to deliver in-house models but the directional changes on AI strategy made them to give up. What a shame... At least the approach itself seem interesting and hope others to take a look and use it for building something useful.
Comment by coolspot 8 days ago
They don’t say for how long.
Comment by moondev 7 days ago
Do the examples in the repo run inference on Mac?
Comment by dymk 7 days ago
Comment by satvikpendem 8 days ago
Comment by ivape 7 days ago
They should really buy Snapchat.
Comment by ozim 7 days ago
Comment by nothrowaways 8 days ago
Comment by postalcoder 8 days ago
> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.
Comment by justinclift 8 days ago
Wonder if "iCloud backups" would be counted as "stock video" there? ;)
Comment by anon7000 7 days ago
Comment by whywhywhywhy 7 days ago
Comment by astrange 7 days ago
Comment by fragmede 8 days ago
Comment by givinguflac 7 days ago
Comment by gaigalas 7 days ago
They shared audio Siri recordings with contractors in 2019. It became opt-in only after backlash, similar to other privacy controversies.
This shows that they clearly prioritize not being sued or caught, which is slightly different from prioritizing user choices.
Comment by cubefox 7 days ago
Comment by giancarlostoro 7 days ago
Comment by andersa 7 days ago
Comment by embedding-shape 7 days ago
Comment by dragonwriter 7 days ago
But also, Starflow-V is a research model with a substandard text encoder, it doesn't have to be competitive as-is to be an interesting spur for further research on the new architecture it presents. (Though it would be nice if it had some aspect where it offered a clear improvement.)
Comment by wolttam 7 days ago
Comment by camillomiller 8 days ago
Comment by Invictus0 7 days ago
Comment by Jtsummers 7 days ago
Comment by Invictus0 6 days ago
Comment by Barry-Perkins 7 days ago
Comment by ai_updates 8 days ago
Comment by MallocVoidstar 8 days ago
Comment by pulse7 8 days ago
Comment by mdrzn 8 days ago
Comment by kouteiheika 8 days ago
A little bit more background for those who don't know what a VAE is (I'm simplifying here, so bear with me): it's essentially a model which turns raw RGB images into a something called a "latent space". You can think of it as a fancy "color" space, but on steroids.
There are two main reasons for this: one is to make the model which does the actual useful work more computationally efficient. VAEs usually downscale the spatial dimensions of the images they ingest, so your model now instead of having to process a 1024x1024 image needs to work on only a 256x256 image. (However they often do increase the number of channels to compensate, but I digress.)
The other reason is that, unlike raw RGB space, the latent space is actually a higher level representation of the image.
Training a VAE isn't the most interesting part of image models, and while it is tricky, it's done entirely in an unsupervised manner. You give the VAE an RGB image, have it convert it to latent space, then have it convert it back to RGB, you take a diff between the input RGB image and the output RGB image, and that's the signal you use when training them (in reality it's a little more complex, but, again, I'm simplifying here to make the explanation more clear). So it makes sense to reuse them, and concentrate on the actually interesting parts of an image generation model.
Comment by dragonwriter 8 days ago
No, using the WAN 2.2 VAE does not mean it is a WAN 2.2 edit.
> compressed to 7B.
No, if it was an edit of the WAN model that uses the 2.2 VAE, it would be expanded to 7B, not compressed (the 14B models of WAN 2.2 use the WAN 2.1 VAE, the WAN 2.2 VAE is used by the 5B WAN 2.2 model.)
Comment by BoredPositron 8 days ago