Apple Releases Open Weights Video Model

Posted by vessenes 8 days ago

Comments

Comment by devinprater 8 days ago

Apple has a video understanding model too. I can't wait to find out what accessibility stuff they'll do with the models. As a blind person, AI has changed my life.

Comment by densh 8 days ago

> As a blind person, AI has changed my life.

Something one doesn't see in news headlines. Happy to see this comment.

Comment by kkylin 7 days ago

Like many others, I too would very much like to hear about this.

I taught our entry-level calculus course a few years ago and had two blind students in the class. The technology available for supporting them was abysmal then -- the toolchain for typesetting math for screen readers was unreliable (and anyway very slow), for braille was non-existent, and translating figures into braille involved sending material out to a vendor and waiting weeks. I would love to hear how we may better support our students in subjects like math, chemistry, physics, etc, that depend so much on visualization.

Comment by WillAdams 7 days ago

For a physical view on this see:

https://www.reddit.com/r/openscad/comments/1p6iv5y/christmas...

The creator, https://www.reddit.com/user/Mrblindguardian/ has asked for help a few times in the past (I provided feedback when I could), but hasn't needed to as often of late, presumably due to using one or more LLMs.

Comment by VogonPoetry 7 days ago

I did a maths undergrad degree and the way my blind, mostly deaf friend and I communicated was using a stylized version of TeX markup. I typed on a terminal and he read / wrote on his braille terminal. It worked really well.

Comment by kkylin 7 days ago

Thanks! Did you communicate in "raw" TeX, or was it compiled / encoded for braille? Can you point me at the software you used?

Comment by VogonPoetry 7 days ago

Yes, mostly raw TeX, just plain ascii - not specially coded for Braille. This was quite a long time ago, mid 1980's, so not long after TeX had started to spread in computer science and maths communities. My friend was using a "Versa Braille" terminal hooked via a serial port to a BBC Micro running a terminal program that I'd written. I cannot completely remember how we came to an understanding of the syntax to use. We did shorten some items because the Versa Braille only had 20 chars per "line".

He is still active and online and has a contact page see https://www.foneware.net. I have been a poor correspondent with him - he will not know my HN username. I will try to reach out to him.

Comment by VogonPoetry 6 days ago

Now that I've been recalling more memories of this, I do remember there being encoding or "escaped" character issues - particularly with brackets and parentheses.

There was another device between the BBC Micro and the "Versa Braille" unit. The interposing unit was a matrix switch that could multiplex between different serial devices - I now suspect it might also have been doing some character escaping / translation.

For those not familiar with Braille, it uses a 2x3 array (6 bits) to encode everything. The "standard" (ahem, by country) Braille encodings are super-sub-optimal for pretty much any programming language or mathematics.

After a bit of (me)memory refresh, in "standard" Braille you only get ( and ) - and they both encode to the same 2x3 pattern! So in Braille ()() and (()) would "read" as the same thing.

I now understand why you were asking about the software used. I do not recall how we completely worked this out. We had to have added some sort of convention for scoping.

I now also remember that the Braille terminal aggressively compressed whitespace. My friend liked to use (physical) touch to build a picture, but it was not easy to send spatial / line-by-line information to the Braille terminal.

Being able to rely on spatial information has always stuck with me. It is for this reason I've always had a bias against Python, it is one of the few languages that depends on precise whitespace for statement syntax / scope.

Comment by kkylin 6 days ago

Thank you so much for all this detail. This is very interesting & quite helpful, and it's great you were able to communicate all this with your friend.

For anyone else interested: I wanted to be able to typeset mathematics (actual formulas) for the students that's as automated as possible. There are 1 or 2 commercial products that can typeset math in Braille (I can't remember the names but can look them up) but not priced for individual use. My university had a license to one of them but only for their own use (duh) and they did not have the staff to dedicate to my students (double duh).

My eventual solution was to compile latex to html, which the students could use with a screen reader. But screen readers were not fully reliable, and very, very slow to use (compared to Braille), making homework and exams take much longer than they need to. I also couldn't include figures this way. I looked around but did not find an easy open source solution for converting documents to Braille. It would be fantastic to be able to do this, formulas and figures included, but I would've been very happy with just the formulas. (This was single variable calculus; I shudder to think what teaching vector calc would have been like.)

FYI Our external vendor was able to convert figures to printed Braille, but I imagine that's a labor intensive process.

Partway through the term we found funding for dedicated "learning assistants" (an undergraduate student who came to class and helped explain what's going on, and also met with the students outside of class). This, as much or more than any tech, was probably the single most imapctful thing.

Comment by tippa123 8 days ago

+1 and I would be curious to read and learn more about it.

Comment by swores 7 days ago

A blind comedian / TV personality in the UK has just done a TV show on this subject - I haven't seen it, but here's a recent article about it: https://www.theguardian.com/tv-and-radio/2025/nov/23/chris-m...

Comment by lukecarr 7 days ago

Chris McCausland is great. A fair bit of his material _does_ reference his visual impairment, but it's genuinely witty and sharp, and it never feels like he's leaning on it for laughs/relying on sympathy.

He did a great skit with Lee Mack at the BAFTAs 2022[0], riffing on the autocue the speakers use for announcing awards.

[0]: https://www.youtube.com/watch?v=CLhy0Zq95HU

Comment by latexr 7 days ago

Hilariously, he beat the other teams in the “Say What You See” round (yes, really) of last year’s Big fat Quiz. No AI involved.

https://youtu.be/i5NvNXz2TSE?t=4732

Comment by swores 7 days ago

Haha that's great!

I'm not a fan of his (nothing against him, just not my cup of tea when it comes to comedy and mostly not been interested in other stuff he's done), but the few times I have seen him as a guest on shows it's been clear that he's a generally clever person.

Comment by asplake 7 days ago

I remembered he was once a techie, and Wikipedia confirms that he (Chris McCausland) has a BSc Honours in Software Engineering.

https://en.wikipedia.org/wiki/Chris_McCausland

Comment by joedevon 8 days ago

If you want to see more on this topic, check out (google) the podcast I co-host called Accessibility and Gen. AI.

Comment by tippa123 7 days ago

Honestly, that’s such a great example of how to share what you do on the interwebs. Right timing, helpful and on topic. Since I’ve listened to several episodes of the podcast, I can confirm it definitely delivers.

Comment by moss_dog 7 days ago

Thanks for the recommendation, just downloaded a few episodes!;

Comment by chrisweekly 7 days ago

Same! @devinprater, have you written about your experiences? You have an eager audience...

Comment by devinprater 7 days ago

I suppose I should write about them. A good few will be about issues with the mobile apps and websites for AI, like Claude not even letting me know a response is available to read, let alone sending it to the screen reader to be read. It's a mess, but if we blind people want it, we have to push through inaccessibility to get it.

Comment by Rover222 7 days ago

`Something one doesn't see` - no pun intended

Comment by K0balt 6 days ago

I must be wrong, but can’t help but harbor a mild suspicion that your use of sight metaphors is not coincidental.

Comment by WarcrimeActual 7 days ago

I have to believe you used the word see twice ironically.

Comment by badmonster 8 days ago

What other accessibility features do you wish existed in video AI models? Real-time vs post-processing?

Comment by devinprater 7 days ago

Mainly realtime processing. I play video games, and would love to play something like Legend of Zelda and just have the AI going, then ask it "read the menu options as I move between them," and it would speak each menu option as the cursor moves to it. Or when navigating a 3D environment, ask it to describe the surroundings, then ask it to tell me how to get to a place or object, then it guide me to it. That could be useful in real-world scenarios too.

Comment by kulahan 7 days ago

Weird question, but have you ever tried text adventures? It seems like it's inherently the ideal option, if you can get your screen reader going.

Comment by devinprater 7 days ago

Yep, they're nice. There are even online versions.

Comment by fguerraz 8 days ago

> Something one doesn't see in news headlines.

I hope this wasn't a terrible pun

Comment by densh 7 days ago

No pun intended but it's indeed an unfortunate choice of words on my part.

Comment by 47282847 7 days ago

My blind friends have gotten used to it and hear/receive it not as a literal “see“ any more. They would not feel offended by your usage.

Comment by devinprater 7 days ago

Nah, best pun ever!

Comment by GeekyBear 7 days ago

One cool feature they added for deaf parents a few years ago was a notification when it detects a baby crying.

Comment by SatvikBeri 7 days ago

My wife is deaf, and we had one kid in 2023 and twins in 2025. There's been a noticeable improvement baby cry detection! In 2023, the best we could find was a specialized device that cost over $1,000 and has all sorts of flakiness/issues. Today, the built-in detection on her (android) phone + watch is better than that device, and a lot more convenient.

Comment by Damogran6 7 days ago

I also got notification on my apple watch, while being away from the house, that the homepod mini heard our fire alarm going off.

A call home let us know that our son had set it off learning to reverse-sear his steak.

Comment by kstrauser 7 days ago

I live across the street from a fire station. Thank for you for diligence, little HomePod Mini, but I'm turning your notifications off now.

Comment by brandonb 7 days ago

If the fire alarm didn't go off, you didn't sear hard enough. :)

Comment by embedding-shape 7 days ago

Is that something you actually need AI for though? A device with a sound sensor and something that shines/vibrate a remote device when it detects sound above some threshold would be cheaper, faster detection, more reliable, easier to maintain, and more.

Comment by evilduck 7 days ago

But your solution costs money in addition to the phone they already own for other purposes. And multiple things can make loud noises in your environment besides babies; differentiating between a police siren going by outside and your baby crying is useful, especially if the baby slept through the siren.

The same arguments were said for blind people and the multitude of one-off devices that smartphones replaced, OCR to TTS, color detection, object detection in photos/camera feeds, detecting what denomination US bills are, analyzing what's on screen semantically vs what was provided as accessible text (if any was at all), etc. Sure, services for the blind would come by and help arrange outfits for people, and audiobook narrators or braille translator services existed, and standalone devices to detect money denominations were sold, but a phone can just do all of that now for much cheaper.

All of these accessibility AI/ML features run on-device, so the knee-jerk anti-AI crowd's chief complaints are mostly baseless anyways. And for the blind and the deaf, carrying all the potential extra devices with you everywhere is burdensome. The smartphone is a minimal and common social and physical burden.

Comment by Aurornis 7 days ago

> more reliable

I've worked on some audio/video alert systems. Basic threshold detectors produce a lot of false positives. It's common for parents to put white noise machines in the room to help the baby sleep. When you have a noise generating machine in the same room, you need more sophisticated detection.

False positives are the fastest way to frustrate users.

Comment by jfindper 7 days ago

>Is that something you actually need AI for though?

Need? Probably not. I bet it helps though (false positives, etc.)

>would be cheaper, faster detection, more reliable, easier to maintain, and more.

Cheaper than the phone I already own? Easier to maintain than the phone that I don't need to do maintenance on?

From a fun hacking perspective, a different sensor & device is cool. But I don't think it's any of the things you mentioned for the majority of people.

Comment by doug_durham 7 days ago

You are talking about a device of smart phone complexity. You need enough compute power to run a model that can distinguish noises. You need a TCP/IP stack and a wireless radio to communicate the information. At that point you have a smart phone. A simple sound threshold device would have too many false positives/negatives to be useful.

Comment by whatsupdog 7 days ago

> As a blind person, AI has changed my life.

I know this is a low quality comment, but I'm genuinely happy for you.

Comment by phyzix5761 8 days ago

Can you share some ways AI has changed your life?

Comment by gostsamo 8 days ago

Not the gp, but currently reading a web novel with a card game where the author didn't include alt text in the card images. I contacted them about it and they started, but in the meantime ai was a big help. all kinds of other images on the internet as well when they are significant to understanding the surrounding text. better search experience when Google, DDG, and the like make finding answers difficult. I might use smart glasses for better outdoor orientation, though a good solution might take some time. phone camera plus ai is also situationally useful.

Comment by dzhiurgis 8 days ago

As a (web app) developer I never quite sure what to put in alt. Figured you might have some advice here?

Comment by gostsamo 8 days ago

The question to ask is, what a sighted person learns after looking at the image? The answer is the alt text. E.g if the image is a floppy, maybe you communicate that this is the save button. If it shows a cat sleeping on the windowsill, the alt text is yep: "my cat looking cute while sleeping on the windowsill".

Comment by michaelbuckbee 7 days ago

I really like how you framed this as the takeaway or learning that needs to happen as what should be in the alt and not a recitation of the image. Where I've often had issues is more for things like business charts and illustrations and less cute cat photos.

Comment by isoprophlex 7 days ago

"A meaningless image of a chart, from which nevertheless emanates a feeling of stonks going up"

Comment by gostsamo 7 days ago

The logic stays the same though the answer is longer and not always easy. Just saying "business chart" is totally useless. You can make a choice on what to focus and say "a chart of the stock for the last five years with constant improvement and a clear increase by 17 percent in 2022" (if it is a simple point that you are trying to make) or you can provide an html table with the datapoints if there is data that the user needs to explore on their own.

Comment by alwillis 7 days ago

Accessible info graphics [1]

[1]: https://web.archive.org/web/20130922065731/http://www.last-c...

Comment by nextaccountic 7 days ago

but the table exists outside the alt text, right? i don't know a mechanism to say "this html table represents the contents of this image" , in a way that screen readers and other accessibility technologies take advantage of

Comment by gostsamo 7 days ago

The figure tag has both image and caption tags that link them. As far as I remember, some content could be marked as screen reader only if you don't want for the table to be visible to the rest of the users.

Additionally, recently I've been a participant in accessibility studies where charts, diagrams and the like have been structured to be easier to explore with a sr. Those needed js to work and some of them looked custom, but they are also an alternative way to layer data.

Comment by travisjungroth 7 days ago

It might be that you’re not perfectly clear on what exactly you’re trying to convey with the image and why it’s there.

Comment by hrimfaxi 7 days ago

What would you put for this? "Graph of All-Transactions House Price Index for the United States 1975-2025"?

https://fred.stlouisfed.org/series/USSTHPI

Comment by wlesieutre 7 days ago

Charts are one I've wondered about, do I need to try to describe the trend of the data, or provide several conclusions that a person seeing the chart might draw?

Just saying "It's a chart" doesn't feel like it'd be useful to someone who can't see the chart. But if the other text on the page talks about the chart, then maybe identifying it as the chart is enough?

Comment by gostsamo 7 days ago

It depends on the context. What do you want to say? How much of it is said in the text? Can the content of the image be inferred from the text part? Even in the best scenario though, giving a summary of the image in the alt text / caption could be immensely useful and include the reader in your thought process.

Comment by embedding-shape 7 days ago

What are you trying to point out with your graph in general? Write that basically. Usually graphs are added for some purpose, and assuming it's not purposefully misleading, verbalizing the purpose usually works well.

Comment by freedomben 7 days ago

I might be an unusual case, but when I present graphs/charts it's not usually because I'm trying to point something out. It's usually a "here's some data, what conclusions do you draw from this?" and hopefully a discussion will follow. Example from recently: "Here is a recent survey of adults in the US and their religious identification, church attendance levels, self-reported "spirituality" level, etc. What do you think is happening?"

Would love to hear a good example of alt text for something like that where the data isn't necessarily clear and I also don't want to do any interpreting of the data lest I influence the person's opinion.

Comment by embedding-shape 7 days ago

> and hopefully a discussion will follow.

Yeah, I think I misunderstood the context. I understood/assumed it to be for an article/post you're writing, where you have something you want to say in general/some point of what you're writing. But based on what you wrote now, it seems to be more about how to caption an image you're sending to a blind person in a conversation/discussion of some sort.

I guess at that point it'd be easier for them if you just share the data itself, rather than anything generated by the data, especially if there is nothing you want to point out.

Comment by gostsamo 7 days ago

An image is the wrong way to convey something like that to a blind person. As written in one of my other comments, give the data in a table format or a custom widget that could be explored.

Comment by alwillis 7 days ago

https://www.w3.org/WAI/tutorials/images/ including how write alt text for charts.

Comment by travisjungroth 7 days ago

Charts would have a link to tabular data. It’s the “business illustrations” that are more about understanding purpose.

Comment by asadotzler 7 days ago

a plaintext table with the actual data

Comment by gostsamo 7 days ago

sorry, snark does not help with my desire to improve accessibility in the wild.

Comment by travisjungroth 7 days ago

I really didn’t mean to be snarky. Maybe if I was speaking, my tone would have made that more clear, or I could have worded it differently.

“Why is this here? What am I trying to say?” are super important things in design and also so easy to lose track of.

Comment by alwillis 7 days ago

> As a (web app) developer I never quite sure what to put in alt.

Are you making these five mistakes when writing alt text? [1] Images tutorial [2] Alternative Text [3]

[1]: https://www.a11yproject.com/posts/are-you-making-these-five-...

[2]: https://www.w3.org/WAI/tutorials/images/

[3]: https://webaim.org/techniques/alttext/

Comment by shagie 7 days ago

I'm gonna flip this around... have you tried pasting the image (and the relevant paragraph of text) and asking ChatGPT (or another LLM) to generate the alt text for the image and see what it produces?

For example... https://chatgpt.com/share/692f1578-2bcc-8011-ac8f-a57f2ab6a7...

Comment by alwillis 7 days ago

> I'm gonna flip this around... have you tried pasting the image (and the relevant paragraph of text) and asking ChatGPT (or another LLM) to generate the alt text for the image and see what it produces?

There's a great app by an indie developer that uses ML to identify objects in images. Totally scriptable via JavaScript, shell script and AppleScript. macOS only.

Could be 10, 100 or 1,000 images [1].

[1]: https://flyingmeat.com/retrobatch/

Comment by askew 7 days ago

One way to frame it is: "how would I describe this image to somebody sat next to me?"

Comment by embedding-shape 7 days ago

Important to add for blind people: "... assuming they never seen anything and visual metaphors won't work"

The amount of times I've seem captions that wouldn't make sense for people who never been able to see is staggering, I don't think most people realize how visual our typical language usage is.

Comment by darkwater 8 days ago

I guess that auto-generated audio descriptions for (almost?) any video you want is a very, very nice feature for a blind person.

Comment by tippa123 8 days ago

My two cents, this seems like a case where it’s better to wait for the person’s response instead of guessing.

Comment by darkwater 8 days ago

Fair enough. Anyway I wasn't trying to say what actually changed GP's life, I was just expressing my opinion on what video models could potentially bring as an improvement to a blind person.

Comment by nkmnz 7 days ago

My two cents, this seems like a comment it should be up to the OP to make instead of virtue signaling.

Comment by foobarian 7 days ago

Yall could have gotten a serviceable answer about this topic out of ChatGPT. 2025 version of "let me google that for you"

Comment by tippa123 7 days ago

> Can you share some ways AI has changed your life?

A question directed to GP, directly asking about their life and pointing this out is somehow virtue signalling, OK.

Comment by throwup238 7 days ago

You can safely assume that anyone who uses “virtue signaling” unironically has nothing substantive to say.

Comment by nkmnz 5 days ago

Funny coming from someone who tells others not to speak unless asked, while no one asked them either.

Comment by SV_BubbleTime 7 days ago

>[People who call out performative bullshit should be ignored because they’re totally wrong and I totally mean it.]

Maybe you’re just being defensive? I’m sure he didn’t mean an attack at you personally.

Comment by throwup238 7 days ago

It’s presumptuous of you to assume I was offended.

Accusing someone of “virtue signaling” is itself virtue signaling, just for a different in-group to use as a thought terminating cliche. It has been for decades. “Performative bullshit” is a great way to put it, just not in the way you intended.

If the OP had a substantive point to make they would have made it instead of using vague ad hominem that’s so 2008 it could be the opening track on a Best of Glenn Beck album (that’s roughly when I remember “virtue signaling” becoming a cliche).

Comment by MangoToupe 7 days ago

...you know, people can have opinions about the best way to behave outside of self-aggrandizement, even if your brain can't grasp this concept.

Comment by nkmnz 7 days ago

exactly

Comment by fragmede 7 days ago

From the list of virtues, which one was this signaling?

https://www.virtuesforlife.com/virtues-list/

Comment by efs24 7 days ago

I’d guess: Respect, consideration, authenticity, fairness.

Or should I too perhaps wait for OP to respond.

Comment by SV_BubbleTime 7 days ago

[flagged]

Comment by meindnoch 7 days ago

[flagged]

Comment by Moomoomoo309 7 days ago

The two cents are not literally monetary - your opinion is literally the two cents. You're contributing your understanding to the shared pot of understanding and that's represented by putting money into the pot, showing you have skin in the game. It's contributing to a larger body of knowledge by putting your small piece in - the phrases you suggest don't have that context behind them and in my opinion are worse for it. The beauty of the phrase is because the two cents are your opinion, everyone has enough, because everyone can have an opinion.

The lens through which you're analyzing the phrase is coloring how you see it negatively, and the one I'm using is doing the opposite. There is no need to change the phrase, just how it's viewed, I think.

Comment by kachapopopow 7 days ago

people put too much weight onto words, the first lesson I learned on the internet is that words are harmless, might be deeply painful for some, but because people as my self put no weight behind them we don't even have a concept of keeping such things mindful since it never crosses our minds and it's really difficult to see if any other way even if we try to since it just seems like a bad joke.

And when I say 'it never crosses our minds' I really mean it, there's zero thoughts between thinking about a message and having it show up in a text box.

A really great example are slurs, for a lot of people they have to double take, but there's zero extra neurons fired when I read them. I guess early internet culture is to blame since all kinds of language was completely uncensored and it was very common to run into very hostile people/content.

Comment by georgebcrawford 7 days ago

> The metaphor of assigning a literal monetary value to one's opinion reinforces the idea that contributions are transactional and that their "worth" is measured through an economic lens. That framing can be exclusionary, especially for people who have been historically marginalized by economic systems. It subtly normalizes a worldview where only those with enough "currency" - social, financial, or otherwise - deserve to be heard.

No. It’s acknowledging that that perhaps one’s opinion may not be as useful as somebody else’s in that moment. Which is often true!

Your first and third paragraphs are true, but they don’t apply to every bloody phrase.

Comment by baq 8 days ago

guessing that being able to hear a description of what the camera is seeing (basically a special case of a video) in any circumstances is indeed life changing if you're blind...? take a picture through the window and ask what's the commotion? door closed outside that's normally open - take a picture, tell me if there's a sign on it? etc.

Comment by devinprater 7 days ago

Image descriptions. TalkBack on Android has it built in and uses Gemini. VoiceOver still uses some older, less accurate, and far less descriptive ML model, but we can share images to Seeing AI or Be My Eyes and such and get a description.

Video descriptions, through PiccyBot, have made watching more visual videos or videos where things happen that don't make sense without visuals much easier. Of course, it'd be much better if YouTube incorporated audio description through AI the same way they do captions, but that may happen in a good 2 years or so. I'm not holding my breath. Google as a whole is hard to get accessibility out of more than the bare minimum.

Looking up information like restaurant menus. Yes it can make things up, but worst-case, the waiter says they don't have that.

Comment by javcasas 7 days ago

Finally good news about the AI doing something good for the people.

Comment by p1esk 7 days ago

I’m not blind and AI has been great for me too.

Comment by majkinetor 7 days ago

[flagged]

Comment by talesfromearth 7 days ago

The smiley at the end doesn’t hide how awful your comment is.

Comment by majkinetor 7 days ago

So serious... you should relax a bit and work up on your humor reception/understanding (smiley intentionally left out this time)

Comment by Workaccount2 7 days ago

People need to understand that a lot of angst around AI comes from AI enabling people to do things that they formally needed to go through gatekeepers for. The angst is coming from the gatekeepers.

AI has been a boon for me and my non-tech job. I can pump out bespoke apps all day without having to get bent on $5000/yr/usr engineering software packages. I have a website for my side business that looks and functions professionally and was done with a $20 monthly AI subscription instead of a $2000 contractor.

Comment by BeFlatXIII 7 days ago

AI is divine retribution for artists being really annoying on Twitter.

Comment by MyFirstSass 7 days ago

I highly doubt "pumping out bespoke apps all day" is possible yet besides 100% boilerplate, and when possible then no good for any other purpose than enshittifiying the web, and at that point not profitable because everyone can do it.

I use AI daily as a senior coder for search and docs, and when used for prototyping you still need to be a senior coder to go from say 60% boilerplate to 100% finished app/site/whatever unless it's incredibly simple.

Comment by alwillis 7 days ago

> I use AI daily as a senior coder for search and docs, and when used for prototyping you still need to be a senior coder to go from say 60% boilerplate to 100% finished app/site/whatever unless it's incredibly simple.

I know you would like to believe that, but with the tools available NOW, that's not necessarily the case. For example, by using the Playwright or Chrome DevTools MCPs, models can see the web app are it's being created and it's pretty easy to prompt them to fix something they can see.

These models know the current frameworks and coding practices but they do need some guidance; they're not mindreaders.

Comment by MyFirstSass 7 days ago

I still don't believe that. Again yes a boilerplate calculator or recipe app probably, but anything advanced real world with latency issues, scaling, race conditions, css quirks, design weirdness, optimisation - in other words the things that actually require domain knowledge i still don't get much help with, even with Claude Code, pointers yes but they completely fumble actual production code in real world scenarios.

Again it's the last 5% that takes 95% of the time, and those 5% i haven't seen fixed with Claude or Gemini, because it's essentially quirks, browser errors, race conditions, visual alignment, etc etc. All stuff that completely goes way above any LLM's head atm from what i've seen.

They can definitely bullshit a 95% working app though, but that's 95% from being done ;)

Comment by Workaccount2 7 days ago

Often the problem with tech people is they think software only exists for tech or for being sold to others from tech.

Nothing I do is in the tech industry. It's all manufacturing and all the software is for in-house processes.

Believe it or not, software is useful to everyone and no longer needs to originate from someone who only knows software.

Comment by MyFirstSass 7 days ago

I'm saying you can't do what you're saying without knowing code at the moment.

You didn't give any examples of the valuable bespoke apps that you are creating by the hour.

I simply don't believe you, and the arrogant salesy tone doesn't help.

Comment by Workaccount2 7 days ago

LLMs can pretty reliably write 5-7k LOC.

If your needs fit in a program that size, you are pretty much good to go.

It will not rewrite PCB_CAD 2025, but it will happily create a PCB hole alignment and conversion app, eliminated the need for the full PCB_CAD software if all you need is that one toolset from it.

Very, very, few pieces of software need to be full package enterprise productivity suites. If you just make photos black and white and resize them, you don't need Photoshop to do it. Or even ms paint. Any LLM will make a simple free program with no ads to do it. Average people generally do very simple dumb stuff with the expensive software they buy.

Comment by vjvjvjvjghv 7 days ago

This is the same as the discussion about using Excel. Excel has its limitations, but it has enabled millions of people to do pretty sophisticated stuff without the help of “professionals”. Most of the stuff us tech people do is also basically some repetitive boilerplate. We just like to make things more complex than they need to be. I am always a little baffled why seemingly every little CRUD site that has at most 100 users needs to be run on Kubernetes with several microservices, CI/CD pipelines, and whatever.

As far as enshittification goes, this was happening long before AI. It probably started with SEO and just kept going from there.

Comment by almosthere 7 days ago

The reality is too, that even if "what is acceptable" has not yet caught up to that guy working at Atlassian, polishing off a new field in Jira, people are using AI + Excel to manage their tasks EXACTLY the way their head works, not the way Jira works.

Yet we fail to see AI as a good thing but just as a jobs destroyer. Are we "better than" the people that used to fill toothpaste tubes manually until a machine was invented to replace them? They were just as mad when they got the pink slip.

Comment by vjvjvjvjghv 7 days ago

I have told people that us techies have proudly killed the jobs of millions of people and we were arrogant about it. Now we are mad that it's our turn. Feels almost like justice :-)

Comment by mycall 6 days ago

I'm in the middle of updating a new public transit app my team wrote in Flutter and want it to flow better for blind people. Would you rather use AI chat session with realtime information or go through the screens with VoiceOver/TalkBack controls? What about trust and hallucinations?

Any tips you can give?

Comment by robbomacrae 7 days ago

Hi Devin and other folks, I'm looking for software developers who are blind or hard of sight as there is a tool I'm building that I think might be of interest to them (it's free and open source). If you or anyone you know is interested in trying it please get in touch through my email.

Comment by andy_ppp 7 days ago

I wonder if there's anything that can help blind people to navigate the world more easily - I guess in the future AR Glasses won't just be for the sighted but allow people without vision to be helped considerably. It really is both amazing and terrifying the future we're heading towards.

Comment by asadotzler 7 days ago

AURA Vision for blind and low vision people has been doing this for years. Be My Eyes has been doing this for years without AI. Meta Ray-Bans can do this. There's nothing new coming soon that hasn't already been available for a while, only refinements.

Comment by astrange 7 days ago

This is an area where a difference in degree becomes a difference in kind quite easily. If an AI is telling you what it's looking at, it suddenly means a lot more once it… knows what it's looking at.

Comment by xnx 7 days ago

https://play.google.com/store/apps/details?id=com.google.and...

Comment by shagie 7 days ago

From a couple years ago...

https://www.microsoft.com/en-us/garage/wall-of-fame/seeing-a...

https://youtu.be/R2mC-NUAmMk

https://youtu.be/DybczED-GKE

... and that was 10 years ago. I'm curious for what it could do now.

Comment by basilgohar 7 days ago

I'm only commenting because I absolutely love this thread. It's an insight into something I think most of us are quite (I'm going to say it...) blind to in our normal experiences with daily life, and I find immense value in removing my ignorance about such things.

Comment by kruxigt 8 days ago

[dead]

Comment by RobotToaster 8 days ago

The license[0] seems quite restrictive, limiting it's use to non commercial research. It doesn't meet the open source definition so it's more appropriate to call it weights available.

[0]https://github.com/apple/ml-starflow/blob/main/LICENSE_MODEL

Comment by limagnolia 7 days ago

They haven't even released the weights yet...

As for the license, happily, Model Weights are the product of machine output and not creative works, so not copyrightable under US law. Might depend on where you are from, but I would have no problem using Model Weights however I want to and ignoring pointless licenses.

Comment by loufe 7 days ago

The weights for the text-->image model are already on Huggingface, FWIW.

Comment by pabs3 7 days ago

The output of a compiler is copyrightable, why aren't models similarly copyrightable?

Comment by 7 days ago

Comment by yegle 8 days ago

Looking at text to video examples (https://starflow-v.github.io/#text-to-video) I'm not impressed. Those gave me the feeling of the early Will Smith noodles videos.

Did I miss anything?

Comment by M4v3R 8 days ago

These are ~2 years behind state of the art from the looks of it. Still cool that they're releasing anything that's open for researchers to play with, but it's nothing groundbreaking.

Comment by tomthe 8 days ago

No, it is not as good as Veo, but better than Grok, I would say. Definitely better than what was available 2 years ago. And it is only a 7B research model!

Comment by Mashimo 8 days ago

But 7b is rather small no? Are other open weight video models also this small? Can this run on a single consumer card?

Comment by dragonwriter 8 days ago

> But 7b is rather small no?

Sure, its smallish.

> Are other open weight video models also this small?

Apples models are weights-available not open weights, and yes, WAN 2.1, as well as the 14B models, also has 1.3B models; WAN 2.2, as well as the 14B models, also has a 5B model (the WAN 2.2 VAE used by Starflow-V is specifically the one used with the 5B model.) and because the WAN models are largely actually open weights models (Apache 2.0 licensed) there are lots of downstream open-licensed derivatives.

> Can this run on a single consumer card?

Modern model runtimes like ComfyUI can run models that do not fit in VRAM on a single consumer card by swapping model layers between RAM and VRAM as needed; models bigger than this can run on single consumer cards.

Comment by Maxious 8 days ago

Wan 2.2: "This generation was run on an RTX 3060 (12 GB VRAM) and took 900 seconds to complete at 840 × 420 resolution, producing 81 frames." https://www.nextdiffusion.ai/tutorials/how-to-run-wan22-imag...

Comment by jjfoooo4 7 days ago

My guess is that they will lean towards smaller models, and try to provide the best experience for running inference on device

Comment by tdesilva 7 days ago

The interesting part is they chose to go with a normalizing flow approach, rather than the industry standard diffusion model approach. Not sure why they chose this direction as I haven’t read the paper yet.

Comment by jfoster 7 days ago

I think you need to go back and rewatch Will Smith eating spaghetti. These examples are far from perfect and probably not the best model right now, but they're far better than you're giving credit for.

As far as I know, this might be the most advanced text-to-video model that has been released? I'm not sure whether the license will qualify as open enough in everyone's eyes, though.

Comment by manmal 7 days ago

I wanted to write exactly the same thing, this reminded me of the Will Smith noodles. The juice glass keeps filling up after the liquid stopped pouring in.

Comment by gorgoiler 7 days ago

It’s not really relevant to this release specifically but it irks me that, in general, an “open weights model” is like an “open source machine code” version of Microsoft Windows. Yes, I guess I have open access to view the thing I am about to execute!

This Apple license is click wrap MIT with the rights, at least, to modify and redistribute the model itself. I suppose I should be grateful for that much openness, at least.

Comment by advisedwang 7 days ago

Great analogy.

To extend the analogy, "closed source machine code" would be like conventional SaaS. There's an argument that shipping me a binary I can freely use is at least better than only providing SaaS.

Comment by satvikpendem 7 days ago

> Yes, I guess I have open access to view the thing I am about to execute!

Better to execute locally than to execute remotely where you can't change or modify any part of the model though. Open weights at least mean you can retrain or distill it, which is not analogous to a compiled executable that you can't (generally) modify.

Comment by limagnolia 7 days ago

I think you are looking at the code license, not the model license.

Comment by Aloisius 7 days ago

No, it's the model license. There's a second license for the code.

Of course, model weights almost certainly are not copyrightable so the license isn't enforceable anyway, at least in the US.

The EU and the UK are a different matter since they have sui generis database rights which seemingly allows individuals to own /dev/random.

Comment by pabs3 7 days ago

The output of a compiler is copyrightable, why aren't models similarly copyrightable?

Comment by limagnolia 6 days ago

The output of a compiler is directly based on what you put in, the source code. That makes it a derivative work of the copyrightable source code, and thus copyright to the copyright holder of the source code, not the person who runs the compiler.

One might argue that model weights are derivative of the training material and copyright held be the copyright holder of the training material. The counter argument would be that the weights are significantly transformational.

Comment by vessenes 7 days ago

From the paper, this is a research model aimed at dealing with the runaway error common in diffusion video models - the latent space is (proposed to be) causal and therefore it should have better coherence.

For a 7b model the results look pretty good! If Apple gets a model out here that is competitive with wan or even veo I believe in my heart it will have been trained with images of the finest taste.

Comment by LoganDark 7 days ago

> Model Release Timeline: Pretrained checkpoints will be released soon. Please check back or watch this repository for updates.

> The checkpoint files are not included in this repository due to size constraints.

So it's not actually open weights yet. Maybe eventually once they actually release the weights it will be. "Soon"

Comment by summerlight 7 days ago

This looks interesting. This project has some novelty as a research and actually delivered a promising PoC but as a product it implies that its training was severely constrained by computing resources, which correlates well with the report that their CFO overruled CEO's decision on ML infra investment.

JG's recent departure and follow up massive reorg to get rid of AI, rumors on Tim's upcoming step down in early 2026... All of these signals indicate that those non-ML folks have won corporate politics to reduce the in-house AI efforts.

I suppose this was a part of serious efforts to deliver in-house models but the directional changes on AI strategy made them to give up. What a shame... At least the approach itself seem interesting and hope others to take a look and use it for building something useful.

Comment by coolspot 8 days ago

> STARFlow-V is trained on 96 H100 GPUs using approximately 20 million videos.

They don’t say for how long.

Comment by moondev 7 days ago

Apple intelligence: trained by Nvidia GPUs on Linux.

Do the examples in the repo run inference on Mac?

Comment by dymk 7 days ago

Title is wrong, model isn’t released yet. Title also doesn’t appear in the link - why the editorializing?

Comment by satvikpendem 8 days ago

Looks good. I wonder what use case Apple has in mind though, or I suppose this is just what the researchers themselves were interested in, perhaps due to the current zeitgeist. I'm not really sure how it works at big tech companies with regards to research, are there top down mandates?

Comment by ivape 7 days ago

To add things to videos you create with your phone. TikTok and Insta will probably add this soon, but I suppose Apple is trying to provide this feature on “some level”. That means you don’t have to send your video through a social media platform first to creatively edit it (the platforms being the few tools that let you do generative video).

They should really buy Snapchat.

Comment by ozim 7 days ago

I guess Apple is big in video production and animation with some ties via Pixar and Disney. Since Jobs started Pixar and it all got tied up in myriad of different ways.

Comment by nothrowaways 8 days ago

Where do they get the video training data?

Comment by postalcoder 8 days ago

From the paper:

> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.

Comment by justinclift 8 days ago

> in-house stock video dataset

Wonder if "iCloud backups" would be counted as "stock video" there? ;)

Comment by anon7000 7 days ago

I have to delete as many videos as humanly possible before backing up to avoid blowing through my iCloud storage quota so I guess I’m safe

Comment by whywhywhywhy 7 days ago

More likely AppleTV shows

Comment by astrange 7 days ago

Stock video means stock video.

https://en.wikipedia.org/wiki/Stock_photography

Comment by fragmede 8 days ago

Turn on advanced data protection so they don't train on yours.

Comment by givinguflac 7 days ago

That has nothing to do with it, and Apple wouldn’t train on user content, they’re not Google. If they ever did there would be opt in at best. There’s a reason they’re walking and observing, not running and trying to be the forefront cloud AI leader, like some others.

Comment by gaigalas 7 days ago

Why should I buy this "ethical Apple" argument?

They shared audio Siri recordings with contractors in 2019. It became opt-in only after backlash, similar to other privacy controversies.

This shows that they clearly prioritize not being sued or caught, which is slightly different from prioritizing user choices.

Comment by cubefox 7 days ago

Interesting that this is an autoregressive ("causal") model rather than a diffusion model.

Comment by 7 days ago

Comment by giancarlostoro 7 days ago

I was upset the page didnt have videos immediately available, then I realized I have to click on some of the tabs. One red flag on their github is the license looks to be their own flavor of MIT (though much closer to MS-PL).

Comment by andersa 7 days ago

The number of video models that are worse than Wan 2.2 and can safely be ignored has increased by 1.

Comment by embedding-shape 7 days ago

To be fair, the sizes aren't comparable, and for the variant that is comparable, the results aren't that much worse.

Comment by dragonwriter 7 days ago

The samples (and this may or may not be completely fair, either set could be more cherry picked than the other, It would be interesting to see a side-by-side comparison with comparable prompts) seem significantly worse than what I’ve seen from WAN 2.1 1.3B, which is both fron the previous WAN version and is smaller, proportionally, compared to Apple’s 7B than that model itself is compared to the 28B combination of the high and low noise 14B WAN 2.2 models that are typically used together.

But also, Starflow-V is a research model with a substandard text encoder, it doesn't have to be competitive as-is to be an interesting spur for further research on the new architecture it presents. (Though it would be nice if it had some aspect where it offered a clear improvement.)

Comment by wolttam 7 days ago

This doesn’t look like it was intended to compete. The research appears interesting

Comment by camillomiller 8 days ago

Hopefully this will make into some useful feature in the ecosystem and not contribute to having just more terrible slop. Apple has saved itself from the destruction of quality and taste that these model enabled, I hope it stays that way.

Comment by Invictus0 7 days ago

Apple's got to stop running their AI group like a university lab. Get some actual products going that we can all use--you know, with a proper fucking web UI and a backend.

Comment by Jtsummers 7 days ago

Personally, I'm happy that Apple is spending the time and money on research. We have products that already do what this model does, the next step is to make it either more efficient or better (closer to the prompt, more realistic, higher quality output). That requires research, not more products.

Comment by Invictus0 6 days ago

Not complaining about the research. I'm complaining that I have to do work to even try this, instead of Apple offering a demo or portal of some kind.

Comment by Barry-Perkins 7 days ago

[dead]

Comment by ai_updates 8 days ago

[flagged]

Comment by MallocVoidstar 8 days ago

you don't "appreciate" anything, you're just posting LLM comments

Comment by pulse7 8 days ago

Comment by mdrzn 8 days ago

"VAE: WAN2.2-VAE" so it's just a Wan2.2 edit, compressed to 7B.

Comment by kouteiheika 8 days ago

This doesn't necessarily mean that it's Wan2.2. People often don't train their own VAEs and just reuse an existing one, because a VAE isn't really what's doing the image generation part.

A little bit more background for those who don't know what a VAE is (I'm simplifying here, so bear with me): it's essentially a model which turns raw RGB images into a something called a "latent space". You can think of it as a fancy "color" space, but on steroids.

There are two main reasons for this: one is to make the model which does the actual useful work more computationally efficient. VAEs usually downscale the spatial dimensions of the images they ingest, so your model now instead of having to process a 1024x1024 image needs to work on only a 256x256 image. (However they often do increase the number of channels to compensate, but I digress.)

The other reason is that, unlike raw RGB space, the latent space is actually a higher level representation of the image.

Training a VAE isn't the most interesting part of image models, and while it is tricky, it's done entirely in an unsupervised manner. You give the VAE an RGB image, have it convert it to latent space, then have it convert it back to RGB, you take a diff between the input RGB image and the output RGB image, and that's the signal you use when training them (in reality it's a little more complex, but, again, I'm simplifying here to make the explanation more clear). So it makes sense to reuse them, and concentrate on the actually interesting parts of an image generation model.

Comment by sroussey 7 days ago

Since you seem to know way more than I on the subject, can you explain the importance of video generation that is not diffusion based?

Comment by mdrzn 7 days ago

Thanks for the explanation!

Comment by dragonwriter 8 days ago

> "VAE: WAN2.2-VAE" so it's just a Wan2.2 edit

No, using the WAN 2.2 VAE does not mean it is a WAN 2.2 edit.

> compressed to 7B.

No, if it was an edit of the WAN model that uses the 2.2 VAE, it would be expanded to 7B, not compressed (the 14B models of WAN 2.2 use the WAN 2.1 VAE, the WAN 2.2 VAE is used by the 5B WAN 2.2 model.)

Comment by BoredPositron 8 days ago

They used the VAE of WAN like many other models do. For image models you see a lot of them using the flux VAE. Which is perfectly fine, they are released as apache2 and save you time to focus on your transformers architecture...