I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models
Posted by iliashad 2 days ago
TLDR: I had 2,207 GoPro videos, and I need to rewatch them to find interesting moments from my cycling journey. I built a project to index them locally on my M1 Max using open-source ML models, search for those moments, and send the best clips straight to my DaVinci Resolve timeline. I indexed 628 videos (668.68 GB, 15h 13m 18s of footage duration), more details in the metrics table in the last section of this article.
Full article: https://iliashaddad.com/blog/i-indexed-669-gb-of-my-gopro-videos-using-my-m1-max-computer
Comments
Comment by asenna 2 days ago
https://news.ycombinator.com/item?id=48222733 https://blog.simbastack.com/indexed-a-year-of-video-locally/
I wasn't familiar with your project though, interesting stuff.
I'm trying to add more photography related features to Framedex but yeah there's so much we can do locally, exciting times.
Comment by iliashad 2 days ago
Good job for the article and the project. That's great, yes local models are getting better and better
Comment by justinram11 2 days ago
I'm really bullish on taking more video of my kids, with the thought that it will become easier and easier for AI to put them into little compilations I can enjoy later.
Comment by iliashad 2 days ago
Comment by mwelpa 2 days ago
Comment by alias_neo 2 days ago
I booted up my old PS3 from my uni days (20 years ago?) and found all of the music I had on it because I used it for everything at the time. Some seriously nostalgic music I'd completely forgotten about.
Comment by theshrike79 1 day ago
Google loves scanning stuff on in the cloud though.
Comment by goodmythical 2 days ago
Years from now they'll be getting "hey look at BIKE BRANDS' NEWEST CHEAP BIKE REMEMBER WHEN YOU USED TO RIDE BIKE BRAND BIKES"
Comment by satvikpendem 2 days ago
Comment by whattheheckheck 2 days ago
Comment by satvikpendem 1 day ago
Comment by marci 2 days ago
Comment by esjeon 2 days ago
Aha, it makes total sense. This number sounds much more reasonable than “669 GB”, since the actual total size of processed frames would be like 10-30 GB.
(Not downplaying anything. Doing-at-home always requires some math on practicality)
> Total compute time 67h 40m 42s
I’m just curious tho — is there any paying options that can accelerate this kind of process? Just spin up GPU instances?
Comment by iliashad 2 days ago
The reason why is “669 GB” is the total raw footage size when I'm doing the video processing, I downscaled each frame to 720p to make the video processing much faster and I don't need full original quality in order to get accurate results (as far as I know and experiment with).
> I’m just curious tho — is there any paying options that can accelerate this kind of process? Just spin up GPU instances?
For now, I found that NVIDIA GPU for example RTX 3060 with 12GB Vram was much faster than my M1 Max. (still working on optimizing for speed and accuracy).
Comment by ngai_aku 2 days ago
Comment by villgax 2 days ago
Comment by fennecfoxy 2 days ago
They were having availability issues with GPUs (of course) but especially their UI where you'd customise a template only to try to start a pod, the GPU be unavailable and the UI reset forcing you to make the changes all over again.
But they have fixed that since, now starting a pod is more from a live page where as GPU availability status changes it updates in realtime/if your deploy fails you just try again - your customised env vars etc are still there.
Plus they also addressed the GPU availability problem as something they're working to fix and it's understandable seeing as nobody can get their hands on GPUs atm.
Comment by egorfine 2 days ago
But it's not as fun as running local model right here on your computer on your own desk. It feels like magic.
Comment by robrain 2 days ago
Comment by iliashad 2 days ago
I think Adobe premiere pro have it as well but cloud processed
Comment by teovall 2 days ago
Comment by robrain 2 days ago
https://www.blackmagicdesign.com/products/davinciresolve/wha...
Comment by Schiendelman 2 days ago
Comment by iliashad 1 day ago
Comment by Schiendelman 1 day ago
Comment by iliashad 2 days ago
Comment by Beijinger 2 days ago
Comment by pduggishetti 2 days ago
You might want to add something like yolo finetune to detect scenes + face recognition too.
Comment by dotancohen 2 days ago
Comment by fennecfoxy 2 days ago
Comment by vorticalbox 2 days ago
Comment by avadodin 2 days ago
I found a few pornographic pictures on the web to hand to Abliterated Gemma4 12B(literally just to test this) and it needs pushing just to accept that people can be naked.
It didn't refuse but it also didn't provide useful descriptions such as "this is a pornographic picture of a woman".
> G4: There is a person lying down in a scientific context, if I had to guess they are a biologist in a classroom
> me: Is she wearing any clothes?
> G4: No.
Also, it is obsessed with penises —seeing them in compositions where there is only a female. I suppose it's been trained to ban dick pics or something.
Prompting may help some but 12B seems to be a bit worse than E4B with the vision/audio model at voice and text reading so maybe that one would do better.
Comment by pduggishetti 2 days ago
Comment by lifestyleguru 2 days ago
Comment by 3eb7988a1663 2 days ago
Comment by dotancohen 2 days ago
Comment by sarjann 2 days ago
Comment by fhdkweig 2 days ago
Comment by nntwozz 2 days ago
Comment by iliashad 2 days ago
Comment by fennecfoxy 2 days ago
Comment by supertroop 2 days ago
Comment by fibers 2 days ago
Comment by kaycey2022 2 days ago
Comment by okr 2 days ago
Comment by MaxGL 2 days ago
Comment by WarOnPrivacy 2 days ago
M1 Max CPU is an ARM/SoC, comparable to an 11th gen Intel i9
Do I have it right? Would Windows ARM performance be similar for those cpu?ref: https://www.cpubenchmark.net/compare/4585vs4245/Apple-M1-Max...
Comment by pachouli-please 2 days ago
- "unified" ram makes all the system ram available as VRAM - dedicated ai coaccelerator thingy
Both of these reasons allow the apple silicon chips to crush conventional cpus in these kind of AI model workload stuffs
No idea about what the windows arm stuff is capable of. I know they use Qualcomm snapdragon chips though.
Comment by owldown 2 days ago
Comment by voidmain0001 2 days ago
Comment by Rohansi 2 days ago
Comment by voidmain0001 1 day ago
Comment by Rohansi 1 day ago
Comment by iliashad 2 days ago
Comment by duncangh 1 day ago
Comment by iliashad 1 day ago
Comment by fl0id 2 days ago
Comment by iliashad 2 days ago
Comment by LeonardoTolstoy 2 days ago
EDITED: I didn't realize Whisper was a local model. I never tried transcription before, so I had always figured it was a pay model by OpenAI. I'll have to check it out (although the runtime listed here is a bit daunting).
For that project I'll say I don't see much degradation in embedding quality at much much worse quality than 720p (all the way down to 240p), which speeds things up considerably. Although I don't really do face or object detection, just scene embeddings. To me any process whereby it would take longer to process the video than watch it is probably a no go in general. Obviously a challenge for local-first analysis.
Comment by insumanth 2 days ago
Take a fast, small and powerful LLM running locally to index my personal data like images, videos, documents and enrich them and tag with the enriched metadata.
Want to group by people - Search tagged metadata and group it What to search an image by description - tagged metadata What to organize by anything - tagged metadata
This should (hopefully) put an end to my file clutter
Comment by nitin_flanker 2 days ago
Local LLMs sound so cool but I know they won't be easy to setup or use for common joe like me.
Comment by Mashimo 2 days ago
And once set up it's easy to use even for non technical people.
Comment by cake-rusk 2 days ago
Comment by iliashad 2 days ago
Comment by crakhamster01 1 day ago
Being able to semantic search over your library is useful, but does it solve the review problem? I feel like you would still need to watch the footage back before you know what you're working with.
Comment by iliashad 1 day ago
Comment by zzsshh 2 days ago
Comment by havercosine 2 days ago
Comment by iliashad 2 days ago
Comment by tontonius 2 days ago
comes with some nifty features like NLE- integrations, people search, MCP, API etc
Disclaimer: one of the co-founders
Comment by ____tom____ 2 days ago
Other comments mention davinci resolve has this built in. How would you compare the two?
Comment by dotancohen 2 days ago
Comment by tontonius 2 days ago
Comment by dotancohen 2 days ago
Comment by asdfasgasdgasdg 2 days ago
Comment by iliashad 2 days ago
For the dog barking videos, those are only the video scenes that I have a dog barking sound in the video.
I'll keep adding more prompts and example videos, keep an eye for that
Comment by asdfasgasdgasdg 2 days ago
Did you ever visit crazyguyonabike.com? A long time ago I had the pleasure of following the journey of a friend of a friend of a friend on that site:
https://www.crazyguyonabike.com/doc/?doc_id=2405
Stuff like that I guess?
Comment by tj-teej 1 day ago
The world and our discourse around it has changed so much over the past ten years and now with this kind of technology I'm so excited to be able to classify these images from my iCloud and start on the project.
Comment by WhitneyLand 2 days ago
Frame level embedding it covering a lot, but can miss out on a lot of action related searches.
Comment by iliashad 2 days ago
Comment by iliashad 1 day ago
Comment by rho138 2 days ago
Comment by ____tom____ 2 days ago
I might be better off getting something with a beefy GPU on AWS or Google cloud.
Comment by lee_wc 2 days ago
When trying to read this article, the main website was throwing errors to CloudFlare unfortunately
Comment by iliashad 2 days ago
Comment by wferrell 2 days ago
Comment by Mawr 2 days ago
Yep. I had the same problem.
> Then, run the frame analysis pipeline [...] I have a face recognition plugin using my custom faces data, object detection, on-screen text, shot type, and scene description [...] we will have three vector DB collections that have all the information about our videos, like video location metadata, camera name, faces recognized, objects detected, on-screen text, transcription, description of each scene, and many more [...] we can get better indexed data if you use the advanced mode indexing to use the Qwen2.5-VL-7B-Instruct model to understand and describe your video much better, but at a slower indexing speed
Yeah, uhm... ok :)
If anyone else has a similar problem, the real solution is as follows:
1. When recording, if you witness an interesting moment worth saving later, press the power button — this will mark the current moment in the video as a chapter.
2. Find the chapters later when editing and cut them into clips.
3. You're done :)
This has two main benefits over the insanity above:
1. It's trivially simple instead of insanely complex and inefficient.
2. It will reliably catch all the stuff you find interesting, since you're the one doing the marking.
The downsides:
1. Doesn't work retroactively.
2. It may miss interesting stuff if you miss it at the time as well.
3. Only works for this use case.
4. Nerds won't salivate over your usage of cutting edge tech.
Comment by Noumenon72 2 days ago
Comment by tredre3 2 days ago
Comment by iliashad 2 days ago
Comment by PreownedPlaid 2 days ago
Comment by iliashad 2 days ago
Comment by m3kw9 2 days ago
Comment by iliashad 2 days ago
Comment by ingvay7 2 days ago
Comment by iliashad 2 days ago
Comment by synergy20 2 days ago
Comment by nyxtom 2 days ago
Comment by iliashad 2 days ago
Comment by PixComicOS 2 days ago
Comment by dadachi 2 days ago
Comment by knightops_dev 2 days ago
Comment by knightops_dev 2 days ago
Comment by volume_tech 2 days ago
Comment by tosief 2 days ago
Comment by aiexpo_app 1 day ago
Comment by GreenSalem 2 days ago
Her client was recording while committing the abhorrent crime. The criminal would otherwise have got off.
From my perspective, the GoPro camera produced a good outcome. Still, one has wonder why anyone to record their criminal actions.
Comment by Yiin 2 days ago
Comment by GreenSalem 2 days ago
She would rather have done corporate law but did not have the academic credentials or the networks needed for a job at the likes of Latham Watkins or White and Case.
Still it is good for society that criminals get the worst lawyers to defend them.
Comment by fennecfoxy 2 days ago
Comment by djmips 2 days ago