OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision

Posted by ternaus 11 days ago

Comments

Comment by plasticeagle 8 days ago

The thing I love about OpenCV is that it remains hands down the best library for simply loading images and video. I've never even used any of its fancy computer vision features, but if I need to load a video file and look at the pixels - which I did need to do recently for an art project - OpenCV does it in about four lines of code.

Comment by Joel_Mckay 7 days ago

Done a few projects with OpenCV over the years, and I agree it can be fun.

However, it has a few issues:

1. Patented algorithms that are effectively impossible to license in a commercial setting.

2. Permuted API that change how identically named functions behave over versions.

3. Hardware CUDA version coupling deprecating support every major release.

4. Inconsistent and contradictory documentation in the constant subtle permutations. Downstream projects tend to version lock the lib for really practical reasons.

5. A shift away from core C libraries like ImageMagick & V4l, and into C++ abstractions with legacy Swig wrapper libraries in Java or Python.

6. Perpetual-Beta culture means the library will unlikely ever really fully stabilize.

It is a fun library, until people actually try to deploy something serious. As users will often simply suggest using an old version release if there is a bug.

Everything from Build flags to the API documentation has never fully stabilized. ymmv =3

Comment by markusMB 7 days ago

Done a few projects with OpenCV myself, and your list of issues reads as if you throw OpenCV and opencv_contrib into the same bucket. Which you shouldnt. And maybe your assessment is outdated here and there and it is time to look again.

- OpenCV is Apache license. Yes, it used to be more complicated.

- The only patented algorithm I am aware of, SIFT, used to be part of opencv_contrib. And the README in opencv_contrib would greet you with a warning, that the code may not be fit commercial use for various reasons. Only when the patent expired, it was moved into OpenCV core.

- Same observation for Aruco marker detection, which was in contrib for a long time because the options to choose from were either not-well-maintained or GPL-licensed code. It is now in core OpenCV (and Apache).

- Despite its age, I think that OpenCV is still more than relevant today. And being part of modern languages like C++, Swig, Java and Python (and for years already) is part of that. Still I was surprised how long they maintained OpenCV 2 and 3.

- Over the past releases and few years, my impression was actually that core API was very much stable(izing). Cant say what happened in contrib – or what it feels like when you treat core and contribute as one and a feature progressed from contributing to core.

- I do agree, that I usually I would check that a MINOR releases wasnt actually a MAJOR release, breaking some API or behavior I was relying on. I am hoping that Version 5 is pulling the ambitions for making things differently away from Version 4. So v4 can be used stably ;-)

Comment by Joel_Mckay 7 days ago

My point was the release numbers are meaningless, as there is always something subtly broken even in the packaged versions. One can't just use the library beyond basic functionality without becoming involved in the code base.

Indeed, if your library dependency constellation works, some will static link to stabilize/freeze their project for more than a few months.

It wasn't that v3 was particularly good, but rather v4 was a mess. I predict v5 inherited that mess, and improved it... lol =3

Comment by Sesse__ 7 days ago

Also, performance is generally pretty low; I've been on projects where we rewrote OpenCV code into more-or-less obvious hand-rolled code and won 5x perf. The abstractions are generally a bit too thick and oriented around single pixels (which also makes the API a bit too verbose for my taste).

Comment by Joel_Mckay 7 days ago

Machine vision has always been resource intensive... and if you are doing trained ML projects the hardware choices are actually very limited.

To enable Intel TBB, CUDA, and CPU specific compiler optimizations... one will almost certainly need to re-build the library, and customize your application build.

Some tasks degrade in performance on a GPU, and others are 740 times faster... ymmv. =3

Comment by jrk 7 days ago

It's not that you need to turn on some extra library backends and rebuild, it's that the abstractions themselves are fundamentally at odds with hitting peak performance on many things so you have to rewrite your code.

Individual image processing operations are often very low arithmetic intensity. If you don't combine them into much larger subroutines—which are necessarily less generic and orthogonal—you spend all your time waiting on memory between every little op.

Comment by Joel_Mckay 7 days ago

> It's not that you need to turn on some extra library backends and rebuild

Our problem domains must obviously differ. Good luck =3

Comment by harrall 7 days ago

Agree with this too. OpenCV is functionality great but its constituent parts are written by many different people who all kind of do things a little differently and it shows.

But I can’t really complain because it’s open source and added to by contributors.

Comment by Joel_Mckay 7 days ago

One can... and should report when stuff is broken, or the project becomes worthless to all but one persons passing interest. =3

Comment by ranit 7 days ago

If this is a fact:

> 1. Patented algorithms that are effectively impossible to license in a commercial setting.

then does anyone know how "OpenCV has been the foundation of countless production systems" is possible, as the OP article claims?

Comment by nextaccountic 7 days ago

Software patents aren't a thing in most of the world

Comment by Joel_Mckay 7 days ago

One can legally use/static-link OpenCV in most commercial projects, and there were only a few legal landmines people still try to document when possible.

However, until each code area turns 17/21 no one knows for sure. It just looks normal at first, and $12k cheaper than MatLab server host licenses. =3

Comment by zdkl 7 days ago

Well I've deployed OpenCV based pipelines in academic contexts for site surveys and photogrammetry.

Comment by Joel_Mckay 7 days ago

There is a CLI photogrammetry OSS project with rather litigious faculty members behind the code. However, at least that group was upfront about what was expected of the library users, and didn't do something dodgy like quietly merge it into another community library like OpenCV.

I discovered that while porting it to a Pi ARM platform years ago (yes it was slow... lol.) Forgot when the IP becomes public domain, but you might want to check that out. If I recall it was unrelated to the COLMAP project design. =3

Comment by zdkl 7 days ago

They wouldn't happen to be french cartography would they?

Comment by Joel_Mckay 7 days ago

These should help narrow down the search, but ibfs commercial restrictions are now 404... and the original IP warnings seem missing/expired.

https://github.com/openMVG/openMVG/blob/develop/COPYRIGHT.md

https://github.com/cdcseacave/openMVS/blob/master/COPYRIGHT....

Personally, I recommend COLMAP + CloudCompare + MeshLab, but the Mozilla Public License 2.0 should address IP license issues if the author is also the rights holder. Keep in mind all work done by University Students and Staff is often property of the institution unless otherwise stated. It is a delicate subject.

Best of luck =3

Comment by zdkl 6 days ago

Thank you for the recommendations, I'll return one myself:

https://github.com/alicevision/AliceVision https://github.com/alicevision/Meshroom

No affiliation, just an excellent tool

Comment by akssri 7 days ago

Yup, it's basically the ROS of computer-vision.

Comment by Joel_Mckay 7 days ago

vision_opencv has been part of ROS for a long time. Mind you many popular projects get integrated into ROS eventually. =3

Comment by fzysingularity 7 days ago

That’s a pretty large binary for simply loading images.

In all honesty, opencv has stood the test of time and I’m certain newer LLMs will likely not attempt to rewrite it from scratch.

P.S. I’ve been a user since the IplImage days, circa 2007, and I’d still consider using it over most CV libraries today.

Comment by tomkarho 7 days ago

> I’m certain newer LLMs will likely not attempt to rewrite it from scratch

Sooner or later a Rust developer will try.

Comment by zipy124 7 days ago

You can build it yourself and end up with a much smaller binary (and many more optimisations).

Comment by dheera 7 days ago

> best library for simply loading images and video

But not for saving video. That fourcc pile of crap doesn't open up in QuickTime player, the default Ubuntu video player, or anything anybody actually uses. I've always had to add a os.system("ffmpeg [ask llm to generate the command for you]") afterwards to fix anything that OpenCV generates.

Comment by doctorpangloss 7 days ago

opencv file loading is crap. it will load images with the wrong gamma, it will give you floating point values that hide the limitation that it pretty much only loads colors in 8 bit, and it will not be able to save to anything useful.

Comment by deadbabe 7 days ago

What are you looking for in the pixels?

Comment by jaffa2 7 days ago

To see if it’s a ‘shop.

Comment by SEJeff 7 days ago

You win the internet friend

Comment by greenavocado 7 days ago

Shoop*

Comment by Geee 7 days ago

To see if it's fake.

Comment by escapecharacter 7 days ago

truth

Comment by lioeters 7 days ago

beauty

Comment by ftchd 8 days ago

> One practical detail is worth knowing. The new engine is CPU-only at the moment, so if you select a non-CPU backend and target (for example CUDA or OpenVINO through setPreferableBackend and setPreferableTarget), you will want the classic engine.

So there's room for even better performance!

Comment by wongarsu 8 days ago

It's certainly a choice to make your headline feature a new ONNX engine, feature a bunch of comparisons how it's better than ONNXRuntime, while casually mentioning on the side that the cool new much faster engine is CPU-only

Sure, running models on the CPU is very much a thing in computer vision (the benchmarked YOLOv8n has 37M params). But this whole announcement feels more like OpenCV catching up to the modern world, not "The Biggest Leap in Years for Computer Vision"

Still great, needing fewer libraries is a good thing, but maybe a bit oversold

Comment by VadimPR 8 days ago

The release post is AI-written with little human oversight and it shows.

Comment by claytongulick 8 days ago

I had to stop reading after: "This is not just another incremental release. OpenCV 5 is a major step forward."

If a human can't be bothered to write a piece, I can't be bothered to read it.

Comment by danjc 8 days ago

It's not just annoying, it's tiring

Comment by VulgarExigency 8 days ago

The endless deluge of AI prose really wears on the soul once you start noticing it.

Comment by thin_carapace 8 days ago

i initially adopted this line of thinking. after exposure to arguably valid cases like translated articles, it now seems to me that the most efficient path forward (after first noting AI prose) is to scan past all language and evaluate whether or not useful content is encoded within. theres no benefit to anyone (except those benefitting from societal atrophy) in wasting brain cycles on unnecessary verbosity, however blanket rejection necessarily involves loss of valuable information.

Comment by kphorn 7 days ago

I think the only thing that the human did was remove the emdash between the two sentence fragments and replace with a period.

Comment by dismantlethesun 7 days ago

I felt that this was an indication that OpenCV had finally discovered SemVer.

Comment by vdfs 8 days ago

The illustrations couldn't be any more generic-ai

Comment by kphorn 7 days ago

my code, my commit - ugh

Comment by trklausss 8 days ago

This is what I hate about AI. Not that people use it, it's great to accelerate specific workflows, make less mistakes etc. It's just blindly trusting it and just saying "Make a post about a CV library release, make no mistakes" and calling it a day.

Where is the human creativity in writing release notes gone?

Comment by nnevatie 8 days ago

No one uses ONNXRuntime (nor the new engine in OpenCV 5) in production. For anything performance-sensitive, one would run models under TensorRT, as an example.

Comment by amorroxic 8 days ago

Curious on what backs this assertion. As a counterpoint we’ve been running 200+ models in production for more than 5 years - language models, embedding, classifiers, low tens to hundred M params. Traffic in the order of 1-2M requests/day and everything is enabled by onnx with some cgo (or Rust) plumbing on top. What’s your SLA?

Comment by nnevatie 7 days ago

Ahh, I should have probably added some context around my hyperbole. I was referring to real-time computer vision - think of e.g. segmenting FHD/UHD video.

Comment by snovv_crash 8 days ago

Strong statement to make when I have at least 2 datapoints contradicting it, in SaaS and embedded/robotics.

Comment by antonvs 8 days ago

We use this in production:

https://docs.rs/onnxruntime/latest/onnxruntime/

It’s a Rust wrapper around ONNX Runtime. We currently serve 5+ million inference requests per day for a highly performance-sensitive application, for a long list of major enterprise clients. We don’t use GPUs for inference, because it would be cost-prohibitive. We launch tens of thousands of VMs per day to run these workloads.

Comment by pzo 8 days ago

how are supposed to use TensorRT on iOS, iPadOS, Android or even Web? Production is not only cloud.

Comment by OvervCW 8 days ago

You can use ONNXRuntime with a TensorRT backend, so one does not exclude the other.

Comment by gunalx 8 days ago

Production dosent have to be performance sensitive, so devex may still outcompete the performance differences in some scenarios.

Comment by dTal 7 days ago

OpenTrack uses it for its AI headtracking, which works extremely well.

Comment by monster_truck 8 days ago

I've never understood how anyone comes into contact with it and thinks its anything more than an incredible inconvenience masked as the easy way of doing things. Given it a few good shakes for various uses and regretted the time spent each time

Comment by cik 7 days ago

Ummm embedded robotics is all about this. For years.

Comment by pzo 8 days ago

Quite a good release although not sure why they invest so much time into their ONNX engine. I don't think they have enough stuff and big pockets to compete with ONNXRuntime, CoreAI, ExecuTorch, LiteRT.

I'm happy they added option for ONNXRuntime. I wish their cv.dnn was mostly that unified wrapper around many different backends (ONNXRuntime, Executorch, LiteRT, CoreAI) and maybe just some tooling around it (performance metrics tools, model downloads etc). Transformers(.js) approach looks better for me.

Wish they also invested more time into better production ready Camera I/O (for mobiles, device/format discovery, manual settings, depthmap support, etc) and better Highgui that could use different backends (skia, webgpu) and on mobiles.

Comment by GreenSalem 8 days ago

AI written release post and it shows...

Comment by oceansky 8 days ago

I can't say for sure, but there is a suspicious amount of "it's not x, it's y". At least there are no em-dashes.

Comment by Npovview 7 days ago

I think Technical posts should be written with 3 levels of audiences in mind. Expert, Middle, Beginner. But I guess that is not necessary, since AIs can cut the flab easily.

Comment by _qua 8 days ago

The diagrams definitely look like LLM output as well

Comment by M4v3R 7 days ago

The diagrams were generated with Nano Banana Pro (most probably, or alternatively with ChatGPT Image 2), if you look closely in high contrast areas you'll see artifacts in the background that give it away.

I personally don't mind AI generated content when it's properly reviewed, but unfortunately more often than not the author just glances at the result and decides it's good enough.

Example: https://opencv.org/wp-content/uploads/2026/06/image-1.jpeg

I'm not knowledgable enough to determine whether this diagram is 100% accurate, but some things look off - the arrows in the bottom left seem superficial, some arrows are connected in weird ways, the mini diagram in AttentionLayer block doesn't look right (it has two Softmax icons and one MatMul icon, while the "before" diagram is the opposite).

Comment by bl0b 7 days ago

Yeah that diagram is all over the place. The arrows on the left branching from the outline of the diagram itself?

Comment by 7 days ago

Comment by saberience 7 days ago

Tested one of the diagrams: "Yes, the digital watermark indicates that most or all of this image was generated or edited using Google AI."

Comment by KolmogorovComp 7 days ago

how do you check?

Comment by saberience 7 days ago

Just go on Gemini and paste the photo into the chat and ask, it can use SynthID as a tool.

Comment by xdennis 7 days ago

Yeah, blatant "it's not x, it's y":

> This is not just another incremental release. OpenCV 5 is a major step forward.

Comment by killingtime74 7 days ago

Written by AI for AI?

Comment by jampekka 7 days ago

Indeed. Well written, clear, informative and to the point.

Comment by xpct 7 days ago

As of now, any human effort is still ~= quality. Human-written article signals to me that a certain amount of time was spent on it, which is a proxy for quality. This goes for both text and diagrams.

If someone slapped together an article from an LLM and a few internal documents, that tells me exactly how much they cared about it.

Comment by Aachen 7 days ago

So to-the-point that it comes with a table of contents. Idk if it needs saying that ToCs have legitimate uses, but the number of search results and blog posts having one since ~2022 is, eh, interesting. You come for whatever the headline was and you get a page with thousands of words, split up into five or more chapters, many of them overlapping or a rephrasing of the same question if you've hit a true content farm. This is not that, but I also can't fathom how one could argue that slop is concise as a hallmark

As for being well-written, does that refer to correct use of grammar and no typos, or do you mean that you find that bots write better than humans in any other way?

Comment by 7 days ago

Comment by marknutter 7 days ago

It could be the best written, most informative article they've ever read, but anti-ai folks would dismiss it as slop the moment someone told them it was written by ai.

Comment by smt88 7 days ago

The problem is that we don’t know if a human fact-checked it before release or if we’re the first humans reading it closely.

Comment by jampekka 7 days ago

We don't really know that about human written text either.

Comment by smt88 7 days ago

Yes we do because a human literally had to write it. That’s at least one human pass and fact-check.

Comment by arcanine 8 days ago

They really improved the performance. I tested yolov8 medium segmentation model on intel i7 11th gen cpu.

Opencv 4.11 : ~255ms Opencv 5.0.0 : ~185ms

with the same code.

Comment by bobmcnamara 7 days ago

Intel never really improved their memory controller and busses and it shows.

Comment by boredemployee 7 days ago

How can I learn the practical side of computer vision in 2026?

I'm not interested in understanding papers or the math behind it, but rather in how to put a system into production, whether it's object detection, running 20 cameras in parallel on a single computer, like sizing hardware for a specific task, and so on.

Any tips?

Comment by bonoboTP 7 days ago

By doing it. Decide on a small project, like tracking your cat, detecting food items in your fridge, then take it step by step.

Then do a slightly more ambitious project. Start with something very simple.

It also heavily depends on what you already know regarding programming, image processing etc.

Comment by kelvinjps10 7 days ago

Just start cooking , python is easy and the bindings are not that hard

Comment by yayitswei 7 days ago

Try a coding agent for writing and tuning the OpenCV part, and have it explain its choices. That's probably the most practical path to shipping a working system.

Speaking from experience: never used OpenCV before, recently vibe coded a tool that makes supercuts of pool videos, trimming each clip from the cue ball's first strike to when the motion stops.

Comment by eastof 7 days ago

One of the great things about OpenCV is how ubiquitous it is, there's a ton samples online and well represented in frontier model training data. I recently vibe-coded an object detector for my own personal photo library so I could separate out my pictures with humans in them. Very approachable with Codex + feeding it a sample from Github.

Comment by wolfgangK 6 days ago

OpenCV being in the list of Pyodide modules [0] was the biggest boon for my online teaching experience because remotely dealing with install woes (corporate proxies & cie) was a show stopper for regular Python. I'm hoping that they will package this new version and that it will bring the new neural networks engines goodies to the no-install crowd !

[0] https://pyodide.org/en/latest/usage/packages-in-pyodide.html

Comment by globalnode 8 days ago

does this mean im actually able to try object detection in opencv now? i mean i know basic image processing techniques, and i know "in theory" how ML works but ive never really seen a case where i can just say "heres an image now detect all the apples". theres always 1. find a model that has the knowledge, 2. hook it up to an inference engine, 3. do something useful. i always get stuck at 1.

Comment by wongarsu 8 days ago

YOLO has basically solved that for my use cases for a couple years now. If you want labels that are not in the pretrained labels it's also easy to fine-tune, provided you're willing to label 200 or so images

If you need something less restricted to existing labels (say wanting all the red apples, or all cardboard signs) SAM3 is great, as the sibling comment says

Comment by IanCal 8 days ago

> provided you're willing to label 200 or so images

A quick note to say that this is also a task you can hand to things like gemini.

Comment by dekhn 7 days ago

Yep- this is what I do. I use a high quality VLM to generate labelled boxes (in my case, around tardigrades in a microscope image), do some light editing to fix the small number of errors, and then train YOLO26 with it. Works great, saved me tens of hours of labelling. It's a bit scary that there is a VLM that works as well as my fine-tuned model (although much slower).

Comment by globalnode 7 days ago

thats a fantastic strategy thank you, and thanks to all the other helpful posters as well here. do you have any tips for how to choose the base yolo model? or just any generic one will do?

Comment by IX-103 7 days ago

How do you handle object disambiguation with YOLO? All the examples I've played with have the problem where if two "cars" get too close to each other then the tracking IDs keep switching between them, meaning we'd need an additional kinetic model for disambiguation.

Comment by fnands 8 days ago

That seems to be the way things are going.

Large general models have taken over in NLP, and (outside of embedded/low latency applications) it seems like they are coming for CV next.

So you should soon be able to have large generic model that can detect whatever for you.

It's already pretty much possible with open-vocabulary detectors like SAM3, where you could just prompt it with "Apple": https://ai.meta.com/research/sam3/

Comment by Npovview 7 days ago

Roboflow is your friend.

Comment by shenberg 8 days ago

moondream is a beast

Comment by shelled 8 days ago

A few years ago I was using OpenCV is a commercial Android SDK (it might still be being used; also because iOS provided almost all of those "needs" ready-made and Android just didn't, neither did Firebase, or Jetpack suites/tools). I was the one who had added it in the SDK. There was a lot I/we could do but as an Android developer (barely any exposure to CV or even C/C++) what I felt we lacked was documentation, a community. We struggled with even shaving off parts that we did not want to ship with our SDK. Speed was such an issue. The problem was someone who just wanted to use the lib (on mobile) a lot of things felt esoteric and out of reach i.e difficult. It didn't have to be.Sadly LLM wasn't at full speed back then, barely useable, not even talked about. Something like this would have been a perfect use case of AI/LLM. A coder, not from the exact/specific field the tool was made in/from, but being able to take full advantage of its capabilities in a nuanced/selective manner.

Comment by ternaus 6 days ago

Image augmentations library Albumentations is heavily based on OpenCV, which allows it to beat torchvision, Kornia, PIL, and other similar libraries.

But there is still a huge room for improvement in terms of performance, as for some low level operations StringZilla or Numkong are faster, for some, especially for float32 images, numpy is the best.

The most annoying component is that OpenCV is limited to input shapes like (H, W, C), which limits its application to videos and volumes with shapes (X, H, W, C)

Comment by trollbridge 7 days ago

Great to hear.

OpenCV was so easy and smooth to set up for doing tasks like generating thumbnails from uploads from arbitrary photo uploads regardless of format (including funky new formats like webp, avif, or heic).

Comment by riazrizvi 7 days ago

I guess i'm the only one blown away by this announcement and super excited to get back into image processing.

Comment by johnAthan_ 6 days ago

judging by the amount of upvotes, it is unlikely you're the only one.

Comment by hbcondo714 10 days ago

> LLMs and VLMs, Running Inside OpenCV…Qwen 2.5, Gemma 3, PaliGemma, and the GPT-2 / GPT-4 family

Why these specific models / versions?

Comment by mkl 7 days ago

Yes, it's weird that they're so old.

Comment by ge96 7 days ago

I remember trying to do photo stitching myself (panoramas) then I failed miserably but it's built into opencv ha. I've used quite a bit of OpenCV features eg. laplace variance for an automatic zoom/focusing mechanical lens camera system (steppers) and contour/blob finding for crude color segmentation.

Comment by mattcox12 5 days ago

I'm more curious about the hardware acceleration on Snapdragon and ARM. That matters way more for real deployment than any new DNN feature.

Comment by dadachi 6 days ago

Same on mobile. I use Apple's Vision framework on-device to find people in photos for a printing app. Sending users' personal photos to an image-model API is a non-starter on privacy, latancy, and per-photo cost alone. Less flexible than a V-LLM, but for "find the people, give me box" it's instant, free, and works offline.

Comment by maelito 8 days ago

Can it detect the speed of the car without any hand-made measurement ?

Comment by MaxikCZ 8 days ago

In pixels/second? Sure!

Comment by monster_truck 8 days ago

Do you know the focal length/AOV of your webcam?

Comment by brk 7 days ago

That would be pretty hard to do with any level of accuracy or external calibration/input.

Comment by 7 days ago

Comment by sixothree 7 days ago

Do you even have one known reference?

Comment by wiradikusuma 7 days ago

Curious how do people usually use OpenCV with CCTV? (Use cases)

Comment by owenpalmer 7 days ago

> This is not just another incremental release. OpenCV 5 is a major step forward.

Am I the only one that finds this sentence very cheesey?

Comment by 3 days ago

Comment by ternaus 6 days ago

To the question about OpenCV peformance to measure JPEG image decoding sequentially and as a part of the PyTorch Dataloader.

TL;DR

OpenCV is fast, but torchvision is faster.

https://arxiv.org/abs/2605.08731

Comment by hdgvhicv 7 days ago

That page does t say “what is computer vision”

Comment by 7 days ago

Comment by maxdo 7 days ago

curious how many people model killed on battlefields of ukraine and russia.

Comment by Magnets 8 days ago

The announcement itself is pure AI slop

Comment by thunky 8 days ago

What about the post was not up to your standards?

Comment by xdennis 7 days ago

It's not just AI slop, it's a milestone AI post.

gptzero.me rates it 91% AI, 9% mixed, and 0% human. (I've only pasted a portion of the text to fit in the 10000 character free limit.)

Comment by thunky 7 days ago

How is this relevant to my question?

Comment by charankilari 8 days ago

wow its been ages

Comment by xavierforge 7 days ago

[flagged]

Comment by Lukevigoss 6 days ago

[dead]

Comment by noobcoder 7 days ago

[flagged]

Comment by cdogukank 7 days ago

[dead]

Comment by imJack 8 days ago

[dead]

Comment by pimlottc 8 days ago

[dead]

Comment by leoncos 11 days ago

When I use Codex/Claude to complete a computer vision task, such as extracting assets from an image, OpenCV is their default solution. However, I believe that using YOLO and other methods is outdated. The best solution now is to directly use Nano Banana or other AI image models. A paper has proven that image generation models can perform most CV tasks well. I believe the new OpenCV should become a wrapper for VLM or AI image models.

Comment by nicolailolansen 8 days ago

Whenever you can run a model like Nano Banana or other vision-LLM with the same compute and time performance/restrictions as an OpenCV or YOLO call, you can make that comparison. Until then, I would not call YOLO and OpenCV outdated, it's simply wrong. There's a time and place for big V-LLMs just as there is a time and place for more "traditional" computer vision methods.

Comment by wongarsu 8 days ago

I can get great results from a YOLO model with 30M to maybe 300M params. To get decent CV from a LLM 8B params is the absolute minimum, closer to 30B for interesting tasks

I might be on board about LLMs being the future of OCR (though many would disagree), but for general CV they are very inefficient for very limited benefit

Comment by IanCal 8 days ago

They can however be extremely useful for curating training data. Also things like SAM and the DINO (/grounding dino) models.

Also if they are better then you can also have a flow that’s cheap model -> marginal cases go to more complex thing (and a chain of these).

The yolo models are really shockingly good for their cost and how well they can work with not much training data as well.

Comment by charcircuit 8 days ago

>for very limited benefit

Due to how simple they are to work with they will become popular. Compare NLP before and after GPT-3. GPT-3 majorly brought down the complexity and skill needed for doing NLP tasks even if traditional NLP is much much faster. Ultimately ease of development will win out and the industry will work towards optimizing running such LLMs to make it cheap enough to run.

Comment by regularfry 8 days ago

I've built hardware with a pi zero 2 + pi cam running a mildly fine-tuned YOLO doing local-only object detection as a USB-OTG device, in a use case where any off-device API calls would have been totally unacceptable, and where the object detection was part of the human interaction loop with a hard ceiling of 300ms on the total interaction time of which the object detection was only one process among many.

We're not going to fit Nano Banana or anything like it on a device with 512MB RAM and a GPU old enough to be irrelevant, and again, API calls just aren't on the menu.

Comment by Hendrikto 8 days ago

> API calls just aren't on the menu

Even if they were an option, your 300ms latency requirement would exclude them anyway.

Comment by mirsadm 8 days ago

That is a very uninformed view. Real time CV is not going to be doing that anytime soon.

Comment by sebmellen 8 days ago

Great, let me know when those models can run on-server and process/analyze streams of ID images with less than 100ms of latency. You’ll need to make sure you have a massive set of training data including all manner of slightly blurred and slightly distorted ID cards

Comment by _the_inflator 8 days ago

Exactly, and all on an embedded system with quite restrictive settings and no overclocked Intel lastest generation combined with NVIDIA's 10k graphic cards.

Comment by charcircuit 8 days ago

Embedded systems can make network calls to powerful, GPU equipped servers.

Comment by ceejayoz 7 days ago

Sure. Claude does that. "Cogitated for 1m 50s" doesn't work for real-time applications.

Comment by charcircuit 7 days ago

You can submit many queries in parallel to increase throughout. Smaller models and faster hardware can reduce the time per query too.

Comment by ceejayoz 7 days ago

None of that gets you the 100ms response time the parent poster talked about, for something like "who is at my doorbell?" real-time uses.

Comment by sebmellen 7 days ago

Ok. Claude will not work for this use case because none of the sample data (weirdly blurry ID images) is in the training data.

Comment by Chu4eeno 7 days ago

They really shouldn't, though.

Comment by charcircuit 7 days ago

It can offer a ton of user value. There is a whole industry built upon this idea, Internet of Things.

Comment by ceejayoz 7 days ago

IoT wasn't not built on "send all the data off to a hosted GenAI". It predated them by quite a few years.

Comment by charcircuit 7 days ago

The GPUs were doing video transcoding instead of GenAI.

Comment by ceejayoz 7 days ago

You can run OpenCV on a GPU-less Raspberry Pi or other IoT device just fine.

And most IoT devices aren't doing video transcoding at all. You're making some very odd assertions in this thread.

Comment by charcircuit 7 days ago

>And most IoT devices aren't doing video transcoding at all.

The data gets streamed to the cloud where servers with GPUs transcode it. I'm pointing out that IoT devices historically have reached out to servers with GPUs even before GenAI.

Comment by ceejayoz 7 days ago

Most IoT devices have no camera and communicate with servers that have no need for a GPU at all.

Comment by serf 10 days ago

do you realize how many edge or unconnected nodes do OpenCV work?

some SBC w/ an industrial camera that is doing pick-place or go/no-go operations on a conveyor belt against a singular object type doesn't need a huge image-gen/llm model governing it.

I mean have you even considered the kind of performance an opencv function can get w/ just mask-matching? I mean even with a fancy YOLO model these answers get thrown out in 1.5-50ms ; this is just a wholly different time scaling.

Comment by Qhemlomo 8 days ago

100.000 pictures take a lot of time with LLMs.

Its a lot better, faster, cheaper to use LLMs for initial labeling together with hand finetuning and then training YOLO with this.

Training YOLO takes a few hours and is then very fast.

Comment by kryptiskt 8 days ago

If I want to identify and measure the size of round things in my orange sorter machine, I shouldn't have to resort to an unnecessarily complicated solution just because some AI bros can't understand that not everything needs to be an AI model.

Like, the AI model tools already exist, all that would be accomplished if OpenCV pivoted would be to take it away for people who want to do low-level vision programming. It wouldn't add anything useful to the world, just destroy an excellent library.

Comment by _the_inflator 8 days ago

"When I use..."

Dude, in business we think in terms of large numbers, internationally easily in billion times processing images. This wouldn't cut it.

Also, do you buy the mega expensive super individually designed shoes from the best shoemaker there is to march along though some dirt or simply stick to gumboots?

OpenCV is used behind the scenes for many of the fancy stuff those major AI provider pretend to do. Claude is a huge system and not a LLM anymore.

Comment by TZubiri 8 days ago

I am confused, how can functions that output images help with functions that should take images as input?

Comment by taneq 8 days ago

They’re multimodal LLMs trained for image generation. Turns out that if you want to generate images you gotta know what things look like.

Comment by TZubiri 8 days ago

That's not helpful my brother. If you have details share them, if not, don't pretend you are more illuminated than me.

Is the image(text) function reversible? Or are they brute force searching a nearest neighbor like word2vec/hash brute forcing.

Comment by sorenjan 8 days ago

Google recently released their paper "Image Generators are Generalist Vision Learners" about exactly this. They fine tuned Nano Banana pro into what they call Vision Banana which can do segmentation etc.

https://arxiv.org/abs/2604.20329

Comment by TZubiri 7 days ago

very interesting, it seems that they use image(image,text) functions to process/filter images, effectively generating arbitrary bitmap(image), where bitmap is of the same dimension as image.

Comment by oliveiracwb 8 days ago

Computer vision was the formative school for many autodidacts. Although I acquired substantial knowledge from articles translated via Power Translator and Babylon (whose outputs closely mirror those of any 2-million-parameter SLM), it was OpenCV that made concepts like convolutions, softmax, minmax, and others finally click for me. I have consistently viewed OpenCV as an intrinsically open, educational, and adaptable library. Any developer can dissect its codebase to extract a specific filter or algorithmic implementation and tailor it to their requirements. It is certainly not cruising at the velocity of trillion-dollar capital. But it holds its altitude. And it will always be there.