Gemini 3 Pro: the frontier of vision AI
Posted by xnx 7 days ago
Comments
Comment by Workaccount2 7 days ago
It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.
Comment by Rover222 7 days ago
Then I asked both Gemini and Grok to count the legs, both kept saying 4.
Gemini just refused to consider it was actually wrong.
Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."
You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.
Comment by theoa 6 days ago
Gemini responds:
Conceptualizing the "Millipup"
https://gemini.google.com/share/b6b8c11bd32f
Draw the five legs of a dog as if the body is a pentagon
https://gemini.google.com/share/d74d9f5b4fa4
And animal legs are quite standardized
https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...
It's all about the prompt. Example:
Can you imagine a dog with five legs?
https://gemini.google.com/share/2dab67661d0e
And generally, the issue sits between the computer and the chair.
;-)
Comment by vunderba 6 days ago
Asymmetry is as hard for AI models as it is for evolution to "prompt for" but they're getting better at it.
Comment by Rover222 6 days ago
Comment by ithkuil 6 days ago
This happens all the time with humans. Imagine you're at a call center and get all sorts of weird descriptions of problems with a product: every human is expected to not expect the caller is an expert and actually will try to interpolate what they might mean by the weird wording they use
Comment by macNchz 7 days ago
Comment by RestartKernel 7 days ago
https://gemini.google.com/share/b3b68deaa6e6
I thought giving it a setting would help, but just skip that first response to see what I mean.
Comment by raw_anon_1111 6 days ago
https://chatgpt.com/share/6933c848-a254-8010-adb5-8f736bdc70...
This is the SVG it created.
Comment by vunderba 7 days ago
Place sneakers on all of its legs.
It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).Comment by tensegrist 7 days ago
Comment by Lamprey 7 days ago
I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.
I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.
Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"
Comment by vunderba 7 days ago
In other words:
1. Took a personal image of my dog Lily
2. Had NB Pro add a fifth leg using the Gemini API
3. Downloaded image
4. Sent image to BFL Flux2 Pro via the BFL API with the prompt "Place sneakers on all the legs of this animal".
5. Sent image to NB Pro via Gemini API with the prompt "Place sneakers on all the legs of this animal".
So not only was there zero "continual context", it was two entirely different models as well to cover my bases.
EDIT: Added images to the Imgur for the following prompts:
- Place red Dixie solo cups on the ends of every foot on the animal
- Draw a red circle around all the feet on the animal
Comment by dwringer 7 days ago
Comment by Rover222 7 days ago
Comment by AIorNot 7 days ago
Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -
Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations
Comment by majormajor 7 days ago
RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.
If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.
Comment by mlinhares 7 days ago
Comment by didgeoridoo 7 days ago
Comment by Lamprey 7 days ago
"The researchers feed a picture into the artificial neural network, asking it to recognise a feature of it, and modify the picture to emphasise the feature it recognises. That modified picture is then fed back into the network, which is again tasked to recognise features and emphasise them, and so on. Eventually, the feedback loop modifies the picture beyond all recognition."
Comment by HardCodedBias 6 days ago
And the AI has been RLed for tens of thousands of years not just a few days.
Comment by squigz 7 days ago
Comment by Rover222 6 days ago
Comment by tarsinge 6 days ago
Comment by irthomasthomas 7 days ago
Comment by adastra22 7 days ago
Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.
Comment by runarberg 7 days ago
LLMs are in fact good at generalizing beyond their training set, if they wouldn’t generalize at all we would call that over-fitting, and that is not good either. What we are talking about here is simply a bias and I suspect biases like these are simply a limitation of the technology. Some of them we can get rid of, but—like almost all statistical modelling—some biases will always remain.
Comment by adastra22 6 days ago
In which case the only way I can read your point is that hallucinations are specifically incorrect generalizations. In which case, sure if that's how you want to define it. I don't think it's a very useful definition though, nor one that is universally agreed upon.
I would say a hallucination is any inference that goes beyond the compressed training data represented in the model weights + context. Sometimes these inferences are correct, and yes we don't usually call that hallucination. But from a technical perspective they are the same -- the only difference is the external validity of the inference, which may or may not be knowable.
Biases in the training data are a very important, but unrelated issue.
Comment by runarberg 6 days ago
Interpolation is a much narrower construct then generalization. LLMs are fundamentally much closer to curve fitting (where interpolation is king) then they are to hypothesis testing (where samples are used to describe populations), though they certainly do something akin to the latter to.
The bias I am talking about is not a bias in the training data, but bias in the curve fitting, probably because of mal-adjusted weights, parameters, etc. And since there are billions of them, I am very skeptical they can all be adjusted correctly.
Comment by adastra22 6 days ago
As for bias, I don’t see the distinction you are making. Biases in the training data produce biases in the weights. That’s where the biases come from: over-fitting (or sometimes, correct fitting) of the training data. You don’t end up with biases at random.
Comment by IsTom 6 days ago
I'm not particularly well-versed in LLMs, but isn't there a step in there somewhere (latent space?) where you effectively interpolate in some high-dimensional space?
Comment by adastra22 6 days ago
The LLM uses attention and some other tricks (attention, it turns out, is not all you need) to build a probabilistic model of what the next token will be, which it then sampled. This is much more powerful than interpolation.
Comment by runarberg 6 days ago
As for bias, sampling bias is only one many types of biases. I mean the UNIX program YES(1) has a bias towards outputting the string y despite not sampling any data. You can very easily and deliberately program a bias into everything you like. I am writing a kanji learning program using SSR and I deliberately bias new cards towards the end of the review queue to help users with long review queues empty it quicker. There is no data which causes that bias, just program it in there.
I don‘t know enough about diffusion models to know how biases can arise, but with unsupervised learning (even though sampling bias is indeed very common) you can get a bias because you are using wrong, mal-adjusted, to many parameters, etc. even the way your data interacts during training can cause a bias, heck even by random one of your parameters hits an unfortunate local maxima yielding a mal-adjusted weight, which may cause bias in your output.
Comment by adastra22 6 days ago
It’s a subtle distinction, but I think an important one in this case, because if it was interpolation then genuine creativity would not be possible. But the attention mechanism results in model building in latent space, which then affects the next token distribution.
Comment by runarberg 6 days ago
My reasons to subscribing to the latter camp is that when you have a distribution and you fit things according to that distribution (even when the fitting is stochastic; and even when the distribution belongs in billions of dimensions) you are doing curve fitting.
I think the one extreme would be a random walk, which is obviously not curve fitting, but if you draw from any other distribution then the uniform distribution, say the normal distribution, you are fitting that distribution (actually, I take that back, the original random walk is fitting the uniform distribution).
Note I am talking about inference, not training. Training can be done using all sorts of algorithms, some include priors (distributions) and would be curve fitting, but only compute the posteriors (also distributions). I think the popular stochastic linear descent does something like this, so it would be curve-fitting, but the older evolutionary algorithm just random walks it and is not fitting any curve (except the uniform distribution). What matters to me is that the training arrives at a distribution, which is described by a weight matrix, and what inference is doing is fitting to that distribution (i.e. the curve).
Comment by adastra22 6 days ago
Except in the most technical sense that any function constrained to meet certain input output values is an interpolation. But that is not the smooth interpolation that seems to be implied here.
Comment by CamperBob2 7 days ago
Comment by Zambyte 7 days ago
Comment by irthomasthomas 7 days ago
Comment by CamperBob2 6 days ago
The systems already absorb much more complex hierarchical relationships during training, just not that particular hierarchy. The notion that everything is made up of smaller components is among the most primitive in human philosophy, and is certainly generalizable by LLMs. It just may not be sufficiently motivated by the current pretraining and RL regimens.
Comment by Rover222 7 days ago
Comment by visioninmyblood 6 days ago
https://chat.vlm.run/c/62394973-a869-4a54-a7f5-5f3bb717df5f
Here is the though process summary(you can see the full thinking the link above):
"I have attempted to generate a dog with 5 legs multiple times, verifying each result. Current image generation models have a strong bias towards standard anatomy (4 legs for dogs), making it difficult to consistently produce a specific number of extra limbs despite explicit prompts."
Comment by qnleigh 7 days ago
(Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).
Comment by cyanmagenta 7 days ago
Comment by FeepingCreature 7 days ago
Also my bet would be that video capable models are better at this.
Comment by qnleigh 6 days ago
So back to the analogy, it could be as if the LLMs experience the equivalent of a very intense optical illusion in these cases, and then completely fall apart trying to make sense of it.
Comment by nearbuy 6 days ago
Comment by SecretDreams 7 days ago
Comment by dostick 5 days ago
Comment by varispeed 7 days ago
Comment by criddell 6 days ago
Comment by DANmode 6 days ago
What is " a dog" to Gemini?
Comment by isodev 6 days ago
LLMs are fancy “lorem ipsum based on a keyword” text generators. They can never become intelligent … or learn how to count or do math without the help of tools.
It can probably generate a story about a 5 legged dog though.
Comment by Benjammer 7 days ago
I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".
Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?
More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?
Comment by majormajor 7 days ago
Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...
Comment by biophysboy 7 days ago
Comment by Benjammer 7 days ago
Comment by menaerus 6 days ago
Comment by michaelmrose 6 days ago
In actual situations you have documentation, editor, tooling, tests, and are a tad less distracted than when dealing with a job interview and all the attendant stress. Isn't the fact that he actually produces quality code in real life a stronger signal of quality?
Comment by menaerus 5 days ago
Comment by biophysboy 7 days ago
Comment by genrader 7 days ago
Comment by runarberg 7 days ago
LLMs don‘t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.
When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).
Comment by Benjammer 7 days ago
Categories of _what_, exactly? What word would you use to describe this "kind" of which LLMs and humans are two very different "categories"? I simply chose the word "cognition". I think you're getting hung up on semantics here a bit more than is reasonable.
Comment by runarberg 7 days ago
Precisely. At least apples and oranges are both fruits, and it makes sense to compare e.g. the sugar contents of each. But an LLM model and the human brain are as different as the wind and the sunshine. You cannot measure the windspeed of the sun and you cannot measure the UV index of the wind.
Your choice of the words here was rather poor in my opinion. Statistical models do not have cognition any more than the wind has ultra-violet radiation. Cognition is a well studied phenomena, there is a whole field of science dedicated to cognition. And while cognition of animals are often modeled using statistics, statistical models in them selves do not have cognition.
A much better word here would by “abilities”. That is that these tests demonstrate the different abilities of LLM models compared to human abilities (or even the abilities of traditional [specialized] models which often do pass these kinds of tests).
Semantics often do matter, and what worries me is that these statistical models are being anthropomorphized way more then is healthy. People treat them like the crew of the Enterprise treated Data, when in fact they should be treated like the ship‘s computer. And I think this because of a deliberate (and malicious/consumer hostile) marketing campaign from the AI companies.
Comment by Workaccount2 6 days ago
If we stay on topic, it's much harder to do since we don't actually know how the brain works. Outside at least that it is a computer doing (almost certainly) analog computation.
Years ago I built a quasi mechanical calculator. The computation was done mechanically, and the interface was done electronically. From a calculators POV it was an abomination, but a few abstraction layers down, they were both doing the same thing, albeit my mecha-calc being dramatically worse at it.
I don't think the brain is an LLM, like my Mecha-calc was a (slow) calculator, but I also don't think we know enough about the brain to firmly put it many degrees away from an LLM. Both are infact electrical signal processors with heavy statistical computation. I doubt you believe the brain is a trans-physical magic soul box.
Comment by runarberg 6 days ago
I don’t believe the brain is a trans-physical magic soul box, nor do I think an LLM is doing anything similar to an LLM (apart from some superficial similarities; some [like the artificial neural network] are in an LLMs because it was inspire by the brain).
We use the term cognition to describe the intrinsic properties of the brain, and how it transforms stimulus to a response, and there are several fields of science dedicated to study this cognition.
Just to be clear, you can describe the brain as a computer (a biological computer; totally distinct from a digital, or even mechanical computers), but that will only be an analogy, or rather, you are describing the extrinsic properties of the brain which it happens to share some of which with some of our technology.
---
1: Note, not an artificial neural network, but an OG neural network. AI models were largely inspired by biological brains, and in some parts model brains.
Comment by Benjammer 7 days ago
Comment by runarberg 7 days ago
What I am trying to say is that the intrinsic properties of the brain and an LLM are completely different, even though the extrinsic properties might appear the same. This is also true of the wind and the sunshine. It is not unreasonable to (though I would disagree) that “cognition” is almost the definition of the sum of all intrinsic properties of the human mind (I would disagree only on the merit of animal and plant cognition existing and the former [probably] having similar intrinsic properties as human cognition).
Comment by Kiro 6 days ago
Comment by runarberg 6 days ago
If you can‘t tell I find issues when terms are taken from psychology and applied to statistics. The terminology should flow in the other direction, from statistics and into psychology.
So my background is that I have done both undergraduate in both psychology and in statistics (though I dropped out of statistics after 2 years) and this is the first time I hear about artificial cognition, so I don‘t think this term is popular, and a short internet search seems to confirm that suspicion.
Out of context I would guess artificial cognition would mean something similar to cognition as artificial neural networks do to neural networks, that is, these are models that simulate the mechanisms of human cognition and recreate some stimulus → response loop. However my internet search revealed (thankfully) that this is not how researches are using this (IMO misguided) term.
https://psycnet.apa.org/record/2020-84784-001
https://arxiv.org/abs/1706.08606
What the researchers mean by the term (at least the ones I found in my short internet search) is not actual machine cognition, nor claims that machines have cognition, but rather an approach of research which takes experimental designs from cognitive psychology and applies them to learning models.
Comment by Libidinalecon 6 days ago
A logical type or a specific conceptual classification dictated by the rules of language and logic.
This is exactly getting hung up on the precise semantic meaning of the words being used.
The lack of precision is going to have huge consequences with this large of bets on the idea that we have "intelligent" machines that "think" or have "cognition" when in reality we have probabilistic language models and all kinds of category errors in the language surrounding these models.
Probably a better example here is that category in this sense is lifted from Bertrand Russell’s Theory of Types.
It is the loose equivalent of asking why are you getting hung up on the type of a variable in a programming language? A float or a string? Who cares if it works?
The problem is in introducing non-obvious bugs.
Comment by Benjammer 3 days ago
No, it's not. This is like me saying "string and float are two types of variables" and you going "what is a 'type' even??? Bertrand Russell said some bullshit and that means I'm right and you suck!"
Comment by runarberg 3 days ago
Cognition is a term from psychology, not statistics, if we are applying type theory, cognition would be a (none-pure) function term which take the atom term stimulus and maps them to another atom term behavior and involves states of types including knowledge, memory, attention, emotions, etc. In cognitive this is notated with S → R where S stands for stimulus, and R stands for response.
Attributing cognition to machine learning algorithms superficially takes this S → R function and replaces all state variables of cognition with weight matrices, at that point you are no longer talking about cognition. The S → R mapping of machine learning algorithms are most glaringly (apart from randomness) pure functions, during the S → R mapping of prompt to output nothing is stored in the long term memory of the algorithm, the attention is not shifted, the perception is not altered, no new knowledge is added, etc. Machine learning algorithms are simply just computing, and not learning.
Comment by white_dragon88 7 days ago
Comment by CamperBob2 7 days ago
Comment by runarberg 7 days ago
Comment by dekhn 7 days ago
Comment by CamperBob2 7 days ago
Comment by Paracompact 6 days ago
No. Humans don't need this handicap, either.
> More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context?
Any answer containing "5" as the leading candidate would be correct.
> What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between?
Irrelevant to the correctness of an answer the question, "how many legs does this dog have." Also, asking how many legs a 5-legged dog has is not deceitful.
> This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?
It's a demonstration of the failures of the rigor of out-of-distribution vision and reasoning capabilities. One can imagine similar scenarios with much more tragic consequences when such AI would be used to e.g. drive vehicles or assist in surgery.
Comment by danielvaughn 7 days ago
Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46
Comment by JamesSwift 7 days ago
```
Create a devenv project that does the following:
- Read the image at maze.jpg
- Write a script that solves the maze in the most optimal way between the mouse and the cheese
- Generate a new image which is of the original maze, but with a red line that represents the calculated path
Use whatever lib/framework is most appropriate```
Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604
Solution: https://imgur.com/a/bkJloPTComment by sebastiennight 6 days ago
I participated in a "math" competition in high school which mostly tested logic and reasoning. The reason my team won by a landslide is because I showed up with a programmable calculator and knew how to turn the problems into a program that could solve them.
By prompting the model to create the program, you're taking away one of the critical reasoning steps needed to solve the problem.
Comment by nl 6 days ago
Comment by JamesSwift 6 days ago
Comment by nearbuy 6 days ago
Comment by JamesSwift 6 days ago
Comment by nl 6 days ago
Represent the maze as a sequence of movements which either continue or end up being forced to backtrack.
Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.
See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:
A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J
Comment by JamesSwift 6 days ago
In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing in terms of judging if its “able to do” that thing. Its functionality equivalent for evaluating what the “state of the art” is, and honestly is naive to what these models even are. If the model hid the tool calling in the background instead, and only showed you its answer would we say its more intelligent? Because that’s essentially how a lot of these things work already. Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
Comment by nl 5 days ago
That's great, but it's demonstrably false.
I can write code that calculates the average letter frequency across any Wikipedia article. I can't do that in my head without tools because of the rule of seven[1].
Tool use is absolutely an intelligence amplifier but it isn't the same thing.
> Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
This is technically true, but somewhat misleading. Humans speak "left to right" too. Specifically, LLMs do have some spatial reasoning ability (which is what you'd expect with RL training: otherwise they'd just predict the most popular token): https://snorkel.ai/blog/introducing-snorkelspatial/
[1] https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...
Comment by JamesSwift 5 days ago
That is precisely the point I am trying to make. Its an arbitrary goalpost to say that knowing how to write the code doesnt mean its intelligent, and only doing it in a "chain of thought" would be.
Comment by nearbuy 6 days ago
> Again, think about how the models work. They generate text sequentially.
You have some misconception on how these models work. Yes, the transformer LLMs generate output tokens sequentially, but it's weird you mention this because it has no relevance to anything. They see and process tokens in parallel, and then process across layers. You can prove, mathematically, that it is possible for a transformer-based LLM to perform any maze-solving algorithm natively (given sufficient model size and the right weights). It's absolutely possible for a transformer model to solve mazes without writing code. It could have a solution before it even outputs a single token.
Beyond that, Gemini 3 Pro is a reasoning model. It writes out pages of hidden tokens before outputting any text that you see. The response you actually see could have been the final results after it backtracked 17 times in its reasoning scratchpad.
Comment by seanmcdirmid 6 days ago
Comment by rglullis 6 days ago
Tool use can be a sign of intelligence, but "being able to use a tool to solve a problem" is not the same as "being intelligent enough to solve a specific class of problems".
Comment by JamesSwift 6 days ago
And what Im really saying is that we need to stop moving the goal post on what "intelligence" is for these models, and start moving the goal post on what "intelligence" actually _is_. The models are giving us an existential crisis on not only what it might mean to _be_ intelligent, but also how it might actually work in our own brains. Im not saying the current models are skynet, but Im saying I think theres going to be a lot learned by reverse engineering the current generation of models to really dig into how they are encoding things internally.
Comment by rglullis 5 days ago
And I don't agree. I think that at best the model is "intelligent enough to use a tool that can solve mazes" (which is an entirely different thing) and at worst it is no different than a circus horse that "can do math". Being able to repeat more tricks and being able to select which trick to execute based on the expected reward is not a measure of intelligence.
Comment by JamesSwift 4 days ago
Comment by rglullis 4 days ago
Where you are seeing "intelligence" and "an existential crisis", I see "a huge pattern-matching system with an ever increasing vocabulary".
LLM's are useful. They will certainly cause a lot of disruption of automation on all types of white-collar work. They will definitely lead to all sorts of economic and social disruptions (good and bad). I'm definitely not ignoring them as just another fad... but none of that depends on LLMs being "intelligent" in any way.
Comment by flyinglizard 6 days ago
Comment by esafak 7 days ago
Comment by vunderba 7 days ago
Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.
Comment by danielvaughn 7 days ago
Comment by vunderba 7 days ago
When you think about posing the "solve a visual image of a maze" to something like ChatGPT, there's a good chance it'll try to throw a python VM at it, threshold it with something like OpenCV, and use a shortest-path style algorithm to try and solve it.
Comment by kridsdale3 7 days ago
Gemini 3 then one-shot ported the whole thing (which uses CV py libraries) to a single page html+js version which works just as well.
I gave that to Claude to assess and assign a FAANG hiring level to, and it was amazed and said Gemini 3 codes like an L6.
Since I work for Google and used my phone in the office to do this, I think I can't share the source or file.
Comment by jiggawatts 7 days ago
That's essentially what's going on with AI models, they're struggling because they only get "one step" to solve the problem instead of being able to trace through the maze slowly.
An interesting experiment would be to ask the AI to incrementally solve the maze. Ask it to draw a line starting at the entrance a little ways into the maze, then a little bit further, etc... until it gets to the end.
Comment by radarsat1 6 days ago
https://arxiv.org/abs/2407.01392
of course it doesn't redraw the image on every step, so not exactly what you're suggesting (interesting idea btw) but i think it's relevant.
Comment by pwagland 7 days ago
``` > [Image #1] Create a unicode "ascii-art" version of this image, with the optimal path through the maze highlighted in a solid colour.
I'll create an ASCII art version of this maze with the solution path highlighted!
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
●●│ │ │ │ │ │ │ │ │ │ │ │ │ │
├─┤●└─┴─┐ ├───┐ │ ╔═╗ ├─────┤ ╔═══╝ │ ╔═╝ ╔═╝ │ │ ╔═╝ ├─┤
│ │●●●●●└─┤ │ │ ║ │ │ │ ║ │ ║ ║ │ │ ║ │ │
│ └───┐●╔═╧═╗ │ └─╢ │ │ ┌─╔═╝ ║ ┌───┤ ╚═╗ ╚═╗ │ └─╢ ┌─┘ │
│ │●║ ║ │ ║ │ │ │ ║ ║ │ │ ║ ║ │ ║ │ │
├───┐ │●╚═╗ ╚═╧═╗ ║ └─┤ │ ║ ╔═╝ └─╔═╧═╗ ╚═╗ ╚═╧═╗ ║ └─┬─┤
│ │ │●●●└───┐ ║ ║ │ │ ║ ║ ║ ║ ║ ║ ║ │ │
│ ╔═╝ └───┐●╔═╝ ║ ╚═╗ │ └─╢ ╚═╗ ╔═╝ ╔═╝ ┌─╢ ╔═══╝ ╚═╗ │ │
│ ║ │●║ ║ ║ │ ║ ║ ║ ║ │ ║ ║ ║ │ │
│ ╚═══╗ ╔═╝●╚═╗ ╚═╗ ║ ├─╔═╧═╗ ║ ║ ╔═╝ ╔═╝ ║ ║ ╔═╗ ╔═╝ │ │
│ ║ ║●●●●●│ ║ ║ │ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ │ │
├───╔═╝ ║●┌─┐●└─╗ ║ └─┤ ║ ╔═╝ ║ ╚═╝ ╔═╧═╗ ║ ╚═╝ ║ ╚═╗ │ │
│ ║ ║●│ │●●●║ ║ │ ║ ║ ║ ║ ║ ║ ║ ║ │ │
│ ╔═╝ ╔═╝●│ └─┐●╚═╧═╗ │ ║ ║ ╔═╧═══╗ ║ ╔═╝ ╚═══╗ ╚═╗ ║ └─┤
│ ║ ║●●●│ │●●●●●║ │ ║ ║ ║ ║ ║ ║ ║ ║ ║ │
│ ║ ╔═╝●┌─┴─╗ └───┐●╚═╝ ║ ║ ║ ╔═╗ ║ └─╢ ╔═══╗ ╚═╗ ╚═╝ ╔═╝
│ ║ ║●●●│ ║ │●●●●●║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
│ ╚═╝●╔═╧═╗ ╚═╗ ╔═╧═══┐●╚═╝ ║ ║ ║ ╚═╗ ║ ║ ╔═╧═╗ ╚═══╗ ╚═╗
│ ●●║ ║ ║ ║ │●●●●●║ ║ ║ ║ ║ ║ ║ ║ ║ ║
├─╔═┘●╚═╗ ╚═╗ ║ ║ ╔═╗ └───┐●╚═╝ ╚═╗ ║ └─╢ ║ ╔═╧═╗ ╔═╝ ╔═╝
│ ║●●●●●║ ║ ║ ║ ║ ║ │●●●●●●●║ ║ ║ ║ ║ ║ ║ ║
│ ║●┌─╗ ╚═╗ ║ └─╢ ║ ╚═══╗ └─────┐●╚═╧═╗ ║ ║ ║ ╔═╝ ║ ╔═╝ ╔
│ ║●│ ║ ║ ║ ║ ║ ║ │●●●●●║ ║ ║ ║ ║ ║ ║ ║
│ ╚═╝ ║ ╔═╝ ╚═╗ ║ ╚═╗ ╔═╧═════╗ └───┐●╚═╝ ║ ╚═╝ ╔═╝ ║ ╔═╝
│ ║ ║ ║ ║ ║ ║ ║ │●●●●●║ ║ ║ ║
└─────╨─╨─────╨─╨───╨─╨───────╨─────┴───●●╨─────╨───╨─╨──
The path from the mouse () to the cheese () is marked with ● characters. This shows the optimal route through the maze, navigating through the corridors while avoiding all the walls!
```Suffice to say, it didn't do either part right.
Comment by ActorNightly 7 days ago
I dunno why people are surprised by this. This is what you get with text->text. Reasoning doesn't work text->text.
Comment by biophysboy 7 days ago
Comment by sebastiennight 6 days ago
The only impressive part would be that the trajectory is "continuous", meaning for every ● there is always another ● character in one of the 4 adjacent positions.
Comment by biophysboy 6 days ago
Comment by FeepingCreature 7 days ago
Comment by buildbot 7 days ago
Comment by vunderba 7 days ago
Try generating:
- A spider missing one leg
- A 9-pointed star
- A 5-leaf clover
- A man with six fingers on his left hand and four fingers on his right
You'll be lucky to get a 25% success rate.
The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.
Comment by moonu 7 days ago
Surprisingly, it got all of them right
Comment by vunderba 7 days ago
Other than the five-leaf clover, most of the images (dog, spider, person's hands) all required a human in the loop to invoke the "Image-to-Image" capabilities of NB Pro after it got them wrong. That's a bit different since you're actively correcting them.
Comment by XenophileJKO 7 days ago
Comment by vunderba 6 days ago
For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.
Comment by XenophileJKO 6 days ago
Comment by Borealid 6 days ago
Comment by thefourthchime 7 days ago
"Generate a Pac-Man game in a single HTML page." -- I've never had a model been able to have a complete working game until a couple weeks ago.
Sonnet Opus 4.5 in Cursor was able to make a fully working game (I'll admit letting cursor be an agent on this is a little bit cheating). Gemini 3 Pro also succeeded, but it's not quite as good because the ghosts seem to be stuck in their jail. Otherwise, it does appear complete.
Comment by seanmcdirmid 7 days ago
Most human beings, if they see a dog that has 5 legs, will quickly think they are hallucinating and the dog really only has 4 legs, unless the fifth leg is really really obvious. It is weird how humans are biased like that:
1. You can look directly at something and not see it because your attention is focused elsewhere (on the expected four legs).
2. Our pre-existing knowledge (dogs have four legs) influences how we interpret visual information from the bottom-up.
3. Our brain actively filters out "unimportant" details that don't align with our expectations or the main "figure" of the dog.
Attention should fix this however, like if you ask the AI to count the number of legs the dog has specifically, it shouldn't go nuts.
A straight up "dumber" computer algorithm that isn't trained extensively on real and realistic image data is going to get this right more often than a transformer that was.
Comment by hcnews 7 days ago
We are able to cleanly separate facts from non-facts (for the most part). This is what LLM are trying to replicate now.
Comment by seanmcdirmid 7 days ago
If you want the AI to identify a dog, we are done. If you want the AI to identify subtle differences from reality, then you are going to have to use a different technique.
Comment by irjustin 7 days ago
We're all just pattern matching machines and we humans are very good at it.
So much so that we have the sayings - you can't teach an old dog... and a specialist in their field only sees hammer => nails.
Evolution anyone?
Comment by seanmcdirmid 7 days ago
I get it, the litmus test of parent is to show that the AI is smarter than a human, not as smart as a human. Can the AI recognize details that are difficult for normal people to see even though the AI has been trained on normal data like the humans have been.
Comment by jvolkman 7 days ago
https://gemini.google.com/share/e7a8b902ff67
But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):
Analyzing Missing Anomaly
I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.
Acknowledge Leg Omission
I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.
Comment by genrader 7 days ago
For instance, look at what other reasoning models show you about their thinking. You can tell Gemini is not giving it all to you, so I don't put full trust in it unless it routinely is giving me an issue that I can catch in there.
For instance, it doesn't show you all the tool use that it can use nor distinguish the different types where I really wish it would. Especially because I force it to in a Gemini personal context instruction as well as many prompts and it does do them
Comment by joaomacp 7 days ago
Comment by averageRoyalty 7 days ago
Comment by thway15269037 7 days ago
I will be very surprised if there are not at least several groups or companies scraping these "smart" and snarky comments to find weird edge cases that they can train on, turn into demo and then sell as improvement. Hell, they would've done it if 10 billion was at stake, I can't really imagine (and I have vivid imagination, to my horror) what Californian psychopaths can do for 10 trillion.
Comment by Yizahi 4 days ago
Comment by Workaccount2 6 days ago
Similar to the pelican bike SVG, the models that do good at that test do good at all SVG generation, so even if they are targeting that benchmark, they're still making the whole model better to score better.
Comment by rottencupcakes 7 days ago
I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...
Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.
Comment by vunderba 7 days ago
I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.
This right HERE is the issue. It's not nearly deterministic enough to rely on.Comment by irthomasthomas 7 days ago
Comment by grugnog 6 days ago
Comment by runarberg 7 days ago
If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.
Comment by energy123 7 days ago
Comment by runarberg 7 days ago
Lets say you are right and these things will be optimized, and in, say, 5 years, most models from the big players will be able do things like reading small text in an obscure image, draw a picture of a glass of wine filled to the brim, draw a path through a maze, count the legs of a 5 footed dog, etc. And in doing so finished their last venture capital subsidies (bringing the actual cost of these to their customers). Why would people use LLMs for these when a traditional specialized model can do it for much cheaper?
Comment by energy123 7 days ago
This is not too different from where I see things going. I don't think a monolithic LLM that does everything perfectly is where we'll go. An LLM in a finite-compute universe is never going to be better at weather forecasting than GraphCast. The LLM will have a finite compute budget, and it should prioritize general reasoning, and be capable of calling tools like GraphCast to extend its intelligence into the necessary verticals for solving a problem.
I don't know exactly what that balance will look like however, and the lines between specialist application knowledge and general intelligence is pretty blurred, and what the API boundaries (if any) should be are unclear to me. There's a phenomenon where capabilities in one vertical do help with general reasoning to an extent, so it's not a completely zero-sum tradeoff between specialist expertise and generalist abilities, which makes it difficult to know what to expect.
Comment by a1j9o94 6 days ago
Comment by runarberg 6 days ago
Comment by maczwei 6 days ago
Comment by spease 6 days ago
I wonder if “How many legs do you see?” is close enough to “How many lights do you see?” that the LLMs are responding based on the memes surrounding the Star Trek episode “Chain of Command”.
Comment by reed1234 7 days ago
Comment by anguyen8 4 days ago
Comment by grog454 7 days ago
When I look at google image search results for "dog with 5 legs" I don't see a lot of great examples. The first unequivocal "dog with 5 legs" was an illustration. Here was my conversation with Chat GPT.
> How many legs does this dog have?
"The dog in the image has four legs."
> look closer.
" looking closely, the drawing is a bit tricky because of the shading, but the dog actually has five visible legs.
Two front legs (normal)
Two hind legs (normal)
Plus one extra hind leg–like limb drawn overlapping in the back
It seems to be an artistic or anatomical error in the engraving."
Seems fair to me.
Comment by vunderba 7 days ago
Comment by andai 7 days ago
Comment by SubiculumCode 6 days ago
Comment by SubiculumCode 6 days ago
Comment by SubiculumCode 6 days ago
Comment by cancan 7 days ago
Comment by cubefox 6 days ago
https://imagazine.pl/wp-content/uploads/2024/12/ugly-AI-pic-...
GPT-5 couldn't do it.
Comment by bee_rider 7 days ago
I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?
Comment by genrader 7 days ago
Comment by andy12_ 6 days ago
> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.
Comment by wasmainiac 6 days ago
Comment by Andrex 6 days ago
Comment by yieldcrv 7 days ago
Comment by teaearlgraycold 7 days ago
Comment by dana321 7 days ago
Comment by knollimar 7 days ago
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
Comment by Libidinalecon 6 days ago
What I notice that I don't see talked about much is how "steerable" the output is.
I think this is a big reason 1 shots are used as examples.
Once you get past 1 shots, so much of the output is dependent on the context the previous prompts have created.
Instead of 1 shots , try something that requires 3 different prompts on a subject with uncertainty involved. Do 4 or 5 iterations and often you will get wildly different results.
It doesn't seem like we have a word for this. A "hallucination" is when we know what the output should be and it is just wrong. This is like the user steers the model towards an answer but there is a lot of uncertainty in what the right answer even would be.
To me this always comes back to the problem that the models are not grounded in reality.
Letting LLMs do electric work without grounding in reality would be insane. No pun intended.
Comment by knollimar 6 days ago
I think they'll never be great at switchgear rooms but apartment outlet circuitry? Why not?
I have a very rigid workflow with what I want as outputs, so if I shape the inputs using an LLM it's promising. You don't need to automate everything; high level choices should be done by a human.
Comment by pardon_me 4 days ago
The main task of existing tools is rule-based checks and flagging errors for attention (like a compiler), because there is simply too much for a human to think about. The rules are based on physics and manufacturing constraints--precise known quantities--leading to output accuracy which can be verified up to 100%. The output is a known-functioning solution and/or simulation (unless the tool is flawed).
Most of these design tools include auto-design (chips)/auto-routing (PCBs) features, but they are notoriously poor due to being too heavily rule-based. Similar to the Photoshop "Content Aware Fill" feature (released 15 years ago!), where the algorithm tries to fill in a selection by guessing values based on the pixels surrounding it. It can work exceptionally well, until it doesn't, due to lacking correct context, at which point the work needs to be done manually (by someone knowledgeable).
"Hallucinogenic" or diffusion-based AI (LLM) algorithms do not readily learn or repeat procedures with high accuracy, but instead look at the problem holistically, much like a human; weights of neural nets almost light up with possible solutions. Any rules are loose, context-based, interconnected, often invisible, and all based on experience.
LLM tools as features on the design-side could be very promising, as existing rule-based algorithms could be integrated in the design-loop feedback to ground them in reality and reiterate the context. Combined with the precise rule-based checking and excellent quality training data, it provides a very promising path, and more so than tasks in most fields as the final output can still be rule-checked with existing algorithms.
In the near-future I expect basic designs can be created with minimal knowledge. EEs and electrical designer "experts" will only be needed to design and manufacture the tools, to verify designs, and to implement complex/critical projects.
In a sane world, this knowledge-barrier drop should encourage and grow the entire field, as worldwide costs for new systems and upgrades decreases. It has the potential to boost global standards of living. We shouldn't have to be worrying about losing jobs, nor weighing up extortionately priced tools vs. selling our data.
Comment by amorzor 7 days ago
Comment by knollimar 7 days ago
I gave it some custom methods it could call, including "get_available_families", "place family instance", "scan_geometry" (reads model walls into LLM by wall endpoint), and "get_view_scale".
The task is basically copy the building engineer's layout onto the architect model by placing my families. It requires reading the symbol list, and you give it a pdf that contains the room.
Notably, it even used a GFCI family when it noticed it was a bathroom (I had told it to check NEC code, implying outlet spacing).
Comment by ftcHn 7 days ago
Comment by knollimar 6 days ago
for clarity now that I'm rereading: it understands vectors a lot better than areas. Encoding it like that seems to work better for me.
Comment by willis936 7 days ago
Comment by knollimar 7 days ago
Comment by skybrian 6 days ago
A good start would be getting image generators to understand instructions like “move the table three feet to the left.”
Comment by reducesuffering 7 days ago
"Ok, I guess it could wipe out the economic demand for digital art, but it could never do all the autonomous tasks of a project manager"
"Ok, I guess it could automate most of that away but there will always be a need for a human engineer to steer it and deal with the nuances of code"
"Ok, well it could never automate blue collar work, how is it gonna wrench a pipe it doesn't have hands"
The goalposts will continue to move until we have no idea if the comments are real anymore.
Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020
Comment by blargey 7 days ago
> "the economic demand for digital art"
You twisted one "goalpost" into a tangential thing in your first "example", and it still wasn't true, so idk what you're going for. "Using a wrench vs preliminary layout draft" is even worse.
If one attempted to make a productive observation of the past few years of AI Discourse, it might be that "AI" capabilities are shaped in a very odd way that does not cleanly overlap/occupy the conceptual spaces we normally think of as demonstrations of "human intelligence". Like taking a 2-dimensional cross-section of the overlap of two twisty pool tubes and trying to prove a Point with it. Yet people continue to do so, because such myopic snapshots are a goldmine of contradictory venn diagrams, and if Discourse in general for the past decade has proven anything, it's that nuance is for losers.
Comment by visarga 6 days ago
Comment by semi-extrinsic 7 days ago
To be clear, it's only ever been a pop science belief that the Turing test was proposed as a literal benchmark. E.g. Chomsky in 1995 wrote:
The question “Can machines think?” is not a question of fact but one of language, and Turing himself observed that the question is 'too meaningless to deserve discussion'.Comment by throw310822 7 days ago
"I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted."
Comment by staticman2 7 days ago
>If the meaning of the words "machine" and "think" are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, "Can machines think?" is to be sought in a statistical survey such as a Gallup poll. But this is absurd.
This anticipates the very modern social media discussion where someone has nothing substantive to say on the topic but delights in showing off their preferred definition of a word.
For example someone shows up in a discussion of LLMs to say:
"Humans and machines both use tokens".
This would be true as long as you choose a sufficiently broad definition of "token" but tells us nothing substantive about either Humans or LLMs.
Comment by Fraterkes 7 days ago
Also, none of the other things you mentioned have actually happened. Don’t really know why I bother responding to this stuff
Comment by Workaccount2 6 days ago
i.e. the tell that it's not human is that it is too perfectly human.
However if we could transport people from 2012 to today to run the test on them, none would guess the LLM output was from a computer.
Comment by skybrian 6 days ago
Also, the skill of the human opponents matters. There’s a difference between testing a chess bot against randomly selected college undergrads versus chess grandmasters.
Just like jailbreaks are not hard to find, figuring out exploits to get LLM’s to reveal themselves probably wouldn’t be that hard? But to even play the game at all, someone would need to train LLM’s that don’t immediately admit that they’re bots.
Comment by visarga 6 days ago
Comment by phainopepla2 7 days ago
I strongly doubt this. If you gave it an appropriate system prompt with instructions and examples on how to speak in a certain way (something different from typical slop, like the way a teenager chats on discord or something), I'm quite sure it could fool the majority of people
Comment by 8n4vidtmkvmk 6 days ago
Like if you put someone in an online chat and ask them to identify if the person they're talking to is a bot or not, you're telling me your average joe honestly can't tell?
A blog post or a random HN comment, sure, it can be hard to tell, but if you allow some back and forth.. i think we can still sniff out the AIs.
Comment by akoboldfrying 6 days ago
IOW, LLMs pass the Turing test.
Comment by knollimar 6 days ago
Comment by webdood90 7 days ago
I don't think it's fair to qualify this as blue collar work
Comment by knollimar 7 days ago
Anything like this willl have trouble getting adopted since you'd need these to work with imperfect humans, which becomes way harder. You could bankroll a whole team of subcontractors (e.g. all trades) using that, but you would have one big liability.
The upper end of the complexity is similar to EDA in difficulty, imo. Complete with "use other layers for routing" problems.
I feel safer here than in programming. The senior guys won't be automated out any time soon, but I worry for Indian drafting firms without trade knowledge; the handholding I give them might go to an LLM soon.
Comment by knollimar 7 days ago
Comment by fuzzy2 6 days ago
For example, artists can create incredible art, and so can AI artists. But me, I just can't do it. Whatever art I have generated will never have the creative spark. It will always be slop.
The goalposts haven't moved at all. However, the narrative would rather not deal with that.
Comment by golem14 6 days ago
Comment by fngjdflmdflg 7 days ago
Comment by levocardia 7 days ago
Comment by tills13 7 days ago
Comment by visarga 6 days ago
Practical example - using LLMs to create deep research reports. It pulls over 500 sources into a complex analysis, and after all that compiling and contrasting it generates an article with references, like a wiki page. That text is probably superior to most of its sources in quality. It does not trust any one source completely, it does not even pretend to present the truth, it only summarizes the distribution of information it found on the topic. Imagine scaling wikipedia 1000x by deep-reporting every conceivable topic.
Comment by Workaccount2 6 days ago
Comment by jeffbee 7 days ago
Comment by Choco31415 6 days ago
Comment by kridsdale3 7 days ago
Comment by djoldman 7 days ago
72.7% Gemini 3 Pro
11.4% Gemini 2.5 Pro
49.9% Claude Opus 4.5
3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer UseComment by simonw 7 days ago
Comment by daemonologist 7 days ago
According to the calculator on the pricing page (it's inside a toggle at the bottom of the FAQs), GPT-5 is resizing images to have a minor dimension of at most 768: https://openai.com/api/pricing/ That's ~half the resolution I would normally use for OCR, so if that's happening even via the API then I guess it makes sense it performs so poorly.
Comment by datadrivenangel 6 days ago
Comment by jasonjmcghee 7 days ago
Comment by energy123 7 days ago
Comment by ericd 7 days ago
Comment by zubiaur 7 days ago
Comment by agentifysh 7 days ago
its going to reach low 90s very soon if trends continue
Comment by simonw 7 days ago
Comment by TechRemarker 7 days ago
Comment by inerte 7 days ago
Oh speaking on mobile, I remember when I tried to use Jira mobile web to move a few tickets up on priority by drag and dropping and ended up closing the Sprint. That stuff was horrible.
Comment by dekhn 7 days ago
Comment by cubefox 6 days ago
Comment by jamiek88 7 days ago
Comment by rohanlikesai 7 days ago
Comment by sumedh 7 days ago
Comment by buildbot 7 days ago
Comment by ed 7 days ago
Comment by mhl47 6 days ago
One was two screenshots of a phone screen with chats that are timestamped and it had to take the nth letter of the mth word based on the timestamp. While the type of riddle could be in the training data the ability to OCR this that well and understand the spatial relation to each object perfectly is something I have not seen from other models yet.
Comment by devttyeu 6 days ago
Comment by TheAceOfHearts 7 days ago
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
Comment by genrader 7 days ago
This may even work if you tell it to do all that prior to figuring out what to create for the image,
Comment by TheAceOfHearts 6 days ago
For generating the prompt which included the word positions I had Gemini 3 Pro do that using the following prompt: "Please try to solve this word search puzzle. Give me the position of each word in the grid. Then generate a prompt which I can pass to Nano Banana Pro, which I will pass along with the same input image to see if Nano Banana Pro is able to properly highlight all the words if given their correct position."
Comment by hodder 7 days ago
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.
Comment by minimaxir 7 days ago
The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.
Comment by hodder 7 days ago
If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).
Comment by iknowstuff 7 days ago
This article is about understanding images.
Your task is unrelated to the article.
Comment by JacobAsmuth 7 days ago
Comment by spchampion2 7 days ago
Comment by RyJones 7 days ago
Comment by IncreasePosts 6 days ago
Comment by ugh123 7 days ago
Comment by zmmmmm 7 days ago
Comment by aziis98 7 days ago
Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references
Comment by atonse 7 days ago
Any model that can do that? I tried looking in huggingface but didn’t quite see anything.
Comment by themanmaran 7 days ago
Comment by inquirerGeneral 7 days ago
Comment by ed 7 days ago
Comment by minimaxir 7 days ago
Comment by siva7 7 days ago
Comment by minimaxir 7 days ago
Comment by siva7 7 days ago
Comment by IanCal 7 days ago
Comment by brokensegue 7 days ago
Comment by minimaxir 7 days ago
Comment by andy12_ 6 days ago
Comment by devinprater 7 days ago
Comment by SXX 7 days ago
Video: Zelda TOTK, R5 5600X, GTX 1650, 1080p 10 Minute Gameplay, No Commentary
https://www.youtube.com/watch?v=wZGmgV-8Rbo
Here can be found narrative descriprion source and command:
https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...
Then I converted it into narrative voice over with Gemini 2.5 Pro TTS:
https://drive.google.com/file/d/1Js2nDtM7sx14I43UY2PEoV5PuLM...
It's somewhat desynced from original video and voice over take 9 and half minutes instead of 10 in video, but description of what happening on screen is quite accurate.
PS: I used 144p video so details could be also messed up because of poor quality. And ofc I specifically asked for narrative-like descripription
Comment by SXX 7 days ago
Source video title: Zelda: Breath of the Wild - Opening five minutes of gameplay
https://www.youtube.com/watch?v=xbt7ZYdUXn8
Prompt:
Please describe what happening in each scene of this video.
List scenes with timestamp, then describe separately:
- Setup and background, colors
- What is moving, what appear
- What objects in this scene and what is happening,
Basically make desceiption of 5 minutes video for a person who cant watch it.
Result on github gist since there too much text:https://gist.github.com/ArseniyShestakov/43fe8b8c1dca45eadab...
I'd say thi is quite accurate.
Comment by SXX 7 days ago
https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...
Comment by SXX 6 days ago
Comment by MostlyStable 7 days ago
Hopefully Google pro marries the two together.
Comment by lysecret 6 days ago
Comment by hackeruser741 7 days ago
Comment by axpy906 7 days ago
Comment by iamjackg 7 days ago
Comment by minimaxir 7 days ago
Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.
Comment by theLiminator 7 days ago
Comment by euvin 7 days ago
Comment by skybrian 7 days ago
Comment by danso 7 days ago
I'm curious as to how close these models are to achieving that once long-ago mocked claim (by Microsoft I think?) that AIs could view gameplay video of long lost games and produce the code to emulate them.
Comment by caseyf 7 days ago
Comment by sublimefire 6 days ago
Comment by k8sToGo 7 days ago
Comment by sumedh 7 days ago
Comment by a-dub 7 days ago
Comment by inquirerGeneral 7 days ago
Comment by jonplackett 7 days ago
Comment by causal 7 days ago
Comment by pseudosavant 7 days ago
Comment by minimaxir 7 days ago
Comment by pseudosavant 7 days ago
Most companies have rules for how many tokens the media should "cost", but they aren't usually exact.
Comment by drivebyhooting 7 days ago
Comment by ch2026 7 days ago
Comment by stego-tech 7 days ago
I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).
Comment by bgwalter 7 days ago
That is called progress.
EDIT: You can downvote the truth but still no one wants your "AI" slop.
Comment by stego-tech 7 days ago
Simple, elegant. I do miss those days.
Comment by oklahomasports 7 days ago
Comment by stego-tech 7 days ago
As for this throwaway line:
> Also I don’t upload stuff I’m worried about Google seeing.
You do realize that these companies harvest even private data, right? Like, even in places you think you own, or that you pay for, they’re mining for revenue opportunities and using you as the product even when you’re a customer, right?
> I wonder if they will allows special plans for corporations
They do, but no matter how much redlining Legal does to protect IP interests, the consensus I keep hearing is “don’t put private or sensitive corporate data into third-parties because no legal agreement will sufficiently protect us from harm if they steal our IP or data”. Just look at the glut of lawsuits against Apple, Google, Microsoft, etc from smaller companies that trusted them to act in good faith but got burned for evidence that you cannot trust these entities.
Comment by _trampeltier 7 days ago
Comment by bovermyer 7 days ago
Comment by themafia 6 days ago
I've never hated industry infatuation with a buzzword more.
Comment by Spacecosmonaut 6 days ago
Comment by vharish 6 days ago
Comment by romanovcode 6 days ago
Comment by genrader 7 days ago
Making sure you ask correctly how it should give you the info is still lacking in many people's ability
Comment by kkukshtel 6 days ago
Comment by Frannky 7 days ago
Comment by empressplay 7 days ago
Comment by dmarzio 7 days ago
Comment by ichik 7 days ago
Comment by OBELISK_ASI 6 days ago
Comment by sora2video 6 days ago
Comment by agentifysh 7 days ago
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex
Comment by alex1138 7 days ago
Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database
It won't happen for that reason alone. Oh who am I kidding of course it will
Comment by hklrekeclhkle 7 days ago