AI will make formal verification go mainstream

Posted by evankhoury 8 hours ago

Counter468Comment226OpenOriginal

Comments

Comment by joegibbs 1 minute ago

I think the problem is that people don't know exactly what it is that they want. You could easily make a formally verified application that is nevertheless completely buggy and doesn't make any sense. Like he says about the challenge moving to defining the specification: I don't think that would help because there are fewer people who understand formal verification, who would be able to read that and make sense of it, than there would be people who could write the code.

Comment by QuadrupleA 2 hours ago

I don't think formal verification really addresses most day-to-day programming problems:

    * A user interface is confusing, or the English around it is unclear
    * An API you rely on changes, is deprecated, etc.
    * Users use something in unexpected ways
    * Updates forced by vendors or open source projects cause things to break
    * The customer isn't clear what they want
    * Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)
For some mathematically pure task, sure, it's great. Or a low-level library like a regular expression parser or a compression codec. But I don't think that represents a lot of what most of us are tasked with, and those low-level "mathematically pure" libraries are generally pretty well handled by now.

Comment by est 36 minutes ago

> An API you rely on changes, is deprecated, etc

Formal verification will eventually lead to good, stable API design.

> Users use something in unexpected ways

> Complex behavior between interconnected systems

It happens when there's no formal verification during the design stage.

Formal verification literally means cover 100% state changes and for every possible input/output, every execution branch should be tested.

Comment by ehnto 3 minutes ago

100% of state changes in business software is unknowable on a long horizon, and relies on thoroughly understanding business logic that is often fuzzy, not discrete and certain.

Comment by Joker_vD 20 minutes ago

> Formal verification will eventually lead to good, stable API design.

Why? Has it ever happened like this? Because to me it would seem that if the system verified to work, then it works no matter how API is shaped, so there is no incentive to change it to something better.

Comment by est 2 minutes ago

> if the system verified to work, then it works no matter how API is shaped

Integrating with API for one-off tasks should be easy, but the messy part always comes when goal changes.

Let's say formal verification could help to avoid some anti-patterns.

Comment by gizmo686 1 hour ago

A limited form of formal verification is already mainstream. It is called type systems. The industry in general has been slowly moving to encode more invariants into the type system, because every invariant that is in the type system is something you can stop thinking about until the type checker yells at you.

A lot of libraries document invariants that are either not checked at all, only at runtime, or somewhere in between. For instance, the requirement that a collection not be modified during interaction. Or that two region of memory do not overlap, or that a variable is not modified without owning a lock. These are all things that, in principle, can be formally verified.

No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.

For LLMs, there is an added benefit. If you can formally specify what you want, you can make that specification your entire program. Then have an LLM driven compiler produce a provably correct implementation. This is a novel programming paradigm that has never before been possible; although every "declarative" language is an attempt to approximate it.

Comment by devin 1 hour ago

> No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.

They really don’t. How did you arrive at such a conclusion?

Comment by Permik 53 minutes ago

Not that I can answer for OP but as a personal anecdote; I've never been more productive than writing in Rust, it's a goddamn delight. Every codebase feels like it would've been my own and you can get to speed from 0 to 100 in no time.

Comment by mplewis 57 minutes ago

Through empirical evidence? Do you think that the vast majority of software devs moved to typing for no reason?

Comment by adrianN 2 hours ago

TBH most day to day programming problems are barely worth having tests for. But if we had formal specs and even just hand wavy correspondences between the specs and the implementation for the low level things everybody depends on that would be a huge improvement for the reliability of the whole ecosystem.

Comment by bkettle 7 hours ago

I think formal verification shines in areas where implementation is much more complex than the spec, like when you’re writing incomprehensible bit-level optimizations in a cryptography implementation or compiler optimization phases. I’m not sure that most of us, day-to-day, write code (or have AI write code) that would benefit from formal verification, since to me it seems like high-level programming languages are already close to a specification language. I’m not sure how much easier to read a specification format that didn’t concern itself with implementation could be, especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.

Comment by pron 6 hours ago

> to me it seems like high-level programming languages are already close to a specification language

They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).

To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

Comment by nyrikki 6 hours ago

TLA+ is not a silver bullet, and like all temporal logic, has constraints.

You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.

> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.

TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

[0] https://learntla.com/core/operators.html

Comment by pron 4 hours ago

> You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

You really don't. It's not LTL. Abstraction/refinement relations are at the core of TLA.

> Or even floats [0] and it becomes challenging and strings are a mess.

No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.

> TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

TLA+ can specify anything that could be specified in mathematics. That there is no predefined set of floats is no more a problem than the one physicists face because mathematics has no "built-in" concept for metal or temperature. TLA+ doesn't even have any built in notions of procedures, memory, instructions, threads, IO, variables in the programming sense, or, indeed programs. It is a mathematical framework for describing the behaviour of discrete or hybrid continuous-discrete dynamical systems, just as ODEs describe continuous dynamical systems.

But you're talking about the verfication tools you can run on TLA+ spec, and like all verification tools, they have their limitations. I never claimed otherwise.

You are, however, absolutely right that it's difficult to specify probabilistic properties in TLA+.

Comment by hwayne 3 hours ago

> No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.

I think it's disingenuous to say that TLA+ verifiers "may or may not have limitations" wrt floats when none of the available tools support floats. People should know going in that they won't be able to verify specs with floats!

Comment by nextos 4 hours ago

There are excellent probabilistic extensions to temporal logic out there that are very useful to uncover subtle performance bugs in protocol specifications, see e.g. what PRISM [1] and Storm [2] implement. That is not within the scope of TLA+.

Formal methods are really broad, ranging from lightweight type systems to theorem proving. Some techniques are fantastic for one type of problem but fail at others. This is quite natural, the same thing happens with different programming paradigms.

For example, what is adequate for a hard real-time system (timed automata) is useless for a typical CRUD application.

[1] https://www.prismmodelchecker.org

[2] https://www.stormchecker.org

Comment by hwayne 3 hours ago

I really do wish that PRISM can one day add some quality of life features like "strings" and "functions"

(Then again, AIUI it's basically a thin wrapper over stochastic matrices, so maybe that's asking too much...)

Comment by eru 3 hours ago

What you said certainly works, but I'm not sure computability is actually the biggest issue here?

Have a look at how SAT solvers or Mixed Integer Linear Programming solvers are used.

There you specify a clear goal (with your code), and then you let the solvers run. You can, but you don't need to, let the solvers run all the way to optimality. And the solvers are also allowed to use all kinds of heuristics to find their answers, but that doesn't impact the statement of your objective.

Compare that to how many people write code without solvers: the objective of what your code is trying to achieve is seldom clearly spelled out, and is instead mixed up with the how-to-compute bits, including all the compromises and heuristics you make to get a reasonable runtime or to accommodate some changes in the spec your boss asked for at the last minute.

Using a solver ain't formal verification, but it shows the same separation between spec and implementation.

Another benefit of formal verification, that you already imply: your formal verification doesn't have to determine the behaviour of your software, and you can have multiple specs simultaneously. But you can only have a single implementation active at a time (even if you use a high level implementation language.)

So you can add 'handling a user request must terminate in finite time' as a (partial) spec. It's an important property, but it tells you almost nothing about the required business logic. In addition you can add "users shouldn't be able to withdraw more than they deposited" (and other more complicated rules), and you only have to review these rules once, and don't have to touch them again, even when you implement a clever new money transfer routine.

Comment by anon-3988 6 hours ago

> To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate

vs

Giving them a framework or a language that does not have for loop?

Edit: If by formal verification you mean type checking. That I very much agree.

Comment by DennisP 5 hours ago

Maybe it's difficult for the average developer to write a formal specification, but the point of the article is that an AI can do it for them.

Comment by goryDeets 6 hours ago

[dead]

Comment by socketcluster 7 hours ago

Yes. I feel like people who are trying to push software verification have never worked on typical real-world software projects where the spec is like 100 pages long and still doesn't fully cover all the requirements and you still have to read between the lines and then requirements keep changing mid-way through the project... Implementing software to meet the spec takes a very long time and then you have to invest a lot of effort and deep thought to ensure that what you've produced fits within the spec so that the stakeholder will be satisfied. You need to be a mind-reader.

It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically would be an absolute nightmare... and extremely unwise. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all main cases and edge cases.

Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.

The refactoring required will invalidate the entire proof from the beginning. We haven't even figured out the optimal way to formally architect software that is resilient to requirement changes; in fact, the industry is REALLY BAD at this. Almost nobody is even thinking about it. I am, but I sometimes feel like I may be the only person in the world who cares about designing optimal architectures to minimize line count and refactoring diff size. We'd have to solve this problem first before we even think about formal verification of 'most software'.

Without a hypothetical super-intelligence which understands everything about the world; the risk of misinterpreting any given 'typical' requirement is almost 100%... And once we have such super-intelligence, we won't need formal verification because the super-intelligence will be able to code perfectly on the first attempt; no need to verify.

And then there's the fact that most software can tolerate bugs... If operationally important big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.

Comment by DennisP 5 hours ago

Software verification has gotten some use for smart contracts. The code is fairly simple, it's certain to be attacked by sophisticated hackers who know the source, and the consequence of failure is theft of funds, possibly in large amounts. 100% test coverage is no guarantee that an attack can't be found.

People spend gobs of money on human security auditors who don't necessarily catch everything either, so verification easily fits in the budget. And once deployed, the code can't be changed.

Verification has also been used in embedded safety-critical code.

Comment by socketcluster 4 hours ago

If the requirements you have to satisfy arise out of a fixed, deterministic contract (as opposed to a human being), I can see how that's possible in this case.

I think the root problem may be that most software has to adapt to a constantly changing reality. There aren't many businesses which can stay afloat without ever changing anything.

Comment by qingcharles 5 hours ago

I used to work adjacent to a team who worked from closely-defined specs for web sites, and it used to infuriate the living hell out of me. The specs had all sorts of horrible UI choices and bugs and stuff that just plain wouldn't work when coded. I tried my best to get them to implement the intent of the spec, not the actual spec, but they had been trained in one method only and would not deviate at any cost.

Comment by socketcluster 5 hours ago

Yeah, IMO, the spec almost always needs refinement. I've worked for some companies where they tried to write specs with precision down to every word; but what happened is; if the spec was too detailed, it usually had to be adjusted later once it started to conflict with reality (efficiency, costs, security/access restrictions, resource limits, AI limitations)... If it wasn't detailed enough, then we had to read between the lines and fill in a lot of gaps... And usually had to iterate with the stakeholder to get it right.

At most other companies, it's like the stakeholder doesn't even know what they want until they start seeing things on a screen... Trying to write a formal spec when literally nobody in the universe even knows what is required; that's physically impossible.

In my view, 'Correct code' means code that does what the client needs it to do. This is downstream from it doing what the client thinks they want; which is itself downstream from it doing what the client asked for. Reminds me of this meme: https://www.reddit.com/r/funny/comments/105v2h/what_the_cust...

Software engineers don't get nearly enough credit for how difficult their job is.

Comment by amw-zero 3 hours ago

How do you or the client know that the software is doing what they want?

Comment by robot-wrangler 6 hours ago

The whole perspective of this argument is hard for me to grasp. I don't think anyone is suggesting that formal specs are an alternative to code, they are just an alternative to informal specs. And actually with AI the new spin is that they aren't even a mutually exclusive alternative.

A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.

It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.

Comment by socketcluster 5 hours ago

> A bidirectional bridge that spans multiple representations from informal spec

Amusingly, what I'm hearing is literally "I have a bridge to sell you."

Comment by robot-wrangler 4 hours ago

There's always a bridge, dude. The only question is whether you want to buy one that's described as "a pretty good one, not too old, sold as is" or if you'd maybe prefer "spans X, holds Y, money back guarantee".

Comment by socketcluster 4 hours ago

I get it. Sometimes complexity is justified. I just don't feel this particular bridge is justified for 'mainstream software' which is what the article is about.

Comment by marcosdumay 6 hours ago

There are many really important properties to enforce even on the most basic CRUD system. You can easily say things like "an anonymous user must never edit any data, except for the create account form", or "every user authorized to see a page must be listed on the admin page that lists what users can see a page".

People don't verify those because it's hard, not for lack of value.

Comment by nextos 6 hours ago

Yes, in fact there is research on type systems to ensure information flow control, avoiding unauthorized data access by construction.

Concrete Semantics [1] has a little example in §9.2.

[1] http://concrete-semantics.org/concrete-semantics.pdf

Comment by bkettle 6 hours ago

Yeah fair enough. I can definitely see the value of property-based verification like this and agree that useful properties could be easy to express and that LLMs could feasibly verify them. I think full verification that an implementation implements an entire spec and nothing else seems much less practical even with AI, but of course that is just one flavor of verification.

Comment by rmah 5 hours ago

Yes, except their cookie preferences to comply with european law. Oh, and they should be able to change their theme from light/dark but only that. Oh and maybe this other thing. Except in situations where it would conflict with current sales promotions. Unless they're referred by a reseller partner. Unless it's during a demo, of course. etc, etc, etc.

This is the sort of reality that a lot of developers in the business world deals with.

Comment by amw-zero 3 hours ago

Compare the spec with the application here: https://concerningquality.com/model-based-testing/

I think we've become used to the complexity in typical web applications, but there's a difference between familiar and simple (simple vs. easy, as it were). The behavior of most business software can be very simply expressed using simple data structures (sets, lists, maps) and simple logic.

No matter how much we simply it, via frameworks and libraries or whatever have you, things like serialization, persistence, asynchrony, concurrency, and performance end up complicating the implementation. Comparing this against a simpler spec is quite nice in practice - and a huge benefit is now you can consult a simple in-memory spec vs. worrying about distributed system deployments.

Comment by giancarlostoro 4 hours ago

> especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

This is my issue with algorithm driven interviewing. Even the creator of Homebrew got denied by Google because he couldn't do some binary sort or whatever it even was. He made a tool used by millions of developers, but apparently that's not good enough.

Comment by StilesCrisis 3 hours ago

Google denies qualified people all the time. They would much rather reject a great hire than take a risk on accepting a mediocre one. I feel for him but it's just the nature of the beast. Not everyone will get in.

Comment by UltraSane 2 hours ago

AWS has said that having formal verification of code lets them be more aggressive in optimization while being confidant it still adheres to the spec. They claim they were able to double the speed of IAM API auth code this way.

Comment by simonw 8 hours ago

I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Codex CLI etc) is having good mechanisms in place to help those agents exercise and validate the code they are writing.

At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.

I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

Comment by roadside_picnic 7 hours ago

I very much agree, and believe using languages with powerful types systems could be a big step in this direction. Most people's first experience with Haskell is "wow this is hard to write a program in, but when I do get it to compile, it works". If this works for human developers, it should also work for LLMs (especially if the human doesn't have to worry about the 'hard to write a program' part).

> The next step up from that is a good automated test suite.

And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.

The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.

Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.

Comment by nextos 7 hours ago

That's my opinion as well. Some functional language, that can also offer access to imperative features when needed, plus an expressive type system might be the future.

My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.

In fact, there are already serious industrial efforts to generate Dafny using LLMs.

Besides, some of the largest verification efforts have been achieved with this language [1].

[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf

Comment by astrostl 4 hours ago

This is why I use Go as much as reasonably possible with vibe coding: types, plus great quality-checking ecosystem, plus adequate training data, plus great distribution story. Even when something has stuff like JS and Python SDKs, I tend to skip them and go straight to the API with Go.

Comment by justatdotin 3 hours ago

I love ML types, but I've concluded they serve humans more than they do agents. I'm sure it affects the agent, maybe just not as much as other choices.

I've noticed real advantages of functional languages to agents, for disposable code. Which is great, cos we can leverage those without dictating the human's experience.

I think the correct way forward is to choose whatever language the humans on your team agree is most useful. For my personal projects, that means a beautiful language for the bits I'll be touching, and whatever gets the job done elsewhere.

Comment by bcrosby95 6 hours ago

Ada when?

It even lets you separate implementation from specification.

Comment by jaggederest 5 hours ago

Even going beyond Ada into dependently typed languages like (quoth wiki) "Agda, ATS, Rocq (previously known as Coq), F*, Epigram, Idris, and Lean"

I think there are some interesting things going on if you can really tightly lock down the syntax to some simple subset with extremely straightforward, powerful, and expressive typing mechanisms.

Comment by ManuelKiessling 7 hours ago

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“

Comment by Wowfunhappy 7 hours ago

> Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.

Comment by victorbjorklund 7 hours ago

AI can do that too? If you have a web app it can use playwright to test functionality and take screenshots to see if it looks right.

Comment by Wowfunhappy 6 hours ago

Yeah, but it doesn't work nearly as well. The AI frequently misinterprets what it sees. And it isn't as good at actually using the website (or app, or piece of hardware, etc) as a human would.

Comment by UncleEntity 4 hours ago

I've been using Claude to implement an ISO specification and I have to keep telling it we're not interested if the repl is correct but that the test suite is ensuring the implementation is correctly following the spec. But when we're tracking down why a test is failing then it'll go to town using the repl to narrow down out what code path is causing the issue. The only reason there's even is a repl at this point is so it can do its 'spray and pray' debugging outside the code and Claude constantly tried to use it to debug issues so I gave in and had it write a pretty basic one.

Horses for courses, I suppose. Back in the day, when I wanted to play with some C(++) library, I'd quite often write a Python C-API extension so I could do the same thing using Python's repl.

Comment by Capricorn2481 4 hours ago

But then the AI would theoretically have to write the playwright code. How does it verify it's getting the right page to begin with?

Comment by simonw 3 hours ago

The recent models are pretty great at this. They read the source code for e.g. a Python web application and use that to derive what the URLs should be. Then they fire up a localhost development server and write Playwright scripts to interact with those pages at the predicted URLs.

The vision models (Claude Opus 4.5, Gemini 3 Pro, GPT-5.2) can even take screenshots via Playwright and then "look at them" with their vision capabilities.

It's a lot of fun to watch. You can tell them to run Playwright not in headless mode at which point a Chrome window will pop up on your computer and you can see them interact with the site via it.

Comment by mccoyb 8 hours ago

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.

Comment by idiotsecant 7 hours ago

How expensive is incredibly expensive?

Comment by mccoyb 6 hours ago

If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.

Comment by jazzyjackson 6 hours ago

It's pretty easy to hit a couple hundred dollars a day filling up Opus's context window with files. This is via Anthropic API and Zed.

Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.

I can't bear to go in circles with Sonnet when Opus can just one shot it.

Comment by fragmede 5 hours ago

The $200/month Max plan has limits, but making a couple of those seems way cheaper than $50/hr for the ~172 hrs in a month.

Comment by rmah 5 hours ago

I think the main limitation is not code validation but assumption verification. When you ask an LLM to write some code based on a few descriptive lines of text, it is, by necessity, making a ton of assumptions. Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely. Moreover, from the behavior I've seen, they don't really backtrack to select a new assumption based on further input (I might be wrong here, it's just a feeling).

What you don't specify, it must to assume. And therein lies a huge landscape of possibilities. And since the AI's can't read your mind (yet), its assumptions will probably not precisely match your assumptions unless the task is very limited in scope.

Comment by crazygringo 4 hours ago

> Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely.

It's not odd, they've just been trained to give helpful answers straight away.

If you tell them not to make assumptions and to rather first ask you all their questions together with the assumptions they would make because you want to confirm before they write the code, they'll do that too. I do that all the time, and I'll get a list of like 12 things to confirm/change.

That's the great thing about LLM's -- if you want them to change their behavior, all you need to do is ask.

Comment by Davidzheng 6 hours ago

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

Comment by oxag3n 7 hours ago

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.

I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.

Comment by jimmaswell 7 hours ago

How long ago was this? I've had outstansingly impressive results asking Copilot Chat with Sonnet 4.5 or ChatGPT to debug difficult multithreaded C++.

Comment by oxag3n 7 hours ago

Few months back with ChatGPT 5. Multi-threaded Rust & custom async runtime, data integrity bug, reproduced every ~5th run.

Comment by simonw 7 hours ago

Have you tried running gdb from a Claude Code or Codex CLI session?

Comment by oxag3n 6 hours ago

No, I'm in academia and the goal is not code or product launch. I find research process to struggle a lot once someone solves a problem instead of you.

I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.

Comment by RA_Fisher 5 hours ago

LLMs are okay at bisecting programs and identifying bugs in my experience. Sometimes they require guidance but often enough I can describe the symptom and they identify the code causing the issue (and recommend a fix). They’re fairly methodical, and often ask me to run diagnostic code (or do it themselves).

Comment by anon-3988 6 hours ago

> I suspect debugging is not that straightforward to LLM'ize.

Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.

Comment by christophilus 7 hours ago

Debugging has been excellent for me with Opus 4.5 and Claude Code.

Comment by fragmede 6 hours ago

> Where they'd get training data?

They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.

Comment by simianwords 2 hours ago

Claude code and other AI coding tools must have a * mandatory * hook for verification.

For front end - the verification is make sure that the UI looks expected (can be verified by an image model) and clicking certain buttons results in certain things (can be verified by chatgpt agent but its not public ig).

For back end it will involve firing API requests one by one and verifying the results.

To make this easier, we need to somehow give an environment for claude or whatever agent to run these verifications on and this is the gap that is missing. Claude Code, Codex should now start shipping verification environments that make it easy for them to verify frontend and backend tasks and I think antigravity already helps a bit here.

------

The thing about backend verification is that it is different in different companies and requires a custom implementation that can't easily be shared across companies. Each company has its own way to deploy stuff.

Imagine a concrete task like creating a new service that reads from a data stream, runs transformations, puts it in another data stream where another new service consumes the transformed data and puts it into an AWS database like Aurora.

``` stream -> service (transforms) -> stream -> service -> Aurora ```

To one shot this with claude code, it must know everything about the company

- how does one consume streams in the company? Schema registry?

- how does one create a new service and register dependencies? how does one deploy it to test environment and production?

- how does one even create an Aurora DB? request approvals and IAM roles etc?

My question is: what would it take for Claude Code to one shot this? At the code level it is not too hard and it can fit in context window easily but the * main * problem is the fragmented processes in creating the infra and operations behind it which is human based now (and need not be!).

-----

My prediction is that companies will make something like a new "agent" environment where all these processes (that used to require a human) can be done by an agent without human intervention.

I'm thinking of other solutions here, but if anyone can figure it out, please tell!

Comment by CobrastanJorji 7 hours ago

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

Comment by 7 hours ago

Comment by 2 hours ago

Comment by QuercusMax 8 hours ago

I've only done a tiny bit of agent-assisted coding, but without the ability to run tests the AI will really go off the rails super quick, and it's kinda hilarious to watch it say "Aha! I know what the problem is!" over and over as it tries different flavors until it gives up.

Comment by jijijijij 7 hours ago

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

Comment by rodphil 4 hours ago

> At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

So I've been exploring the idea of going all-in on this "basic level" of validation. I'm assembling systems out of really small "services" (written in Go) that Claude Code can immediately run and interact with using curl, jq, etc. Plus when building a particular service I already have all of the downstream services (the dependencies) built and running so a lot of dependency management and integration challenges disappear. Only trying this out at a small scale as yet, but it's fascinating how the LLMs can potentially invert a lot of the economics that inform the current conventional wisdom.

(Shameless plug: I write about this here: https://twilightworld.ai/thoughts/atomic-programming/)

My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

Comment by jcranmer 3 hours ago

> My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

There is still a cost to having bugs, even if deploying fixes becomes much cheaper. Especially if your plan is to wait until they actually occur in practice to discover that you have a bug in the first place.

Put differently: would you want the app responsible for your payroll to be developed in this manner? Especially considering that the bug in question would be "oops, you didn't get paid."

Comment by pron 6 hours ago

Maybe in the short term, but that doesn't solve some fundamental problems. Consider, NP problems, problems whose solutions can be easily verified. But that they can all be easily verified does not (as far as we know) mean they can all be easily solved. If we look at the P subset of NP, the problems that can be easily solved, then the easy verification is no longer their key feature. Rather, it is something else that distinguishes them from the harder problems in NP.

So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.

Longer term, I think it's more important to understand what's hard and what's easy for agents.

Comment by thomasfromcdnjs 5 hours ago

shameless plug: I'm working on an open source project https://blocksai.dev/ to attempt to solve this. (and just added a note for me to add formal verification)

Elevator pitch: "Blocks is a semantic linter for human-AI collaboration. Define your domain in YAML, let anyone (humans or AI) write code freely, then validate for drift. Update the code or update the spec, up to human or agent."

(you can add traditional linters to the process if you want but not necessary)

The gist being you define a bunch of validators for a collection of modules you're building (with agentic coding) with a focus on qualifying semantic things;

- domain / business rules/measures

- branding

- data flow invariants — "user data never touches analytics without anonymization"

- accessibility

- anything you can think of

Then you just tell your agentic coder to use the cli tool before committing, so it keeps the code in line with your engineering/business/philosophical values.

(boring) example of it detecting if blog posts have humour in them, running in Claude Code -> https://imgur.com/diKDZ8W

Comment by zahlman 7 hours ago

One objection: all the "don't use --yolo" training in the world is useless if a sufficiently context-poisoned LLM starts putting malware in the codebase and getting to run it under the guise of "unit tests".

Comment by ramoz 7 hours ago

Claude Code hooks is a great way to integrate these things

Comment by htrp 7 hours ago

Better question is which tools at what level

Comment by WhyOhWhyQ 7 hours ago

I've tried getting claude to set up testing frameworks, but what ends up happening is it either creates canned tests, or it forgets about tests, or it outright lies about making tests. It's definitely helpful, but feels very far from a robust thing to rely on. If you're reviewing everything the AI does then it will probably work though.

Comment by simonw 7 hours ago

Something I find helps a lot is having a template for creating a project that includes at least one passing test. That way the agent can run the tests at the start using the correct test harness and then add new tests as it goes along.

I use cookiecutter for this, here's my latest Python library template: https://github.com/simonw/python-lib

Comment by agumonkey 8 hours ago

gemini and claude do that already IIUC, self debugging iterations

Comment by dionian 7 hours ago

you've done some great articles on this topic and my experience aligns with your view completely.

Comment by formerly_proven 8 hours ago

Not so sure about formal verification though. ime with Rust LLM agents tend to struggle with semi-complex ownership or trait issues and will typically reach for unnecessary/dangerous escape hatches ("unsafe impl Send for ..." instead of using the correct locks, for example) fairly quickly. Or just conclude the task is impossible.

> automatic code formatters

I haven't tried this because I assumed it'll destroy agent productivity and massively increase number of tokens needed, because you're changing the file out under the LLM and it ends up constantly re-reading the changed bits to generate the correct str_replace JSON. Or are they smart enough that this quickly trains them to generate code with zero-diff under autoformatting?

But in general of course anything that's helpful for human developers to be more productive will also help LLMs be more productive. For largely identical reasons.

Comment by simonw 7 hours ago

I'm finding my agents generate code that conforms to Black quite effectively, I think it's probably because I usually start them in existing projects that were already formatted using Black so they pick up those patterns.

Comment by goryDeets 6 hours ago

[dead]

Comment by adverbly 4 hours ago

This smells like a Principia Mathematica take to me...

Reducing the problem to "ya just create a specification to formally verify" doesn't move the needle enough to me.

When it comes to real-world, pragmatic, boots-on-the-ground engineering and design, we are so far from even knowing the right questions to ask. I just don't buy it that we'd see huge mainstream productivity changes even if we had access to a crystal ball.

Its hilarious how close we're getting to Hitch hikers guide to the galaxy though. We're almost at that phase where we ask what the question is supposed to be.

Comment by amw-zero 3 hours ago

When you go to write a line of code, how do you decide what to write?

Comment by signa11 56 minutes ago

> When you go to write a line of code, how do you decide what to write?

depends ofcourse, what am i writing for ? a feature, a bugfix, refactor ... ?

Comment by adverbly 1 hour ago

Honestly? I usually look at the previous implementation and try to make some changes to fix an issue that I discovered during testing. Rarely an actual bug - usually we just changed our mind about what the intent should be.

Comment by infruset 5 hours ago

I was waiting for a post like this to hit the front page of Hacker News any day. Ever since Opus 4.5 and GPT 5.2 came out (mere weeks ago), I've been writing tens of thousands of lines of Lean 4 in a software engineering job and I feel like we are on the eve of a revolution. What used to take me 6 months of work when I was doing my PhD in Coq (now Rocq), now takes from a few hours to a few days. Whole programming languages can get formalized executable semantics in little time. Lean 4 already has a gigantic amount of libraries for math but also for computer science; I expect open source projects to sprout with formalizations of every language, protocol, standard, algorithm you can think of.

Even if you have never written formal proofs but are intrigued by them, try asking a coding agent to do some basic verification. You will not regret it.

Formal proof is not just about proving stuff, it's also about disproving stuff, by finding counterexamples. Once you have stated your property, you can let quickcheck/plausible attack it, possibly helped by a suitable generator which does not have to be random: it can be steered by an LLM as well.

Even further, I'm toying with the idea of including LLMs inside the formalization itself. There is an old and rich idea in the domain of formal proof, that of certificates: rather than proving that the algorithm that produces a result is correct, just compute a checkable certificate with untrusted code and verify it is correct. Checkable certificates can be produced by unverified programs, humans, and now LLMs. Properties, invariants, can all be "guessed" without harm by an LLM and would still have to pass a checker. We have truly entered an age of oracles. It's not halting-problem-oracle territory of course, but it sometimes feels pretty close for practical purposes. LLMs are already better at math than most of us and certainly than me, and so any problem I could plausibly solve on my own, they will do faster without my having to wonder if there is a subtle bug in the proof. I still need to look at the definitions and statements, of course, but my role has changed from finding to checking. Exploring the space of possible solutions is now mostly done better and faster by LLMs. And you can run as many in parallel as you can keep up with, in attention and in time (and money).

If anyone else is as excited about all this as I am, feel free to reach out in comments, I'd love to hear about people's projects !

Comment by jrowen 4 hours ago

This is perhaps only tangentially related to formal verification, but it made me wonder - what efforts are there, if any, to use LLMs to help with solving some of the tough questions in math and CS (P=NP, etc)? I'd be curious to know how a mathematician would approach that.

Comment by infruset 4 hours ago

So as for math of that level, (the best) humans are still kings by far. But things are moving quickly and there is very exciting human-machine collaboration, one need only look at recent interviews of Terence Tao!

Comment by qingcharles 4 hours ago

I agree. I think we've gotta get through the rough couple of "AI slop" years of code and we'll come out of it the other side with some incredible tools.

The reason we don't all write code to the level that can operate the Space Shuttle is because we don't have the resources and the projects most of us work on all allow some wiggle room for bugs since lives generally aren't at risk. But we'd all love to check in code that was verifiably bug-free, exploit-free, secure etc if we could get that at a low, low price.

Comment by MobiusHorizons 4 hours ago

at some level it's not really an engineering issue. "bug free" requires that there is some external known goal with sufficient fidelity that it can classify all behaviors as "bug" or "not bug". This really doesn't exist in the vast majority of software projects. It is of course occasionally true that programmers are writing code that explicitly doesn't meet one of the requirements they were given, but most of the time the issue is that nothing was specified for certain cases, so code does whatever was easiest to implement. It is only when encountering those unspecified cases (either via a user report, or product demo, or manual QA) that the behavior is classified as "bug" or "not bug".

I don't see how AI would help with that even if it made writing code completely free. Even if the AI is writing the spec and fully specifies all possible outcomes, the human reviewing it will glance over the spec and approve it only to change their mind when confrunted with the actual behavior or user reports.

Comment by ajcp 3 hours ago

Having known nothing of this field before now I have to say your excitement has me excited!

Comment by nomadygnt 4 hours ago

Where do you work that you get to write Lean? That sounds awesome!

Comment by infruset 4 hours ago

I can't disclose that, but what I can say is no one at my company writes Lean yet. I'm basically experimenting with formalizing in Lean stuff I normally do in other languages, and getting results exciting enough I hope to trigger adoption internally. But this is bigger than any single company!

Comment by the_duke 27 minutes ago

I've been preaching similar thoughts for the last half year.

Most popular programming languages are optimized for human convenience, not for correctness! Even most of the popular typed languages (Java/Kotlin/Go/...) have a wide surface area for misuse that is not caught at compile time.

Case in point: In my experience, LLMs produce correct code way more regularly for Rust than for Js/Ts/Python/... . Rust has a very strict type system. Both the standard library and the whole library ecosystem lean towards strict APIs that enforce correctness, prevent invalid operations, and push towards handling or at least propagating errors.

The AIs will often write code that won't compile initially, but after a few iterations with the compiler the result is often correct. Strong typing also makes it much easier to validate the output when reviewing.

With AIs being able to do more and more of the implementation, the "feel-good" factor of languages will become much less relevant. Iteration speed is not so important when parallel AI agents do the "grunt work". I'd much rather wait 10 minutes for solid output rather than 2 minutes for something fragile.

We can finally move the industry away from wild-west languages like Python/JS and towards more rigorous standards.

Rust is probably the sweet spot at the moment, thanks to it being semi-popular with a reasonably active ecosystem, sadly I don't think the right language exists at the moment.

What we really want is a language with a very strict, comprehensive type system with dependent types, maybe linear types, structured concurrency, and a built-in formal proof system.

Something like ADA/Spark, but more modern.

Comment by pron 7 hours ago

> it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.

There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding [1]: once we imagine AI to do this task, i.e. automatically prove software correct, we can just as easily imagine it to not have to do it (for us) in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better. In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place. It will come up with the idea, implement it, and then decide to what extent to verify it. Formal verification, or programming for that matter, will not become mainstream (as a human activity) but go extinct.

Indeed, it is far easier for humans to design and implement a proof assistant than it is to use one to verify a substantial computer program. A machine that will be able to effectively use a proof checker, will surely be able to come up with a novel proof checker on its own.

I agree it's not hard to extrapolate technological capabilities, but such extrapolation has a name: science fiction. Without a clear understanding of what makes things easier or harder for AI (in the near future), any prediction is based on arbitrary guesses that AI will be able to do X yet not Y. We can imagine any conceivable capability or limitation we like. In science fiction we see technology that's both capable and limited in some rather arbitrary ways.

It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.

[1]: https://news.ycombinator.com/item?id=46207505

Comment by thatoneengineer 6 hours ago

I disagree. Right now, feedback on correctness is a major practical limitation on the usefulness of AI coding agents. They can fix compile errors on their own, they can _sometimes_ fix test errors on their own, but fixing functionality / architecture errors takes human intervention. Formal verification basically turns (a subset of) functionality errors into compile errors, making the feedback loop much stronger. "Come up with what software we need better than us in the first place" is much higher on the ladder than that.

TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.

Comment by pron 6 hours ago

I completely agree it's appealing, I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant. In other words, I also want agents to be able to consistently prove software correct, but if they're actually able to accomplish that, then they could just as likely be able to do everything else in the production of that software (including gathering requirements and writing the spec) without me in the loop.

Comment by UncleEntity 5 hours ago

> I also want agents to be able to consistently prove software correct...

I know this is just an imprecision of language thing but they aren't 'proving' the software is correct but writing the proofs instead of C++ (or whatever).

I had a but of a discussion with one of them about this a while ago to determine the viability of having one generate the proofs and use those to generate the actual code, just another abstraction over the compiler. The main takeaway I got from that (which may or may not be the way to do) is to use the 'result' to do differential testing or to generate the test suite but that was (maybe, don't remember) in the context of proving existing software was correct.

I mean, if they get to the point where they can prove an entire codebase is correct just in their robot brains I think we'll probably have a lot bigger things to worry about...

Comment by qingcharles 4 hours ago

It's getting better every day, though, at "closing the loop."

When I recently booted up Google Antigravity and had it make a change to a backend routine for a web site, I was quite surprised when it opened Chrome, navigated to the page, and started trying out the changes to see if they had worked. It was janky as hell, but a year from now it won't be.

Comment by 1121redblackgo 7 hours ago

First human robot war is us telling the AI/robots 'no', and them insisting that insert technology here is good for us and is the direction we should take. Probably already been done, but yeah, this seems like the tipping point into something entirely different for humanity.

Comment by pron 7 hours ago

... if it's achievable at all in the near future! But we don't know that. It's just that if we assume AI can do X, why do we assume it cannot, at the same level of capability, also do Y? Maybe the tipping point where it can do both X and Y is near, but maybe in the near future it will be able to do neither.

Comment by BobSonOfBob 24 minutes ago

Formal verification at the “unit test” level seems feasible. At the system level of a modern application, the combinations of possible states will need a quantum computer to finish testing in this lifetime.

Comment by rgmerk 6 hours ago

(sarcasm on)

Woohoo, we're almost all of the way there! Now all you need to do is ensure that the formal specification you are proving that the software implements is a complete and accurate description of the requirements (which are likely incomplete and contradictory) as they exist in the minds of the set of stakeholders affected by your software.

(sarcasm off).

Comment by qingcharles 4 hours ago

I mean, I don't disagree. Specs are usually horrible, way off the mark, outdated, and written by folks who don't understand how the rest of the vertical works. But, that's a problem for another day :)

Comment by brookman64k 39 minutes ago

How do you verify that your verification verifies the right thing? Couldn’t the LLM spit out a nice looking but ultimately useless proof (boiling down to something like 1=1). Also, in my experience software projects are full of incorrect, incomplete and constantly changing assumptions and requirements.

Comment by chhxdjsj 17 minutes ago

An excellent use case for this is ethereum smart contract verification. People store millions of dollars in smart contracts that are probably a one or two iterations of claude or gemini away from being pwned.

Comment by jameslk 7 hours ago

> As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.

Proofs never took off because most software engineering moved away from waterfall development, not just because proofs are difficult. Long formal specifications were abandoned since often those who wrote them misunderstood what the user wanted or the user didn’t know what they wanted. Instead, agile development took over and software evolved more iteratively and rapidly to meet the user.

The author seems to make their prediction based on the flawed assumption that difficulty in writing proofs was the only reason we avoided them, when in reality the real challenge was understanding what the user actually wanted.

Comment by bugarela 5 hours ago

Me and my team have recently done an experiment [1] that is pretty aligned with this idea. We took a complex change our colleagues wanted to make to a consensus engine and tried a workflow where Quint formal specifications would be in the middle of prompts and code, and it worked out much better than we imagined. I'm personally very excited about the opportunities for formal methods in this new era.

[1]: https://quint-lang.org/posts/llm_era

Comment by heikkilevanto 5 hours ago

I admit I have never worked with it, but I have a strong feeling that a formal verification can only work if you have a formal specification. Fine and useful for a compiler, or a sorting library. But pretty far from most of the coding jobs I have seen in my career. And even more distant from "vibe coding" where the starting point is some vaguely defined free-text description of what the system might possibly be able to do...

Comment by tkel 2 hours ago

Agreed, writing formal specifications is going to require more work from people, which is exactly the opposite reason why people are excited to use LLMs..

Comment by pedrozieg 8 hours ago

I buy the economics argument, but I’m not sure “mainstream formal verification” looks like everyone suddenly using Lean or Isabelle. The more likely path is that AI smuggles formal-ish checks into workflows people already accept: property checks in CI, model checking around critical state machines, “prove this invariant about this module” buttons in IDEs, etc. The tools can be backed by proof engines without most engineers ever seeing a proof script.

The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.

Comment by mkleczek 41 minutes ago

The article only discusses reasons why formal verification is needed. It does not provide any information on how would AI solve the fundamental issues making it difficult: https://pron.github.io/posts/correctness-and-complexity

Comment by andy99 7 hours ago

Maybe a stupid question, how do you verify the verification program? If an LLM is writing it too, isn’t it turtles all the way down, especially with the propensity of AI to modify tests so they pass?

Comment by crvdgc 5 hours ago

> how do you verify the verification program?

The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:

- Reading it very carefully (doable since it's very small)

- Having multiple independent implementations and compare the results

- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)

Comment by pegasus 7 hours ago

The proof is verified mechanically - it's very easy to verify that a proof is correct, what's hard is coming up with the proof (it's an NP problem). There can still be gotchas, especially if the statement proved is complex, but it does help a lot in keeping bugs away.

Comment by worldsayshi 7 hours ago

How often can the hardness be exchanged with tediousness though? Can at least some of those problems be solved by letting the AI try until it succeeds?

Comment by jdub 2 hours ago

To simplify for a moment, consider asking an LLM to come up with tests for a function. The tests pass. But did it come up with exhaustive tests? Did it understand the full intent of the function? How would it know? How would the operator know? (Even if it's wrangling simpler iterative prop/fuzz/etc testing systems underneath...)

Verification is substantially more challenging.

Currently, even for an expert in the domains of the software to be verified and the process of verification, defining a specification (even partial) is both difficult and tedious. Try reading/comparing the specifications of e.g. a pure crypto function, then a storage or clustering algorithm, then seL4.

(It's possible that brute force specification generation, iteration, and simplification by an LLM might help. It's possible an LLM could help eat away complexity from the other direction, unifying methods and languages, optimising provers, etc.)

Comment by newAccount2025 5 hours ago

There is tons of work on this question. Super recent: https://link.springer.com/article/10.1007/s10817-025-09743-8

Comment by rocqua 8 hours ago

I've been trying to convince others of this, and gotten very little traction. One interesting response I got from someone very familiar with the Tamarin prover was that there just wasn't enough example code out there.

Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.

Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!

Comment by nomadygnt 1 hour ago

Another cool tool that’s being developed for rust is verus. It’s not the same as Kani and is more of a fork of the rust compiler but it lets you do some cool verification proofs combined with the z3 SMT solver. It’s really a cool system for verified programs.

Comment by jacquesm 7 hours ago

This is a tough thing to do: predictions that are premised on the invention of something that does not exist today or that does not exist in the required form hinges on an unknown. Some things you can imagine but they will likely never exist (time travel, ringworlds, space elevators) and some hinge on one thing that still needs to be done before you can have that particular future come true. If the thing you need is 'AGI' then all bets are off. It could be next week, next month, next year or never.

This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.

Comment by hintymad 5 hours ago

Maybe we can define what "mainstream" means? Maybe this is too anecdotal, but my personal experience is that most of the engineers are tweakers. They love building stuff and are good at it, but they simply are not into math-like rigorous thinking. Heck, it's so hard to even motivate them to use basic math like queuing theory and stats to help with their day-to-day work. I highly doubt that they would spend time picking up formal verification despite the help of AI

Comment by ozgrakkurt 20 minutes ago

You can’t even compile regular code with satisfying speed

Comment by manindahat 6 hours ago

At best, not reading generated code results in a world where no humans are able to understand our software, and we regress back to the stone age after some critical piece breaks, and nobody can fix it.

At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.

Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.

Comment by vlovich123 6 hours ago

If you believe the sentient AI is that capable and intention based to replicate itself, what’s stopping it from trying to engage in underhanded code where it comments and writes everything in ways that look fine but actually have vulnerabilities it can exploit to achieve what it wants? Or altering the system that runs your code so that the code that gets deployed is different.

Comment by inimino 1 hour ago

I both agree and disagree. Yes, AI will democratize access to formal methods and will probably increase the adoption of them in areas where they make sense (e.g. safety-critical systems), but no, it won't increase the areas where formal methods are appropriate (probably < 1% of software).

What will happen instead is a more general application of AI systems to verifying software correctness, which should lead to more reliable software. The bottleneck in software quality is in specifying what the behavior needs to be, not in validating conformance to a known specification.

Comment by arjie 7 hours ago

Interesting prediction. It sort of makes sense. I have noticed that LLMs are very good at solving problems whose solutions are easy to check[0]. It ends up being quite an advantage to be able to work on such problems because rarely does an LLM truly one-shot a solution through token generation. Usually the multi-shot is 'hidden' in the reasoning tokens, or for my use-cases it's usually solved via the verification machine.

A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

Comment by degamad 6 hours ago

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

Comment by arjie 6 hours ago

It is the exact same flow. I think a lot of things in programming follow that pattern. The other one I can think of is identifying the commit that introduces a regression: write the test program, git-bisect.

Comment by imiric 7 hours ago

The issue is that many problems aren't easy to verify, and LLMs also excel at producing garbage output that appears correct on the surface. There are fields of science where verification is a long and arduous process, even for content produced by humans. Throwing LLMs at these problems can only produce more work for a human to waste time verifying.

Comment by arjie 6 hours ago

Yes, that is true. And for those problems, those who use LLMs will not get very far.

As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.

Comment by dkga 1 hour ago

As an economist I completely agree with the post. In fact, I assume we will see an explosion of formal proof requirements in code format by journals in a few years time. Not to mention this would make it much easier and faster to publish papers in Economic Theory, where checking the statements and sometimes the proofs simply takes a long time from reviewers.

Comment by rf15 56 minutes ago

Isn't formal verification a "just write it twice" approach with different languages? (and different logical constraints on the way you write the languages)

Comment by 0xWTF 22 minutes ago

And this is the year of the Linux desktop...

Comment by blurbleblurble 5 hours ago

Stuffing this awesome headline through the Curry-Howard isomorphism: LLMs produce better code when the type checker gives more useful feedback.

Comment by zmmmmm 3 hours ago

I dunno about formal verification, but for sure it's brought me back to a much more TDD first style of coding. You get so much mileage from having it first implement tests and then run them after changes. The key is it lowers the friction so much in creating the tests as well. It's like a triple bottom line.

Comment by hedayet 2 hours ago

I can see the inspiration; But then again, how much investment will be required to verify the verifier? (it's still code - and is generated my a non-deterministic system)

Comment by agentultra 7 hours ago

You’re still going to need people to guide and understand the proofs.

Might as well just learn Agda or Lean. There are good books out there. It’s not as hard as the author suggests. Hard, yes, but there’s no Royal road.

Comment by ursAxZA 2 hours ago

The theory is sound, but the practical bridge is missing.

There’s a gap that current LLM architectures simply can’t cross yet.

Comment by throwaway613745 7 hours ago

GPT 5.2 can't even tell me how many rs are in garlic.

Comment by bglazer 7 hours ago

This is a very tiring criticism. Yes, this is true. But, it's an implementation detail (tokenization) that has very little bearing on the practical utility of these tools. How often are you relying on LLM's to count letters in words?

Comment by 1970-01-01 7 hours ago

The implementation detail is that we keep finding them! After this, it couldn't locate a seahorse emoji without freaking out. At some point we need to have a test: there are two drinks before you. One is water, the other is whatever the LLM thought you might like to drink after it completed refactoring the codebase. Choose wisely.

Comment by 101008 7 hours ago

It's an example that shows that if these models aren't trained in a specific problem, they may have a hard time solving it for you.

Comment by altruios 6 hours ago

An analogy is asking someone who is colorblind how many colors are on a sheet of paper. What you are probing isn't reasoning, it's perception. If you can't see the input, you can't reason about the input.

Comment by 9rx 1 hour ago

> What you are probing isn't reasoning, it's perception.

Its both. A colorblind person will admit their shortcomings and, if compelled to be helpful like an LLM is, will reason their way to finding a solution that works around their limitations.

But as LLMs lack a way to reason, you get nonsense instead.

Comment by 48 minutes ago

Comment by Uehreka 7 hours ago

No, it’s an example that shows that LLMs still use a tokenizer, which is not an impediment for almost any task (even many where you would expect it to be, like searching a codebase for variants of a variable name in different cases).

Comment by 8note 7 hours ago

the question remains: is the tokenizer going to be a fundamental limit to my task? how do i know ahead of time?

Comment by worldsayshi 6 hours ago

Would it limit a person getting your instructions in Chinese? Tokenisation pretty much means that the LLM is reading symbols instead of phonemes.

This makes me wonder if LLMs works better in Chinese.

Comment by victorbjorklund 7 hours ago

No, it is the issue with the tokenizer.

Comment by andy99 7 hours ago

At this point if I was openAI I wouldn’t bother fixing this to give pedants something to get excited about.

Comment by blazingbanana 7 hours ago

Unless they fixed this in 25 minutes (possible?) it correctly counts 1 `r`.

https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...

Comment by iAMkenough 7 hours ago

The criticism would stop if the implementation issue was fixed.

It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?

Comment by desman 7 hours ago

This is like complaining that your screwdriver is bad at measuring weight.

If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.

Comment by bgwalter 7 hours ago

⌴⌴⌴

Comment by worldsayshi 7 hours ago

That's a problem that is at least possible for the LLM to perceive and learn through training, while counting letters is much more like asking a colour blind person to count flowers by colour.

Comment by charcircuit 5 hours ago

Why is it not possible? Do you think it's impossible to count? Do you think it's imposing to learn the letters in each token? Do you think combining the two is not possible.

Comment by 6 hours ago

Comment by wikiterra 6 hours ago

[dead]

Comment by 8note 6 hours ago

i could see formal verification become a key part of "the prompt is the code" so that as versions bump and so on, you can have an llm cpmpletely regenerate the code from scratch-ish and be sure that the spec is still followed

but i dont think people will suddenly gravitate towards using them because they're cheaper to write - bugs of the form "we had no idea this sould be considered" is way more common than "we wrote code that didnt do what we wanted it to"

an alternative guess for LLMs and formal verification is that systems where formal verification is a natural fit - putting code in places that are hard to update and have well known conditions, will move faster.

i could also see agent tools embedding in formal methods proofs into their tooling, so they write both the code and the spec at the same time, with the spec acting as memory. that kinda ties into the recent post about "why not have the LLM write machine code?"

Comment by lewisjoe 6 hours ago

A while back I did this experiment where I asked ChatGPT to imagine a new language such that it's best for AI to write code in and to list the properties of such a language. Interestingly it spit out all the properties that are similar to today's functional, strongly typed, in fact dependently typed, formally verifiable / proof languages. Along with reasons why such a language is easier for AI to reason about and generate programs. I found it fascinating because I expected something like typescript or kotlin.

While I agree formal verification itself has its problems, I think the argument has merit because soon AI generated code will surpass all human generated code and when that happens we atleast need a way to verify the code can be proved that it won't have security issues or adheres to compliance / policy.

Comment by WhyOhWhyQ 5 hours ago

AI generated code is pretty far from passing professional human generated code, unless you're talking about snippets. Who should be writing mission critical code in the next 10 years, the people currently doing it (with AI assistance), or e.g. some random team at Google using AI primarily? The answer is obvious.

Comment by zamadatix 4 hours ago

Some past discussion (excluding 0 comment submissions) to add:

https://news.ycombinator.com/item?id=46216274 - 4 comments, 6 days ago

https://news.ycombinator.com/item?id=46203508 - 1 comment, 7 days ago

https://news.ycombinator.com/item?id=46198874 - 2 comments, 8 days ago

Comment by plainOldText 7 hours ago

I guess it’s time to learn OCaml then.

It appears many of the proof assistants/verification systems can generate OCaml. Or perhaps ADA/Spark?

Regardless of how the software engineering discipline will change in the age of gen AI, we must aim to produce higher not lower quality software than whatever we have today, and formal verification will definitely help.

Comment by bob1029 6 hours ago

I don't know if it's "formal", but I think combining Roslyn analyzers with an LLM and a human in the loop could be game changing. The analyzers can enforce intent over time without any regressions. The LLM can write and modify analyzers. Analyzers can constrain the source code of other analyzers. Hypothetically there is no limit to the # of things that could be encoded this way. I am currently working on a few prototypes in the vicinity of this.

https://learn.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/t...

Comment by chamomeal 7 hours ago

Formal methods are very cool (and I know nothing about them lol) but there's still a gap between the proof and the implementation, unless you're using a language with proof-checking built in.

If we're looking to use LLMs to make code absolutely rock-solid, I would say advanced testing practices are a good candidate!. Property-based testing, fuzzing, contract testing (for example https://github.com/griffinbank/test.contract) are all fun but extremely tedious to write and maintain. I think that makes it the perfect candidate for LLMs. These kinds of tests are also more easily understandable by regular ol' software developers, and I think we'll have to be auditing LLM output for quite a while.

Comment by measurablefunc 6 hours ago

If you want to see what verified software looks like then Project Everest is a good demonstration: https://project-everest.github.io/

Comment by bluerooibos 5 hours ago

Is there a difference here between a proof and an automated test?

I'm curious what a proof would look like compared to my RSpec unit tests and what the actual implementation would look like.

Comment by pizlonator 5 hours ago

A test tests some - but not all - of the paths through your code. And some - but not all - of the inputs you might see.

A proof establishes that your code works correctly on all paths, inputs, etc.

Comment by whazor 8 hours ago

Domain Specific Languages + Formal Verification.

With a domain specific language you can add extra limitations on the kinds of outputs. This can also make formal verification faster.

Maybe like React components. You limit the format. You could limit which libraries can be imported, what hooks could be used, how expressive could be.

Comment by Validark 3 hours ago

"we wouldn’t even need to bother looking at the AI-generated code any more, just like we don’t bother looking at the machine code generated by a compiler."

2020: I don't care how it performs

2030: I don't care why it performs

2040: I don't care what it performs

Comment by AgentOrange1234 6 hours ago

Hard disagree. "I'm an expert" in that I have done tons of proofs on many systems with many provers, both academically and professionally for decades.

Also, I am a novice when it comes to programming with sound, and today I have been dorking with a simple limiter. ChatGPT knows way more than me about what I am doing. It has taught me a ton. And as magical and wonderful as it is, it is incredibly tedious to try to work with it to come up with real specifications of interesting properties.

Instead of banging my head against a theorem prover that won't say QED, I get a confident sounding stream of words that I often don't even understand. I often don't even have the language to tell it what I am imagining. When I do understand, it's a lot of typing to explain my understanding. And so often, as a teacher, it just is utterly failing to effectively communicate to me why I am wrong.

At the end of all of this, I think specification is really hard, intellectually creative and challenging work. An LLM cannot do the work for you. Even to be guided down the right path, you will need perseverance and motivation.

Comment by jkaptur 6 hours ago

I think that there are three relevant artifacts: the code, the specification, and the proof.

I agree with the author that if you have the code (and, with an LLM, you do) and a specification, AI agents could be helpful to generate the proof. This is a huge win!

But it certainly doesn't confront the important problem of writing a spec that captures the properties you actually care about. If the LLM writes that for you, I don't see a reason to trust that any more than you trust anything else it writes.

I'm not an expert here, so I invite correction.

Comment by skeeter2020 6 hours ago

Prediction: AI hypers - both those who are clueless and those who know perfectly well - will love this because it makes their "AI replaces every developer" wet dream come true, by shifting the heavy lifting from this thing called "software development" to the tiny problem of just formally verifying the software product. Your average company can bury this close to QA, where they're probably already skimping, save a bunch of money and get out with the rewards before the full damage is apparent.

Comment by nrhrjrjrjtntbt 8 hours ago

I love HN because HN comments have talked about this a fair bit already. I think on the recent Erdos problem submission.

I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.

Comment by QuercusMax 8 hours ago

Or it will lead you to unsafe workarounds when you get stuck

Comment by ben_w 8 hours ago

Always a risk.

Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}

Comment by QuercusMax 7 hours ago

Forums used to be full of ppl using ON ERROR RESUME NEXT in VB

Comment by iwontberude 7 hours ago

The idea that LLMs are steering anything correctly with Rust reference management is hilarious to me, but only due to my experiences.

Comment by nrhrjrjrjtntbt 3 hours ago

Id definitely use LLMs to troubleshoot but not blindly vibecode.

Comment by gaogao 6 hours ago

Topical to my interests, I used Claude Code the other day for formally verifying some matrix multiplication in Rust. Writing the spec wasn't too hard actually, done as post-conditions in code, as proving equivalence to a simpler version of the code, such as for optimization, is pretty straight forward. Maybe I should write up a proper post on it.

Comment by robot-wrangler 7 hours ago

I don't like sharing unpolished WIP and my project is still more at a lab-notebook phase than anything else, but the think-pieces are always light on code and maybe someone is looking for a fun holiday hackathon: https://mattvonrocketstein.github.io/py-mcmas/

Comment by Sevii 8 hours ago

If AI is good enough to write formal verification, why wouldn't it be good enough to do QA? Why not just have AI do a full manual test sweep after every change?

Comment by simonw 8 hours ago

You can do that as well, but it's non-deterministic and also expensive to run.

Much better to have AI write deterministic test suites for your project.

Comment by apercu 7 hours ago

I guess I am luddite-ish in that I think people still need to decide what must always be true in a system. Tests should exist to check those rules.

AI can help write test code and suggest edge cases, but it shouldn’t be trusted to decide whether behavior is correct.

When software is hard to test, that’s usually a sign the design is too tightly coupled or full of side effects, or that the architecture is unnecessarily complicated. Not that the testing tools are bad.

Comment by 8 hours ago

Comment by tachim 5 hours ago

I'm Tudor, CEO of Harmonic. We're huge believers in formal verification, and started Harmonic in 2023 to make this technology mainstream. We built a system that achieved gold medal performance at 2025 IMO (https://harmonic.fun/news), and recently opened a public API that got 10/12 problems on this year's Putnam exam (https://aristotle.harmonic.fun/).

If you also think this is the future of programming and are interested in building it, please consider joining us: https://jobs.ashbyhq.com/Harmonic. We have incredibly interesting challenges across the stack, from infra to AI to Lean.

Comment by kerameikos34 3 hours ago

This is utter nonsense. LLMs are dumb. Formal Methods require actual thinking. Specs are 100% reasoning.

I highly dislike the general tone of the article. Formal Methods are not fringe, they are used all around the world by good teams building reliable systems. The fact they are not mainstream have more to do with the poor ergonomics of the old tools and the corporate greed that got rid of the design activities in the software development process to instead bring about the era of agile cowboy coding. They did this just because they wanted to churn out products quickly at the expense of quality. It was never the correct civilized way of working and never will be.

Comment by robot-wrangler 2 hours ago

If you like the rigor of formal methods and dislike the vaguery of LLMs, shouldn't you be in favor of using the first to improve the second?

Comment by jstummbillig 8 hours ago

Interesting. Here is my ai-powered dev prediction: We'll move toward event-sourced systems, because AI will be able to discover patterns and workflow correlations that are hard or impossible to recover from state-only CRUD. It seems silly to not preserve all that business information, given this analysis machine we have at our hands.

Comment by password-app 5 hours ago

Formal verification for AI is fascinating, but the real challenge is runtime verification of AI agents.

Example: AI browser agents can be exploited via prompt injection (even Google's new "User Alignment Critic" only catches 90% of attacks).

For password management, we solved this with zero-knowledge architecture - the AI navigates websites but never sees credentials. Credentials stay in local Keychain, AI just clicks buttons.

Formal verification would be amazing for proving these isolation guarantees. Has anyone worked on verifying AI agent sandboxes?

Comment by wagwang 7 hours ago

Disagree, the ideal agentic coding workflow for high tolerance programming is to give the agent access to software output and have it iterate in a loop. You can let the agent do TDD, you can give it access to server logs, or even browser access.

Comment by Havoc 7 hours ago

I've been toying with vibecoding rust - hardly formal verification, but it is a step closer than python that's for sure.

So far so good, though the smaller amount of training data is noticeable.

Comment by iwontberude 7 hours ago

vibecoding rust sounds cool, which model are you using? I have tried in the past with GPT4o and Sonnet 4, but they were so bad I thought I should just wait a few years.

Comment by m00dy 1 hour ago

I just write in safe Rust, if it compiles then it is formally verified for me.

I recently used Rust in my recent project, Deepwalker [0]. I have written only once and never looked back.

[0]: https://deepwalker.xyz

Comment by positron26 1 hour ago

The natural can create the formal. An extremely intuitive proof is that human walked to Greece and created new formalisms from pre-historic cultures that did not have them.

Gödel's incompleteness theorems are a formal argument that only the natural can create the formal (because no formal system can create all others).

Tarski's undefinability theorem gives us the idea that we need different languages for formalization and the formalisms themselves.

The Howard Curry correspondence concludes that the formalisms that pop out are indistinguishable from programs.

Altogether we can basically synthesize a proof that AGI means automatic formalization, which absolutely requires strong natural systems employed to create new formal systems.

I ended up staying with some family who were watching The Voice. XG performed Glam, and now that I have spit many other truths, may you discover the truth that motivates my work on swapchain resizing. I wish the world would not waste my time and their own, but bootstrapping is about using the merely sufficient to make the good.

Comment by css_apologist 2 hours ago

i want this to be true, however i am not confident

Comment by p1necone 7 hours ago

Now: AI generates incorrect code.

Future: AI generates incorrect code, and formal verification that proves that the code performs that incorrect behaviour.

Comment by user3939382 1 hour ago

I’ve made huge advances in formal verification with AI feel free to email if this is your field

Comment by 0xbadcafebee 6 hours ago

Unlikely. It's more work than necessary and systemic/cultural change follows the path of least resistance. Formal verification is a new thing a programmer would have to learn, so nobody will want to do it. They'll just accept the resulting problems as "the cost of AI".

Or the alternative will happen: people will stop using AI for programming. It's not actually better than hiring a person, it's just supposedly cheaper (in that you can reduce staff). That's the theory anyway. Yes there will be anecdotes from a few people about how they saved a million dollars in 2 days or something like that, the usual HN clickbait. But an actual study of the impact of using AI for programming will probably find it's only a marginal cost savings and isn't significantly faster.

And this is assuming the gravy train that programmers are using (unsustainable spending/building in unreasonable timeframes for uncertain profits) keeps going indefinitely. Best case, when it all falls apart the govt bails them out. Worst case, you won't have to worry about programming because we'll all be out of work from the AI bust recession.

Comment by tptacek 8 hours ago

Called this a little bit: https://fly.io/blog/semgrep-but-for-real-now/

Semgrep isn't a formal methods tool, but it's in the same space of rigor-improving tooling that sound great but in practice are painful to consistently use.

Comment by heliumtera 7 hours ago

Will formal verification be a viable and profitable avenue for the middle man to exploit and fuck everybody else? Then yes, it absolutely will become mainstream. If not, them no, thanks. Everything that becomes mainstream for SURE will empower the middleman and cuck the developer, nothing else really matters. This is literally the only important factor.

Comment by goalieca 5 hours ago

Really? Language models are fundamentally not equivalent in scale and scope to formal logical models. People are extrapolating poorly.

Comment by shevy-java 3 hours ago

No - I don't want AI to be "mainstream".

Comment by 8organicbits 6 hours ago

If the engineer doesn't understand the proof system then they cannot validate that the proof describes their problem. The golden rule of LLMs is that they make mistakes and you need to check their output.

> writing proof scripts is one of the best applications for LLMs. It doesn’t matter if they hallucinate nonsense, because the proof checker will reject any invalid proof

Nonsense. If the AI hallucinated the proof script then it has no connection to the problem statement.

Comment by justatdotin 3 hours ago

> 2. AI-generated code needs formal verification so that we can skip human review and still be sure that it works;

hahahahaha

Comment by teleforce 6 hours ago

>writing those proofs is both very difficult (requiring PhD-level training) and very laborious.

>For example, as of 2009, the formally verified seL4 microkernel consisted of 8,700 lines of C code, but proving it correct required 20 person-years and 200,000 lines of Isabelle code – or 23 lines of proof and half a person-day for every single line of implementation. Moreover, there are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system.

I think this type of pattern (genuine difficult problem domain with very small number of experts) is the future of AI not AGI. For examples formal verification like this article and similarly automated ECG interpretation can be the AI killer applications, and the former is I'm currently working on.

For most of the countries in the world, only several hundreds to several thousands registered cardiologist per country, making the ratio about 1:100,000 cardiologist to population ratio.

People expecting cardiologist to go through their ECG readings but reading ECG is very cumbersome. Let's say you have 5 minutes ECG signals for the minimum requirement for AFib detection as per guideline. The standard ECG is 12-lead resulting in 12 x 5 x 60 = 3600 beats even for the minimum 5 minutes durations requirements (assuming 1 minute ECG equals to 60 beats).

Then of course we have Holter ECG with typical 24-hour readings that increase the duration considerably and that's why almost all Holter reading now is automated. But current ECG automated detection has very low accuracy because their accuracy of their detection methods (statistics/AI/ML) are bounded by the beat detection algorithm for example the venerable Pan-Tompkins for the limited fiducial time-domain approach [1].

The cardiologist will rather spent their time for more interesting activities like teaching future cardiologists, performing expensive procedures like ICD or pacemaker, or having their once in a blue moon holidays instead of reading monotonous patients' ECGs.

This is why ECG reading automation with AI/ML is necessary to complement the cardiologist but the trick is to increase the sensitivity part of the accuracy to very high value preferably 100% and we achieved this accuracy for both major heart anomalies namely arrhythmia (irregular heart beats) and ischemia (heart not regulating blood flow properly) by going with non-fiducial detection approach or beyond time domain with the help of statistics/ML/AI. Thus the missing of potential patients (false negative) is minimized for the expert and cardiologist in the loop exercise.

[1] Pan–Tompkins algorithm:

https://en.wikipedia.org/wiki/Pan%E2%80%93Tompkins_algorithm

Comment by bgwalter 7 hours ago

No it won't. People who aren't interested in writing specifications now won't be interested later as well. They want to hit the randomization button on their favorite "AI" jukebox.

If anyone does write a specification, the "AI" won't get even past the termination proof of a moderately complex function, which is the first step of accepting said function in the proof environment. Before you can even start the actual proof.

This article is pretty low on evidence, perhaps it is about getting funding by talking about "AI".

Comment by Analemma_ 7 hours ago

I don't really buy it. IMO, the biggest reason formal verification isn't used much in software development right now is that formal verification needs requirements to be fixed in stone, and in the real world requirements are changing constantly: from customer requests, from changes in libraries or external systems, and competitive market pressures. AI and vibe coding will probably accelerate this trend: when people know you can vibe code something, they will feel permitted to demand even faster changes (and your upstream libraries and external systems will change faster too), so formal verification will be less useful than ever.

Comment by henning 8 hours ago

Unless you feed a spec to the LLM, and it nitpicks compiled TLA+ output generated by your PlusCal input, gaslights you into saying the code you just ran and pasted the output of is invalid, then generates invalid TLA+ output in response. Which is exactly what happened when I tried coding with Gemini via formal verification.

Comment by yuppiemephisto 7 hours ago

I vibe code extremely extensively with Lean 4, enough to run out 2 claude code $200 accounts api limits every day for a week.

I added LSP support for images to get better feedback loops and opus was able to debug https://github.com/alok/LeanPlot. The entire library was vibe coded by older ai.

It also wrote https://github.com/alok/hexluthor (a hex color syntax highlighting extension that uses lean’s metaprogramming and lsp to show you what color a hex literal is) by using feedback and me saying “keep goign” (yes i misspelled it).

It has serious issues with slop and the limitations of small data, but the rate of progress is really really fast. Opus 4.5 and Gemini were a huge step change.

The language is also improving very fast. not as fast as AI.

The feedback loop is very real even for ordinary programming. The model really resists it though because it’s super hard, but again this is rapidly improving.

I started vibe coding Lean about 3 years ago and I’ve used Lean 3 (which was far worse). It’s my favorite language after churning through idk 30?

A big aspect of being successful with them is not being all or nothing with proofs. It’s very useful to write down properties as executable code and then just not prove them because they still have to type check and fit together and make sense. github.com/lecopivo/scilean is a good example (search “sorry_proof”).

There’s property testing with “plausible” as a nice 80/20 that can be upgraded to full proof at some point.

When the model gets to another jump in capacity, I predict it will emergently design better systems from the feedback needed to prove that they are correct in the first place. Formal Verification has a tendency like optimization to flow through the system in an anti-modular way and if you want to claw modularity back, you have to design it really really well. But ai gives a huge intellectual overhang. Why not let them put their capacity towards making better systems?

Even the documentation system for lean (verso) is (dependently!) typed.

Check out my Lean vibe codes at https://github.com/alok?tab=repositories&q=Lean&type=&langua...

Comment by pooooooooooooop 7 hours ago

cool quote about smooth operator ... I notice none of the vibe codes are proofs of anything and rather frameworks for using lean. this seems like a waste of tokens - what is your thinking behind this?

Comment by yuppiemephisto 4 hours ago

There’s 4000 lines of nonstandard analysis which are definitely proofs, including equivalence to the standard definitions.

The frameworks are to improve lean’s programming ecosystem and not just its proving. Metaprogramming is pretty well covered already too, but not ordinary programs.

Comment by waterTanuki 6 hours ago

I'm skeptical of formal verification mainly because it's akin to trying to predict the future with a sophisticated crystal ball. You can't formally guarantee hardware won't fail, or that solar radiation will flip a bit. What seems to have had much better ROI in terms of safety critical systems is switching to memory-safe languages that rely less on runtime promises and more on compiler guarantees.

Comment by PunchyHamster 7 hours ago

Prediction: Fuck no.

AI is unreliable as it is. It might make formal verification a bit less work intensive but the last possible place anyone would want the AI hallucinations are in verification.

Comment by bkettle 7 hours ago

The whole point I think, though, is that it doesn’t matter. If an LLM hallucinates a proof that passes the proof checker, it’s not a hallucination. Writing and inspecting the spec is unsolved, but for the actual proof checking hallucinations don’t matter at all.

Comment by nicoburns 6 hours ago

Isnt the actual proof checking what your traditional formal verification tool does? I would have thought that 99% of the work of formal verification would be writing the spec and verifying that it correctly models the problem you are trying to model.

Comment by newAccount2025 5 hours ago

The specification task is indeed a lot of work. Driving the tool to complete the proof is often also a lot of work. There are many fully automatic proof tools. Even the simplest like SAT solvers run into very hard computational complexity limits. Many “interactive” provers are more expressive and allow a human to help guide the tool to the proof. That takes intuition, engineering, etc.

Comment by datatrashfire 5 hours ago

lol no it won’t

Comment by CyberDildonics 5 hours ago

When people have the choice whether to use AI or use rust because LLMs don't produce workable rust programs they leave rust behind and use something else. The venn diagram of people who want formal verification and think LLM slop is a good idea is two separate circles.

Comment by justatdotin 3 hours ago

two of my current agent projects use rust: agent generated rust is going very well for me.

Comment by CyberDildonics 2 hours ago

You're missing the point of what I said. If LLM programming and rust conflict people don't suddenly learn to program and stay with rust, they find a new language.

Comment by justatdotin 2 hours ago

my experience is that llm programming and rust can fit together very nicely, thankyou.

I am finding agent tooling expands my capacity for multi-language projects

Comment by dboreham 5 hours ago

0.02/wo: formal methods are just "fancy tests". Of course tests are good to have, and fancy tests even better. I've found that humans are pretty terrible on average at creating tests (perhaps it would be more accurate to say that their MBA overlords have not been great at supporting their testing endeavors). Meanwhile I've found that LLMs are pretty good at generating tests and don't have the human tendency to see that kind of activity as resume-limiting. Therefore even a not amazing LLM writing tests turns out to be a very useful resource if you have any interest in the quality of your product and mitigating various failure risks.

Comment by 16 minutes ago

Comment by br0s3r 15 minutes ago

[dead]

Comment by NedF 2 hours ago

[dead]