Formal methods and the future of programming

Posted by eatonphil 3 days ago

Comments

Comment by Animats 3 days ago

I used to do proof of correctness work, decades ago.[1] We had more proof automation than many of the later systems. The easy stuff was solved by the first SAT solver, the Oppen-Nelson simplifier. The harder stuff used the Boyer-Moore prover, which uses heuristics and previous lemmas to guide the theorem prover. The Boyer-Moore prover had to be helped along by suggesting lemmas, which it would try to prove and which would then be used for later proofs. That was the tough human job.

Later systems seemed to have less automation. What tended to go wrong with formal methods was that the people doing them were too into the formalism. I was working on this for a commercial project that wanted bug-free code, not as an academic. The academic projects tended to get too clever. They had the mathematician's bias of wanting a terse notation and not much case analysis. That's not a good solution. You want lots of automated grinding and don't want to need much insight. The clever people kept inventing new logics - modal logic, temporal logic, etc, - to avoid verbosity in paper and pencil proof work. That wasn't really all that helpful. The basic truth of this business is that most of the theorems are rather banal.

As the Jane Street people point out, there's a big advantage in having control of the language. You want the verification statements integrated into the programming language. In many systems, they're embedded in comments, in a different syntax than the programming language, or even in completely separate files. This adds unnecessary work. It's good to see them pushing this forward.

We were doing this too early, over 40 years ago. It took about 45 minutes of compute back then to build up number theory from a cold start with the Boyer-Moore prover. Now it takes less than a second.

[1] https://archive.org/details/manualzilla-id-5928072/page/n3/m...

Comment by nextos 2 days ago

I've worked in formal methods for quite a long time, and I disagree a bit with your statement that new logics are not helpful. Industrial logics are really practical and allow you to write all sorts of sophisticated properties that your system should satisfy in a very succinct way. Logic is to computer science and software engineering what calculus is to physics and mechanical or civil engineering [1, 2]. Things like LTL or, more recently, separation logic, have been incredible breakthroughs.

TLA+, which has gained quite a lot of popularity, is a testament to that. Model checking is eminently practical. The exciting thing now is that heavier formal methods, in particular theorem proving, might become cheap enough to use in regular systems software. Writing formal specifications for functions and getting them synthesized and proven correct by some SAT/SMT, theorem prover & LLM hybrid may become the norm in the not-too-distant future.

[1] On the Unusual Effectiveness of Logic in Computer Science. https://www.cs.rice.edu/~vardi/papers/aaas99.jsl.pdf

[2] From Philosophical to Industrial Logics. https://www.cs.rice.edu/~vardi/papers/icla09.pdf

Comment by arka2147483647 2 days ago

> Industrial logics are really practical and allow you to write all sorts of sophisticated properties that your system should satisfy in a very succinct way.

It sounds like what you think as positives are exactly the things parent comment thinks as negatives.

Comment by agentcoops 1 day ago

I used to work in this field myself long ago (academically and practically). The difficulty that OP didn't suggest much appreciation for is that the interesting spaces for automated verification -- especially of distributed systems, say -- require exploring the space of formalisms between the trivial proofs of propositional logic (SAT solvers) and the undecidable proofs of first or second-order logic (interactive theorem proving environments such as Isabelle and Lean). This is the incredibly large space occupied most importantly by the various modal logics, which allow for quantification over graph traversals. There are remarkably expressive logics where the decision problem remains tractable, such as the modal mu-calculus.

That is, precisely if one wants automated tools for verification the seemingly obtuse work on novel formal systems is necessary.

Comment by ajb 2 days ago

There's a big difference between wanting a succinct proof and wanting a succinct statement of the requirements. The easier it is to state the requirements, the more likely you have stated them correctly. Whereas succinct proof is not relevant for industrial purposes, as long as the proover has a small trusted kernel, it makes no difference to the reliability.

Comment by jmalicki 2 days ago

"Logic is to computer science and software engineering what calculus is to ... mechanical or civil engineering"

Things professionals rarely use in practice?

Comment by rramadass 16 hours ago

Thanks for the pointers to the papers.

One book which seems not that well known is Arindama Singh's Logics for Computer Science 2nd edition - https://www.phindia.com/Books/BookDetail/9789387472433/LOGIC...

For more details see author's webpage - https://home.iitm.ac.in/asingh/books.html

Comment by YetAnotherNick 1 day ago

> TLA+, which has gained quite a lot of popularity

Why do you think that? Outside of just toy examples, and vague examples of "AWS uses that", I don't know of any actual code which has TLA. Most of the things you can do with TLA, you can do with informal math notation quite easily.

Do you have any real world long term usage examples?

Comment by sunshowers 18 hours ago

We have a number of TLA+ models at Oxide. Here's one:

https://github.com/oxidecomputer/omicron/blob/8b0886dbd02229...

Comment by nextos 1 day ago

See for example https://www.mongodb.com/company/blog/engineering/conformance...

Comment by sgt101 2 days ago

Human interpretebility and human construction is going to be really important for formal specifications.

My fear is that we will get inscrutable maths that will be guarded by tiny coteries of devoties. The different inscrutable notations will be mutually incompatible, understanding one will not really help with the others. Most people will simply write english statements that cannot be verified properly.

Worse even, people will get machines to formalise their statements, and will not understand the formalism or the proofs, but will take part in the theatre of verification and act shocked when everything explodes.

Comment by ebiederm 2 days ago

Have you looked at ADA Spark?

If you have does it match your intuition of how things should be done?

I am slowly working on something where I hope to integrate such a capability for the things that type systems can't handle quickly.

So I would be interested in perspectives of people who have been down this route before.

Comment by Animats 2 days ago

I've generally liked classic approaches which had entry and exit conditions, and loop invariants, all written using the same syntax and operators as the program. The compiler has to know what to ignore, of course. The compiler should syntax and type check all the proof information, even if it can't verify it. It's important to avoid an impedance mismatch between the proof system and the programming language. If programmers are constantly translating between two notations, customer resistance will be high.

If you need to write functions as scaffolding for proofs, they should be written in the programming language. You might need some operators that aren't actually runnable, such as FORALL, but the syntax should be that of the programming language. If your specs look like

    Γ, P || Q, P ==> R, Q ==> R ⊢ R

in a language with none of those operators, you're doing it wrong. That's the disease of falling in love with the formalism.

The user should not have to tell the proof system things the compiler already knows. Whatever overloading and aliasing rules the language enforces should be known and handled in verification, too. We worked off the syntax tree produced by a compiler front end modified to handle the verification statements, not the raw source code. Ada Spark has different aliasing rules than Ada, for example.

One big problem is that object oriented programming came and went in popularity. You really want object invariants. You need some way to attach an invariant to a data structure. You also need a clear boundary where control enters and exits the objects. If the language doesn't have objects, you struggle with trying to express object type invariants in the proof language. You don't get the boundaries as part of the programming language.

A useful interface between the prover and the program is simply to use

    assert(A);
    assert(B);

in the middle of code to encapsulate complex proof goals. A should be provable from the preceding code using a SAT solver. B, when assumed true, should provide enough info for a SAT solver to proceed from that point. Now you need to prove

    A implies B

by external means. That creates a well defined problem which can be given to something AI-like, backed by a proof checker, to chew on.

Ada Spark has some features that violate these rules. Also, it's hard to find anything about Ada Spark newer than ten years old.

There's a fair amount of interest in verifying Rust. There's been some progress using Dafny. But "Dafny is not Rust. Using Dafny requires porting algorithms of interest from Rust to Dafny. This port can miss details and introduce errors." There's the impedance mismatch again. Verus looks more promising. It is more Rust-like in syntax, they get the SAT solver/AI prover distinction, and there's active development.

Comment by int_19h 6 hours ago

Rust proc macros are good enough to have design-by-contract implemented in them, although for type invariants it's not exactly ergonomic (because you have to disallow inner mutability to avoid someone grabbing a reference to a field and then mutating it without any invariant checks).

I don't know if there are any solutions combining this with static analysis. Of course, even the runtime checks are very helpful (doubly so in AI-written code).

Comment by ebiederm 18 hours ago

Thank you. That makes things clearer.

I especially appreciate the trick of asserting the intermediate truth to help the prover along.

As someone who writes software I very much agree that verification of asserts before run time (written in the language itself) is very approachable to a programmer.

In a similar vein I agree with the folks at Jane Street that aiming to rule out specific classes of bugs (as opposed to proving a program is entirely correct) is a very practical goal.

Comment by rramadass 2 days ago

Have you looked at "Verified Design-By-Contract"? See the paper Safe Object-Oriented Software: The Verified Design-By-Contract Paradigm by David Crocker here - https://www.eschertech.com/products/verified_dbc.php

David Crocker's Verification Blog - https://critical.eschertech.com/

Comment by ebiederm 17 hours ago

Not who you asked but I took a quick look.

Typically Design by Contract has meant runtime assertions. I like that they are doing verification before runtime.

At the same time their take on loops (you can't write them and have verification puts me off). Especially when modern c++ has so many prebuilt looks. It would seem to me it should just be a matter of annotating the prebuilt loops and encouraging their use.

I think their approach will fail on modern encryption code because it takes too much control (loops) from the programmer.

Comment by rramadass 16 hours ago

Which paper did you read? The "Safe Object-Oriented ..." one that i pointed to or something else? Crocker's writings explain a lot more especially w.r.t. usage with C/C++.

See for example Can C++ be made as safe as SPARK? - https://www.eschertech.com/papers/index.php which identifies a subset and enhances it with annotations. This can be updated with his later article Contracts arrive in C++26! - https://critical.eschertech.com/2025/09/09/contracts-have-ar...

Also see articles under Proving C and C++ programs correct - https://www.eschertech.com/articles/index.php

Animats was bemoaning that OO has declined and that you needed object/DS invariants. I was pointing to the fact DbC has it all (people should always use the runtime checking approach) and with Verified-DbC you could do it statically too. Formal Methods can be done at various levels and a developer can choose and adopt what he feels comfortable with initially before graduating to fullblown heavyweight methodologies/tools. What is needed is developing Formal Method Thinking. See the paper On Formal Methods Thinking in Computer Science Education linked to here - https://news.ycombinator.com/item?id=46298961

Comment by winwang 2 days ago

Love this. I've shifted in the past few months to using highly expressive types in Scala 3 to have types carry more and more compile-time proofs (without macros, though a couple are warranted). Not only does it help with agentic test "sprawl", it seems to prevent agents from falling into lower-quality modes of operation. One of the more annoying things I've been preventing is what I call "noun accretion", where agents try and make a new monomorphic type for everything, instead of clearly genericizing when sensible. My bet is on formal-method-shaped tooling (including languages with strong type systems) to accelerate decent-quality agentic coding.

When I say "highly expressive types", I mean techniques I'd likely not want to ship in a typical codebase, unless the team was already on the typelevel programming bandwagon (i.e. having HKT and type functions being basic blocks already, not weird). Agents are better at "math" than basically most devs (even category-theory-pilled ones), at least in terms of knowledge. Better yet, they are decent at pedagogy, especially when considering they have "infinite" patience for dialogue.

In a more personal setting, I've had Codex Lean-ify a couple of my hobby proofs, it was extremely easy. Note: not saying it did this 100% "correctly" (gotta learn more Lean 4 to check more thoroughly), but it also seems to check for classic proof gotchas by default. Very excited for the future of formal methods.

Comment by jdw64 3 days ago

In other words, because GEN AI a lot of code, the idea is to shift human value toward verification. Sometimes I think about what programming really is. In fact, learning programming itself is a huge challenge for a non English speaker like me. I have to rely on machine translation to understand English documents that have no translation. The materials in my language are about 5 to 6 years behind.

Now, since it's impossible to code review the tens of thousands of lines of code that AI produces, I see discussions about establishing an absolute rule like mathematical proofs. Reading this reminds me of Rust's borrow checker. In fact, after writing in Rust a few times, it often leads to bad practices where people use tricks to avoid the borrow checker.

Actually, when mathematical rigor goes too far, humans tend to find ways around it. An undereducated programmer like me is especially prone to that.

Looking back at this kind of attempt, it will probably result in writing code only for specific formalized answers. If it becomes that standardized, I'm not sure it will be able to respond to human needs.

I think these defensive programming attempts are fine, but I want to do offensive programming (I coined that term). You take risks, but you fix things quickly and ship. Believing that over time, it will become good enough. Of course, for established industries where accuracy matters and the scope of work is well defined, like Jane Street, the approach in this article is correct. In other words, because there is enough data to adequately model the market's demands

But for social losers like me trying to make money, constantly moving from place to place looking for gold mines, these kinds of methodologies seem like a luxury.Established businesses with mature modeling need highly educated and specialized personnel to continuously optimize. But I know that realistically, I can't keep up with that demand. So I look for places where modeling is unstructured, but I'm not sure if I can use this approach even then.

Comment by ngruhn 2 days ago

> I want to do offensive programming (I coined that term). You take risks, but you fix things quickly and ship.

Nice, I like the term too. But the paradigm is absolutely status quo in the industry. The thing is: with Gen AI the cost of "defensive programming" has gone way down, while the cost of (human) verification has gone way up. On the other hand, formal methods make verification cheap but come with massive implementation overhead (writing specs, types, proofs, and generally bending the implementation into a rigid framework). But Gen AI can automate all that laborious work. It's a match made in heaven.

Comment by jdw64 2 days ago

You're right. I think the point we need to discuss here is, to be clear, dividing the contexts in which defensive programming is strong and where it should be aggressive.

I think the overhead of implementation is enormous, but I believe AI can write it. However, before even reaching the 'implementation' stage, that is, at the planning stage, sufficient data must be collected for the implementation to be safe.

In that respect, I think Jane Street already has enough data and modeling capabilities. However, I think it's a bit difficult to say whether this approach will take shape in many other domains.

In that sense, I also think that the reason many industries are doing this kind of fast deployment and experimental tooling might be a preparatory step for that kind of modeling. Have a good day Thanks for the good comment. It helped me think more sharply as well

Comment by majormajor 2 days ago

> I think the overhead of implementation is enormous, but I believe AI can write it. However, before even reaching the 'implementation' stage, that is, at the planning stage, sufficient data must be collected for the implementation to be safe.

I don't really follow here, I think once you know what you want the system to do, you could start to model it formally. What data do you think needs collection for planning here? Is it just for performance profiling of whatever algorithms are decided on? But that seems equally as relevant as if you do your initial implementation straight-up in code w/o any formal approach.

Comment by jdw64 2 days ago

Formal verification is a mathematical proof and requires axioms. However, if real world field data is not collected, you are not comparing theory to reality but forcing reality to fit the theory, like the bed of Procrustes.

This means the object of verification is not reality but the consistency between specification and implementation, and incorrect formal verification only refines a wrong worldview.

A company with sufficient capital can turn a theory into reality through marketing and other means. But for small businesses or in markets where new competitive logic is introduced due to the lifespan of an industry, this can be a fatal problem.

For example, consider the stock market. If you use the Buffett indicator to invest right now, the market looks overheated. But other indicators suggest positive prospects. We cannot know whether the market logic of the AI era aligns with the past or is different, but I believe there are cases where modeling changes due to real world shifts.

In reality, when converting the real world into mathematics or physics, information loss is inevitable. In this process, science may later advance and become more precise, or a value once considered important may turn out to be wrong.

In other words, something may be correct within a given set of axioms, but those axioms themselves may change depending on the context. The history of computer standards has shown this as well(ASCII -> Unicode)[To explain this point, ASCII works perfectly in English speaking countries, but in my country it creates incorrect encoding.]

Programmer's ability may come down to distinguishing between what changes little and lasts long, and what changes quickly

After writing the replt, I realized I was being too critical of formal verification. I think Jane Street's argument holds for invariants inside code. However, I am cautious about whether that logic can be extended to other fields. I haven't studied deeply enough, so my thinking may be shallow. Please understand

Comment by orochimaaru 2 days ago

You have to see it from janestreet's perspective. They're an HFT and trading high volume (millions if not 10's of millions) of stock & instruments. There is no "fix". By the time you understand what's wrong you've lost billions.

So yeah - offensive may work in non-critical areas.

Fwiw - you already use defensive everywhere. Python, Java, etc. come with garbage collectors. It's verified that the code is executing your intent.

I was wondering when we would start seeing formal verification. It makes sense that we would go from worrying about implementation details to a scientific/mathematical description of the problems.

Comment by coderenegade 2 days ago

>Fwiw - you already use defensive everywhere. Python, Java, etc. come with garbage collectors. It's verified that the code is executing your intent.

Sort of. Garbage collectors can be fallible too, especially where release optimization is used.

Comment by rramadass 2 days ago

Your ideas about Formal Methods are not clear nor quite correct.

Read first the paper On Formal Methods Thinking in Computer Science Education to understand the different levels of practice available. Here is a previous comment of mine which explains and links to the paper - https://news.ycombinator.com/item?id=46298961

Carroll Morgan just published his Formal Methods, Informally: How to Write Programs That Work which teaches you how to think in a formal method manner before you start using the heavyweight tools - https://www.cambridge.org/highereducation/books/formal-metho...

Also read Understanding Formal Methods by Jean-Francois Monin to get an overview of some of the tools and the concepts/mathematics embodied in those tools.

With just the basic ideas from the above viz. Set Theory, Predicate Calculus, learning to think of a Program as a trajectory through a state space, you can start enforcing the trajectory using simple asserts(i.e. predicates) for preconditions/postconditions/invariants. Now because of Curry-Howard Isomorphism (https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspon...) you can treat "propositions/formulae as types" and thus exploit the type system itself as constraints enforcing the above trajectory.

Once the above is grokked, you can move on to more complex logics (eg. FOL/HOL/Temporal/Separation etc.) and learn about tools/methodologies which implement them (eg. Alloy/B-Method/TLA+ etc.).

Finally, with AI tools, the threshold for the practice of formal methods has dramatically come down. This enables one to do Formal Specification and Verification with guaranteed traceability for AI-generated code which IMO is a necessity.

Comment by jdw64 1 day ago

However, I don't know things as deeply as you do; I only know the basic framework. So you're right that I don't know it deeply. If you have time later, please teach me that part, and I will gladly study it As you said, starting from the light parts and applying them directly might work better for me too. I would appreciate it if you could recommend a few books

Comment by jdw64 1 day ago

I'm sorry, but what you've said are things I've already read.(this:https://research.tue.nl/en/publications/on-formal-methods-th...)

The points you're raising are not what I'm arguing.

What I am arguing is this:

Formal verification only guarantees consistency between the 'specification' and the 'implementation.' It does not guarantee that the 'specification' correctly reflects reality. This is a problem of modeling.

For your logic to hold, it would imply that humans can formalize, to some degree, the amount of loss that occurs when modeling reality. That is not possible.

It only tells you whether the implementation satisfies the specification. That's also what the OP's post is about. The point that 'the cost has been lowered due to AI' is largely no different from simply saying 'the cost of implementation has gone down.' The real question is whether that implementation properly reflects reality.

The OP's post is fundamentally about 'internal invariants.'

But if you look at the beginning of the first post: 'But outside of some special cases (notably, hardware synthesis), our sense has been that formal methods were just not worth the costs for us. And those costs are really high! seL4 is a great example of this. It's a formally verified microkernel, and a profound achievement. But, boy was it expensive to do! It took 25 person-years of effort to verify 8,700 lines of C, and each line of code required something like 23 lines of proof and a half a person-day to verify.' And the post suggests extending existing methodologies. I think a limited case and generalizing it are different problems. The point is that not every special case is general.

That is why I pointed out whether this extension of methodology is even possible. I am skeptical on this point.

If you want to criticize me, rather than talking about such methodologies, you should criticize the essence of my logic: whether even offensive programming properly understands modeling. I am not simply opposing the OP's post; rather, I am asking whether modeling in the age of AI agents is comparable to modeling in the past.

The materials you shared for me are helpful, but I am saying they do not align with my logic.

Your introduction says: '===== The Problem: Realism vs. Idealism =====' Right?

The question is whether the idealism you advocate actually reflects reality. I am simply arguing for modeling. The gap in my argument is that 'then, because modeling also changes frequently, couldn't all methodologies become useless?' That could be a potential flawed skepticism. But what you're saying is not that.

Every argument has a blind spot. But you are not reading the weak points of my argument; you are only reading your own strong points. This only creates friction and leads us to argue about our emotional differences. I think you are intelligent and knowledgeable. But the points you're raising are not the points of criticism I intended, and what you're pointing out is not what I said.

I am not saying that all methodologies are wrong. I am saying that a methodology can go wrong if it steps outside its boundaries.

However, the fact that you linked me to things I could study was, I think, a gesture of goodwill, assuming that I, as a junior programmer, was misunderstanding something and trying to help me learn. But criticism that assumes I just don't understand isn't helpful for either of us.

Comment by rramadass 1 day ago

Let me see whether i can explain things better.

> Formal verification only guarantees consistency between the 'specification' and the 'implementation.' It does not guarantee that the 'specification' correctly reflects reality. This is a problem of modeling.

> For your logic to hold, it would imply that humans can formalize, to some degree, the amount of loss that occurs when modeling reality. That is not possible.

Not quite true.

You model reality by mapping your chosen abstractions from the problem domain onto some solution domain. The language of Mathematics was standardized so that we can communicate our problem domain concepts using a "Formal System" (i.e. set of symbols, syntax, rules etc.) to build a "Mathematical Model" (i.e. the interpretation of the above symbols in some real-world domain). The advantage here is that, the mathematical model is precise/well-defined and thus amenable to mechanical manipulation/checking using the rules of symbolic logic by a human/machine.

So the requirements engineering process identifies real-world problem domain concepts/objects/features/usecasees, expresses them using a formal system notation (eg. The Z notation language) to build a unambiguous self-consistent mathematical model. This is the "Formal Specification" for a system. Because it is unambiguous, there can be no confusion in interpretation between all stakeholders. When requirements change (doesn't matter how often), the model is updated but because it is a formal system any inconsistencies can be immediately detected and fixed to bring the system back to a self-consistent state.

The above is basically a realization of the "Scientific Method" viz. Observe, Measure, Hypothesize, Test, iterate (over all of the previous) so that you converge with arbitrary precision towards a desired mapping of your chosen reality.

In this regard see the classic essay The Unreasonable Effectiveness of Mathematics in the Natural Sciences - https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

> The point that 'the cost has been lowered due to AI' is largely no different from simply saying 'the cost of implementation has gone down.'

There is a subtle but important distinction. AI-generated code is probabilistic i.e. for the same exact specification/prompt every run will produce slightly different code. So verification becomes far more important in assuring system robustness i.e. correctness w.r.t. specification. Using AI-agents you can do implementation(i.e. coding) + verification as easily as implementation only. But of course you have to study formal methods and know what to do so that you can instruct an AI-agent do it.

> The OP's post is fundamentally about 'internal invariants.'

No, the OP is merely pointing to a basic conceptual idea which helps one to do formal verification i.e. invariants -> verification conditions -> proof -> theorem.

> But if you look at the beginning of the first post: 'But outside of some special cases (notably, hardware synthesis), our sense has been that formal methods were just not worth the costs for us. And those costs are really high! seL4 is a great example of this. It's a formally verified microkernel, and a profound achievement. But, boy was it expensive to do! It took 25 person-years of effort to verify 8,700 lines of C, and each line of code required something like 23 lines of proof and a half a person-day to verify.' And the post suggests extending existing methodologies. I think a limited case and generalizing it are different problems. The point is that not every special case is general. ... That is why I pointed out whether this extension of methodology is even possible. I am skeptical on this point.

This refers to difficulties using formal methods pre-AI era, because you had to learn special mathematics, methodologies and tools to use this maths all of which are non-trivial. So people only used it for critical problem domains where the labour was worth the cost. But with AI, this constraint is no longer true and so all problem domains are amenable to formal checking provided of course you get the formal specification right.

> I am asking whether modeling in the age of AI agents is comparable to modeling in the past.

Modeling in the age of AI can be done far more faster i.e. far quicker turnarounds to converge your understanding of requirements and verify it simultaneously.

> The question is whether the idealism you advocate actually reflects reality. I am simply arguing for modeling. The gap in my argument is that 'then, because modeling also changes frequently, couldn't all methodologies become useless?' That could be a potential flawed skepticism. But what you're saying is not that.

> Every argument has a blind spot. But you are not reading the weak points of my argument; you are only reading your own strong points. This only creates friction and leads us to argue about our emotional differences. I think you are intelligent and knowledgeable. But the points you're raising are not the points of criticism I intended, and what you're pointing out is not what I said.

I have already addressed these in the beginning of this post.

PS: I am assuming you wrote your above comment after your sibling comment here - https://news.ycombinator.com/item?id=48548589 I have already mentioned a couple of reference books and more can be suggested if needed. In particular, i highly recommend the "Understanding Formal Methods" book which is a fire-hose of information. It tries to provide an overview of the whole domain with chapters alternating between mathematical theory and methodologies/tools using the theory. Supplement it with wikipedia/google as needed. After this you can pick up a book/documentation on any specific tool you like eg. Dafny, Lean4, TLA+ etc. with the assurance that you understand the basic concepts and can use the tool effectively.

Comment by jdw64 1 day ago

Thank you for your comment. I may have been a bit rude, but that was not my intention. I trust you will understand.

Anyway, there are points in your thinking that I also resonate with. I realize now that some of my arguments were too rough.

I am not going to withdraw the point that formal specifications cannot always be specified with perfect accuracy, but I do think there were some shallow aspects to what I said. I appreciate that a programmer like you took the time to write such a thoughtful reply to me. I have also been reading the books you recommended as PDFs. I have learned a lot about areas and methodologies I was not familiar with. Thank you for that.

For now, I am planning to learn more about formal verification as you mentioned. Up to now, I have only been applying very basic methodologies. I mainly used techniques like Result types, algebraic data type modeling, and guard clause based Design by Contract.

The reason I am doing this is that my work mostly involves repetitive tasks, and I think the methodologies you mentioned would fit me quite well. I hope you are not offended.

Thank you for taking the time to engage with my comments, and I also appreciate you taking the trouble to recommend books. I will study them and try to apply what I learn in my work. Have a good day. I feel a bit sorry for having spoken impolitely to you, a senior programmer, as a junior

Comment by rramadass 1 day ago

That's all right; no need to feel bad. These things happen when communicating in forums like this.

Btw - Another useful way to think about model building (of reality) is analogous to the study of "Accuracy and Precision" usually taught in introductory numerical methods course. Wikipedia has a very nice explanation - https://en.wikipedia.org/wiki/Accuracy_and_precision Formal Methods help with precision and consistency. But of course you could be modeling the wrong requirements and thus it would not be accurate. Accuracy requires human intelligence (i.e. domain knowledge) but AI can help here too. You can structure the domain requirements (formally or not) and engage in socratic dialogue with an AI-agent to refine your understanding of the problem and converge towards the "truth" i.e. a set of requirements acceptable to all stakeholders. There are actually a bunch of startups working on this eg. Entalpa - https://entalpa.com/en

Some additional resources to help one better understand the concepts behind formal methods;

Faultless systems: Yes we can! - https://www.research-collection.ethz.ch/entities/publication...

The Faultless Way of Programming - https://dl.acm.org/doi/fullHtml/10.1145/3698322.3698340

The Digital Maieutic: Socrates and the Art of Prompting - https://news.ycombinator.com/item?id=48342887

Finally, see Industrial-Strength Formal Methods in Practice for case studies and practical advice - https://link.springer.com/book/10.1007/978-1-4471-0523-7

Comment by jdw64 1 day ago

Thanks! I will make sure to read the last book thoroughly. Have a good day

Comment by closeparen 2 days ago

>Now, since it's impossible to code review the tens of thousands of lines of code that AI produces

I would simply not ask the AI to write me more code until I'm satisfied with the code it's already written! But it is true there's more of an opportunity to let the thing rip if you can give the harness the ability to meaningfully verify its own work.

Comment by stephenlf 3 days ago

> these kinds of methodologies seem like a luxury

Absolutely. The article acknowledges this. Jane Street is pretty uniquely equipped to benefit from this.

Comment by teiferer 2 days ago

> I have to rely on machine translation to understand English documents that have no translation. The materials in my language are about 5 to 6 years behind.

Pretty much off topic, but I strongly recommend you learn English. It takes a little bit of effort, but getting rid of that constant translation overhead will be an enormous boost for you.

Comment by curuinor 2 days ago

guy's korean, looks like? korean <-> english is the hardest language pair of two industrialized-nation languages, almost.

Comment by altmanaltman 2 days ago

I don't understand the point of your comment here. Yes of course JaneStreet published that since its relevant to them. Yes, it doesn't generally apply to all programming. How is you being a "social loser" or not following these practices in your career relevant to this?

Comment by addaon 3 days ago

I've been playing with related ideas recently, and can talk about them at length... but one thing that's surprised me is just how effective frontier models (ChatGPT-5.5 in particular) are at completing certain manual proofs in the Roqc (né Coq) proof assistant. The proofs aren't always pretty, but ChatGPT can often prove something in minutes and 10 - 100 iterations that would take me, a human who has limited but non-zero proof assistant experience but significant domain experience in the lemmas being proven, much much longer.

The reason I think this is relevant and interesting in the context of the (short) article is because it challenges basic assumptions about how certain formal method tooling works. Frama-C, Ada/SPARK, etc are focused on generating proof obligations that can be automatically discharged by CVC5, Z3, etc; with a relatively painful fallback of manually proving the obligation in Rocq. The most common workflow seems to be to discover that an obligation is "hard" (not automatically discharged), and to restructure the set of visible lemmas and assertions at the obligation generation point in the code to try to make it "easy"; or even restructuring the code to try to make it easy. Which, in a world where Roqc proofs are doubly expensive (first because they're a challenge for a human to write; second because they tend to require quite a bit of maintenance churn as the code and proof around them evolves), makes sense. But if Roqc proofs are "automatically discharged with AI in the loop", this cost delta goes away -- and it becomes possible to think about writing proofs the same we (usually) write code, with human readability and clarity as the first goal, and [compiler|prover] optimization a distant second.

Which I find quite exciting, although I haven't fully digested the implications yet.

Comment by bluGill 3 days ago

How often will AI look at something that can't be proven because you change requirements and decide to change the code and your requirements to make the proof easier though? I haven't played with proofs at all, but I do know that occasionally when I say, this test failed after making a change, AI will just change the test instead of making the code pass both the old test and the new test which is most often my intent.

Comment by addaon 2 days ago

That's up to the harness. Right now, my harness doesn't have any tools that would allow the agent to change contracts. Also, agents working on a Rocq proof don't have write access to the C+ACSL code; they can review it, and propose structural changes to the coordinator, but all they can do is iterate on the Roqc proof or give up. The weakest part of the harness today is that the agents that do C+ACSL work can modify both C and ACSL, even though they're usually run with the explicit intent to not change the semantics of the C code (although it's totally within bounds, and often required, for them to change e.g. `foo & 3` to `foo % 4`) -- so there's weakness here, but it's bounded, and since the contracts live in header files that they can never write, it's worked so far. It's a starting point, at least; I certainly wouldn't say this is a mature, or even good, tool chain, but still learning.

Comment by rirze 2 days ago

This mirrors my own experience, except I went with Lean. It was more relevant to the features I wanted to build.

Exciting times ahead.

Comment by zahlman 2 days ago

> how effective frontier models (ChatGPT-5.5 in particular) are at completing certain manual proofs in the Roqc (né Coq) proof assistant. The proofs aren't always pretty, but ChatGPT can often prove something in minutes and 10 - 100 iterations that would take me, a human who has limited but non-zero proof assistant experience but significant domain experience in the lemmas being proven, much much longer.

... How do you know that the proofs are themselves correct?

Comment by addaon 2 days ago

With the proof checker.

Comment by bobkb 2 days ago

I assume your idea is, if the spec and the proof is verified the code generated is good enough as well ?

Comment by addaon 2 days ago

Today, I write the code. It’s trivial and takes a lot less time than writing the spec, and since I’m using conventional tooling for WCET and stack sizing it’s nice to get those right up front. The LLMs sometimes tweak the code slightly for provability, but this is usually either direct operator replacement (shift with multiplication, and with modulus, etc) or factoring out a block to a function to tie a contract onto it, both of which I trust my compiler to undo (simple arithmetic operations and inlining, respectively) with zero to minimal impact on the generated binary.

Comment by red75prime 2 days ago

Proof checkers fuel the AI hype by outputting "valid" for a hallucinated text. /s

Comment by brap 3 days ago

Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”.

I guess doing things twice can help catch errors, but I fail to see what’s so special about formal specs if they can suffer from the exact same bugs as the tests/implementation.

I guess the root of the problem is if you want to formally prove that a program does something, you have to be very specific (heh) about what that something is, at which point you are basically just writing tests/implementation all over again.

I have been looking into this on and off for years now, and I keep feeling like I’m missing something, but I don’t know what it is.

Can anyone enlighten me?

Comment by jcranmer 2 days ago

There's a famous quote from Dijkstra: "Program testing can be used to show the presence of bugs, but never to show their absence." The flaw of testing is that it can only test the behaviors that you think might be problematic. To actually reach into the category of proactively fixing behaviors that you didn't know could go wrong, you have to reach for more exotic tools in the toolkit. Fuzz testing is a start down this path; formal verification is a different start down this path.

Obviously, the quality of these sorts of tools is dependent on your ability to write a formal model that is comprehensive in allowing behavior you want to be acceptable and disallowing behavior you want to be unacceptable, and we're still surprisingly far from that in many fields.

Comment by pfdietz 23 hours ago

The problem with that mindset is that testing, like verification, becomes extremely powerful as it becomes more automated and you throw more compute at it. And it's arguably easier to automate testing than it is to automate formal verification.

Sure, testing isn't perfect. But is finding 100% of the bugs that much better than finding (say) 99% of them? This is especially the case if the missed bugs tend to be those that happen very rarely.

A formal specification allows automatic generation of tests. So run billions of tests, randomly generated, and see if any violate the specification.

Even theorem proving systems use this sort of thing as a short cut, for example pruning off attempts to prove a universally quantified statement by looking for counterexamples.

Comment by jlebar 3 days ago

> Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”. [...] Can anyone enlighten me?

A big difference is that formal methods allow you to use the "for all" quantifier.

For example, you might write a unit test that says "foo('abc') returns a string with no trailing whitespace".

But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".

This is a trivial example, but you could imagine something more complicated, such as "for any program P, compile(P) has the same behavior as P".

Of course, you have to define what "has the same behavior as" means!

Comment by brap 3 days ago

>Of course, you have to define what "has the same behavior as" means

And that's really my issue, for example when you define "has no trailing whitespace", you are basically writing a piece of the implementation. Cover all behaviors, and you have basically re-implemented the function, no?

In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?

Comment by bluGill 3 days ago

> In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?

In some cases, however quite often, the spec is much simpler. For instance, it's easy to say that after running sort on some list, that the result is sorted. However, it is very hard to come up with the algorithm to do that from the specification. Sometimes that is even a point. Bubble sort, quick sort, tim sort, we can go on and on. There's a huge number of different sorts that computer science have discovered over the years. They all should have the same result and so you should be able to prove they do the same thing. However, in the real world there are often reasons you would prefer one to another despite all meeting the same spec.

Comment by gsnedders 2 days ago

Another obvious example are cryptographic hash functions: if you have a function f(s) = h, you can trivially specify a function inverse_f(h) = s st f(s) = h, and if you can infer a non-brute-force algorithm for that, you’ve just inferred a cryptographical weakness!

Comment by pydry 2 days ago

I did some of this stuff in college and what bugged me was that the spec actually ended up more complex than the code and it had bugs.

That was a long time ago and they said that formal methods were the future back then, too.

Comment by chadgpt3 2 days ago

It's possible for that to happen but probably means either the function is too trivial or you're missing some abstraction in the spec

Comment by pydry 2 days ago

It's also possible it's not the future of programming.

Comment by chadgpt3 2 days ago

It never was. It's useful but extremely slow.

Comment by brap 2 days ago

This makes a lot of sense, thank you!

Comment by nemo1618 2 days ago

I think the key is that, while you may think you have a full formal spec of f(), you actually do not. You have a program written in some language, and that language has its own spec, and the language is compiled to asm which has its own spec, and the asm executes on an architecture that has its own spec, and so on.

So when you write a function like:

  func hypot(x, y):
    return sqrt(x*x + y*y)

You might think you have "fully specified" hypot, but this is far from true! You have said nothing about what registers will be used, for example. This is not a problem; quite the opposite. It's the whole point of using high-level languages: they let you focus on what you care about. A spec is just a program in a very-high-level language.

Comment by ____tom____ 2 days ago

Exactly.

You can prove that f(a,b) always returns a+b.

And then you overflow the int on that machine.

Oops.

Comment by codebje 2 days ago

I am not sure that the perspective you have taken is the same as what I understood from the parent post; what I took from it is that things like registers, memory locations, ways to implement square root, and so on, are all _implementation choices_ that are not important properties of the specification. You specify that the hypotenuse is the square root of the sum of squares, but whether square root is implemented using Newtonian approximation or a fast inverse square root is irrelevant; whether your first argument is passed in a register or on the stack is irrelevant.

Often, things like resource usage are not specified: running time, memory consumption, etc, aren't relevant enough to appear in a behavioural specification.

If your spec says "f(a, b) returns a + b", but it's just a high-level document you can use to help guide your implementation, integer overflow is just one of many ways your implementation might be inconsistent with the specification. It's still likely that the existence of a formal specification you reference during implementation means that more edge cases have been considered ahead of time than if you just had an informal spec.

If, on the other hand, you prove it but it turns out not to be true (ie, you overflow integers), your proof is wrong. If a machine verified your proof and gave you a big thumbs up, your machine verification is wrong.

If, in Idris, I write "f : (a : Nat) -> (b : Nat) -> (c : Nat * c = a + b)", then I cannot compile an implementation for which I can't show a proof that the result is _always_ the addition of a and b, for all a and b, unbounded by anything but the resources at hand with which to run the program. An implementation subject to integer overflow won't compile.

Or, I could write "f : (a : Bits32) -> (b : Bits32) -> (c : Bits32 * c = a + b)" and implement something where , but then modulo arithmetic on overflow is _part of the specification_, because "+" in there is doing the heavy lifting of being defined as addition modulo 2*32 already; by specification, 4 billion plus 4 billion is ~3.7 billion.

Comment by ____tom____ 2 days ago

Perhaps I have misunderstood or was unclear.

Everything the parent comment mentioned were implementation details that did not affect the correctness of the code.

I just wanted to point out that there are implementation details that DO affect the correctness of the code.

And, of course programs need to run on multiple architectures. So it's hard to do what people seemed to be talking about in this thread and verify code just from the source code.

If you have the luxury of proving the correctness of the CPU, compiler and OS, that should be a big win. Otherwise, it seems to just be another type of testing. Still useful, but calling it verified or proven seems a bit much.

From my perspective, it seems more to be writing another, more complicated program, with more opportunities for bugs, and seeing if the results agree

Comment by Joker_vD 2 days ago

Well, ideally you proof should've failed because your axioms should be aware that arithmetic operators in the programming language of your choice are operating modulo some large power of two, and thus don't produce correct results for some particular inputs.

Admittedly, this one peeves a lot. Remember when Java's binary search and mergesort (and implementations in many other languages' standard libraries) turned out to have a bug of this kind [0]? Admittedly, the proof was informal, but if you are trying to prove some properties of a program X written in a certain programming language Y, you can't "just assume" that the semantics of Y are different from what they are, right? The integers do overflow in Java, that's explicitly stated in the Java's spec... and that means that a lot of even the most simplistic code has some very non-obvious correctness preconditions, which most of the times, I believe, are simply hoped to be true.

[0] https://research.google/blog/extra-extra-read-all-about-it-n...

Comment by chadgpt3 2 days ago

Non trailing whitespace means the string doesn't end with a space. But foo is a function that converts an AST to a string, that's totally different. Or it's a function that loops until \0 and changes all spaces to +

The spec should be a summary of what the impl is supposed to do. You'd want more than just doesn't end with whitespace of course.

Comment by ____tom____ 2 days ago

> Non trailing whitespace means the string doesn't end with a space

No. And this is a great example of the problems with specifications. You still have to write a spec. And this, too, is subject to bugs.

What's wrong with the statement above is there are 17 space characters in Unicode and another eight whitespace characters, like newline.

If you try to verify that something ends in whitespace, you have to make sure you have the right definition.

Not picking on parent poster! It's just a great example of the fact that you can verify, but if what you are verifying is wrong, it doesn't help.

Comment by Joker_vD 2 days ago

> there are 17 space characters in Unicode and another eight whitespace characters, like newline.

And of course, those 25 characters don't include ZERO WIDTH {SPACE,NON-JOINER,JOINER,NON-BREAKING SPACE} and WORD JOINER, which gives you yet another 5 arguably "it's kinda space, right" codepoints which definitely should not be trailing in any reasonable text string.

Comment by ____tom____ 2 days ago

Unicode never ceases to amaze me.

Comment by 3 days ago

Comment by parthdesai 2 days ago

> But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".

Isn't that essentially property testing?

Comment by Agingcoder 2 days ago

Formal methods allow you to prove that it works for all inputs, and not just for the small subset that will be sampled by property testing

It’s a proof, not a successful experiment.

Comment by Jtsummers 2 days ago

Property testing is stochastic, which may be fine, but only gives you a statistically (hopefully) high chance of discovering a problem. If you use something like SPARK/Ada, you can actually embed a proof in the code so that you actually know that the code is correct (for what you've proven). PBT scales better than embedding proofs, though, and is highly effective in practice along with fuzzing.

Comment by akoboldfrying 2 days ago

Unless you literally try every possible combination of inputs (which is usually infeasible), property testing can't give you mathematical guarantees about correctness. You can think of it as a halfway house between classic testing and formal verification:

Classic testing: A human comes up with some concrete example inputs for which they know the "right answers" (corresponding outputs). They write code that runs the code under test, gets its actual outputs, and compares them to the desired outputs.

Property testing: A human comes up with a precise way of randomly generating concrete example (input, desired output) pairs. They write some code to describe how to generate the pairs, often using a declarative DSL that describes only constraints on the inputs and outputs, with the understanding that anything not expressly forbidden is permitted, like "The input can be any list of between 0 and 100 integers each between -500 and 500" and "Every integer in the input must appear the same number of times in the output". They then write some more code (often a single line) to ask the computer to use this "spec" to randomly generate, say, 1000 such pairs, or as many pairs as can be checked in 1s. The computer generates the pairs itself, runs the code under test on each input and and checks its output matches the desired output.

Formal verification: A human comes up with a spec that typically describes conditions that must hold for all (input, output) pairs. This may look very similar to, or even exactly like the DSL used for property testing, though in general there are other conditions that can be expressed that cannot be checked with property testing even in principle -- for example, checking that the program always eventually terminates. The main difference is that the code under test is never actually run; instead, the computer analyses the source code itself to attempt mathematically prove that the stated conditions hold. How to actually accomplish this is a field of active research, but one basic approach is called "symbolic execution". To greatly simplify, if we forget about loops and conditionals for a moment, the idea is that we can write down things we know must be true after each statement executes, based on the things we knew must be true before it executed. So for example if x is a variable initially containing any integer (and we ignore overflow) then after the line

    x = x * x

runs, we know that x >= 0. To handle conditionals like

    if x > 50:
      x = 42

    something_afterwards(x)

the prover "forks" into two cases: One in which we know for certain that x > 50, one in which we know for certain that x <= 50. At the end of the if statement it then has the task of recombining what is known about the two cases. In this example, the first case lets us conclude that x = 42 by the end, while the second case lets us conclude that x <= 50 by the end, so it could conclude that x <= 50 either way by the time execution reaches something_afterwards(x). Handling loops is trickier but generally involves looking for invariants.

Comment by y1n0 3 days ago

I don’t write software, but in asic and fpga design, formal method specifications enable the use of things like SAT solvers to determine if the underlying test article meets the specification.

You specify properties of the design and the tooling tests the design in a variety of ways to see if it can violate those properties.

Let’s say you have a system that controls turn signals. You can specify properties that say things like lanes that cross each other can’t both have a green light at the same time or even within some period of time of each other.

The tooling can exhaustively check the design for this and surface code traces that violate it.

Comment by majormajor 2 days ago

The powerful difference is between specific testing and exhaustive proving. If you fail to think of a test for an edge case, whoops. If you have a good model, you can know that it exists—and fix it—before you ship.

This is particularly valuable IMO in concurrent/distributed/multi-threaded situations where race-conditions and deadlocks are extremely hard to test and reason about. "Can A, B, C happen in the order 'A, C, B'" types of things.

I have a rough hypothesis that maturity in this space looks something like:

1) LLMs will allow much faster learning and usage of formal methods, even if initially just for "do it a second time" post-hoc verification

2) From there you move towards automation of LLM-checking "hey, the implementation code changed, does it look like it broke the model" - though this is still not deterministic, but it's likely a lot better than a human review requirement of something that only changed infrequently would've been

3) And then the real jump will be taking tooling for "write just the formal spec, let the implementation get generated" to the next level. There's various mostly-not-fully-baked-for-what-most-companies-would-want projects already out there for codegen like this, what if the LLM tools can accelerate the year-or-two of manual work it would take to make one of them fit your needs?

Comment by jaggederest 3 days ago

It's more than just "write the same tests just in a different way", because each test is largely independent, or grows to unmanageably large, and you have to do the work of figuring out the completeness of your test suite by e.g. branch coverage or other relatively crude overlap methods.

Statically proven type systems let you do this in a way that each contributing piece is checked in advance against all the other pieces, guaranteeing not just "this test passes" but, in combination, "all these tests create a reasonable, coherent whole", and it disallows incoherent models from being implemented in the code. An example of this is like Rust's match, which requires complete coverage of all the possible branches, but writ large across an entire codebase.

You're right that it does nothing for you if you make a fundamental logic or theory error, it can't catch that kind of error, but you'd be surprised how much more robust it is than "merely" e.g. Haskell's type system + ad hoc functional testing + property testing - which I would consider a quite strong system overall, but formally proven e.g. cryptography is a much higher bar.

Comment by mhh__ 2 days ago

Lamport phrases this as "Thinking doesn't guarantee we will be right; not thinking guarantees we will be wrong" - specifications are for reasoning about systems in the deepest sense

Comment by comonoid 3 days ago

The book "Program = Proof" by Samuel Mimram starts with a formula which is true for all n below

n = 15 341 178 777 673 149 429 167 740 440 969 249 338 310 889

I don't think you can catch it with any test suite.

Comment by pdhborges 3 days ago

If you have an int32 or less you can!

Comment by jaggederest 2 days ago

And you'll rapidly return to proofs when your "function input" is something like a sequence of, say, ieee floating point numbers coming over the wire of possibly unbounded length. State machines with proofs that all the cases are handled are great.

Comment by IshKebab 2 days ago

That can definitely be a problem, and I have definitely written formal specs (for hardware in SVA) where I'm like "I'm just implementing this again". And actually that's a totally valid formal verification task to do - formal equivalence checking of an implementation against a model (which is just another implementation). I can feel (and sometimes is) tautological.

However you usually try to not do that. If possible (and it usually is possible) your formal spec will be either:

1. Another implementation, but a much simpler one. For example you can formally verify the equivalence of a pipelined dual issue CPU, with a combinatorial single cycle model.

2. More general properties that much be true about the implemention, rather than exactly what the implementation must be. The classic example is compression: decompress(compress(x)) == x.

Comment by rendaw 2 days ago

The same with unit tests, they're best suited for problems where the specification is much simpler than the implementation. A sorting algorithm can be long and complicated, but a unit test just needs to present unordered input and assert that the output matches ordered output. A sorting algorithm is also well suited for formal verification.

If you're rewriting the implementation, I think it's probably not a good use for unit tests or formal verification.

As another commenter said, unlike unit tests which cover a specific case, and where you need multiple tests for edge cases and both positive and negative results, formal verification will cover all cases.

Comment by chadgpt3 2 days ago

For trivial functions they can be. Function: isPrime. Impl: loop to sqrt(n) and check divisibility. Spec: returns true if nothing divides. Impl: return true at the end if we didn't return false.

It's not exactly the same because your spec says nothing divides, not just up to sqrt(n). It's definitely not a test if you do it right.

Comment by awinter-py 3 days ago

See previous kleppmann post https://martin.kleppmann.com/2025/12/08/ai-formal-verificati..., and yes, obviously anything that you can put in the typesystem or the linter, you should weigh doing so.

Hopefully we get more ergonomic ways to do this? Like of the tools listed in the post, dafny + iris are the closest to being industrial I think. And amzn S3 has a history of TLA use in-house I think. But we probably haven't seen the typescript in this space yet, a zero cost abstraction that drops into existing tools, and people genuinely prefer it to the old way.

(And custom linters are also still pretty bad to write. Like golangci-lint is a painful codebase, haven't tried semgrep but the rules engine seemed intimidating. I've yet to use an AST API that I liked)

Comment by spenrose 3 days ago

As usual, this paean to deductive reasoning (“formal methods”) leaves out its fundamental limit: how closely do the postulates and definitions fit the domain they purport to map? (“In theory, there is no difference between theory and practice. In practice ...”) My guess is that Jane Street maintains large code bodies where the mapping is 1:1, because the purpose of the code is to implement a deterministic algorithm. Many other coders work in such areas. But millions of us don’t: most UIs, most exploratory work, etc.

There is a movement parallel to formal methods to define acceptance criteria at high resolution but not logico-mathematically, which at least grapples with the mapping problem but can’t resolve it where the map isn’t the territory, which is most places. Has Google’s results page, with its extremely evolved internal optimization frameworks really hit an optimum? Could that prototype you whipped up to capture a hazy idea have better illustrated it? These questions are best answered by looking outside the system to what the system serves.

Comment by benlivengood 3 days ago

Formal methods are precisely for the domains where the semantics are well-defined. Logical circuits (a lot of CPU components get formal verification), kernels, protocols, parsers, compilers, cryptography, security frameworks, concurrency primitives, etc. all benefit a lot from verification.

Comment by sn9 2 days ago

Most UIs in practice boil down to state machines which are extremely amenable to formal verification.

Hillel Wayne's writing is a good starting place to learn more: https://www.hillelwayne.com/formally-specifying-uis/

Comment by petra 3 days ago

Google's results page isn't well defined. But probably 90%+ of the code below it is well defined.

And in some cases, where the result isnt well defined, it can be learned, so it's not about sitting and thinking what would make sense.

Comment by reinitctxoffset 3 days ago

I caught the religion on using types in conjunction with LLMs about eighteen months ago, but I only really got serious about `lean4` like six to eight months ago and now I wouldn't even consider using AI assist in software work with a `CIC` proof substrate that has practical C/C++ (and therefore, everything) FFI.

We've banned everything from JSON to Python, rewritten `nix` to have a compiler, and almost everything we write is not only property tested and multiply fuzzed to within an inch of it's life, but we have proofs in `lean4` that at a minimum drive property tests via `.olean` linkage, and when we have the bandwidth we prove exhaustiveness over the domain and property test that.

We skip the whole C++/Rust thing because all of the fast stuff is generated from `lean4` and so it doesn't really matter (C++ has advantages when bugs in it are actually bugs in `lean4` code, but you could go either way).

This is a big departure, and I certainly don't blame anyone who is skeptical: "ban JSON and Python wtf?!?!", but we've done millions of lines (checked) of this stuff and AI + formal systems is a dramatically bigger leap than no AI -> AI and Python. The latter in our experience is not monotone in progress, the former is almost always monotone in progress.

And you can do wild shit, this is a formal proof of the polyhedral tensor calculus that is modeled by things like ISL and CuTe, and using that we can get swizzling and tiling using `mdspan` in C++23 on device and prove it's right (up to some L'Hopital arugment about the coverage which this example doesn't demonstrate well: https://github.com/b7r6/mdspan-cute

That in turn, well, it goes real fast. On the first try.

https://straylight.software/assets/lambda-hierarchy.svg

Comment by skybrian 3 days ago

Ok but what software have you built and why is it useful?

Comment by reinitctxoffset 3 days ago

This is all pre-release but a few highlights:

- `continuity` is a `lean4` metaprogramming system that we use all kinds of ways but the real meat is it allows formal specification of codecs and state machines in ways that make security and performance properties proof amenable, the key trick here is to limit parser power to just what you need and no more. a very cool thing is that we can add targets to it, so when we do Zig for example, Zig will immediately get proven correct and frontier performance support for dozens of protocols that are not all mature right now.

- `libevring-cpp` (bound up into `libevring-hs` and `libevring-rs`) is a Trinity-inspired deterministic event replay system that replaces anything you would have done with `libuv` or whatever, and it's wired into `io_uring` (we're stuck with `kqueue` on darwin, eh). it interfaces with `continuity` machines and codecs (which are generated very carefully for the hardware they run on) and we have yet to find a way of measuring such programs where it doesn't resoundingly shatter the performance of any other asynchronous programming primitive in any language. i'm sure the community will prove us wrong when we release it, but it's real fast. and you get much stronger guarantees than in most such systems (Trinity derived, so if you can repro a bug, you can walk the event trace until it's sitting there in GDB and shit)

- `hyperconsole-cpp` (and `hyperconsole-hs` and `hyperconsole-rs`) is the TUI library idea taken to some deranged extreme on performance and supports everything in `notcurses`, it's pretty wild: https://youtu.be/YqgEtpJ8tGI

- straylight-nix is a complete rewrite of nix that fixes thousands of bugs, hundreds of them security adjacent and dozens of them we only talk to vendors about. it's daemonless, dramatically faster, ground-up WASM-targeting compiler with a formal grammer, uses an extremely fast LSM-based store (it can read the legacy store but we don't write it) that fixes all the problems in floating CA, IFD is too cheap to care about, and recursive nix is no longer and issue (see daemonless). it supports tearing derivations into `REAPI` actions that you can feed to your friendly native NativeLink or whatever, which just goes through them like a woodchipper. KVM-based sandbox with snapshot and restore, really opens the world up on what your builders can be.

- `slide` is the reference implementation of a family of protocols called `SIGIL`: `SIGIL-LLM` is a binary encoding for LLM data that resets on ambiguity and drops the average bytes on the wire from e.g. OpenRouter to your harness from ~hundreds to ~1.5 per token, `SIGIL-API` is a bijection on OpenAPI 3.1.0 and AsyncAPI 3.0.0 that gives comparable improvements, and `SIGIL-SH` is such an encoding for a sensible subset of bash. this does about 1.5 billion tokens per second on a laptop and never emits partial frames, so you don't get speculative execution rollback problems in your harness that tilt agents off.

- `// WEAPON //` is an adversarial, vendor-skeptical, full-take surveilance agemt harness built on `hyperconsole-cpp` and `SIGIL` (so, you could absorb the entire token output of OpenRouter on one machine if you wanted at least on the wire and in the terminal, clearly the bottleneck rapidly becomes whatever the agents are calling, but it's `zmq4` transport underneath `SIGIL` so it's also trivial to full-take all of your data for fine tuning or whatever you want it for into e.g. `parquet` on R2. `// WEAPON //` does a bunch of stuff: the tool call surface is heavily optimized for AST-level edits that miss dramatically less, we intercept and manipulate shell commands (slice off the stupid `| tail -n5` that keeps the droid in a loop not seeing the error, pre-emptively ground using heuristics that have been tuned (defeats the search flinch), and always recovers from any stall, or nag box, or anything else that would serve as an unannounced rate limit, it's fine if vendors rate limit but they need to put it in their ToS. it has a bunch of other primitives, agents run in real KVM sandboxes and speculate out as wide as you want to pay the tokens for. we hyper-manage things like the cache breakpoint geometry of Anthropic so e.g. Opus rarely misses in cache and always hands off edits to specialized tool use models. it's pretty extreme the difference in outcomes relative to all this React jank.

- `s4` is a general-purpose compiler from most any pytorch 2.0 model to `myelin`-level performance on NVIDIA (we only support NVIDIA Blackwell at the moment, that might change) and it's never worse than `myelin` because if we don't out-tune it we invoke it, but we out tune it a lot because we've proven a lot of decideability theorems about tiling and scheduling on both 1CTA and 2CTA, so we can often arrive at a finite, enumerable set of schedule/tile choices. `myelin` mops up the random garbage around the big GEMMs just like in TRT-LLM.

- `sigil-trtllm` is inspired by TensorRT-LLM-Edge but designed from the ground up around Mellanox/ConnectX and in particular GPUDirect, so it can stream `SIGIL-LLM` tokens directly onto the wire whereas something like Dynamo is usually traversing both Python and NATS, which is super weird to us. this uses the `s4` compiler very heavily.

- `straylight-cas` is a geographically distributed content addressable store backed by any R2-compatible (so most any S3-compatible too) object store with multi-level LSM and extreme performance memtables, optimistic hinted handoff over `zmq4` to other geographies, and a really simple operational story, this is kind of the thing that powers the product surface.

... which is the thing i'm less ready to talk about because it's supposed to be a surprise.

Comment by skybrian 3 days ago

It sounds impressive but also rather obscure. Do you have users?

Comment by reinitctxoffset 3 days ago

No we're pre-alpha, I guess most people would call it stealth but it's more like, not quite done and we don't want to waste people's time because our entire thesis is around correct outcomes in AI systems at a level that permits their use in outcome-critical regimes (we sometimes call it "insurable" AI as a north star, would LLoyds of London or Swiss Re stand behind someone who was writing policies on this?).

Now a bunch of that is development tooling that copes with agent-scale software development, and a lot of that might become product surface, so we have a lot of usage denominated by like, bytes and agent hours and stuff because we build this stack in itself, but that's somewhat orthogonal to the north star vision.

We'll make sure to give the HN community the opportunity to see this stuff as early as anyone does if people find it interesting, most of the above will be open source fairly soon. Don't know if it'll make the front page, but the product will be called `ORBITAL`, so if you see that floating around that's us.

Appreciate the interest!

Comment by rirze 2 days ago

I am super interesting in `straylight-nix`. Even a blog post on what you did to improve it would be elucidating.

Comment by gottlobflegel 2 days ago

Listing Agda and Idris 2 under CIC makes your lambda hierarchy diagram misleading at best.

Comment by blueblazin 2 days ago

As someone with a bit of interest in programming languages (design and implementation) this was really interesting to me:

> For most people who work on programming languages, the easy part is coming up with new and better ideas about how to make programming better. The hard part is convincing anyone to actually use those ideas for real work.

Totally agree, there's only so much strangeness you can introduce in a new language[1] regardless of benefit.

But AI agents should not feel much resistance to radically new ideas in PL design. I've been thinking for a while now about how PL design will evolve post agentic AI. I think it will be very interesting to see what new ideas we can come up with to improve programming languages when we worry much less about adoption.

[1] https://steveklabnik.com/writing/the-language-strangeness-bu...

Comment by mark4 3 days ago

Relevant keynote I wrote about

Formal methods won, now what? https://muratbuffalo.blogspot.com/2026/04/bugbash26-keynote....

Comment by xvilka 3 days ago

seL4 despite being formally verified contained gnarly bugs, like [1]. Thus, it's not a silver bullet as some people think. Yes, it improves the quality, but only one of the aspects of hardening software, like ASAN, SAST, fuzzing, strict languages, and so on.

[1] https://github.com/seL4/seL4/issues/1199

Comment by cayley_graph 3 days ago

> Since it is the unverified SMP config of the kernel

I don't disagree with your point (formal verification does not rid you of all bugs), but this is not the subject of the linked issue. This was a bug in an unverified path.

Comment by Syzygies 3 days ago

I nearly adopted OCaml for my current math research, in no small part because of Jane Street contributions. I instead chose Lean 4, as LLMs became able to help code in Lean. I give up a factor of two in performance (a gap likely to close) for what reads to me as the clearest expression of each algorithm. Lean is a beautiful language that makes OCaml read like Old English.

Lean is designed to be optionally verified; proof is its primary (but not only) application. It seems an impedance mismatch for Jane Street to explore this direction without also migrating to Lean 4.

Comment by derdi 2 days ago

You are praising Lean for optional verification. They want to add optional verification to OCaml. Where do you see an impedance mismatch? Just in the fact that you don't find OCaml pretty enough?

Comment by UncleEntity 2 days ago

I was playing around with stuff trying to get Claude produce a JavaCard VM with the idea that the VM was hand written from the spec with a separate, independently produced, spec file used to generate tests for ESBMC to verify so an identical bug would have to exist in both to make it through. Worked out pretty well, found a few bugs in both projects, but my poor laptop can't handle the full 32-bit space so that part never got the full verification -- 16-bit, rock solid though.

Then I really got serious about the yak shaving and, well, am probably in need of an intervention as I don't get Claude to write a VM but to make the tools to generate a VM from an assortment of DSL and that has snowballed a bit as I really liked shaving yaks before the daffy robot revolution.

--edit--

Almost forgot, I tried that with the little less dodgy banned Claude and the wasm standard it wrote a python script to parse the spec pdf, the official bytecode implementation in OCaml (or whatever) and generate a TOML file (Claude loves the TOML) to generate the type headers and for cross-referencing in the other tools. Was so impressed I just let it go on its merry way and it did the deed.

Comment by eddiepete 3 days ago

I wonder how formal methods can help us move faster with GenAI.

Is it that they can help write formulas faster? That they can help ensure formulas match the system they're modeling faster? If the problem you think formal methods will help with is sloppy code, isn't the verification code going to be sloppy as well, unless some (not sloppy) intelligence carefully confirms that the specification matches the target system, which was the labor that previously made formal methods too expensive? I guess I don't understand how the argument works if code was previously less sloppy and verification was too expensive, and now code is more sloppy, and there's more of it, but somehow the sloppy intelligence will make verification move fast enough to make it worthwhile.

Unless we have some non-sloppy intelligence that's less of a bottleneck on verification than humans, how are we in a better place?

Maybe it's that investing that huge amount of labor of verification by human experts is now worth it because so much code will be produced that uses the verification systems that the investment will now pay off. But that requires creating pretty general verification systems, such as type system verification or something (which is what they seem to be aimed at), rather than individually verifying software systems like the micro-kernel example.

In other words, maybe the play is to invest in reusable verification systems that can be run tons of times on new code and systems. If so, it's surprising that this wasn't always the strategy.

Comment by jsenn 3 days ago

> isn't the verification code going to be sloppy as well

The beauty of formal methods is it doesn't matter if your proof is sloppy. As long as it passes verification, it is correct. And unlike in pure math, the proof that a software system is correct is usually a huge mess of special cases, loop invariants, proofs by induction, and boilerplate that requires a large amount of human labour while providing no insight.

Proofs are also brittle: a tiny change in the code can force you to throw your proof away and start from scratch.

To me, the exciting thing about formal methods in the LLM era is it allows humans to offload the difficult and tedious work of writing proofs to a computer. Taken to an extreme, the human could live entirely in the world of a formal specification, and the LLM could generate 100% of the code. The code may be a mess, but if the system proves it satisfies the spec then it can't be wrong.

Comment by odyssey7 3 days ago

So, formal methods produce runnable systems, but communication remains the challenge.

If a formal spec is messy, then it's a proof of ... what, exactly?

A formal specification that bridges tech and product, that lets non-technical contributors read and discuss all the logical nuances, directly as operational code, at product's level of abstraction of interest, would transform a lot.

It's no longer a challenge to create code, it's a challenge to create business requirements and translate them into systems.

Comment by rzmmm 2 days ago

The spec and proof are separate. In this blog article he mentions seL4 formal verification, where they state that the spec was 4900 lines of Isabelle and the proof was 200K lines. Obviously human has to understand the spec deeply.

Comment by jaggederest 2 days ago

There's an information theoretic aspect about generating a proof which is essentially not human readable from 4900 lines of spec. I wonder how much additional signal they're getting out beyond what's in that 4900 lines, and what's the percentage of noise in the 200k lines of proof?

Comment by pron 3 days ago

The problem is that generating either code or proofs with LLMs is very expensive, and generating good proofs (I don't mean elegant, I mean proving the most important properties) is probably not very fast, either. Reducing the verification time of a program from 100 years to 10 years or the cost from $1bn to $100m is still not practical enough to become truly mainstream.

Things can be improved when people help guide and focus the LLMs, but these people still need to be formal methods experts.

Comment by jnwatson 3 days ago

The general idea is that formal methods are self-verifying, up to a point. Sloppy formal methods simply won't prove.

The point up to which formal methods stops is: do the formally encoded requirements actually encode what I want to be true?

One can make the argument that the requirements is a much smaller surface to verify than that of the entire program.

The counter argument is that figuring out what you want the program to do has always been the hardest part of programming, and that programming in itself is a journey to discover latent requirements.

Comment by Veserv 3 days ago

> One can make the argument that the requirements is a much smaller surface to verify than that of the entire program.

This argument is unfortunately empirically false for any program of any meaningful complexity (>1000 lines, probably even as low as >100 lines ignoring well-defined algorithms and data structures) using current formal methods.

Complete formal specifications are usually multiple times larger than the corresponding source code and encode esoteric propertys necessary for the proof, but which are largely even more impenetrable than a undocumented codebase.

So, it is both harder to figure out if you encoded the desired requirements and it is more complex. Your only advantage is confidence that what you wrote down is proven.

Comment by akoboldfrying 2 days ago

> Complete formal specifications are usually multiple times larger than the corresponding source code and encode esoteric propertys necessary for the proof, but which are largely even more impenetrable than a undocumented codebase.

Is it possible that the spec could be factorised into something higher-level and more modular? If not, can you give a flavour of the type of unavoidable esoteric detail? (One area I can see lots of complications is when dealing with versioned data, especially in multiple interacting systems -- naively, correctness needs to be proven for every combination of versions, even if some are never seen.)

Comment by RetroTechie 1 day ago

> I wonder how formal methods can help us move faster with GenAI.

Perhaps better the other way around? How GenAI can help us move faster with formal methods.

Rather than improve the quality of new software, there's a mountain of existing codebases that could benefit from this.

Comment by rzmmm 2 days ago

He mentions the seL4 microkernel. The specification is written in Isabelle, and it's relatively complex: PDF https://sel4.systems/Info/Docs/seL4-spec.pdf

The bottleneck seems to be that clearly it's critical to be certain about the spec.

Maybe not for kernels, but I can see use for cryptographic algorithms, etc?

Comment by seqradev 2 days ago

We are working toward applying formal methods from the application security testing side, but I believe the same kind of approach can be applied to business logic verification as well. For that, we are using the taint analysis technique — a fairly well-established formal methods approach, but still not widely applied in the field because of the complexity of dataflows in real codebases.

Scaling formal methods beyond AST pattern matching and some simple type checking turns out to be a really hard task! It took years of research and development to reach the point where taint analysis enables us to trace interprocedural dataflows in real codebases in minutes and find deeply hidden vulnerabilities.

If this sounds interesting to you, take a look at our project: https://github.com/seqra/opentaint

Comment by smasher164 2 days ago

I think it's a good time to learn Lean. It positions itself as a proof assistant that's also good at practical programming. I'm not sure how mature the ecosystem is for the latter.

Comment by JacobAsmuth 3 days ago

Sounds like Jane Street would be interested in the Lean framework I've been working on for formally verified frontends. https://github.com/JacobAsmuth/qed

Comment by pjmlp 3 days ago

I am already seeing the uptake on Spec-Driven Development as the Rational Unified Process revenge.

Comment by bobkb 3 days ago

I have been testing formal verification methods with multiple products. It will be great to also understand more about what’s tried and how it was done. For example attempting to verify the spec is what I have been trying to implement.

Comment by yawaramin 3 days ago

Quis custodiet ipsos custodes?

Comment by Animats 3 days ago

The proof checker.

You verify a proof system by having it produce a proof trace:

- Step 3185: Apply rule that a * b = b * a to "length * width" in the current formula.

Then you run a proof trace checker which applies each transformation in sequence and checks that the expected result is obtained.

Provers are complicated, but proof trace checkers are really dumb.

Comment by zahlman 2 days ago

But don't you still have to logically connect the validity of the proof to desirability of the output?

Comment by zisa-security 2 days ago

Interesting read. Formal methods seem to matter most when the architectural complexity grows faster than human intuition can track.

Comment by jp0001 3 days ago

Formal methods is like a plan. Everyone has one until they are punched in the face (real world requirements and trust boundaries).

Comment by Paracompact 3 days ago

Formal methods are not meant to replace trust in a system. They are meant to minimize the surface area of trust. To not understand and advertise what surface area still exists is foolhardy, and mistakes logic for magic.

Comment by angry_octet 2 days ago

Incredibly uninformed comment. The formally verified part is the high confidence component that is the anvil to hammer out bugs in the unverified components.

Comment by brainless 2 days ago

I am not a language nerd but I keep on experimenting in my own ways to use the type system to generate code that is more reliable.

I build a coding agent specifically for small models, which makes everything harder. I started this chat with Claude to build the next step: https://claude.ai/share/4264e5f6-b334-426c-afe4-904d233ef946 - how can I go from PRD to a typed representation of the business logic.

The I started building as per https://github.com/brainless/nocodo/blob/feature/praxis_agen.... The praxis crate: https://github.com/brainless/nocodo/tree/feature/praxis_agen... and a sample Todo app: https://github.com/brainless/nocodo_example_todo_app

Generating unit tests for the library functions of any project would be done via a separate agent than the one coding the functions. And then use tree-sitter to statically check code to PRD (provenance graph).

Again, not a language nerd, just enjoying chasing a goal.

Comment by cadamsdotcom 2 days ago

I’ve had huge success with TDD and nothing more formal than that; test frameworks are great for verifying “business correctness”. But they can also verify other aspects - eg. asserting that 3 queries ran instead of 103 for a request, to prove absence of N+1 queries.

Really any technique that lets the agent create its own verification is ideal, as it makes verification scale.

Comment by hanzewei_qk44 2 days ago

[flagged]