Progress in autoformalization experiments

Can today's AI systems generate formally verified code?

Apr 22, 2025

programming note: Our quarterly update (separate from this blog) went out last week — check it out here, or tell me if you’d like those cross-posted here in the future. Also, this post is much shorter than what we’ve done previously; let us know if you like this format.

First off, a team update: We’re excited to announce that Jason Gross has spun out of Atlas Computing!

In his (regrettably brief but) eventful time at Atlas Computing, Jason primarily ran experiments using LLMs to transpile from Coq to Lean (repo) while advising work on our specification validation tool (repo; summary slides). These translation efforts were an important evaluation and demonstration, not simply to show that it could be possible for libraries or verification tools in one proof system to benefit others, but also to show that today’s AI systems are sufficient to significantly automate various processes related to proof generation and debugging.

There were generally minor issues that required effort getting this system work; for instance it resulted in a handful of new bug reports in the Coq proof assistant.

The nontrivial part of automating this

There’s already a tool that converts from Lean to Coq (not source-to-source, but compiled output to compiled output)
We started with Coq, and used an LLM to generate Lean.
This was then compiled and sent through the rocq lean importer
Now we can compare the compilation of the original Coq against the twice translated Coq, like so:

How to validate LLM translation (main Atlas-built parts are in green)

In Coq, the goals look fiendishly complicated and the proofs look trivial. This is because Lean uses a powerful elaborator on simple primitives while Coq uses a weak elaborator on more powerful primitives, but if you know the structure of how things should reduce, you can basically make it all go away.

That said capabilities were far better than expected and, we believe, far more useful in practice than most practitioners of formal verification believe. For instance, with some hand-holding, Jason got a frontier model to compose and prove a specification of program equivalence — a couple hundred lines of working Lean code in a couple hours. We have high confidence this will replicate across important codebases, and scale to larger and more complex tasks as models improve.

As a result, we’re excited that Jason will be dramatically scaling up our expectations of this effort. Here’s Jason’s home page if you’re interested: https://jasongross.github.io/

Atlas Blog

Discussion about this post