The viral version of this story is simple: an AI system solved a decades-old Erdős problem that had resisted human attack. The careful version is better. The result in question is Erdős Problem #1196, a conjecture on primitive sets recorded on Thomas Bloom’s Erdős Problems site. In April 2026, the site updated the problem’s status to say it was solved by GPT-5.4 Pro, prompted by Liam Price, with the main bound stated for any primitive set .
Table of Contents
A follow-up note then reorganized the argument, sharpened the asymptotics, and tied the proof to a cleaner conceptual framework.
That is already extraordinary. This was not an Olympiad trick, not a benchmark puzzle, and not a polished company demo detached from the research record. It landed inside an active mathematical problem database, in a corner of analytic number theory where experts already knew a lot, but not enough. The reason people in mathematics paid attention is not just that the conjecture was old. It is that the proof seems to have added a genuinely useful organizing idea. Jared Duker Lichtman, the leading human expert on this cluster of problems, wrote on the forum that the proof looked like it came “from The Book,” Erdős’s famous metaphor for the most elegant proofs.
The headline is real, but it needs translation
Most readers do not walk around thinking about primitive sets, von Mangoldt weights, or logarithmic tails of number-theoretic sums. So the first job is translation. A primitive set is a set of integers greater than 1 with the property that no member divides another. The primes are the cleanest example. So are the sets of integers with exactly k prime factors. Primitive sets sit in a sweet spot: sparse enough to avoid trivial divisibility structure, but rich enough to hold deep information about how integers are assembled from primes.
Paul Erdős proved in 1935 that if A is primitive, then the sum
is always bounded. That is the first surprise. A condition that looks purely combinatorial — no element divides another — forces a strong analytic constraint. For decades, mathematicians pushed on a natural next question: which primitive sets come closest to maximizing that sum, and what happens if you force the set to live far out among large integers? Those questions produced a long line of work connecting primitive sets to Mertens-type estimates, almost-primes, Dickman distributions, and the anatomy of integers.
Problem #1196 is one of the sharpest versions of that agenda. It asks whether primitive sets supported on large integers satisfy an asymptotically optimal bound with main term 1. That number is not decorative. It is the exact scale suggested by the family of integers with exactly k prime factors, whose Erdős sums were already known to drift toward 1 as . In other words, the problem asks whether those k-almost-prime examples are not just suggestive but essentially extremal. That is why the result matters mathematically. It closes the gap between a lower-bound model mathematicians understood and an upper bound they still could not fully prove.
The public-facing drama of the story comes from AI. The mathematical drama comes from the gap it closed. Before April 2026, Jared Duker Lichtman had already proved a weaker upper bound of , roughly 1.399 plus a vanishing error term. That was strong work. It was not the conjectured answer. The new result reaches the right main term. Then the follow-up note records an even sharper form.
That starts exposing the second-order behavior as well.
Primitive sets had been building toward this for years
This breakthrough did not arrive in empty space. It arrived at the end of a sequence of results that had been tightening the noose around the conjecture for years. A helpful way to see the landscape is to look at the milestones in order.
A short map of the primitive-set story
| Year | Result | What changed |
|---|---|---|
| 1935 | Erdős proves f(A) is uniformly bounded for primitive sets | The subject gets its core analytic invariant |
| 2020 | Lichtman proves Erdős sums of k-almost-primes tend to 1 | The lower-bound model for the later conjecture becomes concrete |
| 2020 | Chan, Lichtman, and Pomerance settle the conjecture for 2-primitive sets | Stronger variants start yielding to new methods |
| 2023 | Lichtman proves the Erdős primitive set conjecture f(A) ≤ f(P) | The primes are confirmed as global maximizers |
| 2024 | Gorodetsky, Lichtman, and Wong sharpen the k-almost-prime asymptotic | The approach to 1 gets an explicit exponential correction |
| 2026 | Problem #1196 is recorded as solved with main term 1 + O(1 / log x) | The asymptotically optimal upper bound is reached |
This table matters because it shows this was not a random strike on a random conjecture. Problem #1196 sat right behind work already done by Lichtman and others. The 2020 paper on almost-primes proved that the sums over integers with exactly k prime factors tend to 1. The 2023 primitive-set paper proved the older global maximization conjecture: among all primitive sets, the primes maximize the Erdős sum. The 2024 Gorodetsky–Lichtman–Wong paper then sharpened the asymptotic for k-almost-primes to
with an explicit constant . By the time the 2026 proof appeared, mathematicians already had a very good picture of where the truth should lie. They still lacked the upper bound that nailed it. The wall was conceptual, not merely numerical.
Quanta’s 2022 account of Lichtman’s earlier breakthrough is useful here because it shows how hard the remaining distance really was. Lichtman and Carl Pomerance had found a clever route to an upper bound around 1.78 by partitioning integers into non-overlapping densities of multiples and feeding that into Mertens’ theorem. James Maynard described it as slick and strong, but still not tight. For a while, that appeared to be close to the limit of what available ideas could deliver. Getting from “bounded by about 1.78” to “asymptotically 1” is not a routine cleanup. It is the whole problem.
The 2023 primitive-set paper made the landscape even more striking. It proved f(A) ≤ f(P), with , and in the same paper restated the Erdős–Sárközy–Szemerédi problem from 1968 as Conjecture 1.4. That paper already contained the crucial backdrop: if the asymptotic upper bound by 1 were true, then the k-almost-prime family would essentially attain the limit. So by 2026 the target was mathematically clear. The wall was conceptual, not merely numerical.
The new result is sharper than the headline suggests
News coverage tends to flatten mathematical results into a single sentence: “AI solved an open problem.” The actual result deserves better than that. The problem page and the follow-up note together show three distinct layers. First, the site records a proof of the conjectured upper bound 1 + O(1 / log x). Second, the note reframes the argument in a more transparent way using an invariant weight tied to . Third, the same note records a sharper correction term involving Euler’s constant . That is not just a yes-or-no solve. It is the start of a small theory around the solve.
The note’s abstract is worth paraphrasing because it captures the real shift. The original thread began with a short proof based on a downward divisibility Markov chain and its adjoint sub-Markov chain. Tao, Lichtman, Sawin, Barreto, and others then reformulated the argument around a canonical positive weight , governed by , which turns the proof into a hitting-probability statement for a canonical increasing divisibility chain. Once phrased that way, the main theorem follows from the fact that a primitive set is a blocking set for the chain. That is the sort of sentence mathematicians read and immediately recognize as important. It does not just say a bound was proved. It says an awkward estimate has been converted into a structural object.
That conversion matters because hard problems in number theory are often remembered less for the final inequality than for the object that made the inequality obvious after the fact. The note explicitly says the framework may have broader uses and already lists potential applications. One of them is immediate: k-almost-primes are asymptotically extremal at the main-term scale, matching the old lower bounds and the sharper 2024 asymptotic. That does not prove every nearby conjecture, but it changes the local terrain. A good proof does not merely close a door. It lights up the corridor around it.
There is another detail the headline misses. Liam Price wrote on the thread that the solution appeared in what he described as a single run lasting about 80 minutes. That is startling, but it is not the same as saying the whole mathematical event took 80 minutes. Human experts then checked it, challenged it, rewrote parts of it, extracted the cleaner formulation, discussed formalization, and debated what exactly was new in the proof. The machine produced the artifact quickly. The community turned that artifact into mathematics.
The proof idea looks different from the older route
To understand why mathematicians did not shrug and say “fine, another lucky derivation,” it helps to compare the flavor of the new argument with the older literature. The pre-2026 route leaned heavily on Mertens-type smoothing, densities of multiples, and careful partition arguments. Those ideas were already powerful enough to settle the global primitive-set conjecture and to get strong partial progress on the 1968 asymptotic problem. The new proof does not discard that world, but it seems to bypass one of its bottlenecks.
Price’s summary on the thread makes the mechanism plain. The proof considers a downward Markov chain that moves from to for divisors , with transition weight , where is the von Mangoldt function and
Reversing this process gives an adjoint chain that is no longer Markov in the ordinary sense, but after truncation becomes sub-Markov. That is enough to upper-bound the combined hitting probability of elements of a primitive set, which is proportional to . The remaining task is to estimate the initial source mass, which comes out to 1 + O(1 / log x).
Lichtman’s reaction is revealing because he knows the previous machinery as well as anyone. He argued that earlier papers already had a probabilistic viewpoint, but singled out the von Mangoldt-weighted arithmetic formulation as the genuinely new step. That point matters. AI stories are full of claims that a model merely rearranged known techniques. Mathematicians are usually right to be skeptical. In this case, the experts on the thread did not say the proof was wholly disconnected from earlier work. They said the crucial reformulation changed the shape of the argument.
Terence Tao’s later comments push the idea further. He suggested that what first looked like a Markov-chain story may be more naturally understood through flow networks and a discrete divergence theorem. He even wrote that the AI-generated proof artifact, together with the mostly human analysis that followed, appeared to reveal a tight link between the anatomy of integers and flow network theory that, to his knowledge, had no explicit precursor in the literature. That is strong language, and it is one reason this episode has traveled so far so quickly.
Mathematics often works like this. The first proof is not the final proof. A rough artifact appears. Experts compress it, generalize it, cut off the awkward parts, and locate the conceptual center. When that process succeeds, people stop talking only about correctness and start talking about taste. That is where this case seems to have landed unusually fast.
The community response is why this story matters
A proof is one thing. A community’s response to it is another. This story became bigger than a forum post because the response from serious mathematicians was not dismissal. It was scrutiny followed by engagement. Tao commented on the problem. Lichtman commented on the problem. Sawin commented on the problem. A joint explanatory note appeared within days. That pattern is hard to fake. When experts invest their own time in clarifying a proof instead of merely debunking it, they are signaling that something real has happened.
Lichtman’s remarks are especially important because he is not a bystander. He proved the 2023 primitive-set conjecture, made the earlier 2020 almost-prime advance, and coauthored the 2024 asymptotic sharpening. When he says the proof may be the first AI “Book proof” for an Erdős problem and suggests polishing it into a joint paper with applications, that is not social-media excitement. It is field-specific judgment from the person best placed to detect whether the proof is shallow, derivative, or wrong-footed.
The site itself also matters. Thomas Bloom’s Erdős Problems project is not an official academy database, but it has become a genuine working hub. The site says it now contains over 1,200 Erdős problems, with solved and unsolved statuses, problem lists, forum threads, and history pages. Bloom also published explicit categories for AI contributions, separating cases where AI did essentially all the work from true collaboration and from ordinary assistance. That emerging taxonomy is one of the quiet achievements of the episode. It gives the mathematical community a vocabulary for credit without pretending the old authorship model still cleanly fits.
The earlier January 2026 blog post about Problem 728 helps show the progression. Kevin Barreto called it the first-credited novel AI solution to an Erdős problem, but even there the story was clearly mixed: prompting, human judgment, formalization, literature checking, and follow-up adaptation to similar problems all played a role. The #1196 event feels different not because it is cleaner, but because the mathematical weight is higher and the proof appears more conceptually fertile.
This does not prove AI can now do mathematics by itself
The wrong lesson from this episode would be triumphalism. It does not follow that AI can now run a mathematical research program end to end. It does not follow that human mathematicians have become proof editors. It does not even follow that the next generation of famous open problems is about to fall in sequence. What it does show is narrower and more serious: frontier models can sometimes generate research-level proof ideas that survive expert checking and change the local state of knowledge.
OpenAI’s own public material points in exactly that direction. The November 2025 science paper and blog post framed GPT-5 as a useful research collaborator, not an autonomous scientist. They stressed that the case studies were curated, that expert oversight remained essential, and that the model could hallucinate citations, mechanisms, or proofs. The same paper included AI-assisted progress on Erdős Problem #848, but described it as a combination of GPT-5 with human authors and outside commenters, with human verification still doing the hard epistemic work.
OpenAI’s own First Proof page is similarly careful. It says the company used limited human supervision, sometimes selected the best of several attempts by human judgment, and did not regard the sprint as a clean controlled evaluation. That is a healthy sign, not a weakness. It shows the labs now understand that math is becoming one of the places where sloppy claims are easiest to expose.
The broader field has moved in the same direction. DeepMind’s 2024 AlphaProof and AlphaGeometry 2 result reached silver-medal level on the IMO. In 2025, Gemini Deep Think officially reached gold-medal level with 35 points, the same threshold the IMO used for human gold medals that year. Reuters reported that OpenAI also claimed a gold-level score on the 2025 IMO with a general-purpose reasoning model, though without official entry. Those are real milestones, but they are still competition results, not original research theorems. Problem #1196 sits in a different category, which is exactly why people care.
Mathematics is already building tougher tests
One reason #1196 landed so forcefully is that mathematicians were already preparing for this moment. The FrontierMath benchmark was built to measure advanced mathematical reasoning on expert-crafted problems, and Epoch later launched FrontierMath Open Problems, a benchmark of genuinely unsolved research problems whose solutions can be verified programmatically. Epoch explicitly said those open problems were meant to contextualize the emerging wave of AI solutions to previously unsolved Erdős problems.
Then came First Proof, which may be the clearest sign that the field has changed. Its organizers describe it as a set of research-level questions designed to test whether AI systems can autonomously solve problems that arise naturally in research. Their FAQ defines autonomy tightly: the AI must produce a proof without human mathematical ideas or help isolating the core of the problem. The second batch is set up with one-shot testing and blind human grading, with referee-style judgments ranging from essentially flawless to rejected. That is not a publicity stunt. It is the beginning of a standards regime.
Scientific American captured the motivation well. Leading mathematicians wanted a better test because the field had grown tired of opaque claims, benchmark theater, and literature-search confusions dressed up as discovery. The article also made a crucial distinction: AI’s near-term impact may come less from solving legendary conjectures than from handling the smaller, stubborn lemmas and proof subroutines that fill real mathematical work. That is the band where #1196 becomes interesting. It is much more than homework, but it is still close enough to existing theory that human experts can audit it carefully.
Seen that way, the story is less “AI has conquered mathematics” than “mathematics has started redesigning itself for a world where AI can occasionally produce serious proof artifacts.” That redesign involves stronger benchmarks, clearer authorship norms, public transcripts, formal proof assistants, and faster expert review. The theorem is important. The institutional response may matter even more.
The deeper shift is cultural as much as technical
Something subtle changed in mathematics over the past year. The old debate used to be whether language models could do real math at all. That question is dying. The live questions are harder. How much scaffolding counts as autonomy? How should credit be assigned? Which proofs deserve trust before formalization? What kinds of novelty matter most — a new theorem, a new framework, or a new compression of old ideas? Those questions are everywhere in the #1196 thread, in Bloom’s authorship notes, in First Proof’s rules, and in the official lab write-ups.
The 2026 problem is a good lens because it sits between extremes. It is not a giant public conjecture on the scale of the Riemann hypothesis. It is not a toy benchmark either. It belongs to a mature research program with active human experts, a meaningful literature trail, and enough surrounding structure that a proof can be inspected from several angles. That makes it unusually diagnostic. If AI had produced nonsense here, experts would have noticed. If it had produced only a clumsy search over known references, experts would have noticed that too. Instead, they found something worth reorganizing, extending, and absorbing.
There is also a lesson about elegance. In mathematics, difficulty is often misread from the outside. A problem can be open for decades and then collapse under a short proof that looks almost inevitable in retrospect. Tao himself noted on the thread that some problems turn out to have much lower a posteriori difficulty than their reputation suggested. That does not diminish the achievement. It often marks the arrival of the right viewpoint. A proof that makes an old problem look easy is often the strongest proof of all.
That is why this episode will last longer than the headline cycle. Even if sharper proofs appear, even if formalization exposes gaps that later get repaired, even if other models soon solve other Erdős problems, #1196 will still be remembered as the moment the debate changed register. Before it, many people could still pretend that AI-math success stories were mostly benchmark wins, prompt tricks, or carefully framed assistance. After it, that position is much harder to hold without ignoring the evidence.
The real significance sits beyond one theorem
A great deal of nonsense will be written about this result. Some people will treat it as proof that general mathematical research has been automated. Others will insist it proves almost nothing because the human community still checked and reformulated the argument. Both reactions miss the point. The significance of #1196 is that it occupies the narrow middle where the future is actually being decided. A frontier model produced a serious proof artifact on a real research problem. Domain experts engaged with it rather than dismissing it. The proof seems to have carried a fresh organizing idea. And the surrounding institutions — forums, notes, formalizers, benchmarks, public archives — were already mature enough to process the event quickly.
That is what makes the story bigger than the theorem. Mathematics is learning how to metabolize AI. The process is messy, public, and still unstable. It involves brilliance, skepticism, vanity, careful checking, and a lot of people arguing over what should count. That is exactly how mathematics usually absorbs new tools. The only difference is speed. A year ago, the question was whether AI could move beyond clever assistance. Now the better question is which parts of research will remain most stubbornly human, once proof generation itself is no longer a hard boundary.
If there is a sober takeaway, it is this: AI did not solve mathematics. It solved a real mathematical problem, in a way that forced mathematicians to respond on mathematical grounds. That is more important than hype, and more unsettling than hype, because it is specific, verifiable, and hard to wave away. The argument over what comes next will not be settled by slogans. It will be settled the old-fashioned way — theorem by theorem, proof by proof, with the standards getting sharper every time.
FAQ
What was the Erdős problem that AI solved?
It was Erdős Problem #1196, a conjecture about primitive sets and the asymptotic size of the Erdős sum over primitive sets supported on large integers. The Erdős Problems site now lists it as solved and attributes the result to GPT-5.4 Pro, prompted by Liam Price.
What is a primitive set?
A primitive set is a set of integers greater than 1 such that no element divides another. Primes are the standard example, but sets of integers with exactly k prime factors are primitive too.
What does the new theorem actually prove?
The recorded result proves that for any primitive set , the tail sum over elements larger than is at most 1 + O(1 / log x). A later note also records a sharper correction term involving .
Why is the number 1 so important in this problem?
Because earlier work on k-almost-primes showed their Erdős sums tend to 1, making 1 the natural candidate for the optimal asymptotic upper bound. Problem #1196 asked whether that lower-bound model was in fact asymptotically extremal.
How old was the problem?
The asymptotic conjecture behind Problem #1196 goes back to 1968, when Erdős, Sárközy, and Szemerédi posed it. By April 2026, that made it roughly a six-decade-old problem.
Had humans already made partial progress before the AI result?
Yes. Jared Duker Lichtman had already proved a weaker upper bound of , roughly 1.399 plus a vanishing error term, and earlier work on primitive sets and almost-primes had narrowed the target significantly.
Was this the same as the Erdős primitive set conjecture?
No. That older conjecture asked whether the primes maximize the Erdős sum among all primitive sets, and Lichtman proved it in 2023. Problem #1196 is a different, asymptotic question about primitive sets pushed far out into large integers.
What was new about the proof idea?
The discussion around the proof centered on a divisibility-chain viewpoint using von Mangoldt weights, first phrased in terms of a downward Markov chain and then reframed using a canonical positive weight tied to . Experts on the thread treated that reformulation as the key conceptual step.
Did mathematicians think the proof was elegant?
Yes. Jared Duker Lichtman wrote that, aside from rough prose, the proof looked like it came “from The Book,” Erdős’s phrase for an ideally elegant proof.
Was the proof fully autonomous?
The cleanest answer is no, not by the strictest research-benchmark standards. The site credits GPT-5.4 Pro for the solution, but human experts checked it, discussed it, re-expressed its core idea, and developed the follow-up note. First Proof’s autonomy standard is stricter than ordinary forum attribution.
Did the model really do it in one attempt?
Liam Price said on the thread that the solution came in a single run of about 80 minutes. That describes the generation of the proof artifact, not the full human process of checking and reformulation that followed.
Why are mathematicians taking this more seriously than many earlier AI math claims?
Because it landed on a real research problem with active experts, received sustained scrutiny from mathematicians including Tao and Lichtman, and quickly produced a follow-up note that extracted a broader conceptual framework from the original argument.
Was this the first AI-assisted success on an Erdős problem?
No. The Erdős Problems blog had already described Problem 728 as the first credited novel AI solution to an Erdős problem. What makes #1196 stand out is its weight inside the existing primitive-set program and the perceived quality of the proof.
How does this compare with IMO gold-medal AI results?
It is a different kind of achievement. IMO success shows elite competition-level reasoning. Problem #1196 is a research result on an open question. DeepMind officially reached gold-medal level on the 2025 IMO, and Reuters reported OpenAI also claimed a gold-level score, but those were still contest problems rather than new mathematics.
What do official AI labs themselves say about using these models in research?
OpenAI’s public science materials say the models can accelerate parts of research, but expert oversight remains essential. Their own published case studies describe human verification, curated examples, and limits such as hallucinated proofs or citations.
What is First Proof, and why does it matter here?
First Proof is a research-level math evaluation project designed to test whether AI systems can autonomously solve problems that arise naturally in research. It matters because it gives the field a tighter standard for autonomy, rigor, and grading than most forum-based success stories.
What is FrontierMath Open Problems?
It is Epoch AI’s benchmark of genuinely unsolved research problems with programmatically checkable solutions. Epoch introduced it partly to help contextualize the growing number of AI-assisted or AI-generated solutions to open mathematical problems, including Erdős problems.
Does this mean AI will soon solve famous mega-problems like the Riemann hypothesis?
Nothing in the #1196 result proves that. The evidence supports a narrower claim: frontier systems can now sometimes contribute or generate serious proof ideas on research-level problems in areas where the surrounding theory is already rich and checkable.
What is the biggest lasting consequence of this episode?
Probably not the headline, but the change in standards. Mathematics is building new norms for attribution, verification, autonomy, and formalization because AI proof artifacts are no longer hypothetical. Problem #1196 accelerated that shift.
Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

This article is an original analysis supported by the sources cited below
Erdős Problem #1196
The current problem page recording the statement, the prior 1.399 bound, and the April 2026 update marking the problem as solved.
Revision history of 1196
Useful for tracking when the solution attribution and updated statement were added to the database.
Erdős Problem #1196 discussion thread
The main forum discussion containing Price’s summary of the argument, expert reactions, and follow-up technical comments.
A note on Erdős Problem #1196 primitive sets, divisibility chains, and an invariant zeta-weight
The clearest short technical account of the proof’s reformulation, the main theorem, and the sharper correction term.
A proof of the Erdős primitive set conjecture
Jared Duker Lichtman’s 2023 paper proving the older global primitive-set conjecture and restating the 1968 asymptotic problem as Conjecture 1.4.
A generalization of primitive sets and a conjecture of Erdős
A 2020 paper that sharpens the background on primitive sets and proves the Erdős question for 2-primitive sets.
Graduate Student’s Side Project Proves Prime Number Conjecture
A readable account of the older 1.78 and 1.64 bounds, and why the remaining gap was hard.
Almost primes and the Banks–Martin conjecture
Lichtman’s 2020 paper proving the Erdős sums of k-almost-primes tend to 1 and clarifying the larger Banks–Martin picture.
On Erdős sums of almost primes
The 2024 paper by Gorodetsky, Lichtman, and Wong that gives the sharper asymptotic with explicit exponential correction.
Mertens’ prime product formula, dissected
Background on one of the analytic structures that sits behind earlier primitive-set methods.
Translated sums of primitive sets
A short paper showing how delicate related primitive-set variants can be once the weighting is altered.
Erdős Problems FAQ
Explains the site’s origin and confirms Thomas Bloom’s role as creator and maintainer.
Problem 728 and the use of AI on Erdős problems
A first-person account of the earlier AI-assisted wave of Erdős problem solving and the role of Lean formalization.
AI Contributions discussion
Bloom’s public notes on emerging authorship categories for AI-generated, collaborative, and lightly assisted solutions.
Early experiments in accelerating science with GPT-5
OpenAI’s official overview of research case studies, including their stated limits and the role of expert oversight.
Early science acceleration experiments with GPT-5
The detailed paper behind the OpenAI science post, including the Erdős Problem #848 case study and explicit caveats.
Advancing science and math with GPT-5.2
OpenAI’s later public write-up on stronger benchmark performance and a direct-open-problem case study.
How GPT-5 helped mathematician Ernest Ryu solve a 40-year-old open problem
An official case study showing how OpenAI frames mathematical discovery as collaboration rather than pure autonomy.
Our First Proof submissions
OpenAI’s description of its First Proof attempts, including limited human supervision and selection effects.
AI achieves silver-medal standard solving International Mathematical Olympiad problems
DeepMind’s official account of the 2024 silver-medal milestone on IMO problems.
Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad
DeepMind’s official report on the 2025 gold-level IMO result and the move to natural-language proof generation.
66th IMO 2025
The official IMO page giving the 2025 award thresholds, including the 35-point gold-medal cutoff.
Google clinches milestone gold at global math competition, while OpenAI also claims win
A neutral news report comparing DeepMind’s certified 2025 IMO result with OpenAI’s gold-level claim.
FrontierMath A benchmark for evaluating advanced mathematical reasoning in AI
Epoch’s formal introduction to FrontierMath and why it was built.
FrontierMath Open Problems
Epoch’s explanation of its open-problem benchmark and why AI solutions to real research questions need better context.
FrontierMath
The benchmark hub describing both challenge problems and open research problems as part of one evaluation ecosystem.
Mathematicians launch First Proof, a first-of-its-kind math exam for AI
A concise independent overview of why mathematicians wanted a stricter standard for research-level AI proof claims.
First Proof FAQ
The clearest public statement of the project’s autonomy standard for AI-produced proofs.
First Proof Project
The project homepage describing one-shot testing, blind grading, and the second-batch evaluation process.
First Proof
The preprint explaining the scope of First Proof and the distinction between proving known-form questions and doing higher-level mathematical research.















