OpenAI’s ChatGPT-5.5 release is really a bet on agentic work

OpenAI’s ChatGPT-5.5 release is really a bet on agentic work

OpenAI has now officially released GPT-5.5, and that alone settles the basic question behind weeks of speculation. What the company published on April 23, 2026 is not a rumor, not a leak, and not a vague “coming soon” teaser. It is a live product launch with rollout details, benchmark claims, pricing, a system card, a Bio Bug Bounty, and product placement inside ChatGPT and Codex. The more interesting part is what OpenAI thinks this model is for. The company is not selling GPT-5.5 as a prettier chatbot. It is selling it as a model for coding, computer use, knowledge work, and research workflows that unfold across tools and over time.

The “Spud” name belongs to the pre-release story, not the public launch brand. OpenAI’s own materials present the model simply as GPT-5.5 and GPT-5.5 Pro. The codename shows up in outside reporting, most clearly in Axios, which said OpenAI released GPT-5.5 under the internal codename “Spud.” That distinction matters because it separates two different narratives. One is internet speculation about what the next model might be. The other is the product OpenAI actually shipped and how it wants customers to think about it. The codename makes the release memorable. The product framing tells you what OpenAI thinks will make money.

The rumor phase ended the moment OpenAI published the rollout

OpenAI’s launch post is unusually direct about availability. GPT-5.5 is rolling out in ChatGPT and Codex for paid tiers, while GPT-5.5 Pro is rolling out to higher-end ChatGPT tiers. The API is not there yet, which is a notable part of the story rather than a missing footnote. OpenAI says API deployment requires different safeguards and that GPT-5.5 and GPT-5.5 Pro will come to the Responses API and Chat Completions API “very soon.” The Help Center adds a second layer of realism: rollout is gradual, and GPT-5.5 Thinking and Pro may not appear to every user immediately. That is the language of a live release under controlled deployment, not a lab preview.

That staged launch also tells you who OpenAI wants first. ChatGPT Plus, Pro, Business, and Enterprise get GPT-5.5 in the chat product, while Codex availability is broader across Plus, Pro, Business, Enterprise, Edu, and Go. API developers, by contrast, are asked to wait. Companies that live inside ChatGPT workspaces and Codex get first access because OpenAI believes the model’s strongest argument is visible inside managed workflows with tools, files, and permissions, not in raw token endpoints alone. Axios reported the same directional picture: paid ChatGPT and Codex users first, API later once more security work is in place.

Spud is a codename, but GPT-5.5 is the product

The easiest mistake to make around this launch is to overread the codename. “Spud” is colorful, funny, and almost guaranteed to travel faster than a sober model number. But OpenAI did not build the public launch around it. The official page, community announcement, and support documents all speak in the language of GPT-5.5, GPT-5.5 Pro, ChatGPT, and Codex. That is not cosmetic. It shows that OpenAI wants this release understood as part of a product line and a workflow stack, not as a mysterious internal research milestone breaking loose into the wild.

Codename culture still matters, just not for the reason enthusiasts think. Internal names often become a way for the market to narrate a model before the company does. They carry a sense of secrecy and inevitability, and they turn a coming release into a kind of fandom event. But once the model ships, those names tend to matter only if they reveal something about the company’s direction. Here, the more durable signal is not the potato joke. It is the fact that OpenAI chose to present GPT-5.5 as “a new class of intelligence for real work” and tied it tightly to Codex, workspace agents, enterprise workflows, and guarded rollout. That is where the real commercial story sits.

This is not a routine point release

OpenAI’s own comparison point makes the shift easier to see. When GPT-5 launched in August 2025, the company described it as a unified system with routing between a faster model and a deeper reasoning mode. GPT-5.5 sounds different. The public language is about getting to the user’s real goal faster, planning multi-step work, using tools with less micromanagement, checking work, and carrying tasks through to completion. The release post keeps returning to the same promise: less hand-holding, more follow-through.

That is why the “5.5” label can be misleading. A half-step version number usually suggests a refinement release. The material OpenAI published reads more like a product reorientation. GPT-5.5 is pitched as a model that closes the gap between asking for help and delegating work. The model may or may not feel revolutionary in a one-off chat exchange. OpenAI is not betting on that reaction. It is betting that in coding sessions, spreadsheet work, document workflows, software navigation, and research loops, the model will save enough supervision to feel qualitatively different, even when the improvement looks incremental on paper.

Coding is where the release makes its clearest case

If you strip away the launch theater, coding is still the sharpest part of the argument. OpenAI says GPT-5.5 is its strongest agentic coding model to date, reporting 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and 73.1% on its internal Expert-SWE evaluation. The public comparison table also places GPT-5.5 ahead of GPT-5.4 on those coding measures and ahead of Claude Opus 4.7 on Terminal-Bench 2.0, though Anthropic still leads on SWE-Bench Pro in OpenAI’s own table. That is not a clean knockout. It is a credible lead in the category OpenAI most wants to own: long-horizon engineering work with tools.

The surrounding product documents reinforce that emphasis. Codex’s model page says GPT-5.5 is the recommended choice for complex coding, computer use, knowledge work, and research workflows when available. The CLI docs say the same, and the pricing page argues that GPT-5.5 gets comparable or better results with fewer tokens than GPT-5.4 in Codex tasks. OpenAI also gives unusually vivid operational examples: more than 85% of the company uses Codex every week, its finance team used the system on tens of thousands of tax forms, and internal users automated recurring business reporting with time savings measured in hours per week. That collection of details is deliberate. It tries to prove that the model is already embedded in work, not just benchmarked in isolation.

Knowledge work is the second front OpenAI is targeting

The most commercially important numbers in the launch may not be the coding scores at all. OpenAI reports 84.9% on GDPval, 60.0% on FinanceAgent, and 54.1% on OfficeQA Pro. Those names are less famous than SWE-Bench, but they point toward a far larger market: analysts, operators, finance teams, researchers, internal communications staff, and everyone whose job lives inside documents, spreadsheets, portals, and messy inputs. GPT-5.5 is being sold as a model that can take rough material, determine what matters, use tools, verify output, and turn that into something usable. That is office automation with a reasoning model, not search with better prose.

The benchmark context makes that more concrete. GDPval was introduced by OpenAI as a way to measure model performance on economically valuable tasks across 44 occupations. FinanceAgent v1.1, from Vals AI, tests entry-level financial analyst work across 537 questions. OfficeQA Pro is deliberately brutal: its corpus spans nearly 100 years of U.S. Treasury Bulletins, 89,000 pages, and more than 26 million numerical values, and frontier agents still struggle badly even with direct document access. Against that backdrop, a score in the mid-50s on OfficeQA Pro looks less modest than it sounds. It suggests the model is starting to matter on document-heavy tasks where mere fluent language is useless without retrieval, parsing, and arithmetic discipline.

Computer use has moved closer to a real product category

One of the most important lines in the launch sits almost quietly among the benchmarks: 78.7% on OSWorld-Verified. That benchmark exists to test whether agents can operate real computer environments. In plain English, it asks whether a model can handle software the way a person does: looking at what is on screen, moving through interfaces, using applications, and finishing a task inside a real operating context. OpenAI frames GPT-5.5 as stronger on that kind of work, and its Codex materials push the same message. The company is no longer describing computer use as a special demo. It is describing it as part of the normal product surface.

The timing around workspace agents makes that harder to ignore. On April 22, 2026, one day before the GPT-5.5 launch, OpenAI introduced workspace agents in ChatGPT as Codex-powered cloud agents that can take on long-running workflows, use connected apps, remember what they learned, and keep working when the user is away. GPT-5.5 lands straight into that setup. The model is not the whole story. It is the new brain inside a product category OpenAI has already started defining. That matters because agentic work becomes real only when model capability, permissions, tools, and product design line up. GPT-5.5 looks designed to make that stack feel coherent.

Scientific workflows are the boldest part of the launch

OpenAI is making an unusually ambitious claim for GPT-5.5 in research settings. The launch post says the model is better not just at answering hard questions, but at persisting through research loops: exploring an idea, gathering evidence, testing assumptions, interpreting results, and deciding what to try next. The company highlights gains on GeneBench and BixBench, reports 25.0% on GeneBench and 80.5% on BixBench in its detailed table, and adds a striking example: an internal version of GPT-5.5 helped discover a new proof about off-diagonal Ramsey numbers that was later verified in Lean. That is a much bolder pitch than “useful for students.” It is a claim that the model has become materially interesting for expert technical work.

The benchmarks behind those claims are not toy tasks. A new GeneBench preprint describes realistic multi-stage scientific data analysis in genetics and quantitative biology. BixBench was introduced as a benchmark for bioinformatics tasks requiring open-ended analytical work over real datasets, long trajectories, and interpretation. The published BixBench paper is helpful here because it reminds readers how hard the space was just a year ago: frontier models in the authors’ original setup performed badly enough to expose deep limitations. That makes OpenAI’s new numbers feel notable, but it also argues for restraint. A model that scores well on scientific benchmarks is not automatically a reliable autonomous scientist. It is better read as a stronger research copilot whose value depends heavily on human framing, verification, and domain judgment.

The benchmark map behind GPT-5.5 is unusually broad

One reason this launch will land harder with serious users is that the benchmark spread is not limited to one narrow lane. The release post covers agentic coding, professional work, computer use, academic reasoning, biology, and cyber-oriented evaluations. OpenAI’s table includes Terminal-Bench 2.0, GDPval, OSWorld-Verified, Tau2-bench Telecom, FinanceAgent, OfficeQA Pro, GeneBench, BixBench, FrontierMath, GPQA Diamond, and CyberGym. That breadth is not proof of dominance everywhere. It is proof that the company wants the model judged on task completion across many environments, not on conversational elegance alone.

There is also a subtler point here. The benchmark authors themselves often built these tests because older evaluation habits were too forgiving. GDPval was built around real work across occupations. OSWorld-Verified exists because real computer tasks are harder than sandbox demos. Tau2-bench focuses on collaboration between agent and user in shared environments, which is closer to technical support and operational work than standard tool-calling tests. OfficeQA Pro punishes shallow document handling. Taken together, they form a better lens for this release than legacy trivia-heavy leaderboards. GPT-5.5 matters if it reduces friction on messy, verifiable work. These benchmarks at least try to measure that.

A compact view of the benchmarks that matter most

BenchmarkWhat it is testingWhy it matters for GPT-5.5
GDPvalEconomically valuable tasks across 44 occupationsMeasures whether the model is useful on real professional work
OSWorld-VerifiedOpen-ended tasks in real computer environmentsTests whether “computer use” is actually operational
Tau2-bench TelecomAgent and user coordination in customer-service workflowsChecks tool use, guidance, and shared-state reasoning
OfficeQA ProGrounded reasoning over a huge enterprise document corpusExposes whether document-heavy office work is truly improving
GeneBenchMulti-stage analysis in genetics and quantitative biologyProbes scientific workflow quality rather than rote recall
BixBenchReal-world bioinformatics analysisTests long analytical trajectories in a demanding research domain

These benchmarks do not settle every debate, and no serious team should buy a model because one launch table looked strong. They are still useful because they point in the same direction: OpenAI wants GPT-5.5 judged on work that can be checked, repeated, and compared across products, not just on witty answers or synthetic demos.

Availability says a lot about OpenAI’s priorities

The release matrix is revealing. In ChatGPT, GPT-5.5 Thinking goes to Plus, Pro, Business, and Enterprise users, while GPT-5.5 Pro is reserved for Pro, Business, and Enterprise. In Codex, GPT-5.5 is available across Plus, Pro, Business, Enterprise, Edu, and Go, with a 400K context window and a faster mode that trades higher cost for higher token speed. In the API, OpenAI says GPT-5.5 is coming soon with a 1M context window. The immediate pattern is obvious: OpenAI wants the model to prove itself inside its own work products before it becomes just another endpoint in someone else’s stack.

That fits the broader reshuffling of models inside ChatGPT. The Help Center says GPT-5.3 is now the default auto-switching experience for logged-in users and notes that older ChatGPT model choices such as GPT-4o, GPT-4.1, and even GPT-5 Instant and Thinking were retired from ChatGPT earlier in 2026, while API access remained separate. The result is a two-speed ecosystem. ChatGPT users live in a managed model world shaped by product decisions. Developers live in a more gradual migration path. That split can be frustrating, but it also shows where OpenAI now sees product leverage: bundled workflows, managed permissions, and cloud execution, not a pure model marketplace.

Pricing reveals the economic argument behind the launch

OpenAI’s pricing language around GPT-5.5 is blunt. The model costs more per token than GPT-5.4, but OpenAI argues it is more intelligent and much more token efficient. For API developers, GPT-5.5 will be priced at $5 per million input tokens and $30 per million output tokens, with GPT-5.5 Pro at $30 and $180. Codex adds another layer: GPT-5.5 Fast mode generates tokens 1.5 times faster at 2.5 times the cost, and the Codex pricing page repeatedly stresses that GPT-5.5 uses significantly fewer tokens to achieve comparable or better outcomes than GPT-5.4. The sales pitch is no longer cheap tokens. It is cheaper finished work.

That shift matters because frontier model economics are becoming less intuitive. If a stronger model takes fewer retries, makes fewer tool errors, uses fewer tokens to finish a job, and needs less human supervision, then sticker price per token stops being the right headline. OpenAI knows enterprise buyers are starting to think that way. Axios captured the same line of thinking around enterprise adoption and the economics of longer, more complex tasks. Once a model becomes an actor in a workflow rather than a generator of text, the meaningful unit is not the token. It is the cost of successful completion under real-world constraints. GPT-5.5 is clearly designed to win that argument.

OpenAI tied performance claims to infrastructure and compute

The launch is also notable for how openly it connects model quality to serving infrastructure. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving despite being more capable, and says the model was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems. The release does not treat infrastructure as background plumbing. It treats it as part of the product story. That is a signal that frontier releases are starting to depend as much on systems engineering, inference efficiency, and traffic shaping as on the raw model weights.

That framing aligns with recent commentary from both OpenAI and Nvidia. The launch post includes praise from Nvidia’s Justin Boitano, and Axios reports Greg Brockman describing a move toward a “compute-powered economy.” Strip away the slogan and the idea is plain enough: when AI systems take on bigger chunks of work, compute cost and serving architecture become core business questions. OpenAI is not just saying GPT-5.5 is smarter. It is saying we found a way to serve a stronger model at usable speed, which is often the difference between a benchmark win and something a company will actually deploy.

Safety is not sitting in the appendix on this release

OpenAI went out of its way to make the safety material visible on day one. The launch post says GPT-5.5 shipped with its strongest safeguards to date, after full safety and preparedness evaluation, external and internal red-teaming, targeted testing for advanced cyber and biology capabilities, and feedback from nearly 200 early-access partners. The company also published a full GPT-5.5 system card on the same day and launched a GPT-5.5 Bio Bug Bounty program. That combination matters because it changes the cadence of a frontier release. Safety is no longer a PDF that appears days later if people complain loudly enough. It is part of the release package.

The system card is broad and unusually operational. It covers disallowed content, prompt injection against connectors, destructive actions in computer-use settings, user confirmations during high-risk computer actions, jailbreak resistance, health performance, hallucinations, alignment evaluations, biological preparedness, cyber preparedness, and trust-based access for higher-risk cyber capability. That scope reflects what GPT-5.5 is supposed to do. A model that acts across tools and interfaces creates different risks than a model that only talks. Once an agent can click, delete, retrieve, and persist, safety becomes entangled with workflow design, not just refusal style.

Factuality and health performance got real attention

Two of the more grounded safety-adjacent claims in the system card deserve attention because they speak to ordinary use, not just catastrophic risk. First, OpenAI says that on de-identified ChatGPT conversations flagged by users for factual errors, GPT-5.5’s individual claims are 23% more likely to be factually correct and its responses contain a factual error 3% less often than GPT-5.4. Second, on HealthBench and HealthBench Professional, GPT-5.5 improves over GPT-5.4 on length-adjusted scoring, including a 51.8% score on HealthBench Professional. Those are not perfect numbers. They are useful because they point to reliability in the places where people actually notice failure.

OpenAI also built more specific guardrails for agentic behavior. The system card describes training around avoiding accidental data-destructive actions and following configurable confirmation policies during computer use, including high-risk actions. That may sound minor next to biosecurity and cyber risk, but it is central to whether agentic products feel trustworthy in daily use. A model that drafts code brilliantly but reverts the wrong files or pushes through an unsafe action without asking is not enterprise-ready. The card’s focus on confirmations, data preservation, and prompt injection shows OpenAI understands that everyday failure modes can kill adoption long before frontier risk does.

Cybersecurity changed the release playbook

One of the strongest tells in the GPT-5.5 release is how much of the safety apparatus is built around cybersecurity. The system card says OpenAI classifies GPT-5.5 as High capability in the cybersecurity domain, below Critical. It says the model could not produce functional critical-severity exploits in the tested hardened software projects under the company’s evaluation setup, but OpenAI still deployed expanded safeguards and encourages legitimate defenders to use Trusted Access for Cyber for more permissive capabilities. That is a careful line: not publicly unrestricted, not frozen in the lab, and not treated as ordinary consumer software either.

The Trust-based access section makes the strategy explicit. OpenAI says TAC is an identity-gated pathway for higher-risk dual-use cyber capabilities, designed for enterprise customers, verified defenders, and other legitimate users. The company couples that with actor-level enforcement and monitoring for cyber misuse signals. In other words, the release is built on differentiated access rather than a single universal policy. Anthropic is moving in a similar direction from the other side of the market: its Mythos Preview materials focus heavily on cyber capability, and Project Glasswing was launched specifically to use that model for critical software defense. The broader pattern is hard to miss. Frontier labs now see advanced cyber ability as one of the first domains demanding product-tiered access, not just generic safety refusals.

The timing says almost as much as the benchmarks

OpenAI did not release GPT-5.5 into a quiet market. Anthropic introduced Claude Opus 4.7 on April 16, 2026, exactly a week earlier, with its own emphasis on long-running software work, tool reliability, and multi-step autonomy. Anthropic also spent April talking publicly about Mythos Preview’s cyber capabilities. OpenAI launched workspace agents on April 22, then GPT-5.5 on April 23. Taken together, those dates suggest a market that is no longer arguing about whether models are useful. It is arguing about which company can turn model capability into dependable work products first.

That is also why the launch copy keeps circling back to business, legal, education, data science, and professional work. OpenAI is trying to own the category where budgets live. The home page language is about “real work.” The benchmarks lean toward valuable tasks. The examples feature tax forms, reporting workflows, operational planning, and research assistance. Even the day-before workspace-agent launch shows the same intent. The market GPT-5.5 is chasing is not “people who enjoy AI.” It is teams that will pay for work to move faster without hiring at the same pace.

Enterprises will read GPT-5.5 differently from ordinary users

For a consumer, GPT-5.5 may feel like a better answerer, a steadier coder, or a nicer research companion. For an enterprise, the release reads differently. The key questions are not about style. They are about supervision load, permission boundaries, auditability, failure recovery, and how much of a workflow the model can hold together before a human must step back in. OpenAI’s examples were chosen with that lens in mind. Reviewing 24,771 K-1 tax forms across 71,637 pages, building risk frameworks for inbound speaking requests, generating weekly business reports, and using shared workspace agents in Slack are all organizational tasks with traceable value.

That does not make the release a guaranteed enterprise win. Benchmarks do not automatically transfer to production, and agentic systems fail in idiosyncratic ways. OfficeQA Pro itself is a good warning label here: enterprise document reasoning remains hard even with direct corpus access. OpenAI’s own rollout caution around API access also shows the company knows the sharp edges are not gone. But the release is still notable because it shows where the company thinks the enterprise threshold now sits. GPT-5.5 is being offered as something you trust with bounded slices of work, not just something you consult. That is a meaningful commercial step.

Developers are getting a staged transition, not a clean handoff

For developers, the release is slightly awkward in a familiar frontier-model way. GPT-5.5 is recommended in Codex when available, but if it is not in your picker yet, OpenAI says to keep using GPT-5.4. The Codex CLI already supports gpt-5.5 as a selectable model, and Codex docs note that it is currently available when signing in with ChatGPT rather than API-key authentication. Meanwhile the general API story is still “soon.” That is enough to start evaluating the model inside OpenAI’s own workflow surface, but not enough for a full-stack migration plan on day one.

This is not necessarily a weakness. It may be the only sensible way to launch a model whose value depends heavily on tools and whose risk profile changes when it is exposed as a raw API. Still, the split matters. It means the first people to learn GPT-5.5 deeply will often be ChatGPT and Codex users rather than API-first builders. That slightly reverses the old pattern where the API led and the product caught up. It also reinforces the sense that OpenAI is trying to turn its own applications into the default place where frontier capabilities are experienced and monetized.

Claude Opus 4.7 is the comparison that keeps this release honest

No serious reading of GPT-5.5 should ignore Anthropic’s position. Claude Opus 4.7 launched on April 16 with its own strong case in advanced software engineering, long-running tasks, and agent reliability. Anthropic’s release is filled with customer testimony about fewer tool errors, better follow-through, stronger loop resistance, and better coding outcomes. OpenAI’s own launch tables show a mixed competitive picture: GPT-5.5 leads on some tasks, while Claude Opus 4.7 stays highly competitive and even ahead on SWE-Bench Pro in OpenAI’s table. The market is tight enough that one lab’s release copy no longer settles the question.

That competitive pressure is healthy for readers trying to make sense of the hype. It forces a cleaner interpretation of GPT-5.5. The release does not matter because OpenAI suddenly left everyone behind. It matters because OpenAI is pushing a very specific product thesis faster and more coherently: Codex as the work surface, workspace agents as the organizational layer, GPT-5.5 as the model tuned for messy delegation, and differentiated safety access for cyber-sensitive use. Whether that beats Anthropic across the board is still an open commercial question. Whether OpenAI has made its strategy easier to understand is not. GPT-5.5 makes that strategy unusually plain.

What GPT-5.5 changes right now

The cleanest way to read this release is not as a dramatic leap into autonomous labor and not as a trivial half-step either. GPT-5.5 looks like a model that raises the floor on delegation. It appears better at staying on task, using tools, moving through interfaces, and converting ambiguous requests into completed work products. That does not remove the need for human review, especially in finance, research, medicine, or security. It does make the old pattern of spoon-feeding every intermediate step look less necessary in some domains. That is a real product change, even if it arrives wearing a modest version number.

The larger point is where this leaves the industry. GPT-5.5 was released under the shadow of a codename, but the codename is the least important part of the story. What matters is that OpenAI used the launch to say, plainly, that the next contest is about who can supply dependable AI labor inside software, documents, codebases, and organizational workflows. GPT-5.5 is not the finish line for that race. It is a clear sign that the race has changed.

OpenAI GPT-5.5 FAQ

Did OpenAI actually release GPT-5.5 on April 23, 2026?

Yes. OpenAI published an official GPT-5.5 launch page on April 23, 2026, and also posted rollout details in its community and Help Center materials.

Is “Spud” the official product name?

No. OpenAI’s official materials use GPT-5.5 and GPT-5.5 Pro. “Spud” appears in outside reporting as the internal codename attached to the release.

Where is GPT-5.5 available right now?

OpenAI says GPT-5.5 is rolling out in ChatGPT and Codex for paid plans, with GPT-5.5 Pro rolling out to Pro, Business, and Enterprise users in ChatGPT. The Help Center says the rollout is gradual.

Is GPT-5.5 already in the API?

Not yet at launch time. OpenAI says API access is coming soon and that serving the model in the API requires additional safeguards.

What is GPT-5.5 Pro supposed to be?

OpenAI describes GPT-5.5 Pro as the higher-accuracy version for harder questions and more demanding work. The system card says GPT-5.5 Pro uses the same underlying model with extra parallel test-time compute in the cases where that matters.

What kinds of work is OpenAI emphasizing most with GPT-5.5?

The official launch focuses on coding, computer use, knowledge work, and early scientific research. Codex docs repeat that framing and recommend GPT-5.5 for those categories when available.

What are the headline benchmark numbers from the launch?

OpenAI reports 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, 78.7% on OSWorld-Verified, 98.0% on Tau2-bench Telecom, 60.0% on FinanceAgent, 54.1% on OfficeQA Pro, 25.0% on GeneBench, and 80.5% on BixBench.

What does GDPval actually measure?

OpenAI introduced GDPval as an evaluation of economically valuable real-world tasks across 44 occupations. It is meant to reflect practical work rather than trivia-style knowledge tests.

What does OSWorld-Verified test?

OSWorld-Verified measures agent performance in real computer environments. It is designed to test open-ended computer tasks rather than simplified interface demos.

Why does Tau2-bench Telecom matter?

Tau2-bench is a framework for evaluating customer-service agents that must coordinate with users and tools in shared environments. That makes it relevant for real support and workflow automation.

Why is OfficeQA Pro such a useful benchmark for enterprise work?

OfficeQA Pro is built around a huge document corpus of U.S. Treasury Bulletins covering nearly 100 years, 89,000 pages, and over 26 million numerical values. It is hard precisely because enterprise document work is hard.

Does GPT-5.5 look meaningfully stronger for research tasks?

OpenAI is clearly making that claim, especially through GeneBench, BixBench, and its Ramsey-number proof example. The surrounding benchmark literature suggests real progress, though human verification still matters enormously.

What safety work accompanied the launch?

OpenAI published a GPT-5.5 system card, said the model went through full predeployment safety and preparedness evaluations, and launched a GPT-5.5 Bio Bug Bounty on the same day.

What is the GPT-5.5 Bio Bug Bounty for?

It is a red-teaming program that invites selected researchers to try to find a universal jailbreak that defeats GPT-5.5’s bio safety challenge. OpenAI says the goal is to test whether a reproducible universal jailbreak exists after deployment.

What is Trusted Access for Cyber?

Trusted Access for Cyber is OpenAI’s identity-gated pathway for giving verified defenders and legitimate users access to more permissive cyber capabilities while keeping broader safeguards in place. GPT-5.5’s system card says that pathway has been expanded for this model.

How does GPT-5.5 compare with Claude Opus 4.7?

The public picture is mixed. OpenAI’s release tables place GPT-5.5 ahead on some evaluations such as Terminal-Bench 2.0 and GDPval, while Claude Opus 4.7 remains highly competitive and ahead on SWE-Bench Pro in OpenAI’s own table.

Why does this launch focus so heavily on Codex?

Because OpenAI is tying GPT-5.5 to agentic work rather than plain chat. Codex is the product surface where coding, computer use, file handling, and longer workflows come together.

What is the biggest takeaway from the Spud release?

The main takeaway is not the codename. It is that OpenAI is now openly selling frontier models as tools for delegated work across software, documents, and organizational workflows, with safety and access controls built around that goal.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

OpenAI’s GPT-5.5 release is really a bet on agentic work
OpenAI’s GPT-5.5 release is really a bet on agentic work

This article is an original analysis supported by the sources cited below

Introducing GPT-5.5
OpenAI’s official launch post covering rollout, benchmarks, product framing, and pricing.

GPT-5.5 System Card
OpenAI’s deployment safety document for GPT-5.5, covering evaluations, safeguards, and preparedness.

GPT-5.3 and GPT-5.5 in ChatGPT
OpenAI Help Center article with rollout notes and current ChatGPT model availability.

Codex Models
OpenAI’s Codex model guide showing where GPT-5.5 fits in the coding stack.

Pricing – Codex
OpenAI pricing page explaining GPT-5.5 usage limits and efficiency claims inside Codex.

Features – Codex CLI
OpenAI documentation recommending GPT-5.5 for complex coding and computer-use tasks.

Changelog – Codex
OpenAI changelog documenting the expanding role of Codex as a broader work surface.

Introducing workspace agents in ChatGPT
OpenAI product announcement showing how Codex-powered agents fit into team workflows.

Trusted access for the next era of cyber defense
OpenAI’s explanation of its expanded cyber-defense access framework and safety posture.

GPT-5.5 Bio Bug Bounty
OpenAI’s launch note for the public bio red-teaming program tied to GPT-5.5.

Measuring the performance of our models on real-world tasks
OpenAI’s introduction to GDPval and why it uses economically valuable work as an evaluation target.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
The official OSWorld site describing the benchmark behind OpenAI’s computer-use claims.

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
The tau2-bench repository explaining the benchmark’s agent-user coordination setting.

Finance Agent v1.1
Vals AI’s benchmark page for financial analyst-style agent tasks.

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
The OfficeQA Pro paper describing a demanding enterprise document-reasoning benchmark.

OfficeQA
Databricks’ benchmark repository for OfficeQA and OfficeQA Pro.

GeneBench: Assessing AI Agents for Multi-Stage Inference Problems in Genomics and Quantitative Biology
A recent preprint describing the genetics and quantitative biology benchmark referenced by OpenAI.

Announcing BixBench: A Benchmark to Evaluate AI Agents on Bioinformatics Tasks
FutureHouse’s announcement explaining the purpose and structure of BixBench.

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
The BixBench paper giving the technical background for the benchmark’s design and difficulty.

Introducing Claude Opus 4.7
Anthropic’s official release note for the rival model launched one week before GPT-5.5.

Claude Mythos Preview
Anthropic’s technical post on Mythos Preview and its unusually strong cyber capabilities.

OpenAI releases “Spud” GPT-5.5 model
A concise external report tying the “Spud” codename to the GPT-5.5 launch.

Introducing GPT-5
OpenAI’s earlier GPT-5 launch post, useful for comparing product framing and strategy over time.