Google’s Gemini 3 Outperforms gpt-5.2 in Tests

Artificial Intelligence, Gemini, Latest, News, Technology & AI, World

Gemini 3 vs GPT-5.2 is tightening as Google rolls out Gemini 3 Flash worldwide and highlights benchmark results where it edges GPT‑5.2, while OpenAI says GPT‑5.2 sets new highs in knowledge-work, long-context, and agent tool use.

You can open Table of Contents show

What happened

Google expanded its Gemini 3 family in December with the release of Gemini 3 Flash, positioning it as frontier intelligence built for speed and rolling it out across the Gemini app, Search’s AI Mode, and developer/enterprise channels such as the Gemini API, AI Studio, Vertex AI, and Gemini Enterprise.
Google said Flash is designed to keep Pro-grade reasoning while reducing latency and cost, targeting high-frequency workflows like iterative development and agentic applications.

Earlier, Google launched Gemini 3 Pro (preview) and introduced Gemini 3 Deep Think as an enhanced reasoning mode, describing Gemini 3 as a new generation focused on reasoning depth, multimodality, and agentic coding.
OpenAI then introduced GPT‑5.2 and began rolling it out in ChatGPT (starting with paid plans) while also making it available to developers via its API, describing it as optimized for professional work and long-running agents.

Release timeline

Date (2025)	Update	Key details
Nov 17	Google introduces Gemini 3	Gemini 3 Pro preview and Gemini 3 Deep Think announced; Gemini 3 Pro benchmark highlights include 1501 Elo on LMArena and scores like 37.5% on Humanity’s Last Exam (no tools) and 91.9% on GPQA Diamond.
Dec 10	OpenAI introduces GPT‑5.2	OpenAI reports GPT‑5.2 Thinking scores include 70.9% on GDPval (wins or ties), 80% on SWE-bench Verified, 34.5% on Humanity’s Last Exam (no tools), and 92.4% on GPQA Diamond (no tools).
Dec 16	Google releases Gemini 3 Flash	Google reports Gemini 3 Flash scores include 33.7% on Humanity’s Last Exam (no tools), 90.4% on GPQA Diamond, 81.2% on MMMU Pro, and 78% on SWE-bench Verified.

Where Gemini 3 outperforms GPT-5.2 in tests

Google’s headline performance claim for Flash is that it reaches 81.2% on MMMU Pro, which is higher than OpenAI’s reported 79.5% for GPT‑5.2 on MMMU Pro (no tools).
That matters because MMMU Pro is used to assess multimodal understanding (handling mixed inputs like text and images), which is central to product use cases such as visual Q&A, content analysis, and UI understanding.

Google also emphasized speed and efficiency, saying Flash is 3x faster than Gemini 2.5 Pro based on third-party benchmarking and uses 30% fewer tokens on average than 2.5 Pro on typical traffic while maintaining stronger performance.
On coding-agent style tasks, Google reported Gemini 3 Flash scores 78% on SWE-bench Verified and described it as outperforming not only the 2.5 series but also Gemini 3 Pro on that benchmark.

At the flagship end, Google reported Gemini 3 Pro’s benchmark highlights include 37.5% on Humanity’s Last Exam (no tools), 91.9% on GPQA Diamond, and 81% on MMMU‑Pro, along with 76.2% on SWE-bench Verified.
Google also said Gemini 3 Pro has a 1 million-token context window, aimed at long-document and multi-file workflows.

Where GPT-5.2 still leads (and why it matters)

OpenAI framed GPT‑5.2 as a professional productivity model family, reporting that GPT‑5.2 sets a new high on GDPval—an evaluation spanning well-specified knowledge work tasks across 44 occupations—by beating or tying top professionals 70.9% of the time, according to expert judges.
OpenAI also reported large gains in tool-driven workflows, including 98.7% on Tau2-bench Telecom for GPT‑5.2 Thinking, which it described as demonstrating reliable tool use across long, multi-turn tasks.

On core academic-style evaluations, OpenAI reported GPT‑5.2 posts 92.4% on GPQA Diamond (no tools) and 34.5% on Humanity’s Last Exam (no tools) for GPT‑5.2 Thinking.
On coding benchmarks, OpenAI reported 80% on SWE-bench Verified for GPT‑5.2 Thinking and 55.6% on SWE‑Bench Pro (Public), describing SWE‑Bench Pro as more rigorous and multi-language compared with SWE-bench Verified.

OpenAI also said GPT‑5.2 reduces hallucinations compared with GPT‑5.1 on a set of de-identified ChatGPT queries, reporting that responses with errors were 30% less common.
For long-context work, OpenAI reported strong performance on its MRCRv2 evaluation and said GPT‑5.2 Thinking reaches near 100% accuracy on a specific 4-needle variant out to 256k tokens.

Benchmark snapshot: Gemini 3 vs GPT-5.2

The figures below are the vendors’ reported results (and, in one case, an independent code-quality study), so they should be read as directional indicators rather than a single standardized leaderboard.

Benchmark / metric	Gemini 3 Pro	Gemini 3 Flash	GPT‑5.2 Thinking
MMMU Pro (multimodal)	81%	81.2%	79.5% (no tools)
Humanity’s Last Exam	37.5% (no tools)	33.7% (no tools)	34.5% (no tools)
GPQA Diamond	91.9%	90.4%	92.4% (no tools)
SWE-bench Verified	76.2%	78%	80%
GDPval (knowledge work)	Not disclosed in Google post	Not disclosed in Google post	70.9% (wins or ties)
Tau2-bench Telecom (tool use)	Not disclosed in Google post	Not disclosed in Google post	98.7%

Cost and rollout pressure

Google priced Gemini 3 Flash at $0.50 per 1M input tokens and $3 per 1M output tokens (with separate audio input pricing), explicitly framing it as a fraction of the cost while targeting production-scale usage.
OpenAI priced GPT‑5.2 at $1.75 per 1M input tokens and $14 per 1M output tokens (with cached input discounts), while arguing that token efficiency can reduce the cost required to reach a given quality level on agentic evaluations.

Model	Input price (per 1M tokens)	Output price (per 1M tokens)
Gemini 3 Flash	$0.50	$3.00
GPT‑5.2	$1.75	$14.00

Google said Gemini 3 Flash is rolling out broadly across consumer and enterprise surfaces, including becoming the default model in the Gemini app, which effectively pushes its new capability set to a large installed base quickly.
OpenAI said GPT‑5.2 (Instant, Thinking, Pro) is rolling out in ChatGPT starting with paid tiers and that the API versions are available to all developers, a standard pattern meant to manage reliability and capacity during launches.

Independent signal: code quality vs code correctness

A separate analysis from Sonar (via its SonarQube-based evaluations across thousands of Java assignments) argued that pass-rate benchmarks alone can miss maintainability, security, and complexity trade-offs in AI-generated code.
In Sonar’s reported results, Gemini 3 Pro achieved an 81.72% pass rate while keeping low cognitive complexity and low verbosity, and GPT‑5.2 High recorded an 80.66% pass rate but generated the highest code volume among the compared models.

Sonar also reported major differences in issue types, including concurrency issues per million lines of code (MLOC) of 470 for GPT‑5.2 High versus 69 for Gemini 3 Pro in its dataset.
On security posture in the same evaluation, Sonar reported GPT‑5.2 High had 16 blocker vulnerabilities per MLOC, compared with 66 for Gemini 3 Pro.

Final thoughts

Gemini 3 vs GPT-5.2 is no longer a single winner story: Google is using Gemini 3 Flash to claim leadership on at least one visible multimodal benchmark (MMMU Pro) while competing aggressively on speed and price for production use.
OpenAI is countering by emphasizing professional knowledge-work evaluations, long-context reasoning, and high tool-use reliability—areas that matter for enterprise workflows where agents must execute multi-step tasks end to end.
For buyers and builders, the practical decision increasingly looks like model-by-model selection (multimodal accuracy, coding, tool reliability, context, and cost) rather than loyalty to a single vendor’s best overall claim.