Schuyler Stacy2026-07-02

MiniMax M3 for Coding: Benchmarks, Real Pricing, and How to Call It via API (2026)

Is MiniMax M3 good for coding? Independent benchmarks vs vendor claims, the real API cost with its 512K price cliff explained, and runnable GPTProto code.

Discover AI Insights

MiniMax M3 for Coding: Benchmarks, Real Pricing, and How to Call It via API (2026)

Is MiniMax M3 good for coding? The short answer: yes for agentic and multi-file work, with two caveats I'll be upfront about before you read another word. Most of the headline coding scores were run by MiniMax on its own infrastructure, and the "1 million token context" has a price cliff at 512K that hits coding agents in particular. Both are manageable once you know they're there. Neither shows up clearly in most of the launch coverage.

I'm writing this because the coding pitch around M3 got flattened into one number — 59% on SWE-Bench Pro — and that number is doing a lot of unexamined work. What follows is what the model actually is, where the independent measurements land, what it costs on a real coding workload, and how to call it through the GPTProto API. If you just want a verdict: an independent reviewer who runs the same battery on every serious model put M3 "close to GPT and Opus on real coding, not quite past them." That matches where the neutral benchmarks put it too.

Table of contents

What MiniMax M3 is, from a coding angle

M3 launched on June 1, 2026. It's a Mixture-of-Experts model — community trackers reading the released checkpoint put it at roughly 428B total parameters with about 23B active per token, though MiniMax hasn't published a full parameter breakdown itself, so treat the exact figures as secondary. It's natively multimodal (text, image, and video in; text out) and it's a reasoning model with a thinking mode you can toggle per request.

Before the mechanism, the motivation: why does a coding model care about context length at all? Because real engineering work isn't a single file. It's a repo, a stack trace, the test output, and the three files you'd have to touch to fix the thing — held in one place long enough to reason across them. M3's answer is a 1M-token context window with a guaranteed usable floor of 512K tokens. The engine underneath is MiniMax Sparse Attention (MSA), which selects blocks of the key-value cache instead of attending over every token pair. MiniMax reports that at 1M tokens this cuts per-token compute to roughly one-twentieth of the previous generation, with about 9× faster prefill and 15× faster decode. Those are vendor figures for the architecture, not independent measurements, but the direction is consistent with what sparse attention is supposed to buy you.

There's also a first-party coding surface — MiniMax Code, their own agent built on the model. Useful to know it exists; not what this piece is about, since you're here to call the model from your own code.

One line to keep: M3 is built for sustained, multi-step work across a large context, not for one-shot snippets. That framing explains almost every trade-off below.

The coding benchmarks: what MiniMax reports vs. what's independently measured

Here's the split that most write-ups collapse. On the left, MiniMax's own reported coding and agentic scores. On the right, what a neutral third party measured.

Source	Metric	Score
MiniMax (vendor-run, own infra, Claude Code scaffolding)	SWE-Bench Pro	59.0%
MiniMax	SWE-Bench Verified	80.5%
MiniMax	Terminal-Bench 2.1	66.0%
MiniMax	SWE-fficiency	34.8%
MiniMax	KernelBench Hard	28.8%
MiniMax	MCP Atlas (tool orchestration)	74.2%
Artificial Analysis (independent)	Intelligence Index (composite)	55 — #1 in its open-weight class

Two things the vendor table won't tell you. First, the fact that these are vendor-run matters more than usual here: the SWE-Bench Pro figure was produced on MiniMax's own setup using Claude Code as the harness, and independent replication is still catching up. Take 59% as a strong signal of the tier M3 competes in, not a settled result. Second — and this is the detail I haven't seen in a single coding-focused article — when Artificial Analysis broke down M3 against its predecessor, most evaluations improved (Humanity's Last Exam 28→37, GPQA Diamond 87→93, long-context reasoning 69→74), but SciCode, the coding evaluation in that suite, slipped slightly from 47 to 45. That's a small regression and I wouldn't over-read it. But it's the one data point that complicates the clean "much better at coding" story, and it's telling that it went unmentioned everywhere.

My read: M3 is genuinely frontier-adjacent on applied software engineering — writing patches, multi-file edits, terminal work — and the independent index backing (55, top of its class) is real, not marketing. It is not a step-change over the last generation on every coding axis, and the abstract-reasoning gap is real (more on that below). Takeaway: trust the tier, verify the exact number against your own tasks.

What it actually costs on a coding workload

Sticker price first, because the honest comparison isn't the one you'll see in most posts. Through the GPT Proto model page, M3 runs $0.48 per million input tokens and $0.96 per million output tokens on the standard tier.

For reference, MiniMax's own effective rate — after the permanent 50%-off it applies to the list price — is about $0.30 input and $1.20 output. So be precise about the comparison rather than waving at "cheaper": routed through GPT Proto, input runs higher than calling MiniMax direct, output runs lower. Which one wins depends entirely on your workload's read-to-write ratio. A coding agent that ingests a large repo and emits a small diff is input-heavy, so the input rate dominates; a generation-heavy job tilts the other way. The reason to route M3 through an aggregator isn't a headline discount — it's operational: one key and one OpenAI-compatible surface for M3 alongside the rest of the catalog, instead of standing up a separate MiniMax account, region endpoint, and subscription key.

Now the part that actually sets coding bills, and the reason "1M context" deserves an asterisk. Pricing is flat only up to 512K input tokens. Cross that line and the whole request — input, output, and cache reads — bills at 2×. It's a step function, not a gradient. Walk through a normal agent loop: you start at 400K of input and 100K of output, comfortably under the line. But agent loops append. By turn ten or fifteen, nothing got pruned, and one turn quietly crosses 512K — at which point that entire turn pays double, on everything, not just the tokens above the threshold. A 20% bump in input can more than double the cost of a call.

The lever that pulls the other way is caching: repeated input (your system prompt, the stable parts of the codebase) reads back at a fraction of the standard rate. In agent loops, a large share of input is cacheable, so this is worth wiring in early. Takeaway: on M3, your coding bill is decided by how much context you drag along and how well you cache it, not by the per-token number on the pricing card. Budget the workload, not the sticker.

How to call M3 through the GPT Proto API

The endpoint is the OpenAI-compatible chat surface. Authentication is a raw API key in the Authorization header — no Bearer prefix, which trips people up coming from other providers. Swap the model string to MiniMax-M3 and you're running.

import requests, json, glob
 
# Pull a few source files into one long-context prompt.
# M3's floor is 512K tokens, so a dozen files fit without hitting the price cliff.
files = glob.glob("src/**/*.py", recursive=True)[:20]
codebase = "\n\n".join(f"# ---- {p} ----\n{open(p).read()}" for p in files)
 
prompt = (
    "Here is part of a Python service. Find every place a database connection "
    "can leak on an exception path, and return the fix as a unified diff.\n\n"
    + codebase
)
 
resp = requests.post(
    "https://gptproto.com/v1/chat/completions",
    headers={
        "Authorization": "sk-your-gptproto-key",  # raw key, NO "Bearer" prefix
        "Content-Type": "application/json",
    },
    data=json.dumps({
        "model": "MiniMax-M3",
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.2,   # default is 0.95 — lower it for deterministic code edits
        "stream": False,
    }),
    timeout=120,
)
 
print(resp.json()["choices"][0]["message"]["content"])

Two practical notes. The default temperature on this surface is 0.95, which is high for code — I'd drop it to 0.2 or lower when you want reproducible diffs rather than creative vari/ation. And a disclosure: this endpoint and the raw-key auth pattern are confirmed from GPT Proto's live MiniMax model documentation, with the model string swapped to MiniMax-M3; I have not smoke-tested this exact call against M3 myself, so run one live request and confirm the response shape before you wire it into CI.

"Supports 1M" is not "should run at 1M"

Worth separating two claims that get conflated. M3 supports a 1M-token window. Whether you should fill it is a different question, and for coding the answer is usually no. Long-context models have a documented tendency to attend well to the beginning and end of a prompt and lose things buried in the middle; MiniMax says M3 was trained specifically against this, and early reports suggest retrieval holds up across most of the window, but "most of" is doing real work in that sentence and I'd verify it at the extreme end on your own retrieval-sensitive tasks. Stack that on the 512K price cliff and the guidance writes itself: use the long context deliberately — whole-repo reasoning when you truly need cross-file awareness — not as a default dumping ground for every file you have lying around.

M3 vs. DeepSeek V4 Pro for coding

If you're choosing between the two frontier-class Chinese open-weight models for coding, the axis is clean. M3 brings native multimodality and the 1M window — it can take a screenshot of a broken UI alongside the code. DeepSeek V4 Pro is text-only and priced lower, and it's strong on the verified software-engineering suites. My rough framing: reach for M3 when multimodal input or very long context earns its keep, and for DeepSeek when you want the cheapest capable pure-text coder and don't need vision. The head-to-head numbers deserve their own treatment, so I've kept this to the decision axis and put the detailed comparison in MiniMax M3 vs. DeepSeek V4 Pro.

Who should use M3 for coding — and who shouldn't

Use it if your work is agentic and multi-file: patches across a codebase, terminal-driven tasks, long debugging sessions where history matters, or workflows where feeding a UI screenshot back to the model is genuinely useful. That's the profile M3 was trained for, and it shows.

Skip it, or at least test hard first, in three cases. If you need the absolute cheapest text-only coder and never touch images or huge context, a leaner text model will cost less per token. If your problem needs genuine novel abstract reasoning rather than competent execution — the independent reviewer who liked M3's applied coding also flagged that it lags on abstract-reasoning benchmarks — M3 is not where its strength lives. And if your plan is to self-host the weights for commercial use, read the license before you commit: M3 ships under the MiniMax Community License, which is more restrictive than the MIT or Apache terms some competitors use, and its open-weight status has been a moving target since launch. For API usage through the model page, none of that licensing friction applies — you're renting access, not redistributing weights.

All-in-One Creative Studio

Generate images and videos here. The GPTProto API ensures fast model updates and the lowest prices.

Start Creating

Related Models

MiniMax

MiniMax-M3/text-to-text

MiniMax M3 is a frontier Mixture-of-Experts model featuring a 1M token context window and native multimodal support. Built for high-fidelity reasoning, MiniMax M3 excels in coding, bilingual tasks, and long-document analysis.

deepseek-v4-pro/text-to-text

DeepSeek 4 Pro API delivers flagship-level reasoning with a 1M context window. Optimized for agentic coding and STEM logic, it offers elite performance at 1/8th the cost of competitors. Access the deepseek 4 pro api via GPTProto.com today.

gemini-3.1-flash-lite-image/text-to-image

Nano Banana Lite API powers the Gemini 3.1 Flash-Lite model, delivering sub-5 second image generation. This lite vision tool is optimized for high-velocity workflows, offering 1K resolution and native image-to-image editing at scale.

gemini-3.1-flash-lite-image/image-edit

nano banana lite (Gemini 3.1 Flash-Lite) is a hyper-optimized multimodal model for high-velocity image generation and visual reasoning. It delivers sub-5 second 1K resolution results at a fraction of the cost of flagship AI models.

$ 0.0202

40% off

Market: $ 0.0336

FAQ

Is MiniMax M3 good for coding?

For agentic, multi-file, and long-context coding, yes — it sits at the top of its open-weight class on the independent Artificial Analysis index and near the closed frontier on applied software-engineering tasks. It's less suited to abstract-reasoning problems and to the very cheapest text-only workloads.

How much does M3 cost via API?

Through GPTProto, $0.48 per million input tokens and $0.96 per million output tokens on the standard tier. Watch the 512K threshold — requests above it bill at 2× on the entire call. See current rates on the model page.

Does M3 have a free way to try it?

MiniMax has offered trial credits directly; availability and any promotional access change often, so check the current terms on the provider you plan to use rather than trusting a number in a blog post.

Is MiniMax M3 open weights?

MiniMax announced open weights at launch and released them under the MiniMax Community License, but third-party trackers have not uniformly listed the weights as public, and the situation has shifted since June. If self-hosting is your plan, verify the current weight availability and license terms directly before building around them.

M3 or DeepSeek V4 Pro for coding?

M3 for multimodal input and very long context; DeepSeek for the cheapest capable text-only coding. Full breakdown: MiniMax M3 vs. DeepSeek V4 Pro.

More Blogs

Claude Fable 5: The Complete Guide and Honest Review (2026)

Schuyler Stacy | 2026-06-11

What Is GLM 5.2? Open-Weight Coding at 1/6 the Price

Michael Johnson | 2026-06-23

Claude Sonnet 5: What's New, What It Costs, and How It Compares to Sonnet 4.6 (2026 Guide)

Schuyler Stacy | 2026-07-01

MiniMax M3 vs DeepSeek V4 Pro: Price, Benchmarks, and Which One to Actually Use

Tiffany Layne | 2026-07-01