What MiniMax M3 is, from a coding angle
M3 launched on June 1, 2026. It's a Mixture-of-Experts model — community trackers reading the released checkpoint put it at roughly 428B total parameters with about 23B active per token, though MiniMax hasn't published a full parameter breakdown itself, so treat the exact figures as secondary. It's natively multimodal (text, image, and video in; text out) and it's a reasoning model with a thinking mode you can toggle per request.
Before the mechanism, the motivation: why does a coding model care about context length at all? Because real engineering work isn't a single file. It's a repo, a stack trace, the test output, and the three files you'd have to touch to fix the thing — held in one place long enough to reason across them. M3's answer is a 1M-token context window with a guaranteed usable floor of 512K tokens. The engine underneath is MiniMax Sparse Attention (MSA), which selects blocks of the key-value cache instead of attending over every token pair. MiniMax reports that at 1M tokens this cuts per-token compute to roughly one-twentieth of the previous generation, with about 9× faster prefill and 15× faster decode. Those are vendor figures for the architecture, not independent measurements, but the direction is consistent with what sparse attention is supposed to buy you.
There's also a first-party coding surface — MiniMax Code, their own agent built on the model. Useful to know it exists; not what this piece is about, since you're here to call the model from your own code.
One line to keep: M3 is built for sustained, multi-step work across a large context, not for one-shot snippets. That framing explains almost every trade-off below.
The coding benchmarks: what MiniMax reports vs. what's independently measured
Here's the split that most write-ups collapse. On the left, MiniMax's own reported coding and agentic scores. On the right, what a neutral third party measured.
| Source |
Metric |
Score |
| MiniMax (vendor-run, own infra, Claude Code scaffolding) |
SWE-Bench Pro |
59.0% |
| MiniMax |
SWE-Bench Verified |
80.5% |
| MiniMax |
Terminal-Bench 2.1 |
66.0% |
| MiniMax |
SWE-fficiency |
34.8% |
| MiniMax |
KernelBench Hard |
28.8% |
| MiniMax |
MCP Atlas (tool orchestration) |
74.2% |
| Artificial Analysis (independent) |
Intelligence Index (composite) |
55 — #1 in its open-weight class |
Two things the vendor table won't tell you. First, the fact that these are vendor-run matters more than usual here: the SWE-Bench Pro figure was produced on MiniMax's own setup using Claude Code as the harness, and independent replication is still catching up. Take 59% as a strong signal of the tier M3 competes in, not a settled result. Second — and this is the detail I haven't seen in a single coding-focused article — when Artificial Analysis broke down M3 against its predecessor, most evaluations improved (Humanity's Last Exam 28→37, GPQA Diamond 87→93, long-context reasoning 69→74), but SciCode, the coding evaluation in that suite, slipped slightly from 47 to 45. That's a small regression and I wouldn't over-read it. But it's the one data point that complicates the clean "much better at coding" story, and it's telling that it went unmentioned everywhere.
My read: M3 is genuinely frontier-adjacent on applied software engineering — writing patches, multi-file edits, terminal work — and the independent index backing (55, top of its class) is real, not marketing. It is not a step-change over the last generation on every coding axis, and the abstract-reasoning gap is real (more on that below). Takeaway: trust the tier, verify the exact number against your own tasks.
What it actually costs on a coding workload
Sticker price first, because the honest comparison isn't the one you'll see in most posts. Through the GPT Proto model page, M3 runs $0.48 per million input tokens and $0.96 per million output tokens on the standard tier.
For reference, MiniMax's own effective rate — after the permanent 50%-off it applies to the list price — is about $0.30 input and $1.20 output. So be precise about the comparison rather than waving at "cheaper": routed through GPT Proto, input runs higher than calling MiniMax direct, output runs lower. Which one wins depends entirely on your workload's read-to-write ratio. A coding agent that ingests a large repo and emits a small diff is input-heavy, so the input rate dominates; a generation-heavy job tilts the other way. The reason to route M3 through an aggregator isn't a headline discount — it's operational: one key and one OpenAI-compatible surface for M3 alongside the rest of the catalog, instead of standing up a separate MiniMax account, region endpoint, and subscription key.
Now the part that actually sets coding bills, and the reason "1M context" deserves an asterisk. Pricing is flat only up to 512K input tokens. Cross that line and the whole request — input, output, and cache reads — bills at 2×. It's a step function, not a gradient. Walk through a normal agent loop: you start at 400K of input and 100K of output, comfortably under the line. But agent loops append. By turn ten or fifteen, nothing got pruned, and one turn quietly crosses 512K — at which point that entire turn pays double, on everything, not just the tokens above the threshold. A 20% bump in input can more than double the cost of a call.
The lever that pulls the other way is caching: repeated input (your system prompt, the stable parts of the codebase) reads back at a fraction of the standard rate. In agent loops, a large share of input is cacheable, so this is worth wiring in early. Takeaway: on M3, your coding bill is decided by how much context you drag along and how well you cache it, not by the per-token number on the pricing card. Budget the workload, not the sticker.
How to call M3 through the GPT Proto API
The endpoint is the OpenAI-compatible chat surface. Authentication is a raw API key in the Authorization header — no Bearer prefix, which trips people up coming from other providers. Swap the model string to MiniMax-M3 and you're running.
import requests, json, glob
# Pull a few source files into one long-context prompt.
# M3's floor is 512K tokens, so a dozen files fit without hitting the price cliff.
files = glob.glob("src/**/*.py", recursive=True)[:20]
codebase = "\n\n".join(f"# ---- {p} ----\n{open(p).read()}" for p in files)
prompt = (
"Here is part of a Python service. Find every place a database connection "
"can leak on an exception path, and return the fix as a unified diff.\n\n"
+ codebase
)
resp = requests.post(
"https://gptproto.com/v1/chat/completions",
headers={
"Authorization": "sk-your-gptproto-key", # raw key, NO "Bearer" prefix
"Content-Type": "application/json",
},
data=json.dumps({
"model": "MiniMax-M3",
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.2, # default is 0.95 — lower it for deterministic code edits
"stream": False,
}),
timeout=120,
)
print(resp.json()["choices"][0]["message"]["content"])
Two practical notes. The default temperature on this surface is 0.95, which is high for code — I'd drop it to 0.2 or lower when you want reproducible diffs rather than creative vari/ation. And a disclosure: this endpoint and the raw-key auth pattern are confirmed from GPT Proto's live MiniMax model documentation, with the model string swapped to MiniMax-M3; I have not smoke-tested this exact call against M3 myself, so run one live request and confirm the response shape before you wire it into CI.
"Supports 1M" is not "should run at 1M"
Worth separating two claims that get conflated. M3 supports a 1M-token window. Whether you should fill it is a different question, and for coding the answer is usually no. Long-context models have a documented tendency to attend well to the beginning and end of a prompt and lose things buried in the middle; MiniMax says M3 was trained specifically against this, and early reports suggest retrieval holds up across most of the window, but "most of" is doing real work in that sentence and I'd verify it at the extreme end on your own retrieval-sensitive tasks. Stack that on the 512K price cliff and the guidance writes itself: use the long context deliberately — whole-repo reasoning when you truly need cross-file awareness — not as a default dumping ground for every file you have lying around.
M3 vs. DeepSeek V4 Pro for coding
If you're choosing between the two frontier-class Chinese open-weight models for coding, the axis is clean. M3 brings native multimodality and the 1M window — it can take a screenshot of a broken UI alongside the code. DeepSeek V4 Pro is text-only and priced lower, and it's strong on the verified software-engineering suites. My rough framing: reach for M3 when multimodal input or very long context earns its keep, and for DeepSeek when you want the cheapest capable pure-text coder and don't need vision. The head-to-head numbers deserve their own treatment, so I've kept this to the decision axis and put the detailed comparison in MiniMax M3 vs. DeepSeek V4 Pro.
Who should use M3 for coding — and who shouldn't
Use it if your work is agentic and multi-file: patches across a codebase, terminal-driven tasks, long debugging sessions where history matters, or workflows where feeding a UI screenshot back to the model is genuinely useful. That's the profile M3 was trained for, and it shows.
Skip it, or at least test hard first, in three cases. If you need the absolute cheapest text-only coder and never touch images or huge context, a leaner text model will cost less per token. If your problem needs genuine novel abstract reasoning rather than competent execution — the independent reviewer who liked M3's applied coding also flagged that it lags on abstract-reasoning benchmarks — M3 is not where its strength lives. And if your plan is to self-host the weights for commercial use, read the license before you commit: M3 ships under the MiniMax Community License, which is more restrictive than the MIT or Apache terms some competitors use, and its open-weight status has been a moving target since launch. For API usage through the model page, none of that licensing friction applies — you're renting access, not redistributing weights.