Two models, two philosophies
Most head-to-head write-ups line these two up like they're the same product at different price points. They aren't. DeepSeek shipped V4 Pro on April 24, 2026 under an MIT license, and it's the deeper specialist — a text-only mixture-of-experts model tuned hard for agentic coding and STEM reasoning. MiniMax shipped M3 on June 1, 2026, and it's the broader generalist — the first open-weight model to fold frontier coding, a million-token context, and native image-and-video input into one system.
That single difference — multimodal versus text-only — decides more of the choice than any benchmark does. So it's worth stating plainly before the numbers start: you are not picking the "better model." You're picking which shape of model fits the job. The rest of this comparison is about making that call on real data instead of a leaderboard screenshot.
Side-by-side: specs and price
Here's the ground truth on paper, using GPT Proto's actual per-million-token rates rather than a headline figure from someone's launch post.
| |
MiniMax M3 |
DeepSeek V4 Pro |
| Released |
June 1, 2026 |
April 24, 2026 |
| Architecture |
MoE, 428B total / 23B active |
MoE, 1.6T total / 49B active |
| Attention design |
MiniMax Sparse Attention (MSA) |
DeepSeek Sparse Attention (DSA) |
| Context window |
1M tokens |
1M tokens (384K max output) |
| Input modalities |
text, image, video |
text only |
| Output |
text |
text |
| License / weights |
Open weights (Hugging Face) |
MIT, open weights (Hugging Face) |
| GPT Proto input price |
$0.48 / 1M tokens |
$1.3914 / 1M tokens |
| GPT Proto output price |
$0.96 / 1M tokens |
$2.7838 / 1M tokens |
Two things in that table matter more than the rest. M3 takes image and video input; DeepSeek doesn't. And DeepSeek activates roughly twice the parameters per token (49B vs 23B) out of a total pool nearly four times larger — it's the heavier, denser model doing more compute on each token, which shows up in its deep-reasoning scores and, on most hosts, in its price.
Coding and agentic performance
This is where the comparison usually goes wrong, so read the numbers carefully.
DeepSeek V4 Pro, in its maximum reasoning mode, scores 80.6% on SWE-bench Verified — the highest of any open-weight model, tied with Gemini 3.1 Pro. It also posts 93.5 on LiveCodeBench and a 3206 Codeforces rating. Those are algorithmic and competitive-programming strengths, and DeepSeek's scores have been picked up for independent re-runs, which matters for trust.
MiniMax M3's official coding numbers are 59.0% on SWE-Bench Pro, 66.0% on Terminal-Bench 2.1, 34.8% on SWE-fficiency, 28.8% on KernelBench Hard, and 74.2% on MCP Atlas. On Artificial Analysis's independent Intelligence Index — a cross-model score, not a vendor benchmark — M3 lands at 44, second in the peer group it tracks, against a category median around 25.
Now the trap. You'll see a dozen pages put M3's "59%" next to DeepSeek's "80.6%" and declare DeepSeek the runaway coding winner. That comparison is invalid. SWE-bench Pro and SWE-bench Verified are two different benchmarks with different problem sets and difficulty — Pro is the harder, newer variant. Comparing a Pro score to a Verified score tells you nothing about which model is better; it's a units error dressed up as a conclusion. The two labs simply reported different benchmarks, and neither published a clean head-to-head on the same one. My read: on independently measured general intelligence, they're close; on published deep-reasoning and competitive-coding scores, DeepSeek's are higher and better verified; on any task that involves seeing something, the comparison doesn't start, because M3 is the only one that can.
The one capability that isn't a tie
DeepSeek V4 Pro is text-only. MiniMax M3 was built multimodal from the first training step, and it accepts images and video alongside text on the same endpoint. That's not a spec-sheet footnote — it's a category difference.
If you're building an agent that debugs from a screenshot, turns a Figma mock into a component, reads a chart, or watches a screen recording of a reproduction to find the bug, M3 can do it and DeepSeek cannot. There is no prompt, no price, and no fine-tune that gives a text-only model eyes. So for any workflow where the model is part of what the user sees and interacts with — UI work, visual QA, document-with-diagrams parsing — the choice is made before you look at a single benchmark. Conversely, if nothing in your pipeline is ever an image, you're paying for a capability you'll never call, and DeepSeek's text specialization is the better-targeted buy.
Cost, honestly
On GPT Proto, running both models off one balance, MiniMax M3 is the cheaper of the two — $0.48 input and $0.96 output per million tokens, against $1.3914 and $2.7838 for DeepSeek V4 Pro. At GPT Proto's rates, M3 costs roughly a third of V4 Pro per token in both directions.
But I'd be misleading you if I stopped there, because "which is cheaper" depends heavily on where you run each model. DeepSeek's own native economics for V4 Pro are aggressive in a way that doesn't always survive being hosted elsewhere: on DeepSeek's first-party API the model lists around $0.435 input and $0.87 output per million tokens, and — the part that actually moves bills — a cache hit costs about $0.003625 per million, well over a hundred times cheaper than a cache miss. Agentic coding loops resend the same system prompt and file context on every turn, so most of their input lands in cache. If you're pushing high volumes of pure text and you're willing to run DeepSeek natively, that cache pricing is genuinely hard to beat, and it's the strongest single argument in V4 Pro's favor.
So the honest read on cost has two layers. On one aggregated key through GPT Proto, M3 is the lower per-token line item. For raw, high-volume text throughput where you'll optimize around DeepSeek's native cache rate, V4 Pro's economics pull ahead. And underneath both: per-token price is not per-task price. A model that costs less per token but needs three tries to land a working patch is not the cheap option — it just moved the cost into your debugging time. Benchmark the two on your own tasks before you let a pricing table decide.
Context and efficiency
Both models run a 1M-token context window, and both got there by throwing out standard dense attention for a sparse design — but by different routes, and the difference is real rather than cosmetic.
DeepSeek's DSA leans on heavy compression: in the 1M-token setting, V4 Pro needs only about 27% of the single-token inference compute and 10% of the KV cache of its own V3.2 predecessor. MiniMax's MSA does block-level selection on uncompressed key-values instead, which MiniMax argues avoids the precision cost that compression-based schemes pay at long range; at 1M context it cuts per-token compute to roughly 1/20 of the prior M2 model, with prefilling more than 9× faster and decoding more than 15× faster. This is one place where I'd flag the claims as vendor-framed on both sides — each lab describes its own approach as the one without the tradeoff. What you can take to the bank is that both are engineered specifically for long-context work, and both are cheap enough per token at length that a full-repository or long-document workload is practical rather than aspirational.
What the community is actually scrutinizing
If you go looking for reactions to these two models — the "MiniMax M3 vs DeepSeek V4 Pro reddit" search that a lot of people run before committing — two themes come up more than any benchmark argument, and both are worth taking seriously.
The first is verification. M3's launch scores were run on MiniMax's own infrastructure with its own agent scaffolding, which is normal for a launch but is exactly the kind of thing developers discount until independent numbers land. Those numbers have started to: M3's open weights shipped on Hugging Face on June 7, and Artificial Analysis's independent index now corroborates that it's a genuinely top-tier model rather than a benchmark-day artifact. DeepSeek came in with the advantage here — its scores were re-run by independent evaluators early, and its MIT-licensed weights were available from day one for anyone to check. If independently verified performance is a hard requirement, DeepSeek still has the longer track record, even though M3 has now closed most of that gap.
The second is the point that "cheapest per token" and "cheapest to finish the job" are different numbers. A model that writes plausible code and misses a failing test isn't low-cost; it's a model that pushed its cost downstream into your review. This is why the practitioner consensus keeps landing on the same advice: pick by capability fit and reliability on your workload, and let the token price break ties rather than make the decision.
Run either one with the same key
The practical upside of calling both through GPT Proto is that switching models is a one-line change — same key, same OpenAI-compatible request shape, different model string. Here's a chat completion against M3, with a commented switch to V4 Pro:
from openai import OpenAI
client = OpenAI(
api_key="sk-your-key-here", # one key reaches both models
base_url="https://gptproto.com/v1", # OpenAI-compatible gateway
)
def ask(model, prompt):
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content
# Text-only reasoning — either model handles this.
print(ask("deepseek-v4-pro", "Refactor this function for readability:\n<paste code>"))
# Image input — only M3 can take this; DeepSeek is text-only.
def ask_with_image(image_url, prompt):
resp = client.chat.completions.create(
model="MiniMax-M3",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}},
],
}],
)
return resp.choices[0].message.content
print(ask_with_image(
"https://example.com/ui-bug-screenshot.png",
"This screen renders wrong on mobile. What's the likely CSS cause?",
))
The same first call in cURL:
curl https://gptproto.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-your-key-here" \
-d '{
"model": "deepseek-v4-pro",
"messages": [
{"role": "user", "content": "Refactor this function for readability."}
]
}'
To move a text job from one model to the other you change one string — MiniMax-M3 or deepseek-v4-pro — and the same key reaches both plus 200-odd other models on one balance. If you don't have a key yet, create one from the GPT Proto dashboard, and check the pricing page for the exact current rate on each before you run a batch.
Which should you use?
If your workload is text, code, logs, and structured output — backend agents, high-volume extraction, competitive-grade algorithmic problems — use DeepSeek V4 Pro. It has the higher verified deep-reasoning scores, the deeper independent track record, and native economics that reward high-volume text through its cache pricing.
If anything in your pipeline is an image or a video — UI debugging, design-to-code, visual QA, diagram-heavy documents — use MiniMax M3, because it's the only one of the two that can see, and on GPT Proto it's also the cheaper per-token option.
And if you're building something real, the answer is often both: route text and pure-reasoning turns to V4 Pro, hand the visual turns to M3, and run them off one key so there's no second integration to maintain. "MiniMax M3 or DeepSeek V4 Pro" is the wrong framing for most teams — they're specialists in different things, and the strongest setup uses each where it wins.