Michael Johnson2026-07-03

Introducing Gemini Omni Flash: Google's New Video Model You Edit by Talking

Gemini Omni Flash is Google's new any-input video model you edit by chatting — $0.10/sec, ~10s clips. What it does, real API code, and Omni vs Veo.

Discover AI Insights

Introducing Gemini Omni Flash: Google's New Video Model You Edit by Talking

Most AI video models give you exactly one shot. You write a prompt, you get a clip, and if the back half is wrong you start over — and pay again. Gemini Omni Flash is Google's attempt to break that loop. You generate a clip, then keep reshaping it by talking to it: swap the character, relight the scene, change the camera angle, and the model builds each edit on the last one instead of regenerating from zero.

It's the first model in Google's new Gemini Omni family — which Google calls its first "any-to-any" multimodal model — and it went into public preview around July 1, 2026. What follows is a plain-English rundown of what it is, what it costs, how to actually call it, and where it still falls short.

Table of contents

What is Gemini Omni Flash?

In one sentence: Gemini Omni Flash takes text, images, audio, and video as input and produces high-resolution video with synchronized audio as output — and it lets you edit that video through a back-and-forth conversation.

That last clause is the entire pitch, so it's worth separating from the demo reel. A model like Veo or Sora is built to render one strong clip from one prompt. Omni Flash is built on the assumption that your first clip is a draft. Google's framing is that the model "reasons about what should happen next" rather than only painting pixels — it leans on Gemini's world knowledge and an "intuitive understanding of forces like gravity, kinetic energy and fluid dynamics" to decide how a scene should continue. Read those as Google's claims, not independent findings: the model card lists benchmarks as forthcoming, so nobody has scored the reasoning yet.

Why it exists, before how it works

The problem Omni Flash targets isn't "generate a prettier video." It's the editing tax. On a one-shot generator, every change is a fresh render — get 90% of a clip right, ask for one different gesture, and you re-roll the whole thing and hope the good 90% survives. That's slow, expensive, and non-deterministic.

Omni Flash's answer is a stateful, multi-turn loop. You generate, inspect, and then say "make the background a sunset" or "have her turn around" — and the model carries the previous scene forward instead of rebuilding it. Google exposes this to developers with a previous_interaction_id you thread through each turn, so the edits chain. Inside the Gemini app, this conversational surface is the thing Veo and Sora simply don't ship.

One line to hold onto: Omni Flash is less a "better generator" and more an "editing layer" — it gets interesting once a base clip already exists.

What it can do (with the numbers that are actually public)

Here's what's confirmed from Google's model card and API docs, kept to the specifics that matter:

Inputs: text, images, audio, and video. Audio input is limited to voice references at launch; more audio types are planned.
Output: high-resolution video with native, synchronized audio generated in the same pass — not stitched on afterward.
Clip length: roughly 10 seconds at launch. Google frames the cap as a deployment choice rather than a hard model limit, and says longer durations are coming — so treat 10s as "today," not "forever."
Aspect ratios: 9:16 and 16:9 (16:9 default).
Consistency: the model is meant to hold character, object, and style identity across cuts and edits, and to sync on-screen text and graphics with motion.
Provenance: every clip carries an imperceptible SynthID watermark, with C2PA content credentials on by default. There's no opt-out.

Resolution is the one spec Google is coy about — the model card only says "high-resolution" with no pixel number, and higher-resolution support via the enterprise API is listed as "available soon." If exact resolution matters to your pipeline, that's a gap to watch, not a number to assume.

Pricing

Gemini Omni Flash is priced at $0.10 per second of video output. That's the one hard number Google published, and it's clean: a 10-second clip costs about a dollar, and you can cost out a batch job without guessing.

The catch developers spotted immediately is what that rate means during editing. As one commenter put it on the model's Product Hunt page, if you generate a 20-second clip and then say "make the second shot slower," are you re-billed for the full clip each turn, or does it diff against the previous render? Google hasn't spelled this out publicly, and on a conversational model that's exactly where costs can quietly balloon. My read: budget as if each edit turn is a fresh render until Google documents otherwise.

For reference, that $0.10/sec reportedly matches Veo 3.1 Fast — a comparison Google's own team has drawn, though I'd file it as reported rather than a published side-by-side.

How to use it: a real image-to-video call via GPT Proto

We're adding Gemini Omni Flash to GPT Proto so you can call it with the same key and billing you already use for the rest of the catalog — access is rolling out now. GPT Proto follows Google's Vertex-style request format, so an image-to-video generation is a single POST, then you poll the returned operation and download the result. Here's the full Python flow (swap in your own key from the GPT Proto dashboard):

import requests, json, base64, time

BASE = "https://gptproto.com/v1beta/models/gemini-omni-flash-preview:predictLongRunning"
HEADERS = {"x-goog-api-key": "GPTPROTO_API_KEY", "Content-Type": "application/json"}

# 1) Turn a still image into a moving clip
with open("frame.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

payload = {
    "instances": [{
        "prompt": "The subject slowly turns to face the camera as warm sunset light fills the room.",
        "image": {"mimeType": "image/png", "bytesBase64Encoded": image_b64}
    }],
    "parameters": {"aspectRatio": "9:16"}
}
op = requests.post(BASE, headers=HEADERS, data=json.dumps(payload)).json()
operation_name = op["name"]        # e.g. models/gemini-omni-flash-preview/operations/abc123

# 2) Poll until the render is done
POLL = f"https://api.gptproto.com/v1beta/{operation_name}"
while True:
    result = requests.get(POLL, headers=HEADERS).json()
    if result.get("done"):
        break
    time.sleep(10)

# 3) Download the finished video
uri = result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
video = requests.get(uri, headers=HEADERS)
open("output.mp4", "wb").write(video.content)
print("Saved output.mp4")

Or the same first step in cURL:

curl --location 'https://gptproto.com/v1beta/models/gemini-omni-flash-preview:predictLongRunning' \
  --header 'x-goog-api-key: GPTPROTO_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "instances": [{
      "prompt": "The subject slowly turns to face the camera as warm sunset light fills the room.",
      "image": {"mimeType": "image/png", "bytesBase64Encoded": "BASE64_ENCODED_IMAGE"}
    }],
    "parameters": {"aspectRatio": "9:16"}
  }'

A few practical notes from the docs: video generation is a long-running operation, so the poll step is not optional; generated files are retained for a couple of days, so download promptly; and errors follow standard HTTP codes — 401 for a bad key, 403 for insufficient balance or permission, 429 for rate limits.

A quick prompt guide

Omni Flash is deliberately spare on knobs — the API doesn't support negative prompts, temperature, top-p, or system instructions. That changes how you steer it:

Put the control in the prompt, not the parameters. Describe motion and physics explicitly ("marble rolls fast, continuous smooth shot") rather than leaning on settings. Don't cram every requirement into one mega-prompt — generate a solid base clip first, then make surgical changes conversationally, which is the workflow the model is actually built for. For character continuity, feed a reference image rather than describing a face in words. And keep expectations calibrated: Google's own card admits that complex motion, perfectly accurate text, and complete consistency across edits are still weak spots, so prompts that hinge on all three at once are the ones most likely to disappoint.

The honest review: where it wins and where it wobbles

The genuinely new thing here is the conversational editing loop. Iterating by chatting with a clip is meaningfully faster than re-prompting, and it's a surface Veo and Sora don't offer. That's the reason to reach for it.

But early reaction has been measured, not hyped, and it's worth relaying honestly. The recurring community verdict is that Omni Flash is a strong editor and a merely acceptable pure generator — several hands-on testers said it shines when reshaping an already-good clip but is less convincing when creating the original from scratch, with physics, motion, and temporal consistency called out as the soft spots. That tracks with Google's own admissions. There are harder limits, too: in the API, video references up to three seconds are "accepted by the schema but not correctly processed," multi-video referencing isn't supported, and audio/speech editing is deliberately withheld at launch — Google cites deepfake risk. So this is a prototype-with-it release, not a wire-it-into-production-untested one.

Reception on the business side has been warmer: the launch pointed to early production adopters (Figma and WPP among the reported names), which suggests the price-per-second math already clears the bar for real creative workflows. I'd weight that as a signal about pricing, not about output quality.

Gemini Omni Flash vs Veo 3.1: which should you use?

This is the question most searches are really asking, and the consensus — including from Google — is that they're complementary, not a replacement. Veo has only been swapped out for Omni inside the Gemini app; developers still call Veo 3.1 directly via API for high-fidelity work.

	Gemini Omni Flash	Veo 3.1 / 3.1 Fast
Best for	Conversational editing, multi-input remixing	High-fidelity one-shot generation
Inputs	Text, image, audio (voice ref), video	Text, image (ingredients), video
Clip length	~10s at launch (Google calls it a deployment choice)	4 / 6 / 8s, plus extension
Resolution	"High-resolution" — no public number yet	720p / 1080p / 4K (1080p & 4K at 8s only)
Native audio	Yes, every clip	Yes
Editing	Multi-turn conversational edits	Prompt / ingredients, extension
Price	$0.10 / sec output	$0.10 / sec (Fast, reported)
Watermark	SynthID + C2PA	SynthID

The practical rule: choose Veo 3.1 when resolution and clean cinematic output matter most; choose Omni Flash when changing, refining, and remixing a clip through conversation matters more. In a lot of real pipelines the answer is both — generate the base with the sharper generator, then hand it to Omni Flash for the edit loop.

Who it's for — and who should wait

Reach for Omni Flash if your work is edit-heavy: VFX-style tweaks, avatar and reference-based transformations, or anything where you'll iterate on a clip several times. Wait if you need guaranteed high resolution today, single clips longer than ~10 seconds, reliable video-reference inputs, or any audio/voice editing — those are either unspecified, capped, or switched off at launch.

Ready to build with it? Grab a key and browse the live catalog on GPT Proto — Gemini Omni Flash access is rolling out now.

All-in-One Creative Studio

Generate images and videos here. The GPTProto API ensures fast model updates and the lowest prices.

Start Creating

Related Models

Google

veo-3.1-generate-preview/text-to-video

Veo-3.1-generate-preview is an advanced AI video generator by Google offering three main modes: text-to-video, image-to-video, and video-to-video. It creates high-quality 4-8 second videos in 720p/1080p with synchronized audio and realistic visuals. Key features include using up to 3 reference images for consistency, smooth transitions between start/end frames, and video extensions for longer sequences.

$ 3.2

Google

gemini-3.1-flash-lite-image/text-to-image

Nano Banana Lite API powers the Gemini 3.1 Flash-Lite model, delivering sub-5 second image generation. This lite vision tool is optimized for high-velocity workflows, offering 1K resolution and native image-to-image editing at scale.

gemini-3.1-flash-lite-image/image-edit

nano banana lite (Gemini 3.1 Flash-Lite) is a hyper-optimized multimodal model for high-velocity image generation and visual reasoning. It delivers sub-5 second 1K resolution results at a fraction of the cost of flagship AI models.

claude-sonnet-5/text-to-text

Claude Sonnet 5 is Anthropic's most agentic Sonnet model, released June 30, 2026, with performance close to Opus 4.8 at a lower price. On GPTProto the Sonnet 5 API runs from $1.6 / $8 per 1M tokens — roughly 20% below Anthropic's own rate — billed from a single balance shared across every model on the platform.

$ 8

20% off

Market: $ 10

FAQ

How do I get the Gemini Omni Flash API?

Sign up at GPTProto, generate an API key in your dashboard, and call the gemini-omni-flash-preview endpoint shown above. Access is rolling out — you'll be able to use it with the same key and billing as the rest of the catalog. Browse the full model library to see what's live now.

How much does Gemini Omni Flash cost?

$0.10 per second of video output — about $1 for a 10-second clip. Whether iterative edits re-bill the full clip each turn isn't documented yet, so budget conservatively.

How long can Gemini Omni Flash videos be?

Around 10 seconds at launch. Google describes this as a deployment choice, not a model limit, and says longer durations are coming.

Gemini Omni Flash vs Veo 3.1 — which is better?

Neither, exactly. Veo 3.1 wins on resolution and polished one-shot output; Omni Flash wins on conversational, multi-input editing. Many teams use both.

Can Gemini Omni Flash edit voices or generate uncensored audio?

No. Audio and speech editing is deliberately withheld at launch over deepfake concerns, and every clip carries a non-removable SynthID watermark plus C2PA credentials.

More Blogs

Is Seedance 2.5 Out Yet? Release Date and What We Actually Know (2026)

Tiffany Layne | 2026-06-23

What Is Wan 2.7? Guide to Alibaba's Thinking-Mode Model (2026)

Schuyler Stacy | 2026-06-24

How to Use Kling 3.0 Motion Control: A Developer's Guide (Web + API)

Michael Johnson | 2026-06-30

Seedance 2.0 Mini vs Seedance 2.0: Price, Quality, and Which One to Actually Use

Tiffany Layne | 2026-06-30

Introducing Gemini Omni Flash: Google's New Video Model You Edit by Talking

What is Gemini Omni Flash?

Why it exists, before how it works

What it can do (with the numbers that are actually public)

Pricing

How to use it: a real image-to-video call via GPT Proto

A quick prompt guide

The honest review: where it wins and where it wobbles

Gemini Omni Flash vs Veo 3.1: which should you use?

Who it's for — and who should wait

All-in-One Creative Studio

FAQ

How do I get the Gemini Omni Flash API?

How much does Gemini Omni Flash cost?

How long can Gemini Omni Flash videos be?

Gemini Omni Flash vs Veo 3.1 — which is better?

Can Gemini Omni Flash edit voices or generate uncensored audio?

Related Articles

Is Seedance 2.5 Out Yet? Release Date and What We Actually Know (2026)

What Is Wan 2.7? Guide to Alibaba's Thinking-Mode Model (2026)

How to Use Kling 3.0 Motion Control: A Developer's Guide (Web + API)

Seedance 2.0 Mini vs Seedance 2.0: Price, Quality, and Which One to Actually Use