What is Doubao Seedance 2.0
Released in February 2026 by ByteDance's Seed team, Doubao Seedance 2.0 (also written doubao-seedance-2.0 or seedance 2.0 doubao) is built on a unified audio-video architecture. The change that matters from version 1.5 Pro is not a bigger resolution number — it is that the model treats every input as a control signal. A single generation can take a text brief plus reference images, a clip for camera motion, and an audio bed, and reason over all of them together before the first frame is produced. For this text-to-video endpoint you work from text, but the same model id is what powers the fuller reference workflow elsewhere.
| Spec | Value |
|---|---|
| Provider | ByteDance (Seed team) |
| Model string | doubao-seedance-2-0-260128 |
| This endpoint | Text-to-video |
| Input modalities (model) | Text, image, video, audio |
| Reference capacity (model) | Up to 9 images, 3 video clips, 3 audio clips per call |
| Native audio | Yes — generated jointly with video; multilingual lip-sync |
| Duration | 4–15s (model); 4–14s priced on GPTProto |
| Resolution | Up to 1080p (480p / 720p / 1080p tiers) |
| Aspect ratios | 16:9, 9:16, 4:3, 3:4, 21:9, 1:1 |
| Editing / extension | Targeted clip/character/action edits; clip extension |
| Image-to-video | Yes — separate endpoint at /image-to-video |
| Variants | Full + Fast (doubao-seedance-2-0-fast-260128 on GPTProto) |
| Open source | No — proprietary |
How the model handles audio and motion
Two things separate Seedance 2.0 from a standard text-to-video model, and both affect how you prompt it. First, audio and video come out of one pass rather than a silent clip with a soundtrack bolted on afterward. Because sound and image are generated together, dialogue lands on the mouth and ambient effects follow the action — so a prompt that names the sound ("crowd noise builds as the car passes") tends to produce tighter sync than describing visuals alone. Second, the model holds a subject's identity across movement and angle changes more reliably than earlier versions, which is what makes multi-shot and recurring-character work usable instead of one-off. If you have written off AI video because a face drifts between cuts, this is the axis that improved most.
Doubao Seedance 2.0 vs Sora 2
Both models are live on GPTProto, so this is a like-for-like choice rather than a marketing claim.
| Doubao Seedance 2.0 | Sora 2 | |
|---|---|---|
| Inputs | Text, image, video, audio | Text, image |
| Native audio | Yes, joint generation | Yes, synced |
| Max duration | 15s | 12s (standard) / 25s (Pro) |
| Max resolution | Up to 1080p | 720p (standard) / 1080p (Pro) |
| API roadmap | Active | OpenAI is sunsetting the Sora 2 API on 2026-09-24 |
| GPTProto price | from $0.2957/run (scales with res × duration) | $0.40/run (flat) |
Doubao Seedance 2.0 vs Seedance 1.5 Pro
If you already run Seedance 1.5 Pro, the upgrade is about input control, not a resolution bump. Version 1.5 Pro already generated audio and video together and followed multi-shot instructions. Seedance 2.0 adds the reference stack on top — up to 9 images, 3 video clips, and 3 audio clips in a single call — plus targeted editing and clip extension. In third-party motion and physics testing it scores higher than 1.5 Pro, with the largest gain in physical accuracy.
The cost gap is large: 1.5 Pro starts at $0.0408/run versus $0.2957 for 2.0. Practical rule: stay on 1.5 Pro for plain, high-volume text-to-video where budget dominates; move to 2.0 when you need reference-driven consistency, editing, or audio-led scenes. Both run from the same balance, so you can compare them on one prompt.
| Seedance 2.0 | Seedance 1.5 Pro | |
|---|---|---|
| Reference inputs | Text, image, video, audio (9 img / 3 clip / 3 audio) | Text, image |
| Native audio | Yes | Yes |
| Editing / extension | Yes | No |
| GPTProto price | from $0.2957/run | from $0.0408/run |
Switching from Doubao, Jimeng, or Volcano Ark
If you already prototype Seedance 2.0 in ByteDance's own surfaces, GPTProto gives you the same model over one REST API: POST your prompt and parameters to the text-to-video endpoint, then poll /api/v3/predictions/{result_id}/result for the output. The parameter names (prompt, aspect_ratio, duration, resolution, generate_audio, camera_fixed, seed) map directly to what you already use, so porting a prompt is mostly copy-paste. The payoff beyond access is that the same key reaches the Fast variant, Seedance 1.5 Pro, and Sora 2, so you can A/B them on identical prompts from one balance instead of opening an account per provider.
What developers build with it
The features point at a few jobs it does well: short-form social and ad clips where native audio saves a separate dubbing step; character-consistent series and brand mascots that have to stay recognizable across shots; dialogue scenes that need lip-sync; and music- or beat-driven edits where sound and motion have to line up. For high-volume ideation, draft on the Fast variant and re-render keepers on the full model.
Doubao Seedance 2.0 prompt recipes
Seedance 2.0 rewards prompts that name sound and camera, not just visuals — because audio is generated jointly and camera language is a control signal. These are paste-and-edit starting points; tune duration and ratio to your shot.
1. Audio-led scene (use the joint audio pass)
A rainy Tokyo alley at night, neon reflections on wet asphalt. A figure in a dark coat walks toward camera. Diegetic sound: steady rain, distant traffic, footsteps in puddles. Slow dolly-in. 16:9.
Naming the sound lets the model sync ambience to the motion in one pass instead of you scoring it afterward. Keep generate_audio on.
2. Locked-off product shot (camera_fixed)
A ceramic coffee cup on a marble counter, morning light from the left, steam rising. Static camera, no movement. Shallow depth of field. 1:1.
Pair this with camera_fixed: true when you want the subject to move but the frame to hold — useful for e-commerce loops and packshots.
3. Character-consistent action
A young woman in a red jacket runs across a rooftop, jumps a gap, lands and keeps running. Keep her face, hair, and jacket identical across the whole clip. Handheld camera tracking from the side. Crowd and wind audio. 16:9, 8s.
Stating the consistency requirement in words leans on the model's strongest upgrade — holding a subject through movement.
4. Multi-shot in one generation
Shot 1: wide of a chef plating a dish. Shot 2: close-up of the garnish being placed. Shot 3: the chef looks up and smiles. Warm kitchen ambience, light sizzling. Smooth cuts. 16:9, 10s.
Numbering shots gives the model an explicit structure to follow, which holds better than one run-on description.
Prompting tips
- Name diegetic sound explicitly when it matters; the audio is generated with the video.
- Describe the camera move (dolly, pan, handheld, static) — it is treated as direction.
- For recurring subjects, state the consistency requirement in words.
- Draft on the Fast variant, then re-render the keeper on the full model.







