speech-2.5-turbo-preview

The Speech 2.5 API by MiniMax provides a high-fidelity, low-latency audio-native experience. It supports native speech-to-speech processing and 3-second zero-shot voice cloning, making it ideal for responsive, emotionally intelligent AI agents.

$ 0

$ 36

$ 60

text

audio

$ 0

text

$ 36

$ 60

audio

Playground

JSON

API

Input

Text*

Voice_id*

Speed

Volume

Pitch

Emotion

English_normalization

This parameter supports English text normalization, which improves performance in number-reading scenarios.

Sample_rate

Bitrate

Channel

Format

Language_boost

Enable_sync_mode

If set to true, the function will wait for the result to be generated and uploaded before returning the response. It allows you to get the result directly in the response. This property is only available through the API.

Related Models

speech-2.5-turbo-preview-voice-clone

speech-2.5-hd-preview-voice-clone

$ 0.5003

$ 0.8338

MiniMax

speech-2.5-hd-preview

$ 60

$ 100

Key Speech 2.5 API Features

Q: How does the Speech 2.5 API handle latency?

The Speech 2.5 API uses a native speech-to-speech (S2S) architecture that removes the intermediate text-transcription layer. This reduces the Time-To-First-Audio to approximately 280ms in optimal conditions. For the lowest possible latency in interactive applications, we recommend using the WebSocket streaming interface rather than standard chunked HTTP.

Q: Does this 2.5 model support voice cloning?

Yes, the 2.5 turbo model features zero-shot voice cloning. You can replicate a speaker's voice using just a 3-second audio sample. This process requires no secondary training or fine-tuning, allowing you to generate speech with a specific timbre or accent instantly via the /v1/audio/voices endpoint.

Q: What are the Speech 2.5 API pricing rates?

Pricing is based on both token count for input and audio duration for output. Input text is priced at $2.00 per 1M tokens, while input audio costs $0.015 per minute. Generated output audio is billed at $0.030 per minute. GPTProto.com users benefit from unified billing and 50% discounts on cached audio prompts for repeat sessions.

Q: Is my speech data used for model training?

No. Under our Enterprise Privacy Policy, all audio and text data processed through the Speech 2.5 API is not used for training by us or MiniMax. Your intellectual property and voice clones remain private and secure within your project environment.

Q: How does it compare to GPT-4o audio mode?

While GPT-4o has a slightly faster TTFA (~240ms), the Speech 2.5 API offers a higher sampling rate of 48kHz compared to GPT-4o's 24kHz. Additionally, Speech 2.5 provides superior non-verbal emotional cues, such as audible sighs and laughter, which are often limited in other low-latency models.

Q: Can I use the API for bilingual speech?

Absolutely. The model is optimized for bilingual code-switching across more than 25 languages. It can transition between languages like English and Chinese within a single sentence while keeping the speaker's identity and emotional delivery perfectly consistent.

Explore the technical capabilities that make the Speech 2.5 API the industry leader for emotional, real-time audio.

Zero-Shot Voice Cloning

Replicate any voice with high fidelity using only a 3-second sample for instant personalization.

Dynamic Emotional Prosody

Generate laughter, sighs, and breathing sounds to create an incredibly realistic human presence.

48kHz HD Audio Output

Support for high-definition, studio-quality audio delivery suitable for professional production.

Native Speech-to-Speech

Direct audio-to-audio processing ensures a latency of <300ms for natural, fluid conversations.

How to Get a speech-2.5-turbo-preview API Key

Getting a speech-2.5-turbo-preview API key takes four steps and a few minutes. Create a free GPTProto account, add credits, generate your key, and make your first call — at $0 / $36 it's a cheaper speech-2.5-turbo-preview API key than going direct, and one key works across every model on the platform. Full speech-2.5-turbo-preview Documentation is in the docs.

Create your free GPT Proto account to begin. You can set up an organization for your team at any time.

Top up

Your balance can be used across all models on the platform, including speech-2.5-turbo-preview, giving you the flexibility to experiment and scale as needed.

Generate your API key

In your dashboard, create an API key — you'll need it to authenticate when making requests to speech-2.5-turbo-preview.

Make your first API call

Use your API key with our sample code to send a request to speech-2.5-turbo-preview via GPT Proto and see instant AI-powered results.

Get API Key

Speech 2.5 API Frequently Asked Questions

How does the Speech 2.5 API handle latency?

The Speech 2.5 API uses a native speech-to-speech (S2S) architecture that removes the intermediate text-transcription layer. This reduces the Time-To-First-Audio to approximately 280ms in optimal conditions. For the lowest possible latency in interactive applications, we recommend using the WebSocket streaming interface rather than standard chunked HTTP.

Does this 2.5 model support voice cloning?

Yes, the 2.5 turbo model features zero-shot voice cloning. You can replicate a speaker's voice using just a 3-second audio sample. This process requires no secondary training or fine-tuning, allowing you to generate speech with a specific timbre or accent instantly via the /v1/audio/voices endpoint.

What are the Speech 2.5 API pricing rates?

Pricing is based on both token count for input and audio duration for output. Input text is priced at $2.00 per 1M tokens, while input audio costs $0.015 per minute. Generated output audio is billed at $0.030 per minute. GPTProto.com users benefit from unified billing and 50% discounts on cached audio prompts for repeat sessions.

Is my speech data used for model training?

No. Under our Enterprise Privacy Policy, all audio and text data processed through the Speech 2.5 API is not used for training by us or MiniMax. Your intellectual property and voice clones remain private and secure within your project environment.

How does it compare to GPT-4o audio mode?

While GPT-4o has a slightly faster TTFA (~240ms), the Speech 2.5 API offers a higher sampling rate of 48kHz compared to GPT-4o's 24kHz. Additionally, Speech 2.5 provides superior non-verbal emotional cues, such as audible sighs and laughter, which are often limited in other low-latency models.

Can I use the API for bilingual speech?

Absolutely. The model is optimized for bilingual code-switching across more than 25 languages. It can transition between languages like English and Chinese within a single sentence while keeping the speaker's identity and emotional delivery perfectly consistent.

Key Speech 2.5 API Features

Zero-Shot Voice Cloning

Dynamic Emotional Prosody

48kHz HD Audio Output

Native Speech-to-Speech

How to Get a speech-2.5-turbo-preview API Key

Create your free GPT Proto account to begin. You can set up an organization for your team at any time.

Your balance can be used across all models on the platform, including speech-2.5-turbo-preview, giving you the flexibility to experiment and scale as needed.

In your dashboard, create an API key — you'll need it to authenticate when making requests to speech-2.5-turbo-preview.

Use your API key with our sample code to send a request to speech-2.5-turbo-preview via GPT Proto and see instant AI-powered results.

Speech 2.5 API Frequently Asked Questions

How does the Speech 2.5 API handle latency?

Does this 2.5 model support voice cloning?

What are the Speech 2.5 API pricing rates?

Is my speech data used for model training?

How does it compare to GPT-4o audio mode?

Can I use the API for bilingual speech?

Further Reading

Master GPT-4o Transcribe: Speech to Text

Minimax Speech 02: Realism & API Latency

GPT-4o Mini TTS: OpenAI's Text-to-Speech Technology

Kling 2.6: Breaking New Ground with Synchronized Audio-Visual AI Generation