speech-02-hd

Text Speech 02 is MiniMax's flagship HD audio model. It delivers ultra-high-fidelity 48kHz output with natural emotional cues like breaths and laughter. Ideal for real-time conversational AI, it bridges the gap between text and human-like speech.

$ 0.0082

$ 0.0137

text

audio

$ 0.0082

$ 0.0137

text

audio

Playground

JSON

API

Input

Text*

Voice_id*

Speed

Volume

Pitch

Emotion

English_normalization

This parameter supports English text normalization, which improves performance in number-reading scenarios.

Sample_rate

Bitrate

Channel

Format

Language_boost

Enable_sync_mode

If set to true, the function will wait for the result to be generated and uploaded before returning the response. It allows you to get the result directly in the response. This property is only available through the API.

Your request will cost$0per run, for$100you can run this model approximately0times

Related Models

speech-2.5-turbo-preview

$ 36

$ 60

MiniMax

speech-2.5-turbo-preview-voice-clone

speech-2.5-hd-preview-voice-clone

$ 0.5003

$ 0.8338

MiniMax

speech-2.5-hd-preview

$ 60

$ 100

Text Speech 02 Key Features

Advanced technical capabilities that make speech-02-hd a leader in the audio generation market.

Real-time 210ms Latency

Achieve near-instant response times. The speech-02-hd model is optimized for low-latency text processing, ensuring your conversational AI feels responsive and human.

Native Emotional Prosody

The 02 model adds realistic breaths, laughter, and sighs. It transforms flat text into emotionally nuanced speech, achieving a industry-leading 4.62 MOS score.

3-Second Voice Cloning

Replicate any voice with just a 3-second sample. Text Speech 02 maintains the original speaker's timbre and rhythm across all supported languages and text inputs.

48kHz HD Audio Output

Experience superior clarity with 48kHz sampling. This high-definition text to speech output eliminates aliasing, making it ideal for professional dubbing and media.

How to Get a speech-02-hd API Key

Getting a speech-02-hd API key takes four steps and a few minutes. Create a free GPTProto account, add credits, generate your key, and make your first call — at $0.0082 it's a cheaper speech-02-hd API key than going direct, and one key works across every model on the platform. Full speech-02-hd Documentation is in the docs.

Create your free GPT Proto account to begin. You can set up an organization for your team at any time.

Top up

Your balance can be used across all models on the platform, including speech-02-hd, giving you the flexibility to experiment and scale as needed.

Generate your API key

In your dashboard, create an API key — you'll need it to authenticate when making requests to speech-02-hd.

Make your first API call

Use your API key with our sample code to send a request to speech-02-hd via GPT Proto and see instant AI-powered results.

Get API Key

Text Speech 02 Frequently Asked Questions

How does Text Speech 02 achieve such high fidelity?

The model uses a native 48kHz sampling rate, which is double the standard of most competitors. By processing text and audio natively within its weights, it avoids the digital artifacts common in traditional pipelines. This ensures that every speech segment generated is studio-quality and ready for professional use cases like podcasting or high-end virtual assistants that require the highest possible sonic clarity.

Can the 02 model handle real-time interruptions?

Yes. With a latency of approximately 210ms, it is significantly faster than many alternatives. This ultra-low latency allows the speech system to respond almost instantly to text inputs, making it perfect for conversational AI where human-like timing and the ability to handle interruptions are critical for a natural user experience. It ensures your text-to-voice transition feels like a real conversation.

Is voice cloning possible with this speech engine?

Absolutely. Text Speech 02 can clone a specific voice using a mere 3-second audio sample. Unlike other models that require minutes of data, this 02-hd variant captures emotional range and timbre quickly, allowing for highly personalized text-to-audio transitions that maintain the original speaker's unique vocal characteristics across different scripts and text inputs without losing the emotional core of the performance.

Which languages does the 02 model support?

It supports over 30 languages, including English, Chinese, Japanese, and several European languages. The 02 architecture allows for seamless bilingual switching within a single sentence. This ensures that the text being read maintains consistent voice quality and accent even when navigating between different linguistic structures, making it an ideal choice for global text localization and accessibility projects.

Does the API support streaming for long text?

Yes, it supports native streaming via WebSocket and Chunked HTTP. This is ideal for synthesizing long-form text or documents. While individual requests can handle up to 10 minutes of audio, the streaming capability allows for virtually unlimited output duration, ensuring the speech remains stable and rhythmic even during extended playback sessions without any degradation in the generated text quality or sound.

How is my data handled by the speech-02-hd model?

Privacy is a priority. All audio data processed through the Text Speech 02 model on our platform is ephemeral. We do not use your inputs or generated outputs for training purposes. This makes it a trustworthy choice for enterprise-level applications where sensitive text must be converted into high-quality speech without risk of data leakage or unauthorized access to your proprietary text-based content.

More Blogs

GPT-4o Mini TTS: OpenAI's Text-to-Speech Technology

Learn about GPT-4o Mini TTS, OpenAI's text-to-speech model that provides natural-sounding voices, emotional expression, and fast response times.

Minimax Speech 02: Realism & API Latency

Master high-fidelity voice synthesis with minimax speech 02. Learn to build low-latency, emotional AI audio applications today.

Master GPT-4o Transcribe: Speech to Text

Instantly convert audio to text with GPT-4o transcribe. Learn how to access this game-changing AI, its practical uses, and its affordable pricing.

Claude Mythos: Anthropic's AI Reasoning Power

Claude Mythos is a step change in AI performance. Learn why its reasoning and cyber capabilities have the industry on alert. Get the full breakdown.

Text Speech 02 Key Features

Real-time 210ms Latency

Native Emotional Prosody

3-Second Voice Cloning

48kHz HD Audio Output

How to Get a speech-02-hd API Key

Create your free GPT Proto account to begin. You can set up an organization for your team at any time.

Your balance can be used across all models on the platform, including speech-02-hd, giving you the flexibility to experiment and scale as needed.

In your dashboard, create an API key — you'll need it to authenticate when making requests to speech-02-hd.

Use your API key with our sample code to send a request to speech-02-hd via GPT Proto and see instant AI-powered results.

Text Speech 02 Frequently Asked Questions

How does Text Speech 02 achieve such high fidelity?

Can the 02 model handle real-time interruptions?

Is voice cloning possible with this speech engine?

Which languages does the 02 model support?

Does the API support streaming for long text?

How is my data handled by the speech-02-hd model?

Related Articles

GPT-4o Mini TTS: OpenAI's Text-to-Speech Technology

Minimax Speech 02: Realism & API Latency

Master GPT-4o Transcribe: Speech to Text

Claude Mythos: Anthropic's AI Reasoning Power