logo

Explore the Best AI Models Online

Browse a curated directory of cutting-edge AI models for text, images, and more. Compare capabilities, features, and pricing to find the right model for your projects.

logoOpenAI
logoKling
logoGoogle
logoGrok
logoMiniMax
logoQwen
logoBytedance
logoNovelAI
logoClaude
logoTripo3d
logoGptproto
logoDeepSeek
logoHiggsfield
logoFlux
logoIdeogram
logoMidjourney
Models
Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

The gpt-latest model represents the newest generation of text to text intelligence from OpenAI, available through the robust GPT Proto infrastructure. This version introduces significant architectural improvements over previous iterations like GPT 4o, offering deeper reasoning tokens and a higher capacity for logical consistency in long form outputs. Designed specifically for developers who require high precision, gpt-latest excels at complex instruction following and structured JSON data generation. By utilizing the Responses API on GPT Proto, users can leverage the full intelligence of this model with lower latency and higher reliability. Whether you are building automated research tools or sophisticated coding agents, gpt-latest provides the speed and depth necessary for modern enterprise applications.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-latest/image-to-text is a state of the art multimodal vision model designed to bridge the gap between visual perception and textual understanding. As the latest iteration in the flagship series, this model excels at analyzing complex scenes, extracting high density text through advanced OCR, and reasoning about spatial relationships within images. It is significantly faster and more accurate than previous generations, offering developers a robust tool for automation and accessibility. When deployed on GPT Proto, it provides a stable environment for building applications that require real time visual data processing and reliable multimodal outputs for enterprise scale operations.

$0/per time

kling-image-o1/text-to-image is a sophisticated generative AI model designed for professional-grade visual synthesis. Developed as part of the Kling AI ecosystem, this model specializes in transforming complex text descriptions into high-fidelity, photorealistic images with remarkable detail. It excels in diverse creative scenarios, from cinematic concept art to commercial photography and intricate digital illustrations. Compared to standard generative models, kling-image-o1/text-to-image offers superior understanding of spatial relationships and lighting, ensuring consistent and aesthetic results. Its architecture is optimized for speed and quality, making it a premier choice for developers and creators seeking reliable API-driven image generation.

$0/per time

The kling-image-o1/image-to-image model represents the pinnacle of Kling AI's generative capabilities, specifically engineered for high-fidelity image transformations. As a state-of-the-art AIGC tool, it excels in maintaining structural integrity while applying radical style changes or subtle enhancements. Unlike base models, the o1 version offers superior reasoning for complex textures and lighting, ensuring that modified outputs feel natural and professional. It is ideal for fashion design, architectural visualization, and digital art, providing a faster and more intuitive workflow for creators who need to evolve existing visual concepts into masterpiece-quality assets with unmatched consistency and detail on GPT Proto.

$0.448/per time$0.56/per time

kling-video-o1-pro/text-to-video represents the pinnacle of Kling AI's generative video technology, specifically engineered for professional-grade output. As an evolution within the Kling family, this model introduces enhanced reasoning capabilities to interpret complex prompts with high temporal consistency and realistic physical interactions. It excels in generating high-definition 1080p content with cinematic aesthetics and fluid motion. Compared to standard generative video models, kling-video-o1-pro offers superior detail preservation over longer sequences. It is the ideal choice for marketing agencies, game developers, and film professionals requiring precise control over AI-generated visual narratives through a stable API integration.

$0.448/per time$0.56/per time

kling-video-o1-pro is a premium generative AI model specialized in image to video synthesis. As a professional tier within the Kling AI ecosystem, this model is engineered to provide superior motion consistency and temporal stability compared to standard versions. It excels at interpreting static visual cues to generate fluid, realistic movement that adheres to the laws of physics. Whether for cinematic storyboarding, dynamic marketing content, or digital art, kling-video-o1-pro delivers high-definition outputs with intricate detail. By leveraging advanced diffusion transformers, it ensures that every frame maintains character and background integrity, making it a top choice for developers seeking enterprise-grade video generation capabilities.

$0.2688/per time$0.336/per time

kling-video-o1-pro/reference-to-video is a flagship video generation model designed for high-end cinematic production and creative storytelling. As part of the prestigious o1-pro series, this specific mode focuses on Reference-to-Video (image-to-video) capabilities, ensuring that static reference images are brought to life with incredible temporal consistency and physical accuracy. Compared to standard Kling models, the o1-pro variant offers superior resolution, longer duration possibilities, and a deeper understanding of complex motion prompts. It is optimized for professionals in advertising, filmmaking, and game development who require strict adherence to visual references while demanding fluid, realistic movement. By leveraging advanced diffusion transformers, it delivers industry-leading video quality that bridges the gap between AI generation and professional cinematography.

$0.2688/per time$0.336/per time

kling-video-o1-pro/video-to-video is a high performance AI model specifically engineered for professional grade video transformation and style transfer. As the pro tier of the Kling video family, it offers significantly enhanced motion stability and visual fidelity compared to standard versions. This model excels at taking source footage and reimagining it through text prompts while maintaining the original temporal structure. It is ideal for filmmakers, marketing agencies, and developers who require consistent, high resolution video outputs for commercial use. By leveraging advanced diffusion techniques, it ensures that characters and backgrounds remain stable across frames, providing a seamless bridge between raw footage and creative vision.

$0.336/per time$0.42/per time

kling-video-o1-std/text-to-video is a state of the art generative video model designed to transform complex textual descriptions into high quality cinematic footage. As a standard version within the acclaimed Kling AI family, this model balances computational efficiency with breathtaking visual realism. It specializes in simulating real world physics, maintaining character consistency, and producing fluid motions that rival professional cinematography. Whether you are creating short form social media clips or conceptualizing large scale film projects, kling-video-o1-std/text-to-video provides the reliability and creative depth needed for modern digital storytelling. Its architecture is optimized for high resolution output, ensuring that every frame remains sharp and logically coherent throughout the generated sequence.

$0.336/per time$0.42/per time

kling-video-o1-std/image-to-video represents the pinnacle of Kuaishou's generative video technology. As part of the sophisticated o1 series, this model specializes in transforming static images into fluid, high-fidelity videos with exceptional motion consistency. It bridges the gap between static digital art and cinematic storytelling by applying advanced physical reasoning to every pixel. Unlike standard animation tools, this model preserves complex textures and lighting while introducing realistic camera movements and character dynamics. It is perfectly suited for professional creators requiring precise control over visual narratives, offering a significant upgrade in temporal stability compared to previous generations of AI video models.

$0.2016/per time$0.252/per time

kling-video-o1-std/video-to-video is a specialized AI model designed for high precision video transformation. Developed as part of the innovative Kling family, this model focuses on the Video-to-Video (V2V) modality, allowing users to take original footage and restyle or modify it while maintaining impeccable temporal consistency. Unlike basic text-to-video models, the kling-video-o1-std/video-to-video variant excels in preserving motion structures and subject identity throughout the generation process. It is the perfect tool for VFX artists and developers who require predictable yet creative control over existing video assets. By leveraging standard o1 optimization, it balances processing speed with cinematic quality, making it an industry leader for scalable video production workflows.

$0.2016/per time$0.252/per time

kling-video-o1-std/reference-to-video is a high-performance AI video generation model designed to convert static images into fluid, cinematic video sequences with exceptional temporal consistency. As part of the prestigious Kling family, the o1-std variant introduces enhanced motion reasoning, ensuring that complex physical interactions and camera movements remain realistic throughout the clip. This model excels in 'reference-to-video' tasks, where a provided image serves as the structural and aesthetic foundation for the generated content. Ideal for filmmakers, advertisers, and developers, it offers a significant leap in quality over baseline models by maintaining strict character and environmental fidelity. By utilizing this model on GPT Proto, professionals can access a stable, scalable API for high-end visual storytelling.

$0.28/per time$0.35/per time

kling-v2.6-pro/text-to-video is a flagship generative video model designed for professional-grade visual storytelling. Building upon the core Kling architecture, this Pro version introduces significantly enhanced motion dynamics and temporal consistency, capable of producing full HD 1080p sequences with cinematic fluid movements. It excels in simulating complex physical laws and lifelike human expressions, making it a superior choice for advertising, film pre-visualization, and high-end digital marketing. Compared to standard models, kling-v2.6-pro/text-to-video offers more precise prompt adherence and sophisticated camera control, ensuring every generated clip meets the rigorous standards of modern content creators demanding excellence and efficiency in AIGC.

$0.28/per time$0.35/per time

kling-v2.6-pro/image-to-video is a top-tier generative AI model specifically designed for high-resolution video synthesis from static images. As part of the prestigious Kling AI family, the Pro version enhances temporal consistency and physical realism beyond standard releases. It enables developers to generate cinematic sequences up to 10 seconds with complex motion paths and high structural integrity. This model stands out by maintaining the fine details of the input image while applying sophisticated diffusion-based animation. Whether for marketing, film pre-visualization, or social media content, kling-v2.6-pro/image-to-video provides professional-grade stability and creative flexibility for demanding AIGC workflows.

Input:$0.3/1M tokens$0.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gemini-2.5-flash-preview-tts/text-to-audio is Google’s latest Gemini family model specializing in efficient text-to-speech and audio synthesis. Designed for rapid, natural voice output, it delivers high-quality results for conversational AI, accessibility solutions, and real-time multimedia apps. Compared to earlier generations, gemini-2.5-flash-preview-tts/text-to-audio provides improved speech nuance, faster response times, and seamless multimodal integration. Its streamlined API makes deployment easy for developers, while its robust architecture ensures scalable performance in demanding contexts.

Input:$0.6/1M tokens$1/1M tokens
Output:$12/1M tokens$20/1M tokens

gemini-2.5-pro-preview-tts/text-to-audio is a multimodal AI model specializing in text-to-speech conversion. Built on Gemini’s latest architectural advancements, it transforms written content into natural-sounding audio. This model distinguishes itself with high accuracy, rapid processing, and customizable voice outputs. Suited for developers seeking scalable, real-time speech synthesis, gemini-2.5-pro-preview-tts/text-to-audio ensures smooth integration into apps, accessibility platforms, customer support, and multimedia solutions. Compared to standard Gemini or previous generation models, it offers enhanced audio fidelity and expanded language support.

Input:$0.12/1M tokens$0.2/1M tokens
Output:$0.9/1M tokens$1.5/1M tokens

grok-code-fast-1/text-to-text is a high-speed AI model tailored for rapid code generation and text-to-text transformation tasks. It delivers efficient, context-driven coding outputs and is optimized for developer productivity. Compared to mainstream models like GPT, grok-code-fast-1/text-to-text prioritizes minimal latency and workflow adaptability, particularly for software engineering scenarios. Its fast response and streamlined design make it a reliable choice for professionals needing accurate, quick code suggestions or refactoring. The model supports complex programming tasks, robust error handling, and seamless integration into dev environments.

Input:$1.8/1M tokens$3/1M tokens
Output:$9/1M tokens$15/1M tokens

grok-4-0709/text-to-text is an advanced text generation AI model from xAI’s Grok family, optimized for speed and precision in handling natural language tasks. It efficiently supports writing, programming, and data summarization workflows. Compared to earlier Grok iterations, grok-4-0709/text-to-text provides enhanced reasoning abilities and consistent outputs, making it suitable for professionals requiring reliable and context-aware responses. Its foundation on the Grok architecture ensures rapid processing and integration for scalable solutions across diverse industries.

Input:$1.8/1M tokens$3/1M tokens
Output:$9/1M tokens$15/1M tokens

grok-4-0709/image-to-text is an advanced multimodal AI model by Grok, part of the 4-0709 family. Tailored for accurate image interpretation and text generation, it bridges visual analysis and language, excelling in extracting structured information from images. Compared to foundational Grok models, image-to-text expands multimodal capabilities, making it ideal for developers needing image comprehension, OCR tasks, or seamless image-to-text workflows in real-time environments.

Input:$0/1M tokens
Output:$60/1M tokens$100/1M tokens

speech-2.6-hd/text-to-audio is a state-of-the-art AI model for converting text into high-definition audio. Designed for speed and natural language handling, it generates clear, expressive speech in various styles. As part of the speech-2.6-hd family, it improves latency and natural prosody versus earlier generations. This model stands out for realistic synthesis, multi-language support, and seamless API integration. It is ideal for applications in media production, accessible technology, customer service, and educational tools. It enables developers to build scalable voice solutions with excellent audio quality and robust customization options.

$0.45/per time$0.5/per time

wan-2.6/text-to-video is a cutting-edge AI model designed for rapid and flexible text-to-video synthesis. Developed as part of the wan model family, it excels in generating dynamic video content directly from textual prompts, empowering developers and creators in media, marketing, and education. Compared to earlier generations, wan-2.6/text-to-video offers faster rendering speeds, improved visual coherence, and support for a wide variety of styles. Its multimodal architecture and powerful context processing set it apart from text-only models, making it ideal for modern multimedia workflows and innovation-driven production teams.

$0.45/per time$0.5/per time

wan-2.6/image-to-video is a leading-edge AI model designed for fast, automated conversion of static images into dynamic video clips. From the WAN model family, it leverages advanced generation algorithms to produce seamless transitions and high fidelity visuals. This generation supports enhanced speed and adaptability, making it suitable for creative industries, marketing, education, and social media content production. Unlike basic image-to-video tools or foundational models, wan-2.6/image-to-video provides superior scene continuity, customization options, and precise temporal control, offering developers a scalable, reliable solution for synthetic media pipelines.

$0.9/per time$1/per time

wan-2.6/reference-to-video is an advanced AI model engineered for video reference tasks such as semantic video search, temporal localization, and content analysis. As a member of the wan-2.6 family, this model offers scalable video understanding, combining multi-modal input capabilities and efficient retrieval. It differs from base models by focusing on video-specific features, supporting accurate cross-modal scene matching and real-time video analytics. Ideal for media, education, and security industries, wan-2.6/reference-to-video provides developers robust tools for integrating video understanding into modern workflows.

$0.0384/per time$0.048/per time

doubao-seedance-1-5-pro-251215/text-to-video is a next-gen multimodal AI model designed for transforming textual input into high-quality videos within seconds. Developed as part of the advanced doubao-seedance family, this model leverages accelerated generation speed and precise scene synthesis. Compared to basic models, it features improved temporal consistency, enhanced visual fidelity, and customizable output options. Ideal for marketing, education, creative production, and business prototyping, it empowers developers to automate video workflows with scalable API support. Its unique processing pipeline offers fast, reliable video creation from contextual prompts, setting it apart from traditional text or image-focused models.

$0.0384/per time$0.048/per time

doubao-seedance-1-5-pro-251215/image-to-video is an advanced multimodal AI model designed for generating videos from images with high fidelity and technical precision. Built on the Seedance model family, it supports creative video synthesis and animation production from static visual input. Compared to foundational models, doubao-seedance-1-5-pro-251215/image-to-video provides optimized processing speed, enhanced temporal consistency, and greater flexibility for creative industries and developers. Its core strengths lie in its multimodal capability, efficient video rendering, and automatic context adaptation, making it ideal for media, entertainment, design, and AI video research.

$0.0384/per time$0.048/per time

seedance-1-5-pro-251215 is a next-generation text-to-video AI model designed for rapid and efficient multimedia content creation. Supporting the conversion of written prompts into dynamic videos, it enables developers, marketers, and educators to generate tailored visual content with ease. Compared to previous iterations, seedance-1-5-pro-251215 offers faster rendering speed, improved video quality, and more reliable scene interpretation. Its foundation model powers seamless context adaptation, making it ideal for industry-specific visual storytelling across digital platforms, advertising, training, and social media campaigns.

$0.0384/per time$0.048/per time

seedance-1-5-pro-251215 is a next-generation text-to-video AI model designed for rapid and efficient multimedia content creation. Supporting the conversion of written prompts into dynamic videos, it enables developers, marketers, and educators to generate tailored visual content with ease. Compared to previous iterations, seedance-1-5-pro-251215 offers faster rendering speed, improved video quality, and more reliable scene interpretation. Its foundation model powers seamless context adaptation, making it ideal for industry-specific visual storytelling across digital platforms, advertising, training, and social media campaigns.

Input:$0.3/1M tokens$0.5/1M tokens
Output:$1.8/1M tokens$3/1M tokens

gemini-3-flash-preview/text-to-text is a high-speed AI language model from Google’s Gemini family, built for text generation, coding, and automation. It stands out for rapid inference, efficient resource usage, and strong task specialization. Optimized for enterprise and developer workflows, its architecture refines context handling compared to core Gemini models, enabling precise outputs and robust API integration. gemini-3-flash-preview/text-to-text is ideal for teams needing dependable, scalable solutions in content creation, code analysis, and real-time operations.

Input:$0.3/1M tokens$0.5/1M tokens
Output:$1.8/1M tokens$3/1M tokens

gemini-3-flash-preview/image-to-text is a Google Gemini 3 family multimodal AI model engineered for efficient image-to-text transformation. It delivers exceptionally fast inference, high accuracy, and robust image understanding for technical and enterprise scenarios. Unlike generic models, it is optimized for processing visual data and extracting contextual information, making it ideal for rapid tagging, accessibility workflows, and precise document analysis. Its core differentiator is speed without compromising on detail or versatility, which sets it apart from broader Gemini models as well as other competitors such as GPT-4V. Developers and businesses can leverage this model for streamlined image data integration and scalable automation solutions.

$0.05/per time

gpt-image-1.5-plus/text-to-image is an advanced multimodal AI model designed for generating high-quality images from natural language prompts. Built upon the GPT family, it extends multimodal capabilities with superior text-to-image synthesis, realistic visual output, and rapid generation speed. It stands out for industry-level reliability, flexible deployment, and seamless integration with creative workflows. Compared with previous GPT image models, it delivers enhanced image fidelity and context understanding, making it ideal for creative professionals and technical teams.

$0.05/per time

gpt-image-1.5-plus/image-edit is an advanced generative AI model from OpenAI, designed for detailed image editing and multimodal tasks. Building on the GPT-4 architecture, this model supports image understanding alongside editing via natural language prompts. Developers can utilize it for creative, technical, and educational image workflows. Compared to pure text-based models, it uniquely integrates image context for robust editing functionality and more intuitive multimedia outputs, making it ideal for professionals seeking precise, high-quality image transformations.

Input:$4.8/1M tokens$8/1M tokens
Output:$19.2/1M tokens$32/1M tokens

gpt-image-1.5/text-to-image is an advanced multimodal AI model built for accurate and fast text-to-image generation. Part of the GPT family, it leverages foundational GPT technology but is uniquely optimized for visual synthesis. Developers use it for rapid prototyping, creative design workflows, and automated image generation tasks. Compared to standard GPT models, it adds robust image processing, visual creativity, and seamless integration with multimodal workflows, making it a powerful tool for digital content creators, marketers, and product teams operating in diverse industries.

Input:$4.8/1M tokens$8/1M tokens
Output:$19.2/1M tokens$32/1M tokens

gpt-image-1.5/image-edit is an advanced multimodal AI model by OpenAI designed for image manipulation, creative editing, and text-image fusion tasks. Part of the GPT Proto platform, it combines image understanding with precise editing workflows. Compared to base GPT language models, gpt-image-1.5/image-edit enables context-aware image changes, making it ideal for designers, developers, and marketing teams seeking scalable, creative, and reliable AI-driven imaging solutions. Its fast processing, robust architecture, and intuitive controls provide a unique edge for image-centric tasks and seamless pipeline integrations.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro-2025-12-11 is a state-of-the-art AI language model designed for developers and enterprises needing robust text generation, code assistance, and data analysis. As part of the GPT-5 series, it offers enhanced speed, improved context management, and multimodal support. Compared to its predecessors, gpt-5.2-pro-2025-12-11 delivers superior accuracy, creative flexibility, and scalable API performance, making it ideal for demanding business and technical applications.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro-2025-12-11/image-to-text is a state-of-the-art vision-language AI in the GPT-5.2 Pro family, designed for high-accuracy image-to-text conversion. Ideal for professionals in document processing, content extraction, and accessibility, this model delivers fast, reliable OCR and contextual scene understanding. With enhanced multimodal capabilities beyond its base, gpt-5.2-pro-2025-12-11/image-to-text stands out for rich semantic analysis and flexible API deployment, making it a preferred choice for enterprise automation and developer workflows.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro-2025-12-11/web-search is a cutting-edge AI model from OpenAI, designed for advanced natural language processing, code generation, and real-time web search integration. It delivers rapid and precise responses, robust multi-modal understanding, and enhanced security features. Compared to previous generations, the pro variant offers higher throughput, improved accuracy, and seamless access to web data. Ideal for developers, enterprises, and research teams needing context-rich, scalable AI solutions.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro-2025-12-11/file-analysis is a next-generation AI model from the GPT-5.2 Pro series, designed for detailed file analysis, rapid code review, and handling structured data workloads. It supports multimodal input, advanced parsing features, and robust content safety checks, making it ideal for developers, analysts, and enterprise teams handling complex documents and code. Compared to base GPT-5.2, the file-analysis variant offers specialized file processing capabilities, improved speed, and integration-friendly APIs for large-scale automated workflows.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-2025-12-11/text-to-text is a state-of-the-art AI language model from OpenAI’s fifth generation, designed for high-speed and precise text generation. Built on enhanced transformer technology, it supports advanced creative writing, programming help, summarization, and technical content. Improving on prior GPT models, it delivers faster responses, better accuracy, and more context-aware outputs, making it ideal for developers, enterprises, researchers, and writers demanding reliable performance. Its specialized text-to-text focus ensures consistent, logical, and human-like output for modern AI-powered applications.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-2025-12-11/image-to-text is a cutting-edge AI model from the GPT-5.2 generation. Specializing in image-to-text conversion, it enables accurate text extraction and comprehensive image interpretation for various tasks. Unlike basic GPT-5.2 models, this variant is optimized for multimodal processing, delivering precise outputs in scenarios such as document digitization and visual data analysis. Its robust architecture ensures fast performance, high reliability, and seamless integration, making it ideal for industries that require efficient image-to-text solutions.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-2025-12-11/file-analysis is a specialized version of the GPT-5.2 model family, engineered to deliver advanced file and data analysis. Building on the core strengths of GPT-5.2, this model processes documents, code files, and structured data with enhanced precision and speed. It is particularly effective in programming, legal review, academic research, and enterprise automation, standing out for its contextual awareness and robust handling of complex file formats. Compared to basic GPT-5.2, this variant offers optimized parsing, deeper document insights, and workflow integrations, making it ideal for developers and data professionals.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-2025-12-11/web-search is a state-of-the-art AI model from the GPT-5 family, optimized for advanced text generation, coding, web-integrated tasks, and multi-modal analysis. Unlike the GPT-5 base, this model features fast web search capabilities and enhanced retrieval-augmented generation. It delivers precise, context-rich outputs for diverse professional scenarios. Its adaptability and robust APIs make it ideal for developers and enterprises requiring reliable, current AI solutions.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-chat-latest/text-to-text is a cutting-edge text modality AI model from OpenAI, designed for developers needing fast, accurate, context-driven output in chat, writing, programming, and analytics. Building on the GPT-5 family, it offers improved response speed and logic over previous versions. This model delivers stable, creative, and scalable text processing, making it ideal for applications in content generation, automated support, technical writing, and data analysis. Compared to earlier GPT models, it features deeper contextual reasoning and better adaptation for professional workflows, setting it apart in quality and efficiency for technical users across industries.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-chat-latest/image-to-text is a cutting-edge multimodal AI model from the GPT-5.2 family, specialized in converting images to detailed, context-aware text descriptions. Unlike pure text models, it excels at visual understanding tasks, offering fast, accurate image captioning and recognition for technical, creative, or accessibility scenarios. Enhanced by the latest GPT-5.2 advancements, it delivers optimized performance, stable outputs, and scalable integration for developers needing reliable image-to-text solutions.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-chat-latest/web-search is a cutting-edge AI language model from the GPT-5 family, designed specifically for efficient chat and conversational search tasks. It excels in natural language understanding, coding support, and dynamic content generation. Compared with earlier GPT models, it offers faster responses, improved web-integrated knowledge, and enhanced context handling. Its flexibility and robust architecture empower developers to create advanced applications for customer support, data extraction, technical assistance, and more. This model is ideal for technical users seeking real-time information retrieval and seamless integration into modern workflows.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2-chat-latest/file-analysis is a cutting-edge AI model focused on both advanced conversational AI and sophisticated file analysis. It supports high-speed, multi-modal file processing, code understanding, and deep document insights. As an extension of the GPT-5.2 core, this variant is tailored for developers, analysts, and enterprises seeking robust, reliable file-driven AI solutions. Compared to standard GPT models, it delivers faster, more accurate document parsing and workflow-centric automation, making it indispensable for businesses requiring secure, scalable file and data handling.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro/text-to-text is a powerful generative AI model from the fifth-generation GPT family designed for advanced text-only tasks. It excels in text creation, code support, and extended enterprise scenarios requiring high reliability and accuracy. Compared to earlier GPT versions, gpt-5.2-pro/text-to-text delivers faster, more context-rich outputs, precise response handling, and improved creative reasoning. It is ideal for developers and professionals needing scalable, efficient text workflow automation and robust language capabilities for critical projects.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro/image-to-text is OpenAI’s state-of-the-art multi-modal model in the GPT-5 family, optimized for fast, accurate image-to-text conversion. It excels at extracting information from visuals and supports complex natural language understanding. Compared to standard GPT-5.2 Pro, its distinguishing feature is seamless integration of visual inputs and robust contextual analysis. Ideal for developers, businesses, and educators needing reliable visual data processing, gpt-5.2-pro/image-to-text delivers improved response speed, high scalability, and detailed outputs for workflows such as OCR, document analysis, and accessibility solutions.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro/web-search is an advanced AI model from the GPT-5 family, designed by OpenAI for high-speed, scalable text generation and real-time web search capabilities. It offers improved context handling, multimodal support, and integration for live data retrieval. With accurate, fast outputs and flexible workflows, gpt-5.2-pro/web-search excels in professional contexts where up-to-date information and superior language understanding are crucial, distinguishing itself from base GPT-5 by offering built-in web search and enhanced customization.

Input:$12.6/1M tokens$21/1M tokens
Output:$100.8/1M tokens$168/1M tokens

gpt-5.2-pro/file-analysis is a specialized AI language model based on the GPT-5.2 family. It offers advanced capabilities for file processing, document understanding, and code analysis. Designed for technical users, it delivers high accuracy, fast performance, and strong multimodal support. Unlike the base GPT-5.2 model, gpt-5.2-pro/file-analysis is optimized for structured document tasks and file-based workflows, providing developers and enterprises with efficient, reliable analysis tools for textual data and code review.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2/text-to-text is a next-generation AI language model designed for rapid, precise text-based tasks such as writing, summarizing, code generation, and data analysis. As a part of the advanced GPT-5 family, it integrates improved text understanding with higher speed and accuracy compared to previous models. Its specialized architecture supports scalable performance, robust context management, and reliable results in professional settings. Developers, analysts, and educators benefit from its focused text-to-text processing, making it ideal for demanding workflows and seamless API integration. Compared to generic models, gpt-5.2/text-to-text offers enhanced analytic strength and optimized experience for enterprise applications.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2/image-to-text is a next-generation multimodal AI model from OpenAI's GPT family, designed to convert visual content into precise textual descriptions and data. It supports fast, accurate image-to-text processing, making it ideal for developers needing robust automation, accessibility solutions, and workflow integration. Unlike base GPT-5.2, it includes a superior image understanding module, enabling seamless cross-modal tasks, efficient extraction, and contextual outputs for various industries. Its differentiators include advanced speed, reliability, and scalable processing capacities.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2/file-analysis is a specialized AI model from the GPT-5.2 family, designed for fast and precise file analysis tasks. It excels at extracting, interpreting, and summarizing data from various file formats including text, code, and spreadsheets. Compared to its base GPT-5.2 model, gpt-5.2/file-analysis offers enhanced capabilities for structured data workflows, improved accuracy on complex file types, and optimized performance for developers. Its multi-modal processing, robust context handling, and tailored modules make it ideal for industries requiring reliable file intelligence at scale.

Input:$1.05/1M tokens$1.75/1M tokens
Output:$8.4/1M tokens$14/1M tokens

gpt-5.2/web-search is an advanced AI model in the GPT-5 series, designed for fast, accurate language processing with seamless web search integration. It supports text generation, code tasks, and real-time content research, providing up-to-date answers directly from the web. Its difference from standard GPT-5.2 lies in its direct web-enabled processing, making it ideal for developers and researchers seeking both powerful text generation and instant online data retrieval.

$0.027/per time

nai-diffusion-4-5-curated is an advanced text-to-image AI model designed for fast and high-quality visual content generation. Built upon the latest diffusion techniques, it delivers detailed artwork, vibrant illustrations, and customized imagery from text prompts. Distinct from earlier nai models, the 4-5-curated release improves output consistency, style fidelity, and prompt responsiveness, benefiting creative professionals and developers. Its optimized pipeline ensures rapid inference and seamless integration, making it ideal for digital art, design, game development, marketing campaigns, and social media visuals.

$0.027/per time

nai-diffusion-4-5-curated/image-to-image is a next-generation diffusion-based AI model designed for advanced image-to-image transformations. Built on the stable diffusion framework, this curated model specializes in artwork enhancement, style transfer, and precise image modification. Unique to the nai-diffusion family, it offers greater creativity, fine control, and faster inference compared to standard diffusion models. Ideal for digital artists, game developers, and creative professionals, it is optimized for high-quality visual output and adaptive workflows in production environments.

$0.034/per time$0.04/per time

seedream-4-5-251128/text-to-image is a modern, high-performance multimodal AI model that converts text instructions into detailed and accurate images. Designed as part of the Seedream model family, it delivers reliable, creative, and context-aware results for commercial and research scenarios. Compared to its foundational base, seedream-4-5-251128/text-to-image optimizes speed and accuracy for image generation tasks, supporting seamless integration for developers and businesses. Its advanced architecture ensures fast processing, flexible input handling, and consistent output, distinguishing it from other mainstream models with robust, scalable multimodal workflows.

$0.034/per time$0.04/per time

Try Seedream-4-5-251128/image-edit on GPT Proto. Edit images for inpainting, background removal, restoration, and creative modifications with detail preservation. Get more affordable AI API.

$0.0303/per time$0.0357/per time

doubao-seedream-4-5-251128/text-to-image is an API model identifier for ByteDance’s Doubao Seedream 4.5, a high-quality text-to-image generator for creating detailed, styled visuals from natural language prompts, typically used for marketing creatives, concept art, and educational or product illustrations via programmatic image generation workflows.

$0.0303/per time$0.0357/per time

doubao-seedream-4-5-251128/image-edit is an API variant of ByteDance’s Seedream 4.5 image model that edits existing images using a prompt and optional masks, handling tasks like inpainting, object removal or addition, background changes, style and lighting adjustments, and detailed retouching while preserving subject identity and producing high‑resolution, production‑ready visual results suitable for e‑commerce, creative work, and photo restoration workflows.

$0.027/per time

NovelAI Diffusion V4.5 Full is a state-of-the-art diffusion model for generating high-resolution images from text prompts. It excels in creative automation, delivering vivid, contextually accurate visuals with a high degree of control and customization. Compared to earlier diffusion models, it offers faster inference, stronger prompt adherence, and broader stylistic flexibility. Its robust architecture supports easy integration into creative and production workflows, making it ideal for concept art, advertising, illustration, and rapid design development.

$0.027/per time

nai-diffusion-4-5-full/image-to-image is an advanced AI model specializing in image-to-image conversion and enhancement. Developed by NovelAI, it is part of the powerful nai-diffusion 4.5 family, offering fast, accurate, and creative transformations across diverse visual styles. The model stands out for its reliable processing speed, customizability, and robust multi-modal capabilities. Compared to prior NaiDiffusion generations, it delivers superior resolution and flexibility for professional workflows, making it ideal for creative teams, designers, animators, and developers seeking state-of-the-art image generation.

$0.135/per time

Grok Imagine v0.9 is xAI's advanced text-to-video AI model powered by the Aurora engine, generating 6-15 second HD videos with native synchronized audio, lip-sync dialogue, music, and cinematic effects at 24 FPS. It supports image-to-video, voice prompts, and rapid rendering (<15 seconds) for marketing, storytelling, and prototyping via X Premium+ or API.

$0.135/per time

Grok Imagine v0.9 image-to-video transforms static images into 5-15 second HD clips (480p-1080p) with synchronized audio, lip-sync, music, and cinematic motion in under 30 seconds. Features 4 modes (Normal, Fun, Custom, Spicy), natural animations, camera effects, and optional soundtracks—ideal for social media, marketing, and rapid prototyping.

Input:$3.5/1M tokens$5/1M tokens
Output:$17.5/1M tokens$25/1M tokens

claude-opus-4-5-20251101 is an advanced AI language model from Anthropic’s Claude family. Designed for rapid, high-quality text generation and code, it supports broad use cases from content creation to complex analysis. Compared to previous Claude models, it brings improved reasoning, greater reliability, and more control over context windows and task-specific outputs. Professionals choose claude-opus-4-5-20251101 for its balance of speed, creativity, and precision across enterprise, research, and general productivity applications.

Input:$3.5/1M tokens$5/1M tokens
Output:$17.5/1M tokens$25/1M tokens

claude-opus-4-5-20251101/file-analysis is an Anthropic Claude Opus family model focused on robust file analysis, document parsing, and code review tasks. It delivers high-speed, accurate text and code interpretation, setting itself apart from general-purpose models through specialized workflow optimizations. It features advanced multi-file handling and context retention, making it an excellent choice for developers, data analysts, and researchers seeking scalable, reliable file-centric AI solutions.

Input:$3.5/1M tokens$5/1M tokens
Output:$17.5/1M tokens$25/1M tokens

claude-opus-4-5-20251101 is a flagship AI model from Anthropic, built for complex language comprehension, fluent generation, and programmatic reasoning. It outperforms earlier Claude models with faster responses, higher accuracy, and extensive context capabilities. Tailored for professionals in research, coding, data analysis, and customer support, it offers reliable, nuanced outputs. Compared to GPT-4 and Gemini, this release delivers strong alignment, safety, and advanced reasoning while maintaining competitive speed. Developers appreciate its scalable performance and robust integration support for modern workflows.

Input:$0.12/1M tokens$0.2/1M tokens
Output:$0.3/1M tokens$0.5/1M tokens

Grok-4-1-fast-non-reasoning is a fast and efficient AI language model designed primarily for high-speed content generation and automation. Part of the Grok family, this model emphasizes throughput and reliability over complex reasoning, making it ideal for large-scale workflows, batch processing, and scenarios where rapid responses are critical. Compared to foundational Grok models, grok-4-1-fast-non-reasoning trades deeper reasoning for optimized speed, supporting tasks such as templated copywriting, straightforward summarization, and auto-messaging. It is ideal for developers and enterprises demanding maximum efficiency and scalable performance.

Input:$0.12/1M tokens$0.2/1M tokens
Output:$0.3/1M tokens$0.5/1M tokens

Grok-4-1-fast-non-reasoning/image-to-text is a specialized AI model designed for ultra-fast image-to-text conversion. As part of the Grok 4.1 fast series, it focuses on quick and accurate extraction of textual information from images, without complex reasoning modules. Distinctively, it prioritizes response speed and throughput, making it ideal for large-scale OCR tasks, rapid document digitization, and developer pipelines needing high-efficiency vision processing. Compared to standard multimodal models, this variant trades deeper semantic interpretation for unmatched speed, making it a practical choice for direct image text extraction.

Input:$0.12/1M tokens$0.2/1M tokens
Output:$0.3/1M tokens$0.5/1M tokens

Grok-4-1-fast-reasoning is a next-generation AI language model developed by xAI, engineered for high-speed reasoning and rapid response in text-based tasks. It excels at fast, context-rich outputs in scenarios including code generation, analytics, and technical writing. Compared to standard Grok models, grok-4-1-fast-reasoning provides accelerated processing and enhanced performance for real-time applications. This model is ideal for developers and technical professionals seeking reliable and efficient AI for fast-paced workflows and dynamic environments.

Input:$0.12/1M tokens$0.2/1M tokens
Output:$0.3/1M tokens$0.5/1M tokens

Grok-4-1-fast-reasoning/image-to-text is a next-generation multimodal AI model from Grok, engineered for rapid image-to-text conversion, robust context handling, and fast reasoning. It enables seamless workflows for professionals who require precise visual content analysis alongside rapid textual interpretation. Compared to the base Grok-4-1 model, this variant uniquely integrates visual understanding with advanced natural language reasoning for efficient feedback. Its optimized speed and cross-modal logic empower developers, data scientists, and analysts to extract structured information from images while maintaining reliable response quality across integrated tasks.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5.1-Codex is an advanced coding model from OpenAI optimized for sustained, long-horizon software engineering tasks. It features a unique context compaction mechanism that preserves critical information across multiple sessions to handle large projects coherently. GPT-5.1-Codex-Max offers higher token efficiency, long-duration agentic coding workflows, and improved quality in debugging, refactoring, and CI/CD automation, making it ideal for complex and multi-file codebase management

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5.1-Code image to text is a multimodal capability of GPT-5.1 that enables extracting and interpreting text directly from images. It uses advanced AI to analyze layout, fonts, and stylized or handwritten text beyond traditional OCR, supporting complex document structures and multiple languages. This feature is useful for digitizing documents, UI designs, and extracting code or information embedded in images with high accuracy and contextual understanding.

$0.0335/per time$0.134/per time

Gemini-3-Pro-Image-Preview, or Nano Banana Pro (nano banana 2) , is Google's advanced AI image model built on Gemini 3 Pro. It generates high-fidelity 1K–4K images with accurate text, deep reasoning, and enhanced editing features like 3D object control and localized changes. It enables professional-grade visuals with fast production, watermarking for authenticity, and supports complex multi-step prompts and compositions.

$0.0335/per time$0.134/per time

Gemini-3-Pro-Image-Preview image edit, also known as Nano Banana Pro (nano banana 2) image editing, provides studio-quality control for creating and modifying images. It supports localized edits like adjusting lighting, camera angles, focus, and color grading. Users can transform scenes, blend multiple images, and maintain character consistency. This model excels at generating clear, accurate text in images and supports multi-turn conversational editing by preserving visual context with thought signatures. Advanced AI reasoning and grounding with Google Search improve real-world accuracy in edits.

$0.96/per time$1.2/per time

Veo-3.1-Fast-Generate-Preview is a rapid video generation model from Google DeepMind that enables real-time creation of short, cinematic videos from text, images, or video frames, prioritizing speed and lower latency over maximum fidelity. It supports text-to-video, image-to-video, and video-to-video generation workflows with native audio and is optimized for rapid previews and iterative creative processes.

$0.96/per time$1.2/per time

Veo-3.1-fast-generate-preview image-to-video is a fast AI model that converts static images into high-quality, smooth videos with synchronized audio. It supports resolutions up to 1080p and offers quick generation within seconds, enabling creators to animate images for social media, storytelling, and prototypes with cinematic realism.

$0.96/per time$1.2/per time

Veo-3.1-fast-generate-preview video-to-video creates seamless video transitions by generating intermediate frames between given first and last video frames. It produces short, high-quality video clips with native audio, supporting 1080p and 24fps, ideal for extending scenes, creative video morphing, and rapid video production workflows.

Input:$1.2/1M tokens$2/1M tokens
Output:$7.2/1M tokens$12/1M tokens

Gemini 3 Pro was officially released by Google on November 18, 2025. It is the company’s most advanced multimodal AI model, excelling in complex reasoning, long-context understanding, and processing text, images, audio, and video. Gemini 3 Pro powers Google Search, Workspace, and developer tools, setting new standards on AI benchmarks at launch with broad enterprise and consumer integration.

Input:$1.2/1M tokens$2/1M tokens
Output:$7.2/1M tokens$12/1M tokens

Gemini 3 Pro’s image-to-text model excels at accurately interpreting and describing images. It processes complex visuals, including photos and documents, to generate precise textual descriptions and extract structured data. This enables superior OCR, video analysis, and content understanding in multilingual, real-world scenarios, making it powerful for enterprise applications requiring high-fidelity vision-to-text conversion.

Input:$1.2/1M tokens$2/1M tokens
Output:$7.2/1M tokens$12/1M tokens

gemini-3-pro-preview/file-analysis is a cutting-edge AI model from Google’s Gemini 3 family, focused on robust file and document analysis. It stands out with multimodal capabilities, efficiently processing diverse formats such as text, code, images, and PDFs. Compared to core Gemini models, it adds enhanced document handling and context-aware extraction, making it ideal for technical workflows. Its high processing speed, accuracy and adaptability help developers automate code reviews, analyze reports, and unlock insights from complex files—perfect for those seeking advanced, scalable AI file analysis.

$2.56/per time$3.2/per time

Veo-3.1-generate-preview is an advanced AI video generator by Google offering three main modes: text-to-video, image-to-video, and video-to-video. It creates high-quality 4-8 second videos in 720p/1080p with synchronized audio and realistic visuals. Key features include using up to 3 reference images for consistency, smooth transitions between start/end frames, and video extensions for longer sequences.

$2.56/per time$3.2/per time

Veo-3.1-generate-preview image-to-video lets you input one or more images (up to three reference images) to guide video content, animating objects or scenes from the image and preserving subject consistency across frames. This modality uses the input image as the initial frame to generate smooth video transitions.

$2.56/per time$3.2/per time

Veo-3.1-generate-preview video-to-video supports extending or editing existing videos by specifying first and last frames to generate seamless transitions and continuity. It enhances videos by adding realistic audiovisual elements and narrative control while maintaining coherent scene evolution.

$0.0244/per time$0.0375/per time

Qwen-Image-LoRA is an advanced AI image editing and generation model based on the Qwen-Image foundational model. It supports precise editing of images, including complex bilingual text edits in Chinese and English, multi-image batch processing, and style preservation. It allows custom LoRA models for flexible style control, enabling professionals to perform high-quality, detailed, and customizable image modifications efficiently.

$0.0244/per time$0.0375/per time

Qwen-Image-Plus-Lora extends the Qwen-Image family with LoRA (Low-Rank Adaptation) technology, enabling rapid fine-tuning or customization on specific styles or subjects using LoRA adapters. Developed by Alibaba Cloud’s Qwen team, it maintains core Qwen-Image editing and generation capabilities while supporting efficient, lightweight model adaptation for branded content, stylistic transfers, and specialized creative tasks.

$0.0195/per time$0.03/per time

Qwen-Image-Plus (also known as Qwen-Image-Edit-2509) is an advanced AI image editing model by Alibaba Cloud’s Qwen team. It supports multi-image editing, enhanced consistency in preserving identities of people and products, advanced text editing, and native ControlNet support for precise image manipulation. It excels in semantic, appearance editing, creative generation, and dynamic pose creation, enabling versatile, high-quality image edits.

Input:$0.09/1M tokens$0.15/1M tokens
Output:$0.36/1M tokens$0.6/1M tokens

gpt-4o-mini-2024-07-18 is an optimized AI language model from OpenAI’s GPT-4o family, built for fast, scalable natural language understanding and generation. It delivers multichannel support, reliable coding assistance, content creation, and data tasks in a lighter, efficient form. Compared with larger GPT-4o models, gpt-4o-mini-2024-07-18 offers reduced latency and resource demands, making it ideal for developers seeking balance between capability and responsiveness across business, education, and creative applications.

Input:$3/1M tokens$5/1M tokens
Output:$9/1M tokens$15/1M tokens

ChatGPT-4o-latest is the most recent update of OpenAI’s GPT-4 Omni (4o) model, integrated into ChatGPT as of early 2025. This version emphasizes increased creativity, clearer and more natural communication, better code handling, and more concise, focused responses. It improves instruction following, readability, and reduces clutter in outputs, available both for ChatGPT users and via the API as the current flagship multimodal chat model.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5.1 is OpenAI's newest GPT-5 series model, designed for developers. It uses adaptive reasoning to dynamically adjust thinking time, speeding up simple tasks by 2-3x without sacrificing intelligence. New features like "reasoning-free" mode, 24-hour caching, and apply_patch/shell tools significantly boost code editing and programming efficiency. This release delivers a powerful and optimized AI experience.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5.1 image-to-text refers to OpenAI’s GPT-5.1 release with enhanced multimodal capabilities that can process images and text together to generate descriptive text, captions, summaries, or structured data from visual content. It emphasizes improved image understanding, better OCR-like text extraction, and more context-aware reasoning for image inputs, along with customizable output styles and longer context handling.

$0.042/per time$0.07/per time

Grok-4-image extends Grok 4’s abilities to visual understanding and reasoning. It can interpret and analyze images, supporting multimodal interaction that combines text and vision. Future developments aim to include image generation, enabling rich AI-assisted workflows that unify text, vision, and code capabilities in one powerful system.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$4.8/1M tokens$8/1M tokens

GPT-image-1-mini is OpenAI’s lightweight model for creating new images directly from textual prompts. It provides fast and affordable image generation up to 1536×1024 resolution, with adjustable quality and fidelity. It’s ideal for bulk creative applications, though maximum micro-detail and photorealism are less than premium models

Input:$1.5/1M tokens$2.5/1M tokens
Output:$4.8/1M tokens$8/1M tokens
$1.12/per time$1.4/per time

Kling-v2.1-master is Kuaishou's premium text-to-video and image-to-video AI model, generating 1080p cinematic clips (5-10s) with realistic physics, smooth motion, and temporal consistency. It supports 16:9/9:16/1:1 ratios via API, excels in complex prompts/camera controls, but lacks audio. Ideal for professional storytelling/marketing; costs ~$1.40-$2.80 per clip.

$1.12/per time$1.4/per time

Kling-v2.1-master text-to-video is Kuaishou's premium AI model that generates 1080p cinematic video clips (5-10s) from text prompts. It delivers smooth motion dynamics, realistic physics, temporal consistency, and precise prompt adherence across 16:9/9:16/1:1 ratios. Ideal for storytelling/marketing; no audio support; ~$1.40-$2.80 per clip via API.

$0.392/per time$0.49/per time

Kling-v2.1-pro is Kuaishou's professional-grade image-to-video AI model, generating 1080p clips (5-10s) from static images with enhanced visual fidelity, precise camera movements (pan/zoom/tilt), and smooth motion dynamics. It preserves details/textures, supports motion brush controls, and excels in cinematic storytelling for marketing/product demos. API pricing ~$0.32-$1.40 per clip.

$0.392/per time$0.49/per time

Kling-v2.1-pro "start-end-framed" refers to its Start/End Frame Conditioning feature, allowing users to upload images for the video's first and last frames. The AI generates smooth 1080p transitions (5-10s clips) between them, ensuring precise continuity, cinematic motion, and loop effects (same image for both). Ideal for product reveals, narrative beats, and seamless multi-clip workflows via API.​

$0.224/per time$0.28/per time

Kling-v2.1-standard is Kuaishou's entry-level image-to-video and text-to-video AI model, producing 720p clips (5-10s) with reliable motion, prompt adherence, and basic camera controls. More affordable (~$0.18-$0.25 per clip) than Pro/Master tiers, it's suited for social media, previews, and casual content creation via API.

$0.171/per time$0.19/per time

Hailuo 2.3 Fast is a high-speed AI video generation model focused on image-to-video creation. It produces smooth, realistic videos with dynamic motion at 2.5 times the speed of standard models and lower cost. The model supports 768p resolution clips around 6 seconds long, ideal for rapid video creation and iterative testing while maintaining good visual quality and motion fluidity.

$0.171/per time$0.19/per time

hailuo-2.3-fast/text-to-video represents the pinnacle of high-speed video synthesis within the MiniMax Hailuo family. This model is specifically optimized for developers who require rapid inference without compromising the fluid motion and cinematic aesthetic of the base 2.3 architecture. Unlike standard versions, the fast variant leverages advanced quantization and hardware-specific optimizations to deliver 1080P visuals with significantly reduced latency. It excels in generating complex human movements and environmental transformations from simple text prompts. By prioritizing throughput, it becomes the ideal solution for real-time creative tools and large-scale automated content pipelines that demand professional-grade temporal consistency and visual fidelity.

$0.441/per time$0.49/per time

Hailuo-2.3-Pro image to video is a MiniMax-developed AI model that converts static images into smooth animated videos. It maintains image composition and color fidelity while adding fluid motion, camera transitions, and scene coherence. This model supports multi-aspect ratios and rapid generation speeds, serving creators who need high-quality video output from images efficiently.

$0.441/per time$0.49/per time

Hailuo-2.3-Pro text to video is an AI video generator developed by MiniMax, a Shanghai-based AI foundation model company. It produces cinematic 6 to 10-second 1080p videos with realistic human motions, detailed facial expressions, and dynamic camera work. The model excels in choreography, artistic style stability, and is optimized for commercial marketing and storytelling use.

$0.252/per time$0.28/per time

Hailuo-2.3-Standard image to video is a MiniMax AI model designed to animate static images into smooth, cinematic 768p videos lasting up to 10 seconds. It maintains image composition, lighting, and character details while adding realistic motion, camera movements, and scene transitions. The model balances quality and cost-effectiveness for fast, high-fidelity video production.

$0.252/per time$0.28/per time

Hailuo-2.3-Standard text to video is an AI model from MiniMax that generates 6 to 10-second videos in 1080p resolution based on text prompts. It features improved motion capture, realistic facial expressions, dynamic camera angles, and artistic style control, making it suitable for marketing, entertainment, and professional storytelling.

$0.252/per time$0.28/per time

Hailuo-02-Standard is a version of MiniMax's AI video generation model designed for producing high-quality videos from images or text prompts. It typically generates videos at 768p resolution (compared to 1080p for the Pro version) with 6 or 10 second lengths at 25 frames per second. The model excels in natural motion synthesis, advanced camera controls, and deep prompt understanding for creating cinematic videos with realistic physics. It balances fast generation times (around 4 minutes) and professional visual quality, making it suitable for social media, marketing, and creative content production.

$0.252/per time$0.28/per time

Hailuo-02-Standard image-to-video is an AI video generation model by MiniMax designed to convert static images into dynamic videos at 768p resolution with 25 frames per second. It features natural motion synthesis that preserves the integrity of the original image while creating smooth, lifelike animations. Processing time is around 4 minutes, supporting various image formats like JPG, PNG, GIF, and AVIF. The model is suitable for social media content, marketing, and creative applications, and provides consistent output quality with fast generation speed. It supports user prompts to guide the video motion and style

$0.441/per time$0.49/per time

Hailuo-02-Pro is a state-of-the-art AI video generation model developed by MiniMax. It produces professional-grade, high-definition 1080p videos up to 10 seconds long from text or image prompts. The model excels in realistic physics simulation, cinematic motions, and director-level controls such as camera angles and timing. It maintains visual and semantic consistency with low hallucination rates and is widely used for marketing, social media content, education, and prototyping.

$0.441/per time$0.49/per time

Hailuo-02-Pro image-to-video is an advanced AI video generation model by MiniMax that creates high-definition 1080p videos from a single input image combined with text prompts. It specializes in producing realistic cinematic motion with physics-based animation, including natural hair, water, and material interactions. The model supports detailed director controls such as camera movement and scene timing for professional-grade videos up to 10 seconds long. It delivers smooth, visually rich video with stable characters and accurate prompt interpretation, ideal for social media, marketing, and creative content. The workflow includes uploading an image, providing a descriptive prompt, choosing motion styles, and adjusting video length and settings for the final output.

$0.09/per time$0.1/per time

Hailuo-02-fast is MiniMax’s advanced AI video generation model producing 1080p cinematic-quality videos up to 10 seconds from text or images. It features ultra-realistic physics simulation (fluid dynamics, collision, lighting), precise director-level camera control (pan, zoom, tracking), and consistent character rendering. Ranked #2 globally, it excels in fast, professional-grade video creation with rich motion and visual effects.

$0.09/per time$0.1/per time

WAN-2.2-Plus Text-to-Video is an advanced AI model that transforms text descriptions into professional, cinematic-quality videos. It uses a 5 billion parameter architecture to generate 720p videos at 24 frames per second. The model features sophisticated controls over lighting, camera angles, and motion dynamics to create visually rich, realistic, and fluid animations. It is fast, user-friendly, and designed for creators and commercial use

$0.09/per time$0.1/per time

WAN-2.2-Plus Image-to-Video uses similar technology to animate static images, turning them into dynamic videos with natural, smooth motion. It supports complex camera movements and transitions, maintaining visual consistency and stability. The model outputs high-resolution videos and is optimized for consumer GPUs, making it accessible for both creative and professional applications. It enhances images by adding cinematic motion while preserving style and detail.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5-Chat is OpenAI's flagship multimodal chatbot interface powered by GPT-5, featuring adaptive reasoning (instant or chain-of-thought), 400K token context, reduced hallucinations (~45-80% fewer), and preset personalities (Cynic, Robot, Listener, Nerd). Excels in coding, writing, multilingual support, and real-world tasks via ChatGPT or API.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5-Chat image-to-text capability enables the model to process and analyze images alongside text inputs, generating accurate and detailed textual descriptions, visual question answers, document analysis, and multimodal reasoning. It supports various image formats and can interpret complex visuals such as charts, screenshots, and photos for use in applications like content creation, accessibility, and interactive assistants.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5-Codex is developed by OpenAI. It is a version of GPT-5 specifically optimized for software engineering tasks, featuring advanced capabilities like dynamic thinking time adjustment, autonomous coding, and deep integration with developer tools. The model is a key part of OpenAI's efforts to enhance AI-assisted programming and software development.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-5-Codex/image-to-text is an advanced multimodal model from OpenAI, tailored for robust image-to-text conversion. Optimized for developers, it enables high-quality code generation and visual data extraction. Compared to the core GPT-5, this model offers specialized image understanding, faster processing, and enhanced accuracy for technical tasks. Key application areas include code review, documentation automation, and data analysis. Its speed and multimodal capability create unique value for tech-driven workflows.

$0.3/per time

Tripo3D v2.5 is an advanced AI-powered 3D modeling tool that generates high-quality 3D assets from single images and text prompts. It features improved geometric precision with sharper edges, enhanced PBR rendering for realistic materials, and seamless integration with tools like Blender and ComfyUI. It supports customizable styles, quad mesh topology, and efficient workflows for designers and game developers.

$0.01/per time

image-watermark-remover/image-to-image is a specialized deep learning AI model designed for removing watermarks from digital images. Leveraging advanced image-to-image translation techniques, it processes visual inputs to produce clean, watermark-free outputs. The model stands apart from baseline image models by its trained ability to detect and remedy visible watermarks, making it essential for media restoration tasks, digital asset management, and visual quality enhancement in both professional and technical sectors.

$0.02/per time

The image-zoom/image-to-image model is an advanced AI generative tool specialized for transforming and enhancing images. Differing from base image models, it supports high-resolution processing with versatile image-to-image transfer capabilities. Ideal for creative, technical, and professional applications, the model focuses on speed, accuracy, and flexible API integration, making it especially attractive for developers and designers seeking adaptive image solutions.

$0.01/per time

image-upscaler/image-to-image is a modern AI model designed for image enhancement and transformation. Built by reputable AI teams, this model excels at converting low-resolution or noisy images into cleaner, higher-quality versions. Compared to basic upscaling models, it offers advanced processing, faster speeds, and reliable output consistency. It is ideal for developers working in imaging, creative industries, and technical workflows requiring fast, accurate results.

$0.001/per time

image-background-remover/image-to-image is an advanced AI model designed for fast and precise background removal from images. It specializes in image-to-image transformation, making it distinct from text-based or multi-task models. Developed to support creative, commercial, and automation workflows, it delivers high-speed processing and reliable output quality for developers. Compared to basic background removal tools, this model provides optimized accuracy, multi-format compatibility, and seamless API integration. Ideal for content creators, e-commerce, and digital design industries.

$0.02/per time$0.05/per time

Gemini 2.5 Flash Image HD is an advanced AI image generation and editing model with enhanced resolution and creative control. It supports blending multiple images, maintaining character consistency, and precise local edits through natural language prompts. The model enables users to perform tasks like background blurring, object removal, pose alteration, and colorization with real-world understanding.

$0.02/per time$0.05/per time

Gemini 2.5 Flash Image HD is a powerful image editing feature allowing precise, targeted transformations and local edits via natural language. It enables blending multiple images, maintaining character consistency, altering poses, removing objects, and colorizing photos with fast, high-quality output and real-world understanding for creative workflows.

Input:$0.7/1M tokens$1/1M tokens
Output:$3.5/1M tokens$5/1M tokens

Claude Haiku 4.5 is Anthropic’s fastest, most cost-effective small AI model, offering near-frontier reasoning and coding, 200K-token context, and extended “thinking” for deep logic. It excels in real-time applications, supports text/image input, and delivers rapid, reliable output at one-third the cost of larger frontier models

Input:$0.7/1M tokens$1/1M tokens
Output:$3.5/1M tokens$5/1M tokens

Claude Haiku 4.5 features advanced file analysis capabilities, processing both text and images with a 200,000-token context window. It supports extended thinking for deeper reasoning, context awareness for sustained coherence in multi-session tasks, and the ability to interact with software interfaces. This makes it powerful for analyzing, summarizing, and extracting information from large documents and complex workflows seamlessly. It balances speed, cost, and near-frontier intelligence effectively.

Input:$0.7/1M tokens$1/1M tokens
Output:$3.5/1M tokens$5/1M tokens

claude-haiku-4-5-20251001 is a highly efficient AI language model from Anthropic’s Claude family. It is optimized for rapid and cost-effective text generation, coding, summarization, and professional workflows. Compared to its larger siblings like Claude Opus, it offers much faster response times with lower compute requirements, making it ideal for scalable chatbot experiences, customer support, and creative writing. Skilled at concise reasoning and dialogue, claude-haiku-4-5-20251001 is designed for developers and businesses who value agility, precise output, and seamless integration into high-volume applications.

$0.5/per time

Veo 3.1 generates smooth, high-quality videos by transforming a single image or multiple reference images into video sequences. It supports start-and-end frame control for seamless transitions, maintaining consistent characters and styles. Videos can be created in 720p or 1080p with synchronized audio, ideal for storytelling, marketing, and social media content creation.

$0.5/per time

Veo 3.1 converts detailed text prompts into vivid videos, demonstrating strong prompt understanding and cinematic style control. It produces realistic motion, character consistency, and audio synchronization with natural sounds and dialogue. This tool empowers creators to quickly generate professional, narrative-driven video content, supporting popular aspect ratios for various platforms.

$0.5/per time

veo3.1/reference-to-video is an innovative multi-modal AI model designed for accurate and high-speed video reference tasks. It supports precise mapping between text descriptions and video content, making it ideal for indexing, video search, and smart media workflows. Its robust architecture and faster processing output set it apart from earlier models. veo3.1/reference-to-video offers developers powerful referencing capabilities, enabling industry-leading solutions in content management, video analytics, and generative video pipelines.

$2.5/per time

Veo 3.1 Pro is Google's latest advanced AI video generation model designed for creating high-quality 8-second videos at 720p or 1080p with natively synchronized audio. It offers enhanced scene and shot control with features like multi-shot sequencing, reference-image guidance, and cinematic presets including lighting and camera effects. The model supports longer seamless video extensions, richer native audio including dialogue and environmental sounds, and precise editing tools for inserting or removing objects. Veo 3.1 Pro enables creators and enterprises to produce realistic, immersive, and consistent video content efficiently, perfect for media, marketing, and storytelling applications.

$2.5/per time

Veo 3.1 Pro image-to-video is an advanced feature of Google DeepMind’s Veo 3.1 AI video generation model that transforms single still images or pairs of start and end frames into high-fidelity, cinematic 1080p videos with native synchronized audio. It supports up to 3 reference images for maintaining visual consistency and offers rich creative controls including multi-shot sequencing, realistic camera motion, lighting effects, and voice-synced dialogue. This capability is designed for content creators and enterprises needing professional-quality video production with flexible scene management and enhanced prompt adherence.

$0.5/per time

Veo 3.1 Fast is a fast and cost-effective version of Google's Veo 3.1 AI video generation model that produces 4-8 second 1080p videos with synchronized native audio in under 60 seconds. It supports both text-to-video and image-to-video workflows for rapid content creation with cinematic motion and ambient sounds.

$0.5/per time

Veo 3.1 Fast image-to-video enables converting a single image into a dynamic video clip with guided motion, narrative, and sound through optional text prompts, delivering smooth and realistic audiovisual experiences quickly.

$0.5/per time

Veo 3.1 Fast reference-to-video allows using 1-3 reference images to maintain subject consistency and appearance throughout the video, ensuring continuity for characters or objects in complex scenes. This is ideal for storytelling and content requiring visual coherence across frames.

$0.0384/per time$0.048/per time

Seedance-1-0-pro-250528 is ByteDance's pro-grade Seedance 1.0 video generation model variant, supporting text-to-video (T2V) and image-to-video (I2V) for 5-10s clips at up to 1080p resolution and 24 FPS. It excels in multi-shot cinematic sequences with smooth motion, camera control (pan/zoom/drone), style diversity, and temporal consistency.

$0.0384/per time$0.048/per time

Seedance-1-0-pro-250528 image-to-video is a ByteDance AI model that converts images into high-quality 1080p videos with smooth, natural motion and cinematic camera effects like panning and zooming. It supports multi-shot sequences, dynamic scene transitions, and diverse visual styles, ideal for storytelling, branded content, and complex narratives. It offers fine-grained control over motion intensity, video length, and resolution.

$0.042/per time$0.07/per time

Grok-2-image is xAI's multimodal vision model for image analysis, text descriptions, visual Q&A, and content creation. It processes 4K images (JPG/PNG/PDF) with low latency (<500ms), supports real-time apps, and integrates with X platform. Outperforms GPT-4 Vision in efficiency for e-commerce, healthcare, and marketing.

$1.2/per time

Sora-2-Pro is OpenAI’s most advanced AI video generation model that produces short videos with synchronized visuals and sound from text or image prompts. It enhances realism, motion physics, and audio-video coherence—delivering narrative-driven clips with accurate lip-sync, ambient sound, and expressive motion, making it ideal for creative professionals and content creators.

$1.2/per time

Sora-2-Pro image-to-video is an advanced AI model by OpenAI that generates high-quality videos with synchronized audio from single images and text prompts. It supports resolutions up to 1792x1024 and produces clips up to 25 seconds long. This model excels in realistic motion, physics, lip-sync, and cohesive sound, making it ideal for professional cinematic, marketing, and storytelling uses

$0.0156/per time$0.039/per time

Gemini 2.5 Flash Image, also known as Nano Banana, is Google’s advanced AI model for fast, high-quality image generation and editing. It supports blending multiple images, consistent character rendering, and precise natural language editing. The model leverages real-world knowledge for context-aware visuals, offers various aspect ratios. It is cost-effective and production-ready.

$0.0156/per time$0.039/per time

Gemini-2.5-flash-image / image-edit enables precise modifications using natural language. It supports object removal, background changes, pose adjustments, and multi-image blending while maintaining character consistency. The model integrates real-world knowledge for context-aware edits and delivers fast, high-quality results.

$0.4/per time

Sora 2 text-to-video is OpenAI’s flagship AI model that generates high-fidelity, realistic videos directly from natural language prompts. It understands and simulates complex scenes, follows script-level instructions, and creates synchronized audio and persistent characters. Sora 2 excels in physical realism, cinematic quality, and multi-shot continuity for rapid content production and storytelling.​

$0.4/per time

Sora 2 image-to-video transforms a single image into a dynamic, animated video sequence. It brings still images to life with realistic motion, scene continuity, and sophisticated effects, supporting advanced editing like inpainting or style transfers. The model preserves subject and background while animating the original content for engaging marketing, entertainment, and creative projects.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-5-20250929-thinking/text-to-text is a versatile AI language model from Anthropic, designed for high-quality text understanding and generation. It supports advanced reasoning, creative writing, and code assistance at high speed. Compared to legacy Claude models, it improves context handling, reasoning capability, and accuracy for professional workflows. Its reliability and focused text-to-text processing make it a robust choice for developers, data analysts, and content creators seeking safe, ethical AI assistance.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-5-20250929-thinking/file-analysis is an advanced AI model in the Claude Sonnet family by Anthropic. Designed for multi-modal file analysis, it supports robust natural language processing, code interpretation, document summarization, and contextual reasoning. Its strengths include fast file parsing, accurate data extraction, and seamless integration with complex workflows. Compared to the baseline Claude Sonnet 4.5, this variant emphasizes enhanced file analytic capabilities and developer-centric features. Its ability to process varied formats makes it ideal for technical teams requiring speed, reliability, and depth in business, legal, or research settings.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-5-20250929-thinking is a state-of-the-art AI model from the Claude family by Anthropic. It excels in natural language understanding, code generation, and advanced reasoning. This version stands out for its improved speed, higher context window, and robust multimodal abilities over earlier Sonnet variants. Designed for enterprise-grade scalability, it optimizes task-specific output for technical, creative, and analytical workflows. Its differences from base Claude models include larger input capacity and more consistent logic handling, making it an efficient tool for developers, businesses, and educators needing accurate, reliable AI solutions.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

Claude Sonnet 4.5 is Anthropic's top AI for coding, reasoning, and complex tasks with up to 30+ hours of focus and 10M token context. It excels in coding accuracy (0% error rate), finance, law, medicine, and computer use with strong safety and alignment improvements.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

Claude Sonnet 4.5 file-analysis excels at creating and refining professional work deliverables like presentations, spreadsheets, and documents. It improves formula accuracy, layout consistency, and formula logic in spreadsheets. It autonomously interprets, edits, and summarizes complex files, accelerating tasks like vulnerability detection and legal or financial document review with high accuracy and reliability.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-5-20250929/web-search is a cutting-edge AI model from Anthropic's Sonnet family. It provides fast, scalable performance for developers, blending advanced language understanding, code generation, and contextual search capabilities. Compared to base Sonnet models, it supports deep web context integration for richer, real-time outputs across technical and creative tasks. Ideal for businesses and professionals, it stands out in speed, accuracy, and context-driven intelligence.

Input:$10.5/1M tokens$15/1M tokens
Output:$52.5/1M tokens$75/1M tokens

claude-opus-4-1-20250805-thinking is a next-generation AI language model in the Claude family developed by Anthropic. It offers advanced performance for text generation, programming help, and analytical tasks. Compared to its predecessors, this model brings improved context understanding, increased speed, and enhanced multi-turn reasoning. Developers appreciate its reliability, safety-centric design, and scalability. Its strengths make it ideal for creative writing, intelligent automation, and knowledge-based solutions across various industries.

Input:$10.5/1M tokens$15/1M tokens
Output:$52.5/1M tokens$75/1M tokens

claude-opus-4-1-20250805-thinking/file-analysis is a state-of-the-art AI language model built for detailed file analysis, coding workflows, and structured data interpretation. Developed by Anthropic, this model advances the Claude family with faster multi-file processing and improved reasoning. It features robust context understanding and precise content extraction, making it ideal for professionals handling technical documentation, codebases, or large datasets. Compared to previous Claude models, claude-opus-4-1-20250805-thinking/file-analysis delivers enhanced speed and accuracy in file-oriented scenarios, as well as scalable support for complex files and multi-modal data.

Input:$10.5/1M tokens$15/1M tokens
Output:$52.5/1M tokens$75/1M tokens

claude-opus-4-1-20250805-thinking is a next-generation AI language model in the Claude family developed by Anthropic. It offers advanced performance for text generation, programming help, and analytical tasks. Compared to its predecessors, this model brings improved context understanding, increased speed, and enhanced multi-turn reasoning. Developers appreciate its reliability, safety-centric design, and scalability. Its strengths make it ideal for creative writing, intelligent automation, and knowledge-based solutions across various industries.

$0.024/per time$0.03/per time

Seedream-4-0-250828 is ByteDance’s advanced text-to-image generation model capable of producing highly detailed, ultra-high-resolution (up to 4K) images by interpreting text prompts. It features fast processing, strong prompt adherence, and supports editing and multi-image blending, making it ideal for creative, commercial, and professional visual workflows.

$0.024/per time$0.03/per time

Seedream-4-0-250828 image-edit refers to the model’s advanced image editing capability, powered by natural language instructions. Users can upload an image and describe modifications, such as background replacement, object addition or removal, style changes, or attribute adjustments, and Seedream 4.0 applies these edits at professional quality with high feature retention and strong prompt adherence, all within seconds and up to 4K resolution.

$0.027/per time$0.03/per time

Wan 2.5 Text-to-Image generates high-quality, detailed images from text prompts, supporting artistic and realistic styles with resolutions up to 1440x1440. It offers flexible aspect ratios and prompt expansions, catering to creative, commercial, and multimedia applications.

$0.027/per time$0.03/per time

Wan 2.5 Image Edit allows instruction-based interactive editing of images or videos, enabling object removal, addition, or repositioning with natural language commands. This AI-powered editing integrates visual reasoning for refined and adaptive modifications.

$0.225/per time$0.25/per time

Wan 2.5 Text to Video creates cinematic videos up to 10 seconds long at 1080p from textual descriptions, with realistic motion, lighting, and rich temporal details. It also generates synchronized audio including voice and ambient sound, ideal for storytelling and marketing.

$0.135/per time$0.15/per time

Wan 2.5 Image to Video dynamically animates still images into videos, preserving scene structure, lighting, and perspective. It produces smooth, natural camera movements and transitions with audio synchronization, supporting diverse aspect ratios and high visual fidelity.

$0.28/per time$0.35/per time

Kling-v2.5-turbo-pro is a state-of-the-art AI video generator delivering high-quality, cinematic videos with realistic motion, advanced physics, and smooth transitions. It supports up to 10-second HD videos in multiple aspect ratios with up to 2500-character prompts, ideal for marketing, entertainment, education, and professional use.

$0.28/per time$0.35/per time

Kling-v2.5-turbo-pro text-to-video converts detailed text descriptions into dynamic videos featuring lifelike character expressions, natural movements, and advanced camera control. It offers rapid generation with professional-level output, supporting complex multi-step prompts and creative customization, suitable for social media, advertising, and storytelling applications.

$0.28/per time$0.35/per time

Kling-v2.5-turbo-pro is a state-of-the-art AI video generator delivering high-quality, cinematic videos with realistic motion, advanced physics, and smooth transitions. It supports up to 10-second HD videos in multiple aspect ratios with up to 2500-character prompts, ideal for marketing, entertainment, education, and professional use.

Input:$0/1M tokens
Output:$36/1M tokens$60/1M tokens

Speech-2.5-turbo-preview is a high-definition text-to-speech model supporting 40 languages with natural, expressive voices. It offers fast, real-time streaming, precise voice replication, customizable parameters, and is suitable for conversational AI, content creation, and global applications requiring emotional nuance and low latency.

$0.5003/per time$0.8338/per time

Speech-2.5-turbo-preview-voice-clone is MiniMax's fast text-to-speech variant with integrated voice cloning, enabling realistic replication from 6-second audio samples across 40+ languages. It preserves accents, styles, and emotions with ultra-low latency streaming, ideal for real-time apps like personalized assistants and multilingual content.

$0.5003/per time$0.8338/per time

speech-2.5-turbo-preview-voice-clone is a state-of-the-art AI voice model designed for rapid, realistic speech synthesis and precise voice cloning. Built upon the Turbo family’s fast generation engine, this model achieves low-latency performance ideal for real-time applications. Unlike standard speech AI, it features advanced voice reproduction and customization capabilities, making it optimal for customer service, accessibility tools, and interactive media. With robust support for multi-speaker and dynamic modulation, it enables seamless integration into production workflows.

$0.0021/per time$0.0034/per time

Speech-02-turbo is MiniMax's real-time text-to-speech (TTS) AI model designed for ultra-low latency and high-speed audio generation. It supports 100+ voices across 30+ languages with customizable parameters such as pitch, speed, volume, and emotional expression. Ideal for interactive apps like gaming, virtual meetings, and live assistants, it delivers smooth, natural voice output with advanced voice cloning features.

$0.0082/per time$0.0137/per time

Speech-02-HD is a high-definition text-to-speech (TTS) AI model developed by MiniMax, designed for producing natural, human-like voice output with studio-grade clarity. It supports over 30 languages and 300+ voices, offers advanced features like emotion and pitch control, voice cloning from short samples, and real-time streaming with low latency. It is ideal for professional voiceovers, audiobooks, and interactive applications requiring high-quality, expressive speech synthesis.

$0.5003/per time$0.8338/per time

Speech-2.5-hd-preview-voice-clone is an advanced AI speech model by MiniMax that offers ultra-realistic, high-definition voice cloning and text-to-speech synthesis. It can clone a person's voice from just seconds of audio and generate natural-sounding speech in 40+ languages, preserving accent and emotion even across languages. It supports detailed voice customization, real-time synthesis, and produces studio-quality expressive audio for applications like narrations, voiceovers, and interactive voice systems.

$0.5003/per time$0.8338/per time

speech-2.5-hd-preview-voice-clone is an advanced AI model specializing in high-definition voice cloning and speech synthesis. It delivers lifelike, expressive audio outputs suited for entertainment, customer interaction, accessibility, and more. Compared to foundational speech-2.5-hd models, the voice-clone variant offers more nuanced cloning, richer prosody, and flexible adaptation to user voice samples. Its efficient processing supports real-time deployment and precise control, standing out for professionals seeking reliable, high-quality voice generation across multimedia and service applications.

Input:$0/1M tokens
Output:$60/1M tokens$100/1M tokens

Speech-2.5-hd-preview is MiniMax's high-definition text-to-speech (TTS) model preview, featuring ultra-realistic voices, enhanced multilingual support (40+ languages), precise voice cloning (6-second clips), and real-time streaming. It offers customizable pitch, speed, emotion, and natural pronunciation for professional audio generation up to 5000 characters.

Input:$0.18/1M tokens$0.3/1M tokens
Output:$1.5/1M tokens$2.5/1M tokens

Gemini-2.5-flash-nothinking is a version of Google’s Gemini 2.5 Flash model with the reasoning ("thinking") feature turned off to prioritize speed and low latency. It offers fast, efficient responses suitable for simpler or high-throughput tasks where deep reasoning is unnecessary. Developers can control the "thinking budget" via API to balance quality, cost, and latency, with non-thinking mode delivering quicker outputs at a lower cost.

Input:$0.18/1M tokens$0.3/1M tokens
Output:$1.5/1M tokens$2.5/1M tokens

Gemini-2.5-flash-nothinking image-to-text is a mode of the Google DeepMind Gemini 2.5 Flash model that supports fast image understanding and optical character recognition (OCR) without using deep reasoning to prioritize speed. It excels in extracting readable text from images quickly for real-time applications. This model balances multimodal capabilities with low latency, suitable for automation, technical analysis, and integrated developer workflows requiring rapid visual text extraction.​​

Input:$0.18/1M tokens$0.3/1M tokens
Output:$1.5/1M tokens$2.5/1M tokens

gemini-2.5-flash-nothinking/file-analysis is a next-generation AI model from Google Gemini’s 2.5 family, specialized in fast, multimodal file and text analysis. It delivers rapid, context-aware processing of documents and images, making it ideal for file-heavy workflows and enterprise data tasks. Compared to other Gemini models, it focuses on speed and minimal latency with streamlined reasoning, enabling seamless integration in real-time applications. Its high accuracy and multimodal engine distinguish it from GPT and Claude, supporting scenarios requiring efficient analysis of large or complex files across industries.

$0.0242/per time$0.0285/per time

Doubao Seedream 4.0-250828 is a high-speed, multimodal AI image generator from ByteDance’s Doubao team, producing ultra-high-resolution (up to 4K) images from text and image prompts in seconds, with advanced editing features, support for multi-image inputs, and strong consistency, making it ideal for professional artwork, advertising, and commercial design workflows.

$0.0242/per time$0.0285/per time

doubao-seedream-4-0-250828/image-edit is a cutting-edge multimodal AI model developed by Doubao Seedream. It specializes in automated image editing, creative enhancement, and visual content transformation. Featuring advanced neural architectures, the model integrates fast processing and high-fidelity output, making it ideal for design, marketing, and web applications. Compared to standard Seedream models, this variant offers optimized workflows for image inputs, more control over output styles, and extended compatibility with creative pipelines. Its differentiated features and scalability empower developers and creative professionals seeking reliable, adaptable AI-powered image solutions.

Input:$9/1M tokens$15/1M tokens
Output:$72/1M tokens$120/1M tokens

GPT-5 Pro is an advanced variant of GPT-5 designed for the most challenging and complex tasks. It features extended reasoning capabilities, allowing it to think longer and produce more comprehensive and accurate answers than the standard GPT-5. GPT-5 Pro achieves state-of-the-art performance on difficult benchmarks, reduces major errors by 22%, and is aimed at professional and enterprise users requiring maximum AI performance and precision.

Input:$9/1M tokens$15/1M tokens
Output:$72/1M tokens$120/1M tokens

GPT-5 Pro supports image-to-text capabilities, allowing it to analyze and interpret visual content comprehensively. It can generate detailed, descriptive text from images, recognizing objects, scenes, and textual information within images. This feature enables applications like visual content analysis, enhanced image understanding, and multimodal interaction, making GPT-5 Pro highly effective for complex visual and textual tasks.

Input:$0.2432/1M tokens$0.2703/1M tokens
Output:$0.973/1M tokens$1.0811/1M tokens

DeepSeek-V3 is an open-source AI language model with 671 billion parameters and 37 billion activated per token. It uses a Mixture-of-Experts architecture and Multi-head Latent Attention for efficient, cost-effective inference and training. Supporting a 128,000-token context window, it excels in natural language understanding, reasoning, coding, and multilingual tasks, offering fast, accurate, and scalable performance for diverse applications.

$0.0315/per time$0.035/per time

Qwen-Image is a 20 billion parameter multimodal foundation model by Alibaba's Tongyi Qianwen team, specializing in high-quality image generation and precise image editing. It excels at complex text rendering, including multi-line layouts and fine details, supports multilingual input, and offers advanced editing features like style transfer, object replacement, and background generation. Qwen-Image is widely used for creative visual AI applications and available through APIs and open-source platforms.

Input:$0.495/1M tokens$0.55/1M tokens
Output:$1.9703/1M tokens$2.1892/1M tokens

DeepSeek-R1 is an advanced open-source AI model by DeepSeek designed for high-speed logical reasoning, problem-solving, and mathematical tasks. It uses a Mixture of Experts architecture combined with reinforcement learning and supervised fine-tuning to achieve powerful chain-of-thought reasoning, self-verification, and high accuracy. It excels in software development, complex reasoning, data analysis, and educational support across multiple languages.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o-2024-08-06/text-to-text is a modern OpenAI language model in the GPT-4o family, designed for high-speed, accurate text generation and processing. It supports advanced content creation, code assistance, and information retrieval for developers. Compared to previous GPT models, it offers improved context handling, faster response times, and robust reliability, making it ideal for technical, business, and educational tasks.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o-2024-08-06/image-to-text is OpenAI’s state-of-the-art multimodal model designed for fast and accurate image-to-text (OCR and captioning) tasks. Based on the GPT-4o architecture, it offers lightning-fast processing, robust recognition capabilities, and contextual understanding. Ideal for developers needing scalable solutions for document automation, accessibility, and data extraction. Compared to prior GPT models, it introduces native image handling and enhanced performance for mixed-modality workflows, making it a leading choice for modern multimodal applications.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o-2024-08-06/web-search is OpenAI’s latest GPT-4o variant optimized for multi-modal tasks including web search, text generation, coding, and image understanding. Its core upgrade lies in enhanced speed and context handling, integrating more accurate web results and image-to-text capabilities. Compared to prior GPT-4 models, it delivers quicker and richer outputs for developers and professionals across industries seeking powerful, scalable, and flexible AI solutions.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o-2024-08-06/file-analysis is a cutting-edge AI model from OpenAI, based on the GPT-4o family, tailored for in-depth, multimodal file analysis. It processes text, code, and image files efficiently. Developers rely on its fast, context-aware output and advanced reasoning to streamline workflows like content extraction, vulnerability identification, and document parsing. Compared to the base GPT-4o, it features enhanced file handling and robust analytical abilities. Ideal for technical teams needing powerful, scalable solutions for diverse input types across industries.

Input:$0.03/1M tokens$0.05/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-5-nano/text-to-text is an efficient, compact AI language model built for fast and accurate text processing. As part of the GPT-5 family, it is designed for developers and teams seeking high-throughput natural language tasks, like coding, content generation, and summarization. Compared to larger models, gpt-5-nano/text-to-text delivers optimized resource usage, predictable output, and faster response times, making it ideal for real-world applications where scalability and cost-efficiency are critical.

Input:$0.03/1M tokens$0.05/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-5-nano/web-search is a high-performance AI language model in the GPT-5 family, designed to combine fast, accurate text generation with real-time web search capabilities. Tailored for developers and technical professionals, it excels in coding tasks, data retrieval, and contextual responses using up-to-date web information. Compared to its base GPT-5 models, gpt-5-nano/web-search offers enhanced efficiency, smaller deployment footprint, and superior web integration, making it ideal for dynamic workflows that require seamless access to current data sources.

Input:$0.03/1M tokens$0.05/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-5-nano/file-analysis is a lightweight AI model specialized in fast file parsing, intelligent data extraction, and efficient document analysis. Building on the GPT-5-nano foundation, this version introduces optimized algorithms for rapid, scalable, and accurate file-oriented tasks. It is highly suitable for developers needing batch document processing, structured data extraction, and workflow automation. Its differentiating features include superior speed, minimal resource requirements, and seamless integration for enterprise and developer use cases, setting it apart from other large language models.

Input:$0.03/1M tokens$0.05/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-5-nano/image-to-text is a fast, compact multimodal AI model from the GPT-5 family, specialized in converting visual data to accurate text descriptions. Designed for developers needing speed and reliability, it blends efficient processing with high output quality. Compared to base GPT-5 models, it offers focused image understanding, faster inference, and optimized resource use. Ideal for document digitization, accessibility, and media workflows, its architecture enables stable API integration and scalable image-to-text conversion across industries.

Input:$0.15/1M tokens$0.25/1M tokens
Output:$1.2/1M tokens$2/1M tokens

gpt-5-mini/text-to-text is a streamlined AI model from the GPT-5 family, designed for quick text generation and code-oriented workflows. Its compact architecture offers faster response times and lower resource requirements than standard GPT-5 models. Ideal for developers, educators, and businesses needing scalable, lightweight solutions for everyday text tasks, it delivers reliable results with reduced infrastructure costs. gpt-5-mini/text-to-text bridges the gap between advanced AI and practical deployment at scale.

Input:$0.15/1M tokens$0.25/1M tokens
Output:$1.2/1M tokens$2/1M tokens

gpt-5-mini/file-analysis is a focused AI model designed for rapid and accurate file analysis tasks. Derived from the GPT-5-mini architecture, it offers optimized performance for text extraction, code review, and structured data parsing. Compared to the full GPT-5 line, gpt-5-mini/file-analysis provides lighter, faster processing and is ideal for situations demanding quick insights from documents, logs, or code files. Its unique differentiation lies in efficient context handling and specialized algorithms for file-based workflows. Suitable for IT, legal, finance, and research applications, it empowers developers and analysts with reliable file-driven AI capabilities.

Input:$0.15/1M tokens$0.25/1M tokens
Output:$1.2/1M tokens$2/1M tokens

gpt-5-mini/web-search is an efficient AI language model designed for high-speed web search, text generation, code help, and data analysis. Part of the GPT-5 family, it stands out for streamlined performance and real-time web integration. Unlike larger models such as GPT-5 or Gemini, gpt-5-mini/web-search specializes in fast queries and lightweight deployments. Its core strengths include quick information retrieval, accurate answers, and contextual web reasoning, making it a reliable solution for developers, researchers, and teams needing instant results. It is highly optimized for modern workflows where speed and relevance matter.

Input:$0.15/1M tokens$0.25/1M tokens
Output:$1.2/1M tokens$2/1M tokens

gpt-5-mini/image-to-text is a specialized AI model from the GPT-5-mini family, designed for rapid image-to-text conversion. Built on GPT's robust architecture, it focuses on delivering concise and accurate text outputs from images, supporting multimodal tasks. Compared to the base GPT-5-mini, this variant offers optimized image processing workflows and a streamlined API for faster performance. Industry professionals value its speed, reliability, and precise extraction—especially in document automation, data entry, and accessibility solutions.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-5/text-to-text is OpenAI’s latest-generation language model, optimized for multilingual text transformation, code assistance, and advanced analysis. Faster, smarter, and more context-aware than prior GPT models, it excels in generating accurate, reliable, and creative textual outputs. With improved reasoning and customization features, gpt-5/text-to-text is ideal for developers, enterprises, and researchers seeking scalable, AI-driven solutions. Unlike GPT-4, it offers more precise context handling and enhanced workflow integration for professional use.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-5/file-analysis is an advanced AI model tailored for in-depth file content understanding, code analysis, and structured data extraction. As a specialized variant of the GPT-5 model family, it stands out with optimized processing of large documents, accurate code interpretation, and robust data parsing capabilities. Unlike base GPT-5, gpt-5/file-analysis emphasizes targeted file workflows, making it ideal for developers, data analysts, and businesses that demand high-precision document or file-driven automation. It delivers scalable, reliable, and context-aware results across a spectrum of technical and business environments.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-5/web-search is an advanced AI model from the fifth-generation GPT family, optimized for real-time web information retrieval and multimodal tasks. It blends state-of-the-art language understanding with the ability to process textual and online data, offering rapid, accurate results for complex queries. Unlike GPT-4 and Claude, it stands out with native web search integration, enhanced speed, and superior context handling. Developers and enterprises use gpt-5/web-search for next-level code generation, business analysis, and dynamic content creation, benefiting from its reliability, scalability, and multi-modal input processing.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-5/image-to-text is a next-generation AI model built by OpenAI, focused on converting images into accurate, detailed textual descriptions. As an extension of the GPT-5 family, it merges multi-modal understanding with advanced vision capabilities. It excels in accessibility, content moderation, data labeling, and automated reporting. Unlike standard GPT-5, gpt-5/image-to-text specializes in visual context extraction and structured text generation from image inputs, offering faster inference, expanded compatibility, and robust accuracy for developers seeking seamless integration of multimodal intelligence.

$0.2842/per time$0.406/per time

Higgsfield Turbo is a speed-optimized version of the Higgsfield AI video generation platform. It offers approximately 1.5 times faster rendering speeds and around 30% cost savings compared to standard models. Turbo includes seven new motion styles for enhanced creative flexibility and priority queue access, making it ideal for rapid video creation, quick iterations, and exploring multiple styles efficiently. It maintains high-quality cinematic video outputs with professional camera movements and effects.​

$0.0875/per time$0.125/per time

Higgsfield-lite is an advanced AI video generation model by Higgsfield AI, designed to quickly transform static images and text prompts into short, cinematic video clips with lifelike motion and professional-grade camera effects. It enables creators to produce visually engaging videos with sophisticated lighting, smooth transitions, and dynamic animations, all through an intuitive platform that requires no advanced technical skills. Higgsfield-lite emphasizes fast video creation, realistic character animation, and flexible format support optimized for social media and marketing content.

$0.3941/per time$0.563/per time

Higgsfield-Standard is an AI video generation model producing 3–5 second cinematic clips with lifelike movement and professional camera effects. It features over 50 motion presets, style filters, and prompt enhancement via large language models. Designed for creators and marketers, it balances speed and quality, enabling easy video creation from text or images without advanced skills or editing software.

Input:$0.09/1M tokens$0.15/1M tokens
Output:$0.36/1M tokens$0.6/1M tokens

GPT-4o-mini is OpenAI's cost-efficient small model, outperforming GPT-3.5 Turbo on benchmarks with 128K context window, text/image inputs, and 16K output tokens. It excels in reasoning, coding, multilingual tasks, function calling at $0.15/M input, $0.60/M output—ideal for chatbots, real-time apps, and high-volume use

Input:$0.09/1M tokens$0.15/1M tokens
Output:$0.36/1M tokens$0.6/1M tokens

GPT-4o-mini supports image-to-text capabilities as part of its multimodal features. It can process image inputs to provide detailed textual descriptions, perform OCR, extract information, and interpret visual content in various applications like document analysis and data extraction. It offers 128K token context with strong accuracy and cost-efficiency for vision-language tasks.​​

Input:$10.5/1M tokens$15/1M tokens
Output:$52.5/1M tokens$75/1M tokens

claude-opus-4-1-20250805 is an advanced AI language model from Anthropic designed for precise text generation, coding, and data analysis. Building on the Claude 4 architecture, it delivers improved speed, accuracy, and understanding for complex developer workflows. This model stands out through strong reasoning, safe outputs, and adaptive capabilities—making it ideal for business, research, and technical teams requiring context-rich, reliable AI performance. Compared to previous Claude models, the opus-4-1 generation offers enhanced multi-step logic and broader integration support.

Input:$10.5/1M tokens$15/1M tokens
Output:$52.5/1M tokens$75/1M tokens

claude-opus-4-1-20250805/file-analysis is a cutting-edge model in the Claude Opus AI family, specialized for deep file analysis, document parsing, and advanced text processing. Building on Claude Opus 4’s multimodal and scalable architecture, this variant boosts accuracy and speed for complex file-driven workflows. It is engineered for developers and data professionals needing robust solutions for bulk document extraction, code review, and context-aware content analysis. Differentiated by its optimized file handling, this model excels in enterprise, legal, research, and engineering settings, delivering reliable, secure, and detailed outputs even on challenging datasets.

Input:$10.5/1M tokens$15/1M tokens
Output:$52.5/1M tokens$75/1M tokens

claude-opus-4-1-20250805/web-search is a state-of-the-art AI model from Anthropic’s Claude series, engineered for advanced natural language tasks with integrated real-time web search. It blends large-scale reasoning, coding, and enterprise security with rapid access to the latest online data, setting it apart from earlier Claude or GPT generations. The model is designed for developers and professionals seeking highly reliable, up-to-date AI analysis, automated research, and context-enriched content generation.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.9706/1M tokens$1.1419/1M tokens

Doubao-seed-1-6-thinking-250715 is a ByteDance ARK multimodal LLM variant from the Seed 1.6 series, optimized for deep thinking in reasoning, coding, and math. It supports 256K context (max 224K input), 32K output, text/image/video inputs, and JSON outputs via /v1/chat/completions API.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.9706/1M tokens$1.1419/1M tokens

Doubao-seed-1-6-thinking-250715 image-to-text supports multimodal inputs (text, images, video) to generate text outputs like descriptions, OCR, visual reasoning, and chart analysis via /v1/chat/completions API. With 256K context and step-by-step thinking mode, it excels in complex visual tasks such as document processing and exam problem-solving.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.9706/1M tokens$1.1419/1M tokens

Doubao-seed-1-6-thinking-250615 is an advanced ByteDance multimodal model variant optimized for deep reasoning and complex problem-solving. It supports 256K-token context, handling text, images, and video inputs with up to 16K tokens output. Key features include a hybrid sparse attention mechanism, enhanced embedding spaces, and extensive multimodal training, enabling superior understanding, logical deduction, and real-time efficiency.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.9706/1M tokens$1.1419/1M tokens

Doubao-seed-1-6-thinking-250615 image-to-text leverages its native vision-language model (VLM) integration for accurate visual understanding, including detailed descriptions, OCR on high-res images, chart/diagram reasoning, and multimodal chain-of-thought deduction. It processes images with 256K text context for complex queries.

Input:$0.0172/1M tokens$0.0203/1M tokens
Output:$0.1815/1M tokens$0.2135/1M tokens

Doubao-seed-1.6-flash is a high-speed multimodal deep-thinking model supporting low-latency inference (around 10ms) with strong text and image understanding. It handles image-to-text and text-to-text tasks efficiently, with a 256K-token context window and up to 16K output tokens. It's designed for real-time interaction and complex visual/text reasoning.

Input:$0.0172/1M tokens$0.0203/1M tokens
Output:$0.1815/1M tokens$0.2135/1M tokens

Doubao-seed-1.6-flash image-to-text processes images alongside text prompts to generate detailed descriptions, visual reasoning, OCR, chart analysis, and object recognition at ultra-low latency (10ms TPOT). Its visual capabilities match pro-series competitors while supporting 256K context for complex multimodal queries.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.2424/1M tokens$0.2851/1M tokens

Doubao-seed-1.6 is ByteDance's multimodal deep-thinking LLM family with 256K context, supporting text/images/video inputs and up to 16K outputs. Variants include seed-1.6 (all-round), -thinking (coding/math/logic boost), and -flash (low-latency). Excels in reasoning, tool-calling, and agentic tasks at reduced cost.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.2424/1M tokens$0.2851/1M tokens

Doubao-seed-1.6 text-to-text capability means the model can understand and generate high-quality text responses from text inputs. It supports long contexts (up to 256K tokens), advanced deep reasoning, complex problem-solving, and multi-turn conversations. It excels in language tasks like question answering, summarization, code generation, and insights across diverse topics.

Input:$0.36/1M tokens$0.6/1M tokens
Output:$7.2/1M tokens$12/1M tokens

GPT-4o-mini-tts is OpenAI's text-to-speech model built on GPT-4o mini, generating natural, expressive speech from text with customizable voices, emotions, accents, and multilingual support (50+ languages). It supports real-time streaming, up to 2,000 tokens, and prompt-based styling for audiobooks, voice agents, and interactive apps via API.​

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

Gemini 2.5 Pro excels in complex text generation and understanding, with a massive context window of up to 1 million tokens. It supports nuanced conversation, multi-step reasoning, and API tool integration for dynamic data access. The model is optimized for expressive, coherent interactions across 24+ languages, making it ideal for advanced question answering, writing, summarization, and coding assistance.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

Gemini 2.5 Pro enables high-quality image generation from text prompts with detailed control over style, composition, and content. It maintains character consistency and supports multi-image blending and precise edits. The model’s real-world knowledge integration ensures context-aware visuals. Available through Gemini API and Google AI Studio, it suits creative tasks and commercial applications needing fast, accurate image rendering.

Input:$0.75/1M tokens$1.25/1M tokens
Output:$6/1M tokens$10/1M tokens

Gemini 2.5 Pro offers powerful file analysis capabilities using an extensive token context window. It can interpret and summarize large documents, extract insights from images, code, and video, and understand multimodal inputs. Its reasoning extends across diverse data types, enabling complex workflows involving research, data mining, and content synthesis. This multimodal understanding enhances productivity in enterprise and research environments.

Input:$3.6/1M tokens$6/1M tokens
Output:$6/1M tokens$10/1M tokens

GPT-4o-transcribe is OpenAI's advanced speech-to-text model leveraging GPT-4o for superior audio transcription, outperforming Whisper v3 with lower word error rates across 50+ languages. Features 16K token context, 2K output limit, real-time WebSocket streaming, noise cancellation, speaker separation, and semantic understanding for meetings, voice agents, and live captioning via API.

Input:$3.6/1M tokens$6/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o-transcribe/audio-to-text is a high-performance audio transcription model by OpenAI, designed to convert speech to text with remarkable accuracy in real time. Built on the GPT-4o architecture, it extends core text understanding with advanced audio handling. The model supports multiple languages, fast response, and robust diarization, making it ideal for industries such as media, education, legal, and healthcare. Compared to standard GPT family models, gpt-4o-transcribe/audio-to-text delivers specialized audio recognition, optimized workflows, and scalable deployment for developers seeking seamless multimodal integration and reliable transcription solutions.

Input:$1.8/1M tokens$3/1M tokens
Output:$9/1M tokens$15/1M tokens

Grok 4 is xAI’s most advanced AI language model with 1.7 trillion parameters, offering highly improved reasoning, a massive 130,000-token context window, and multimodal capabilities including text and images. It excels in complex tasks such as scientific research, coding, and real-time data analysis, integrating live data from platforms like X to provide dynamic, accurate responses.

Input:$1.8/1M tokens$3/1M tokens
Output:$9/1M tokens$15/1M tokens

grok-4/image-to-text is a fourth-generation multimodal AI model from the Grok family, specialized in fast and reliable image-to-text conversion. It supports automated content extraction, object recognition, and enhanced accessibility. Unlike previous Grok models, grok-4/image-to-text delivers improved processing speed and better contextual understanding for visual inputs. Its distinct multimodal capabilities and focus on image interpretation set it apart from text-only models like GPT-4 or Claude, making it a robust choice for developers seeking scalable solutions across media analysis, digital archiving, and workflow automation.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1-2025-04-14/text-to-text is an advanced natural language AI model from OpenAI’s latest GPT-4.1 generation, specializing in complex text generation, intelligent code assistance, and nuanced data processing. Designed for enterprise reliability and developer productivity, it delivers more precise outputs, faster inference, and improved context understanding compared to earlier versions. Tailored for text-to-text tasks, it outperforms many general models in structured content creation, professional communication, and scalable document workflows.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1-2025-04-14/image-to-text is a state-of-the-art multimodal AI model by OpenAI, designed for fast and accurate image-to-text conversion. Building on the GPT-4 foundation, it features optimized image understanding and detailed textual output, making it ideal for technical, educational, and enterprise workflows. Its efficiency, multi-format support, and robust performance set it apart from traditional language-only models, offering developers superior flexibility and advanced vision-language capabilities.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1-2025-04-14/web-search is a next-generation large language model from OpenAI, built for advanced tasks such as dynamic text generation, coding assistance, and in-depth research. Leveraging the GPT-4.1 architecture, it seamlessly integrates up-to-date web search, enabling precise answers with real-time references. This model stands out due to its improved speed, enhanced accuracy, and robust comprehension of complex queries, making it ideal for developers, enterprises, and technical teams seeking accurate, scalable AI-powered insights.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1-2025-04-14/text-to-text is an advanced natural language AI model from OpenAI’s latest GPT-4.1 generation, specializing in complex text generation, intelligent code assistance, and nuanced data processing. Designed for enterprise reliability and developer productivity, it delivers more precise outputs, faster inference, and improved context understanding compared to earlier versions. Tailored for text-to-text tasks, it outperforms many general models in structured content creation, professional communication, and scalable document workflows.

Input:$0.0965/1M tokens$0.1135/1M tokens
Output:$0.2424/1M tokens$0.2851/1M tokens

Doubao-1-5-pro-32k-250115 is a specific version of ByteDance’s Doubao 1.5 Pro large language model with a 32K-token context window, tuned for strong reasoning and enterprise use. It uses a sparse Mixture-of-Experts architecture for high performance and efficiency, and the “250115” suffix denotes a particular dated build/release of this 32K variant for stable deployment tracking.

Input:$0.3641/1M tokens$0.4284/1M tokens
Output:$1.0924/1M tokens$1.2851/1M tokens

Doubao-1-5-vision-pro-32k-250115 is a multimodal Doubao 1.5 Vision Pro model variant from ByteDance that supports both text and image input with a 32K-token context window. It is optimized for visual reasoning, document understanding, and detailed image analysis.

Input:$0.3641/1M tokens$0.4284/1M tokens
Output:$1.0924/1M tokens$1.2851/1M tokens

Doubao-1.5-Vision-Pro-32K-250115 is a multimodal model supporting image-to-text, visual reasoning, and OCR. It analyzes images, generates precise descriptions, interprets charts, and answers visual questions. With a 32K context window and advanced vision–language fusion, it delivers reliable professional-grade understanding for captioning, document reading, and complex visual analysis.

Input:$0.18/1M tokens$0.3/1M tokens
Output:$1.5/1M tokens$2.5/1M tokens

Gemini 2.5 Flash is Google’s lightweight, ultra-fast AI model optimized for real-time, high-volume tasks with up to 1 million tokens context. It prioritizes speed and efficiency while maintaining strong reasoning capabilities and tool integration, making it ideal for quick writing, summarizing, and data extraction.

Input:$0.18/1M tokens$0.3/1M tokens
Output:$1.5/1M tokens$2.5/1M tokens

Gemini 2.5 Flash Image-to-Text processes images to generate detailed, analytical descriptions, enabling advanced vision-language workflows with fast, precise responses. It supports tasks like multi-image fusion, targeted edits, and reading hand-drawn diagrams, leveraging world knowledge for real-world understanding.

Input:$0.18/1M tokens$0.3/1M tokens
Output:$1.5/1M tokens$2.5/1M tokens

Gemini 2.5 Flash File Analysis specializes in parsing and summarizing complex documents and datasets, accelerating legal, financial, and vulnerability reviews by providing clear, actionable insights with high accuracy and efficiency.

$1.28/per time$3.2/per time

Veo 3 Pro is a subscription tier of Google's Veo 3 AI video generation model that offers up to 100 video generations per month at 720p resolution and 24 FPS with 8-second clip duration. It includes native synchronized audio generation, advanced prompt adherence for cinematic control, and realistic physics-based motion and lighting. Pro users may need third-party tools for watermark removal and video hosting, and it’s optimized for professional-quality short videos.

$1.28/per time$3.2/per time

Veo 3 Pro image-to-video is a premium feature of Google’s Veo 3 AI video generation model that transforms still images into dynamic 8-second videos with synchronized native audio and cinematic motion. It offers advanced creative controls to guide motion, narrative, and sound from the input image and optional text prompts. This capability supports professional-quality video creation with realistic animations, voiceovers, and sound effects via the Gemini API, aimed at creators seeking high fidelity and artistic control.

$0.48/per time$1.2/per time

Veo 3 Fast is a streamlined, speed-optimized version of Google's Veo 3 AI video generation model. It produces high-fidelity, 8-second video clips at 1080p with synchronized native audio in under one minute, significantly faster than the standard Veo 3. Veo 3 Fast supports both text-to-video and image-to-video workflows and is designed for rapid content iteration, enterprise use, and scalable video production. It features embedded SynthID watermarking and legal indemnity for enterprise users.

$0.48/per time$1.2/per time

Veo 3 Fast image-to-video is a rapid, cost-effective AI feature from Google's Veo 3 model that creates high-quality videos from a single still image with synchronized audio. It supports guiding the motion, narrative, and sound via text prompts alongside the image. Veo 3 Fast delivers smooth, cinematic motion sequences in under a minute, ideal for quick iterations and scalable video production through the Gemini API.

$0.48/per time$1.2/per time

Veo 3 Fast is a streamlined, speed-optimized version of Google's Veo 3 AI video generation model. It produces high-fidelity, 8-second video clips at 1080p with synchronized native audio in under one minute, significantly faster than the standard Veo 3. Veo 3 Fast supports both text-to-video and image-to-video workflows and is designed for rapid content iteration, enterprise use, and scalable video production. It features embedded SynthID watermarking and legal indemnity for enterprise users.

$0.032/per time$0.04/per time

Flux Kontext Pro is an advanced AI image editing tool designed for precise, context-aware editing using natural language instructions. It supports both local and large-scale scene changes while preserving character consistency and visual quality. Users can modify text, change backgrounds, adjust styles, and perform multi-turn iterative edits. It offers fast, high-quality results with compatibility for various image formats and workflows.

$0.032/per time$0.04/per time

flux-kontext-pro/text-to-image is a next-generation AI model for text-to-image synthesis. Developed by the Flux research team, it specializes in converting textual prompts into detailed visual outputs with high fidelity and speed. It supports scalable workflows and API integration for tech-oriented use cases. The model stands out for its precise rendering, interpretability controls, and flexible deployment options, differing from base models by improved context retention and output quality. Ideal for creative, engineering, and research application scenarios.

$0.064/per time$0.08/per time

Flux Kontext Max (FLUX.1 Kontext [max]) is an advanced AI model by Black Forest Labs for high-resolution, precise image generation and editing. It delivers superior prompt adherence, detailed rendering, and advanced typography control. It supports complex scene transformations, maintains character consistency, and enables high-quality automated creative workflows in enterprise and design applications.

$0.064/per time$0.08/per time

flux-kontext-max/text-to-image is a state-of-the-art model for generating high-quality images from textual input. Built by the Flux AI team, it focuses on speed, multimodal integration, and advanced control. Compared to its foundational variants, flux-kontext-max delivers faster rendering and improved fidelity, making it ideal for creative design, prototyping, and visual content development. It suits industries needing reliable text-to-image capabilities, offering flexible API support and scalable deployment.

Input:$1.8/1M tokens$3/1M tokens
Output:$9/1M tokens$15/1M tokens

Grok-3-reasoner-r is an enhanced reasoning variant of xAI’s Grok 3 model that emphasizes robust, multi-step problem solving with an extended reasoning budget. It dynamically allocates compute to deeply analyze and refine answers, providing highly accurate step-by-step solutions for complex tasks in mathematics, science, and programming. This version offers improved reliability and transparency through detailed reasoning traces and error correction.

Input:$0.18/1M tokens$0.3/1M tokens
Output:$0.3/1M tokens$0.5/1M tokens

Grok-3-mini is a lightweight, cost-effective reasoning model developed by xAI. It supports text-only input and offers a large context window of up to 131,072 tokens. Grok-3-mini excels at logic-based tasks that don't require deep domain knowledge and provides accessible reasoning traces for transparency. It supports function calling, structured outputs, and adjustable "thinking effort" for simple to complex queries, making it ideal for high-volume, cost-sensitive applications requiring scalable reasoning.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-20250514 is the latest generation AI model from Anthropic's Claude family, offering balanced performance between speed and advanced reasoning. It supports both text and multi-modal inputs, provides reliable outputs for coding, data analysis, and business automation, and stands out with improved context windows and creative capabilities over previous Claude models. Designed for developers and enterprises, claude-sonnet-4-20250514 excels in complex tasks, scalable integration, and enhanced content safety. This model delivers a unique combination of fast responses and high accuracy, making it ideal for real-world, professional scenarios.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-20250514/file-analysis is a specialized version of Anthropic’s Claude Sonnet 4 family focused on advanced file understanding, content extraction, and natural language response. This model excels at parsing complex documents, source code, and structured data efficiently, delivering context-rich, high-quality outputs. It stands out from general-purpose models by offering rapid file-specific insights and deeper contextual accuracy. Designed for professionals in data-driven fields, it merges Claude’s core strengths with tailored file analysis capabilities, enabling streamlined workflows for developers, researchers, and analysts seeking precise, scalable AI solutions.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-20250514/web-search is a next-generation AI language model from Anthropic's Claude family, designed for advanced text understanding, coding, content generation, and enhanced real-time information retrieval through web search. It delivers high-speed, context-aware responses with a balanced focus on creativity, ethical alignment, and factual accuracy. Compared to previous Sonnet or Claude models, this version features updated training, broader knowledge integration, and more robust support for web-augmented queries, making it a top choice for professionals requiring dependable AI for research, coding, writing, and complex problem solving.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-20250514-thinking is a state-of-the-art AI language model from Anthropic's Claude Sonnet series, designed for deep reasoning, creative writing, and advanced code understanding. It features fast, scalable performance, improved context retention, and strong multimodal support. Compared to previous Claude Sonnet and base Claude iterations, this version delivers enhanced logic and accuracy for complex tasks, making it a smart choice for developers, analysts, and enterprise teams tackling intricate workflows.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-20250514-thinking/file-analysis is a powerful AI model from the Claude Sonnet 4 series, built to analyze and understand complex files efficiently. Tailored for developers and technical professionals, it offers rapid document parsing, precise code analysis, and advanced reasoning over structured data. Compared to other large language models like GPT, it excels in file-specific processing and delivers contextual, reliable outputs for demanding workflows. Its optimized architecture ensures balanced speed, accuracy, and creativity, making it a top choice for tasks like automated reporting, code generation, and technical research.

Input:$2.1/1M tokens$3/1M tokens
Output:$10.5/1M tokens$15/1M tokens

claude-sonnet-4-20250514-thinking is an advanced AI language model from Anthropic’s Claude family, designed for versatile tasks such as text generation, coding, and data analysis. Compared to base Claude models, it offers improved reasoning, speed, and context management. Its robust architecture delivers stable and creative outputs, making it ideal for developers, enterprises, and content professionals who prioritize reliable and scalable AI solutions.

Input:$1.8/1M tokens$2/1M tokens
Output:$7.2/1M tokens$8/1M tokens

o3/text-to-text is a next-generation AI language model specialized in converting prompts to high-quality text outputs. Developed for speed, versatility, and precision, it supports core tasks like content generation, programming help, and structured data transformation. Compared with other foundational models, o3/text-to-text emphasizes efficiency in workflow automation, stronger task-specific adaptation, and reliable output stability. It's ideal for developers and teams who prioritize seamless integration, scalable performance, and reliable linguistic intelligence within digital applications.

Input:$1.8/1M tokens$2/1M tokens
Output:$7.2/1M tokens$8/1M tokens

o3/image-to-text is a next-generation AI vision model specialized in converting image content to structured text. Engineered for rapid and accurate Optical Character Recognition (OCR), it enables seamless automation, accessibility, and real-time information extraction across industries. Unlike traditional OCR solutions or generic multimodal models, o3/image-to-text emphasizes speed, reliability, and adaptability, making it ideal for developers seeking robust image-to-text capabilities. It uses advanced neural architectures that excel in diverse scenarios, including document processing, automated workflows, and AI-powered accessibility tools.

Input:$1.8/1M tokens$2/1M tokens
Output:$7.2/1M tokens$8/1M tokens

o3/file-analysis is a state-of-the-art AI model designed for efficient and accurate file analysis across diverse formats. Built on Open3 foundations, it brings advanced data extraction and interpretation capabilities suited for developers and enterprises. Unlike general text models, o3/file-analysis is optimized to handle structured files, metadata, and complex documents, providing faster insights and higher accuracy for workflow integration, audit, and compliance tasks.

Input:$1.8/1M tokens$2/1M tokens
Output:$7.2/1M tokens$8/1M tokens

o3/web-search is an advanced AI model tailored for enhanced web search scenarios and intelligent content creation. It combines AI-driven natural language understanding with real-time web data retrieval, delivering fact-based, up-to-date responses. Compared to standard models, o3/web-search incorporates web integration natively for accuracy and relevancy, making it ideal for research, customer support, technical writing, and SEO-focused applications. Its robust modal capabilities and responsive design optimize data extraction, answer generation, and workflow automation for developers, businesses, and content professionals.

Input:$0.99/1M tokens$1.1/1M tokens
Output:$3.96/1M tokens$4.4/1M tokens

o4-mini/text-to-text is a compact AI language model tailored for rapid and efficient text-based tasks. With a lightweight architecture, it delivers fast inference and reliable outputs, making it suitable for real-time applications such as automated writing, coding assistance, and conversational bots. Compared to its base o4 models, o4-mini/text-to-text focuses on speed and resource savings while maintaining high output quality for most standard use cases. It's particularly valuable for developers and businesses seeking scalable, low-latency AI solutions without extensive hardware requirements.

Input:$0.99/1M tokens$1.1/1M tokens
Output:$3.96/1M tokens$4.4/1M tokens

o4-mini/image-to-text is a fast, compact AI vision model engineered for converting images into descriptive text. Belonging to the o4-mini family, this model focuses on image captioning and visual content description with improved speed and lightweight architecture. It delivers reliable performance for image analysis tasks in real time, distinguishing itself from larger multimodal models through efficiency and lower resource consumption. Its text output is precise and context-aware, making o4-mini/image-to-text ideal for applications in accessibility, content moderation, and automated media annotation. Compared to its base model, o4-mini/image-to-text is optimized for rapid deployment and use on resource-constrained environments.

Input:$0.99/1M tokens$1.1/1M tokens
Output:$3.96/1M tokens$4.4/1M tokens

o4-mini/file-analysis is a focused AI model designed for automated file analysis, data extraction, and document understanding across industries. As part of the o4-mini model family, it is optimized for speed, lightweight deployment, and specialized processing of files such as PDFs, spreadsheets, and text documents. It stands apart from base o4-mini models by offering enhanced structure recognition, smarter data parsing, and better support for enterprise workflows. Developers use it to streamline document review, compliance checks, and file-driven automation, benefiting from its precision and efficient operation, especially in technical and business scenarios.

Input:$0.99/1M tokens$1.1/1M tokens
Output:$3.96/1M tokens$4.4/1M tokens

o4-mini/web-search is a lightweight AI language model specifically optimized for web search, data extraction, and information retrieval tasks. Designed for speed and efficiency, it is well-suited for real-time indexing, summarization, and knowledge graph building. Compared to its o4-mini family base model, o4-mini/web-search introduces enhanced relevance ranking, faster query resolution, and domain-specific accuracy. Its compact architecture ensures rapid deployment for developers and seamless integration into search-driven workflows.

$0.0081/per time$0.0135/per time

Grok-3-reasoner is a specialized reasoning variant of xAI’s Grok 3 model designed for deep, multi-step problem solving. It uses test-time compute to dynamically allocate resources, allowing extended reflection, error correction, and exploration of alternative solutions. This mode excels in complex reasoning tasks like advanced mathematics, scientific research, and coding, providing transparent step-by-step thought processes and significantly improved accuracy over generalist models.

$0.048/per time$0.06/per time

ideogram-replace-background-v3/text-to-image is an advanced generative AI model specialized in transforming text prompts into high-quality images with seamless background manipulation. Building on the Ideogram family, it offers enhanced background replacement, fast processing, and precise scene adaptation. Designed for media, design, and digital marketing, it stands out for its flexibility in complex workflows and integration with enterprise imaging pipelines. Compared to standard text-to-image models, it delivers superior control over scene elements and background context.

$0.048/per time$0.06/per time

ideogram-remix-v3/text-to-image is an advanced text-to-image AI model designed for high-quality visual content generation. Leveraging diffusion-based architectures, it transforms textual prompts into coherent and detailed images. This model excels in versatility, supporting various creative workflows such as design prototyping, ad visuals, and educational illustration. Compared to its base model, ideogram-remix-v3/text-to-image introduces improvements in rendering speed, prompt adherence, and style consistency. It is ideal for developers, artists, marketers, and educators who require scalable and reliable generative imagery.

$0.048/per time$0.06/per time

Ideogram Edit v3 is an advanced AI image generation and editing model from Ideogram, focused on producing highly realistic and textually accurate images. It includes powerful editing tools like Magic Fill for adding or changing image areas, and Extend for expanding image boundaries. The model features enhanced text rendering within images, supports style reference images, and allows fine control over image composition, texture, and lighting for professional-quality visuals. It is widely used for marketing content, social media graphics, and creative design workflows.

$0.048/per time$0.06/per time

Ideogram-Reframe-V3 is an advanced image-to-image AI model designed to extend and adapt existing images to different resolutions and aspect ratios while preserving key visual elements. It enables creative expansions and modifications for various formats like JPEG, PNG, and WebP. The Reframe feature is ideal for responsive design, digital media, and creative automation, allowing developers to efficiently repurpose visuals across platforms with prompt-driven control and style customization.

$0.048/per time$0.06/per time

Ideogram-Generate-V3 is an advanced AI text-to-image generation model known for high visual fidelity, photorealism, and excellent text rendering within images. Released in 2024, it supports multiple artistic styles and custom aspect ratios, enabling creation of logos, marketing visuals, and creative designs with readable text and detailed compositions. It delivers fast, high-quality images suitable for professional and creative workflows.

$0.0608/per time$0.1014/per time

Midjourney is an AI-based image generation service that transforms natural language prompts into detailed, artistic images using advanced machine learning models. Its API allows developers to integrate this capability into applications, offering features like image generation, upscaling, inpainting, and blending.

$0.0608/per time$0.1014/per time

Midjourney Image-to-Image API enables users to submit an existing image as input to generate variations, enhancements, or stylistic changes. This feature facilitates creative editing such as style transfer, background alteration, or generating image continuations, all leveraging powerful AI models to tailor outputs to user needs.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o/text-to-text is OpenAI’s latest-generation language model designed for high-performance text generation and understanding. It combines optimized speed, improved logic, and multi-turn conversational skills. Ideal for real-time writing, code generation, and data analysis, gpt-4o/text-to-text stands apart from previous models like GPT-4 because of its scalable throughput and context-aware accuracy. Developers rely on it for reliable automation and productivity across business, tech, and education sectors.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o/image-to-text is OpenAI’s advanced multimodal AI model designed for fast and accurate image-to-text conversion. It excels at extracting information from images, enabling high-quality OCR and contextual visual analysis. Compared to the core GPT-4o model, gpt-4o/image-to-text optimizes workflows for visual content understanding, making it ideal for technical, business, and accessibility applications. Its scalable inference, robust architecture, and multimodal capabilities support rapid integration, helping developers automate document extraction and enhance user experiences with reliable image analysis.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o/web-search is a next-generation multimodal AI model from OpenAI designed for fast, accurate web-based queries, code generation, and knowledge retrieval. It improves on the GPT foundation with enhanced real-time web search integration, efficient multi-modal processing for text and images, and superior task adaptability. gpt-4o/web-search is optimized for workflows requiring up-to-date data, context-rich outputs, and high-speed interaction, making it ideal for developers, analysts, and researchers who demand reliable AI-driven solutions with scalable performance.

Input:$1.5/1M tokens$2.5/1M tokens
Output:$6/1M tokens$10/1M tokens

gpt-4o/file-analysis is a cutting-edge multimodal AI model based on the GPT-4o family, designed to analyze, interpret, and generate insights from diverse file types including text, code, and images. Building upon the speed and accuracy of GPT-4o, this model uniquely integrates file understanding, enabling developers to extract structured information and automate document-heavy workflows. Compared to standard GPT-4o, it further streamlines file-centric tasks, making it indispensable for software engineering, research, and business automation.

Input:$6/1M tokens$10/1M tokens
Output:$24/1M tokens$40/1M tokens

GPT Image-1 image-edit is a feature of the same OpenAI model that allows precise editing of images using text prompts and optional masks. Users can modify specific areas by adding or removing elements, adjusting styles or correcting details, leveraging GPT-image-1’s understanding of visual and textual cues for seamless image modifications.

Input:$6/1M tokens$10/1M tokens
Output:$24/1M tokens$40/1M tokens

gpt-image-1/text-to-image is a multimodal AI model designed for fast and accurate text-to-image synthesis. Developed by OpenAI as part of the GPT image model family, it brings advanced generative capabilities to image creation. This model stands out by combining the reliable architecture of GPT with adaptation for image generation, supporting industry use in digital media, creative tasks, and automation. Its optimized speed and multimodal input make it a preferred choice for developers and teams seeking robust text-to-image solutions.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1 represents a refined evolution within the GPT-4 family, specifically engineered to provide developers with enhanced instruction following and superior reasoning stability. As a premium text to text model, it bridges the gap between the speed of previous iterations and the deep intelligence of the latest frontier models. Developed by OpenAI, gpt-4.1 excels in complex logic tasks, high density coding, and nuanced prose generation. When accessed via GPT Proto, users benefit from optimized latency and a streamlined environment tailored for enterprise scale production. It offers a distinct advantage in reliability, ensuring consistent outputs for high stakes automation and creative content strategies.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1/file-analysis is a specialized AI model from the GPT-4.1 family, designed for advanced file interpretation, code review, and data extraction. It excels at automated file processing, supporting varied codebases, document types, and complex workflows. Unlike general GPT-4 engines, gpt-4.1/file-analysis integrates unique file parsing capabilities and high-speed performance suited for technical, developer-focused environments. Its adaptable model architecture ensures reliability and efficiency in file-centric automation, making it a go-to choice for software engineers, data analysts, and IT professionals needing robust, accurate analytics in diverse file formats.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

gpt-4.1/web-search represents a significant leap in functional AI, combining the deep reasoning of the 4.1 generation with integrated live internet access. This model is specifically tuned to perform searches before generating responses, ensuring that information is current and backed by clickable citations. Unlike static base models, gpt-4.1/web-search offers dynamic tool usage, domain filtering, and location-aware results. It is ideal for developers building research agents, market analysis tools, or news aggregators. By bridging the gap between historical training data and live web content, it provides a reliable foundation for enterprise applications requiring high factual integrity and real-time relevance.

Input:$1.2/1M tokens$2/1M tokens
Output:$4.8/1M tokens$8/1M tokens

GPT-4.1/image-to-text represents the pinnacle of multimodal language modeling, specifically designed to bridge visual perception and linguistic understanding. This model processes image inputs with extreme precision, offering developers the ability to extract text, identify objects, and reason about complex visual scenes. Built upon the robust foundation of the latest GPT architecture, GPT-4.1/image-to-text introduces optimized tokenization for images, allowing for cost-effective analysis in both low and high-resolution modes. Whether you are building accessibility tools or automated content moderation, this model provides the reliable, structured output necessary for enterprise applications. Experience the fastest and most stable integration of this vision powerhouse on the GPT Proto platform today.

Input:$0.24/1M tokens$0.4/1M tokens
Output:$0.96/1M tokens$1.6/1M tokens

gpt-4.1-mini/text-to-text is a lightweight, high-speed AI model purpose-built for rapid and efficient text processing. As a member of the GPT-4.1 family, it inherits core natural language understanding from the base model but optimizes for minimal latency and resource usage. Suitable for real-time chatbots, summarization, and drafting tasks, it serves developers needing prompt, reliable, and cost-effective solutions. Its main differentiator is its size-to-performance ratio, delivering quality outputs in environments where speed and efficiency are critical, outpacing larger models in throughput while remaining accurate and context-aware.

Input:$0.24/1M tokens$0.4/1M tokens
Output:$0.96/1M tokens$1.6/1M tokens

gpt-4.1-mini/image-to-text is a compact multimodal AI model focusing on converting images to accurate text. As part of the GPT-4.1-mini family, it offers efficient visual data extraction and advanced OCR capability while maintaining fast inference speeds. Unlike general-purpose models, gpt-4.1-mini/image-to-text is optimized for real-time document processing, receipts recognition, and visual content parsing, making it highly relevant for developers building solutions in finance, logistics, and automation. Its precision, efficiency, and cost-effective deployment set it apart for teams needing scalable image-to-text workflows.

Input:$0.24/1M tokens$0.4/1M tokens
Output:$0.96/1M tokens$1.6/1M tokens

gpt-4.1-mini/text-to-text is a lightweight, high-speed AI model purpose-built for rapid and efficient text processing. As a member of the GPT-4.1 family, it inherits core natural language understanding from the base model but optimizes for minimal latency and resource usage. Suitable for real-time chatbots, summarization, and drafting tasks, it serves developers needing prompt, reliable, and cost-effective solutions. Its main differentiator is its size-to-performance ratio, delivering quality outputs in environments where speed and efficiency are critical, outpacing larger models in throughput while remaining accurate and context-aware.

Input:$0.24/1M tokens$0.4/1M tokens
Output:$0.96/1M tokens$1.6/1M tokens

gpt-4.1-mini/file-analysis is a compact AI language model specialized in efficient file analysis, code review, and structured text extraction. Part of the GPT-4.1-mini family, it focuses on delivering fast response times, low resource usage, and high accuracy, making it ideal for developers and teams needing lightweight, reliable AI-powered file intelligence. Its core strengths include advanced code understanding, robust document processing, and seamless integration into automation pipelines—providing an efficient alternative to larger, general-purpose models.

Input:$0.06/1M tokens$0.1/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-4.1-nano/text-to-text is an efficient AI text generation model built for speed and resource efficiency. Designed on the GPT-4.1 family, it bridges core NLP capabilities and fast deployment. Its differentiator is rapid inference with reduced compute needs, making it an ideal solution for edge devices, quick-response systems, or lightweight applications. Compared to larger GPT variants, it offers faster results with lower overhead, suitable for developers needing reliable summarization, generation, or everyday language processing under strict resource constraints.

Input:$0.06/1M tokens$0.1/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-4.1-nano/image-to-text is a compact multimodal AI model by OpenAI based on the GPT-4.1-nano architecture. Designed for fast and accurate image-to-text conversion, it excels in optical character recognition, document parsing, and extracting textual content from images. Compared to full-scale GPT-4, this version offers rapid processing and lower resource usage, making it optimal for applications needing real-time results or high deployment scalability. Its speed and focused modality make it ideal for developers and businesses automating image analysis pipelines, digital archiving, accessibility, or mobile scenarios.

Input:$0.06/1M tokens$0.1/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gpt-4.1-nano/file-analysis is a compact, next-generation AI model designed for fast, precise document and code file analysis. As part of the GPT-4.1 family, it leverages efficient architecture and lightweight deployment, making it highly suitable for workflow automation, file auditing, and technical review scenarios. Unlike larger GPT models, nano/file-analysis emphasizes speed and resource efficiency, supporting developers and businesses needing reliable file-centric AI capabilities for seamless integration. Its specialized processing mode covers text, code, and structured file formats, ensuring consistent results with minimal overhead.

Input:$1.8/1M tokens$3/1M tokens
Output:$9/1M tokens$15/1M tokens

Grok-3 is the third-generation AI language model developed by xAI, designed to compete with leading models like GPT-4 and Gemini 2. It features enhanced reasoning, advanced real-time data integration, and a massive 128,000-token context window for deep understanding. Grok-3 offers specialized modes like "Big Brain" for complex problem-solving and "DeepSearch" for real-time information synthesis, excelling in coding, research, and multitask AI applications.

$0.0203/per time$0.0338/per time

GPT-4o-image-vip is a premium variant of OpenAI's GPT-4o multimodal model, specialized for advanced image-to-text and image generation tasks. It offers enhanced image understanding, detailed visual description, and precise text extraction from images. GPT-4o-image-vip supports high fidelity, multi-image processing, and iterative image editing, making it ideal for complex visual workflows in creative design, technical analysis, and interactive applications. It integrates seamlessly with text inputs for rich multimodal conversations and is optimized for both latency and output quality.

$0.0203/per time$0.0338/per time

GPT-4o-image-vip image-to-image refers to the advanced image generation and editing capabilities of the GPT-4o-image-vip model by OpenAI. This model enables uploading an existing image and providing precise instructions to modify, enhance, or creatively transform it while maintaining coherence in style and details. It supports high-resolution outputs, multi-turn conversational refinement, and accurate rendering of text within images. Ideal for creative workflows, design prototyping, and interactive media, GPT-4o-image-vip excels in generating production-ready visuals with realism and flexibility through natural language commands.

Input:$0.06/1M tokens$0.1/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

Gemini 2.0 Flash is an advanced AI model by Google designed for fast, accurate text processing supporting complex reasoning, extended context (up to 1 million tokens), and native tool integrations. It excels in multilingual, real-time text generation and advanced coding, research, and conversational applications.

Input:$0.06/1M tokens$0.1/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

Gemini 2.0 Flash Image-to-Text processes images natively to extract and generate descriptive, analytical text, enabling multimodal input for tasks like image analysis, captioning, and combined vision-language workflows. Both are part of Gemini 2.0's multimodal, high-speed AI platform with ongoing API and tool enhancements.

Input:$0.06/1M tokens$0.1/1M tokens
Output:$0.24/1M tokens$0.4/1M tokens

gemini-2.0-flash/file-analysis is a highly optimized, multimodal AI model built for fast and accurate file content analysis. Part of the Gemini 2.0 family, it leverages advanced architecture to deliver rapid processing speeds, efficient text and document evaluation, and robust performance. Unlike core Gemini models, it specializes in file input workflows, making it ideal for developers and businesses needing reliable, scalable, and secure file-based AI solutions. Its precision and flexibility drive innovation in sectors like legal, education, and enterprise document management.

$0.48/per time$1.2/per time

Veo 3 is Google DeepMind's advanced AI video generation model that creates high-definition, realistic videos with synchronized native audio from simple text or image prompts. It combines three specialized systems for visuals, audio, and timing to produce cohesive audiovisual content including dialogue, ambient sounds, and music. Veo 3 supports complex scenes with realistic motion, lighting, and physics, making it a versatile tool for cinematic-quality video creation.

$0.48/per time$1.2/per time

Veo 3 image-to-video is an AI capability that transforms a single still image into a dynamic, high-quality video clip with consistent motion and native audio. It allows users to guide the generated video’s motion, narrative, and sound by providing an initial image plus optional text prompts. Veo 3 and its faster variant, Veo 3 Fast, power this feature with realistic animation, seamless transitions, and synchronized sound effects, making it ideal for creative video production workflows.

$0.48/per time$1.2/per time

Veo 3 is Google DeepMind's advanced AI video generation model that creates high-definition, realistic videos with synchronized native audio from simple text or image prompts. It combines three specialized systems for visuals, audio, and timing to produce cohesive audiovisual content including dialogue, ambient sounds, and music. Veo 3 supports complex scenes with realistic motion, lighting, and physics, making it a versatile tool for cinematic-quality video creation.