Thursday, June 26, 2025
HomeArtificial IntelligenceGemma 3 vs. MiniCPM vs. Qwen 2.5 VL

Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL

Introduction

Imaginative and prescient-Language Fashions (VLMs) are quickly changing into the core of many generative AI functions, from multimodal chatbots and agentic methods to automated content material evaluation instruments. As open-source fashions mature, they provide promising alternate options to proprietary methods, enabling builders and enterprises to construct cost-effective, scalable, and customizable AI options.

Nonetheless, the rising variety of VLMs presents a typical dilemma: how do you select the best mannequin in your use case? It is typically a balancing act between output high quality, latency, throughput, context size, and infrastructure value.

This weblog goals to simplify the decision-making course of by offering detailed benchmarks and mannequin descriptions for 3 main open-source VLMs: Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct. All benchmarks had been run utilizing Clarifai’s Compute Orchestration, our personal inference engine, to make sure constant situations and dependable comparisons throughout fashions.

Earlier than diving into the outcomes, right here’s a fast breakdown of the important thing metrics used within the benchmarks. All outcomes had been generated utilizing Clarifai’s Compute Orchestration on NVIDIA L40S GPUs, with enter tokens set to 500 and output tokens set to 150.

  1. Latency per Token: The time it takes to generate every output token. Decrease latency means sooner responses, particularly essential for chat-like experiences.
  2. Time to First Token (TTFT): Measures how rapidly the mannequin generates the primary token after receiving the enter. It impacts perceived responsiveness in streaming era duties.
  3. Finish-to-Finish Throughput: The variety of tokens the mannequin can generate per second for a single request, contemplating the complete request processing time. Increased end-to-end throughput means the mannequin can effectively generate output whereas holding latency low.
  4. Total Throughput: The entire variety of tokens generated per second throughout all concurrent requests. This displays the mannequin’s capacity to scale and preserve efficiency beneath load.

Now, let’s dive into the small print of every mannequin, beginning with Gemma-3-4B.

Gemma3-4b

Gemma-3-4B, a part of Google’s newest Gemma 3 household of open multimodal fashions, is designed to deal with each textual content and picture inputs, producing coherent and contextually wealthy textual content responses. With assist for as much as 128K context tokens, 140+ languages, and duties like textual content era, picture understanding, reasoning, and summarization, it’s constructed for production-grade functions throughout various use instances.

Benchmark Abstract: Efficiency on L40S GPU

Gemma-3-4B continues to indicate robust efficiency throughout each textual content and picture duties, with constant conduct beneath various concurrency ranges. All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. Gemma-3-4B is optimized for low-latency textual content processing and handles picture inputs as much as 512px with secure throughput throughout concurrency ranges.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.135 sec

  • Finish-to-end throughput: 202.25 tokens/sec

  • Requests per minute (RPM): As much as 329.90 at 32 concurrent requests

  • Total throughput: 942.57 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (Total Throughput):

  • 256px photographs: 718.63 tokens/sec, 252.16 RPM at 32 concurrency

  • 512px photographs: 688.21 tokens/sec, 242.04 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

Total Perception:

Gemma-3-4B supplies quick and dependable efficiency for text-heavy and structured vision-language duties. For giant picture inputs (512px), efficiency stays secure, however chances are you’ll have to scale compute sources to take care of low latency and excessive throughput.

In case you’re evaluating GPU efficiency for serving this mannequin, we’ve printed a separate comparability of A10 vs. L40S, serving to you select the most effective {hardware} in your wants.

gemma_throughput_trimmed

MiniCPM-o 2.6

MiniCPM-o 2.6 represents a serious leap in end-side multimodal LLMs. It expands enter modalities to pictures, video, audio, and textual content, providing real-time speech dialog and multimodal streaming assist.

With an structure integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the mannequin boasts a complete of 8 billion parameters. MiniCPM-o-2.6 demonstrates vital enhancements over its predecessor, MiniCPM-V 2.6, and introduces real-time speech dialog, multimodal stay streaming, and superior effectivity in token processing.

Benchmark Abstract: Efficiency on L40S GPU

All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. MiniCPM-o-2.6 performs exceptionally effectively throughout each textual content and picture workloads, scaling easily throughout concurrency ranges. Shared vLLM serving supplies vital beneficial properties in total throughput whereas sustaining low latency.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.087 sec

  • Finish-to-end throughput: 213.23 tokens/sec

  • Requests per minute (RPM): As much as 362.83 at 32 concurrent requests

  • Total throughput: 1075.28 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (Total Throughput):

  • 256px photographs: 1039.60 tokens/sec, 353.19 RPM at 32 concurrency

  • 512px photographs: 957.37 tokens/sec, 324.66 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

Total Perception:

MiniCPM-o-2.6 performs reliably throughout a spread of duties and enter sizes. It maintains low latency, scales linearly with concurrency, and stays performant even with 512px picture inputs. This makes it a stable selection for real-time functions operating on fashionable GPUs just like the L40S. These outcomes mirror efficiency on that particular {hardware} configuration, and should fluctuate relying on the setting or GPU tier.

minicpm_throughput_vs_concurrency_trimmed

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL is a vision-language mannequin designed for visible recognition, reasoning, lengthy video evaluation, object localization, and structured information extraction.

Its structure integrates window consideration into the Imaginative and prescient Transformer (ViT), considerably bettering each coaching and inference effectivity. Further optimizations like SwiGLU activation and RMSNorm additional align the ViT with the Qwen2.5 LLM, enhancing total efficiency and consistency.

Benchmark Abstract: Efficiency on L40S GPU

Qwen2.5-VL-7B-Instruct delivers constant efficiency throughout each textual content and image-based duties. Benchmarks from Clarifai’s Compute Orchestration spotlight its capacity to deal with multimodal inputs at scale, with robust throughput and responsiveness beneath various concurrency ranges.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.089 sec

  • Finish-to-end throughput: 205.67 tokens/sec

  • Requests per minute (RPM): As much as 353.78 at 32 concurrent requests

  • Total throughput: 1017.16 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (Total Throughput):

  • 256px photographs: 854.53 tokens/sec, 318.64 RPM at 32 concurrency

  • 512px photographs: 832.28 tokens/sec, 345.98 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

Total Perception:

Qwen2.5-VL-7B-Instruct is well-suited for each textual content and multimodal duties. Whereas bigger photographs introduce latency and throughput trade-offs, the mannequin performs reliably with small to medium-sized inputs even at excessive concurrency. It’s a robust selection for scalable vision-language pipelines that prioritize throughput and reasonable latency.

qwen_throughput_vs_concurrency_trimmed

Which VLM is Proper for You?

Selecting the best Imaginative and prescient-Language Mannequin (VLM) relies on your workload kind, enter modality, and concurrency necessities. All benchmarks on this report had been generated utilizing NVIDIA L40S GPUs through Clarifai’s Compute Orchestration.

These outcomes mirror efficiency on enterprise-grade infrastructure. In case you’re utilizing lower-end {hardware} or focusing on bigger batch sizes or ultra-low latency, precise efficiency might differ. It’s essential to guage primarily based in your particular deployment setup.

MiniCPM-o-2.6
MiniCPM gives constant efficiency throughout each textual content and picture duties, particularly when deployed with shared vLLM. It scales effectively as much as 32 concurrent requests, sustaining excessive throughput and low latency even with 1024px picture inputs.

In case your software requires secure efficiency beneath load and adaptability throughout modalities, MiniCPM is probably the most well-rounded selection on this group.

Gemma-3-4B
Gemma performs greatest on text-heavy workloads with occasional picture enter. It handles concurrency effectively as much as 16 requests however begins to dip at 32, notably with giant photographs equivalent to 2048px.

In case your use case is primarily targeted on quick, high-quality textual content era with small to medium picture inputs, Gemma supplies robust efficiency without having high-end scaling.

Qwen2.5-VL-7B-Instruct
Qwen2.5 is optimized for structured vision-language duties equivalent to doc parsing, OCR, and multimodal reasoning, making it a robust selection for functions that require exact visible and textual understanding.

In case your precedence is correct visible reasoning and multimodal understanding, Qwen2.5 is a robust match, particularly when output high quality issues greater than peak throughput.

That can assist you evaluate at a look, right here’s a abstract of the important thing efficiency metrics for all three fashions at 32 concurrent requests throughout textual content and picture inputs.

Imaginative and prescient-Language Mannequin Benchmark Abstract (32 Concurrent Requests, L40S GPU)

 

 

Metric Mannequin Textual content Solely 256px Picture 512px Picture
Latency per Token (sec) Gemma-3-4B 0.027 0.036 0.037
MiniCPM-o 2.6 0.024 0.026 0.028
Qwen2.5-VL-7B-Instruct 0.025 0.032 0.032
Time to First Token (sec) Gemma-3-4B 0.236 1.034 1.164
MiniCPM-o 2.6 0.120 0.347 0.786
Qwen2.5-VL-7B-Instruct 0.121 0.364 0.341
Finish-to-Finish Throughput (tokens/s) Gemma-3-4B 168.45 124.56 120.01
MiniCPM-o 2.6 188.86 176.29 160.14
Qwen2.5-VL-7B-Instruct 186.91 179.69 191.94
Total Throughput (tokens/s) Gemma-3-4B 942.58 718.63 688.21
MiniCPM-o 2.6 1075.28 1039.60 957.37
Qwen2.5-VL-7B-Instruct 1017.16 854.53 832.28
Requests per Minute (RPM) Gemma-3-4B 329.90 252.16 242.04
MiniCPM-o 2.6 362.84 353.19 324.66
Qwen2.5-VL-7B-Instruct 353.78 318.64 345.98

 

Word: These benchmarks had been run on L40S GPUs. Outcomes might fluctuate relying on GPU class (equivalent to A100 or H100), CPU limitations, or runtime configurations together with batching, quantization, or mannequin variants.

Conclusion

We have now seen the benchmarks throughout MiniCPM-2.6, Gemma-3-4B, and Qwen2.5-VL-7B-Instruct, protecting their efficiency on latency, throughput, and scalability beneath totally different concurrency ranges and picture sizes. Every mannequin performs in another way primarily based on the duty and workload necessities.

If you wish to check out these fashions, we now have launched a brand new AI Playground the place you’ll be able to discover them straight. We are going to proceed including the most recent fashions to the platform, so keep watch over our updates and be a part of our Discord group for the most recent bulletins.

In case you are additionally trying to deploy these Open Supply VLMs by yourself devoted compute, our platform helps production-grade inference, and scalable deployments. You’ll be able to rapidly get began with establishing your individual node pool and operating inference effectively. Try the tutorial under to get began.

 


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments