Liquid AI Launched LFM2-Audio-1.5B: An Finish-to-Finish Audio Basis Mannequin with Sub-100 ms Response Latency

By admin2010

October 2, 2025

54

Liquid AI has launched LFM2-Audio-1.5B, a compact audio–language basis mannequin that each understands and generates speech and textual content via a single end-to-end stack. It positions itself for low-latency, real-time assistants on resource-constrained units, extending the LFM2 household into audio whereas retaining a small footprint.

https://www.liquid.ai/weblog/lfm2-audio-an-end-to-end-audio-foundation-model

However what’s really new? a unified spine with disentangled audio I/O

LFM2-Audio extends the 1.2B-parameter LFM2 language spine to deal with audio and textual content as first-class sequence tokens. Crucially, the mannequin disentangles audio representations: inputs are steady embeddings projected immediately from uncooked waveform chunks (~80 ms), whereas outputs are discrete audio codes. This avoids discretization artifacts on the enter path whereas retaining coaching and technology autoregressive for each modalities on the output path.

On the implementation aspect, the launched checkpoint makes use of:

Spine: LFM2 (hybrid conv + consideration), 1.2B params (LM solely)
Audio encoder: FastConformer (~115M, canary-180m-flash)
Audio decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks)
Context: 32,768 tokens; vocab: 65,536 (textual content) / 2049×8 (audio)
Precision: bfloat16; license: LFM Open License v1.0; languages: English

Two technology modes for real-time brokers

Interleaved technology for dwell, speech-to-speech chat the place the mannequin alternates textual content and audio tokens to reduce perceived latency.
Sequential technology for ASR/TTS (switching modalities turn-by-turn).

Liquid AI gives a Python bundle (liquid-audio) and a Gradio demo to breed these behaviors.

Latency: <100 ms to first audio

Liquid AI staff reviews end-to-end latency under 100 ms from a 4-second audio question to the primary audible response—a proxy for perceived responsiveness in interactive use—stating it’s quicker than fashions smaller than 1.5B parameters below their setup.

Benchmarks: VoiceBench and ASR outcomes

On VoiceBench—a set of 9 audio-assistant evaluations—Liquid reviews an total rating of 56.78 for LFM2-Audio-1.5B, with per-task numbers disclosed within the weblog’s chart (e.g., AlpacaEval 3.71, CommonEval 3.49, WildVoice 3.17). The Liquid AI staff contrasts this end result with bigger fashions like Qwen2.5-Omni-3B and Moshi-7B in the identical desk. (VoiceBench is an exterior benchmark launched in late 2024 for LLM-based voice assistants)

The mannequin card on Hugging Face gives an extra VoiceBench desk (with carefully associated—however not equivalent—per-task values) and contains basic ASR WERs the place LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets regardless of being a generalist speech–textual content mannequin. For instance (decrease is best): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

https://huggingface.co/LiquidAI/LFM2-Audio-1.5B

Alright, however why does it actually matter in voice AI developments?

Most “omni” stacks couple ASR → LLM → TTS, which provides latency and brittle interfaces. LFM2-Audio’s single-backbone design with steady enter embeddings and discrete output codes reduces glue logic and permits interleaved decoding for early audio emission. For builders, this interprets to less complicated pipelines and quicker perceived response instances, whereas nonetheless supporting ASR, TTS, classification, and conversational brokers from one mannequin. Liquid AI gives code, demo entry factors, and distribution through Hugging Face.

Try the GitHub Web page, Hugging Face Mannequin Card and Technical particulars. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as effectively.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Liquid AI Launched LFM2-Audio-1.5B: An Finish-to-Finish Audio Basis Mannequin with Sub-100 ms Response Latency

However what’s really new? a unified spine with disentangled audio I/O

Two technology modes for real-time brokers

Latency: <100 ms to first audio

Benchmarks: VoiceBench and ASR outcomes

Alright, however why does it actually matter in voice AI developments?

Job titles of the long run: Head transplant surgeon

Recursive Language Fashions (RLMs): From MIT’s Blueprint to Prime Mind’s RLMEnv for Lengthy Horizon LLM Brokers

The Greatest Agentic AI Browsers to Look For in 2026

LEAVE A REPLY Cancel reply

Most Popular

If Development Is Your Recreation, We Have the Title of the Dividend Inventory for You

Job titles of the long run: Head transplant surgeon

Billion-Greenback Bitcoin Hacker Ilya Lichtenstein Launched, Thanks Trump

Bitfinex Securities 2025 12 months in Evaluation: Increasing Entry to International Capital Markets

Recent Comments

ABOUT US

POPULAR POSTS

If Development Is Your Recreation, We Have the Title of the Dividend Inventory for You

Job titles of the long run: Head transplant surgeon

Billion-Greenback Bitcoin Hacker Ilya Lichtenstein Launched, Thanks Trump

POPULAR CATEGORY