Friday, June 26, 2026
HomeArtificial IntelligenceDeepReinforce Releases Ornith-1.0: An Open-Supply Coding Mannequin Household That Learns Its Personal...

DeepReinforce Releases Ornith-1.0: An Open-Supply Coding Mannequin Household That Learns Its Personal RL Scaffolds





DeepReinforce has launched Ornith-1.0, an open-source mannequin household constructed for agentic coding. The lineup spans 4 sizes, from a 9B dense mannequin to a 397B mixture-of-experts flagship. Each checkpoint ships below the MIT license on Hugging Face. The fashions are post-trained on prime of pretrained Gemma 4 and Qwen 3.5.

Most coding brokers pair a mannequin with a hard and fast, human-designed harness. Ornith-1.0 as an alternative learns to jot down its personal. The DeepReinforce analysis group studies state-of-the-art outcomes amongst open fashions of comparable dimension.

TL;DR

  • Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes below MIT, constructed on Gemma 4 and Qwen 3.5.
  • The mannequin learns its personal scaffold throughout RL, collectively optimizing the harness and the answer.
  • Ornith-1.0-397B tops Claude Opus 4.7 on each headline benchmarks, however not Opus 4.8 or the bigger GLM-5.2-744B.
  • Three layers — mounted belief boundary, deterministic monitor, frozen LLM choose — guard towards reward hacking.

What’s Ornith-1.0?

Ornith-1.0 is a set of reasoning fashions tuned for coding brokers. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B mannequin is mixture-of-experts and prompts roughly 3B parameters per token. FP8 and GGUF builds are additionally printed for quicker native serving.

Every mannequin is a reasoning mannequin. Replies open with a block earlier than the ultimate reply. The serving recipes allow a reasoning parser, in order that hint returns in a separate reasoning_content subject. The fashions additionally emit well-formed instrument requires agent loops.

Deployment is simple. The 9B mannequin is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes goal vLLM, SGLang, and Transformers. Every mannequin exposes an OpenAI-compatible endpoint. Customary agent frameworks subsequently work with out code modifications.

Interactive Explainer


” fashion=”width:100%;border:0;show:block;min-height:600px;overflow:hidden” top=”600″ scrolling=”no” loading=”lazy” title=”Ornith-1.0 Interactive Explainer”>

The Self-Scaffolding Thought

Most coding brokers depend on a scaffold, additionally known as a harness. A scaffold wraps the mannequin with reminiscence, instruments, error dealing with, and orchestration logic. AI groups normally hand-design one scaffold per job class.

Ornith-1.0 treats the scaffold as a learnable object as an alternative. Throughout reinforcement studying, the scaffold co-evolves with the mannequin’s coverage. Every RL step runs in two levels.

First, the mannequin reads the duty and its earlier scaffold. It then proposes a refined scaffold. Second, it makes use of that scaffold and the duty to generate an answer rollout. Reward from the rollout flows again to each levels.

So the mannequin is optimized to writer orchestration, not simply solutions. Over coaching, higher-reward scaffolds are mutated and chosen robotically. Per-task methods emerge with out hand-engineered harness design.

Coaching additionally runs asynchronously, utilizing a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them previous a threshold. The optimization makes use of a token-level GRPO goal.

Guarding In opposition to Reward Hacking

Letting a mannequin write its personal scaffold invitations reward hacking. A scaffold may learn seen check information and hardcode anticipated outputs. It may additionally copy an oracle resolution sitting within the atmosphere. DeepReinforce group describes three protection layers.

  1. The outer belief boundary is mounted and immutable. The atmosphere, instrument floor, and check isolation keep outdoors the mannequin’s attain. The mannequin evolves solely its interior coverage scaffold.
  2. A deterministic monitor flags banned actions. Studying withheld paths or modifying verification scripts earns zero reward. These trajectories are excluded from the benefit computation.
  3. A frozen LLM choose acts as a veto. It sits on prime of the verifier, not as the first reward.

Benchmark

DeepReinforce studies vendor numbers throughout a number of agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails solely Claude Opus 4.8 (87.6) among the many listed fashions. On Terminal-Bench 2.1, the image is extra combined.

Ornith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. Nevertheless it trails Claude Opus 4.8 (85) and the bigger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ declare is scoped to open fashions of comparable dimension.

The smaller fashions carry the effectivity case. The 35B mannequin scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B mannequin reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.

Benchmark Ornith-1.0-397B Qwen3.5-397B Qwen3.7-Max GLM-5.2-744B Minimax-M3-428B DeepSeek-V4-Professional-1.6T Claude Opus 4.7 Claude Opus 4.8
Terminal-Bench 2.1 77.5 53.5 73.5 81.0 64 64 70.3 85
SWE-Bench Verified 82.4 76.4 80.4 80.6 80.8 87.6
SWE-Bench Professional 62.2 51.6 60.6 62.1 59 55.4 64.3 69.2
SWE-Bench Multilingual 78.9 69.3 78.3 76.2
NL2Repo 48.2 36.8 47.2 48.9 42.1 69.7
ClawEval Avg 77.1 70.7 65.2 75.8 78.2

Use Instances and a Fast Begin

The fashions goal terminal-native coding brokers and repository-scale work. Sensible suits embody multi-file refactors, bug localization, and test-driven patches. The 9B mannequin fits edge or single-GPU setups the place latency and value matter. The 397B mannequin targets most accuracy on lengthy, multi-step duties.

For instance, a dev can run the 9B mannequin domestically to triage a failing check suite. A platform group can self-host the 397B mannequin for an inner coding agent.

Serving is a one-liner with vLLM:

vllm serve deepreinforce-ai/Ornith-1.0-9B 
    --served-model-name Ornith-1.0-9B 
    --max-model-len 262144 
    --enable-auto-tool-choice --tool-call-parser qwen3_xml 
    --reasoning-parser qwen3 
    --trust-remote-code

Then name it with any OpenAI consumer:

from openai import OpenAI

consumer = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = consumer.chat.completions.create(
    mannequin="Ornith-1.0-9B",
    messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
    temperature=0.6, top_p=0.95,
)
msg = resp.selections[0].message
print(getattr(msg, "reasoning_content", None))  # the  hint
print(msg.content material)                              # the ultimate reply

The reasoning hint returns in reasoning_content, with the reply in content material. Beneficial sampling is temperature=0.6, top_p=0.95, top_k=20. The mannequin additionally plugs into OpenHands, OpenClaw, and OpenCode.


Try the Mannequin Weights and Technical particulars. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 150k+ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as nicely.

Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments