Wednesday, June 10, 2026
HomeArtificial IntelligenceNative Agentic Programming on the Low-cost: Claude Code + Ollama + Gemma4

Native Agentic Programming on the Low-cost: Claude Code + Ollama + Gemma4

Native Agentic Programming on the Low-cost: Claude Code + Ollama + Gemma4
 

Introduction

 
Visualize this: a multi-agent workflow that reads information, writes patches, runs assessments, and iterates throughout 4 providers, making 400 API calls in a single afternoon. The notification arrives. You’ve gotten crossed the delicate restrict once more. Each token prices cash, each immediate sends your proprietary code to a third-party server, and the speed limits interrupt long-running classes — the one answer is paying extra.

Gemma 4 26B MoE prompts solely 3.8 billion of its 26 billion parameters per ahead move. It scores 77.1% on LiveCodeBench v6 and 86.4% on τ2-bench agentic instrument use — the benchmark that particularly assessments what occurs when a mannequin has to name instruments, execute steps, and deal with errors throughout a multi-step workflow. The earlier era, Gemma 3 27B, scored 6.6% on that very same benchmark. That isn’t a small improve. It’s the distinction between a mannequin that can’t reliably name instruments and one that may run a Claude Code agentic loop with out continuously malforming its operate name parameters.

This text builds the complete stack: Ollama serving Gemma 4 domestically, the Modelfile that stops context window failures in agentic classes, the settings.json that wires Claude Code to the native endpoint, a verification script that confirms every thing is working earlier than you apply it to actual code, and an trustworthy rundown of what breaks and the best way to repair it. The viewers is engineers who already perceive what giant language fashions (LLMs) are and what agentic loops value. No hand-holding on the fundamentals.

 

Why Gemma 4?

 
Launched on April 2, 2026 beneath Apache 2.0, Gemma 4 is Google DeepMind’s most succesful open-weight mannequin household to this point. 4 variants shipped: E2B (2B efficient), E4B (4B efficient), 26B MoE, and 31B Dense. The 26B MoE makes use of 128 small specialists and prompts solely 8 per token plus one shared professional, delivering near-31B high quality at dramatically decrease compute value.

Earlier Gemma variations used a customized Google license with industrial use restrictions ambiguous sufficient that enterprise authorized groups routinely flagged it as a blocker. Gemma 4 is Apache 2.0, a primary for the Gemma household. In case your group desires to embed this in inner tooling, ship merchandise on high of it, or run it in manufacturing pipelines with out authorized evaluation overhead, that change issues operationally.

 

// The Numbers That Matter for Coding Brokers

 

Benchmark Gemma 3 27B Gemma 4 26B MoE Gemma 4 31B Dense
τ2-bench (agentic instrument use) 6.6% ~79% 86.4%
LiveCodeBench v6 29.1% 77.1% 80.0%
GPQA Diamond 42.4% 82.3% 84.3%
AIME 2026 (math) 20.8% 88.3% 89.2%
Enviornment AI ELO 1365 1441 1452

 

// {Hardware} Necessities

Earlier than pulling an 18 GB mannequin, know what you might be truly working with. The Gemma 4 household was designed to span edge units by workstations, and the 4 variants mirror that vary.

 

Variant Ollama tag Energetic params VRAM at This fall Context window
Edge 4B gemma4:e4b 4B ~6 GB 128K
26B MoE gemma4:26b 3.8B ~16–18 GB 256K
31B Dense gemma4:31b 31B ~24–32 GB 256K

 

// Putting in Ollama, Gemma 4, and Claude Code

Step 1: Set up Ollama

# macOS and Linux -- one-line set up
curl -fsSL https://ollama.com/set up.sh | sh

# Confirm model -- should be 0.14.0+ for Anthropic Messages API assist
# The Anthropic-compatible endpoint was added in January 2026
ollama model
# Anticipated: ollama model is 0.22.x or larger (as of Might 2026)

# Home windows: obtain the native installer from https://ollama.com
# WSL2 is advisable in order for you GPU passthrough on Home windows

 

After set up, Ollama begins as a background service on port 11434. Confirm it’s up:

curl http://localhost:11434
# Anticipated response: Ollama is working

 

Step 2: Pull Gemma 4

# The 26B MoE -- advisable for this setup (~18 GB obtain)
ollama pull gemma4:26b

# When you wait, verify the obtain is progressing
ollama ps
# Exhibits at present downloading or working fashions

# Elective: additionally pull the 31B for comparability on succesful {hardware}
ollama pull gemma4:31b

# Affirm the pull accomplished
ollama checklist
# Ought to present gemma4:26b with dimension and modification date

 

Step 3: Set up Claude Code

# Conditions: Node.js 18 or later
node --version   # Affirm you might be on 18+

# Set up Claude Code CLI globally
npm set up -g @anthropic-ai/claude-code

# Confirm the set up
claude --version

 

With Ollama working and Gemma 4 pulled, the pure subsequent intuition is to export the setting variables and launch Claude Code instantly.

 

The Modelfile

 
Ollama‘s default context window for Gemma 4 is 4K tokens. Gemma 4’s precise context window is 128K–256K. That 4K default shouldn’t be a suggestion — it’s what Ollama will use except you override it. In a Claude Code agentic session that reads supply information, holds dialog historical past, and maintains instrument name outcomes throughout a number of turns, 4K tokens is exhausted in seconds.

With out the context override, Claude Code loses observe of file contents mid-edit, forgets earlier directions, and produces fragmented modifications. Particularly: when an agent tries to refactor a 200-line service class, it cleanly forgets the second half exists. The agent doesn’t increase an error. It simply silently works on an incomplete view of the file and produces partially right output that breaks downstream.

The repair is a Modelfile that bakes the proper context dimension and different inference parameters right into a named mannequin variant. Create this file:

# ~/.ollama/Modelfiles/gemma4-claude
# Gemma 4 26B MoE variant tuned for Claude Code agentic classes.
# Bakes context window, temperature, and system immediate into the mannequin
# so each Claude Code session begins with the proper configuration.
#
# Construct with:
#   mkdir -p ~/.ollama/Modelfiles
#   ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

FROM gemma4:26b

# Context window -- 65536 tokens (64K) is the tested-safe flooring for actual
# codebases with out triggering swap on 16-18 GB VRAM programs.
# Enhance to 131072 (128K) in case you have headroom on 24 GB+ programs.
# Don't go above 131072 except you might have profiled your reminiscence utilization
# beneath load -- Ollama pre-allocates the complete KV cache upfront.
PARAMETER num_ctx 65536

# Temperature -- 0.2 is intentionally low for agentic coding.
# Larger temperature introduces variability in instrument name parameter
# formatting that causes Claude Code's instrument validator to reject calls.
# For artistic duties, you'll set this larger. For agentic loops: low.
PARAMETER temperature 0.2

# top_p -- nucleus sampling threshold. 0.9 retains era targeted
# whereas avoiding the repetition loops that top_p=1.0 can produce on
# lengthy agentic classes.
PARAMETER top_p 0.9

# repeat_penalty -- penalizes the mannequin for repeating tokens.
# 1.15 helps stop instrument name loops the place Gemma 4 retries the identical
# failed instrument name with almost similar parameters indefinitely.
PARAMETER repeat_penalty 1.15

# num_predict -- most tokens per response. 4096 is ample for
# most code patches. Enhance to 8192 in case you recurrently generate
# giant information in a single era.
PARAMETER num_predict 4096

# System immediate -- reinforces coding agent habits and specific
# instrument use self-discipline. Gemma 4 advantages from being reminded to
# decide to instrument calls moderately than describing what it could do.
SYSTEM """You're a senior software program engineer working as a coding agent.

When working with code:
- Learn information earlier than modifying them. By no means assume file contents.
- Make one targeted change at a time and confirm it earlier than continuing.
- When a instrument name fails, look at the error rigorously earlier than retrying.
  Don't retry with similar parameters. Diagnose first.
- Desire surgical edits over full file rewrites.
- Run assessments after every significant change, not after a batch of modifications.
- In case you are unsure in regards to the codebase construction, learn extra information
  moderately than guessing.

Be exact and methodical. Keep away from explaining what you might be about to do
when you could possibly merely do it."""

 

Construct the variant:

# Create the Modelfiles listing if it doesn't exist
mkdir -p ~/.ollama/Modelfiles

# Save the Modelfile content material from above to this path, then construct:
ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

# Confirm the variant was created
ollama checklist
# Ought to present gemma4-claude alongside gemma4:26b

# Fast smoke take a look at -- confirm it masses and responds
ollama run gemma4-claude "What's the time complexity of binary search and why?"
# Count on a transparent, concise technical response inside a couple of seconds

 

Wiring Claude Code to the Native Mannequin

 
With the mannequin variant constructed, the configuration layer connects Claude Code to Ollama. Two setting variables are the core of this, however three extra variables stop the most typical failure modes.

Ollama’s Anthropic-compatible endpoint is at http://localhost:11434, not http://localhost:11434/v1. The /v1 path is Ollama’s OpenAI-compatible layer. Claude Code makes use of the Anthropic Messages API protocol, which maps to the foundation endpoint. Utilizing the /v1 path will produce authentication errors or surprising habits.

 

// World Settings — ~/.claude/settings.json

This configuration applies to each Claude Code session throughout all tasks. It’s the proper alternative except you might be switching between native and cloud fashions ceaselessly per undertaking.

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",

    "ANTHROPIC_AUTH_TOKEN": "ollama",

    "ANTHROPIC_API_KEY": "",

    "ANTHROPIC_MODEL": "gemma4-claude",

    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}

 

Why every variable issues:

  • ANTHROPIC_BASE_URL redirects all Claude Code API calls from Anthropic’s servers to your native Ollama occasion.
  • ANTHROPIC_AUTH_TOKEN should be set to any non-empty string; Ollama ignores the worth however Claude Code requires the header to be current.
  • ANTHROPIC_API_KEY: “” explicitly empties the important thing so Claude Code can’t fall again to an actual Anthropic API key if one occurs to be set in your shell setting. With out this, a misconfigured ANTHROPIC_BASE_URL may silently fail over to the paid API.
  • ANTHROPIC_MODEL is the first mannequin identify Claude Code sends in requests. Set this to your customized Modelfile variant, gemma4-claude not gemma4:26b. The uncooked mannequin tag doesn’t carry the context window override.
  • ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL: Claude Code internally routes completely different activity varieties to completely different mannequin tiers. Setting all three to the identical native mannequin ensures each request lands at your Ollama occasion no matter which tier Claude Code internally selects.
  • CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: “1” strips the Anthropic-specific beta headers that Claude Code provides to requests. Native inference servers don’t acknowledge these headers and reject requests that embrace them. Setting this variable prevents that error with out affecting any core Claude Code performance.

 

// Per-Challenge Configuration — .claude/settings.json

For tasks the place you need native inference remoted out of your world setup — personal repositories, delicate codebases, or tasks with particular mannequin necessities — use a project-level settings file as a substitute:

# In your undertaking root
mkdir -p .claude

cat > .claude/settings.json << 'EOF'
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}
EOF

 

Claude Code reads the project-level .claude/settings.json when it exists, overriding world settings for that undertaking. Add .claude/settings.json to your .gitignore if the settings comprise something environment-specific, or commit it in order for you your complete group working native inference on that undertaking.

 

// Verifying the Setup

Earlier than working Claude Code towards an actual codebase, confirm three issues: Ollama is serving appropriately, the mannequin responds to API calls within the Anthropic Messages format, and gear calling particularly works. The third level is non-negotiable: instrument calling is how Claude Code reads information, writes patches, and executes instructions. A mannequin that can’t format instrument calls appropriately will loop and fail on primary agentic duties.

Conditions:

pip set up httpx   # Async HTTP consumer for the verification script

 

The total verification script:


#!/usr/bin/env python3
"""
verify_local_setup.py

Verifies the complete Claude Code + Ollama + Gemma 4 stack earlier than use.
Runs three checks in sequence:
  1. Ollama well being and mannequin availability
  2. Fundamental Anthropic Messages API name
  3. Device calling round-trip

Conditions:
  pip set up httpx

Tips on how to run:
  python verify_local_setup.py

Anticipated output on a working setup:
  [PASS] Ollama is working on localhost:11434
  [PASS] Mannequin 'gemma4-claude' is on the market
  [PASS] Anthropic Messages API name profitable
  [PASS] Device calling: mannequin produced a legitimate tool_use block
  All checks handed -- Claude Code + Ollama + Gemma 4 is prepared.
"""

import httpx
import json
import sys

# ── Configuration ─────────────────────────────────────────────────────────────
OLLAMA_BASE_URL = "http://localhost:11434"
MODEL_NAME      = "gemma4-claude"   # Should match your Modelfile variant identify
TIMEOUT         = 120.0             # Seconds -- era will be gradual on first name


def check_ollama_health() -> bool:
    """
    Examine 1: Confirm Ollama is working and responding.
    Hits the foundation endpoint which returns 'Ollama is working' when wholesome.
    """
    print("nCheck 1: Ollama well being")
    attempt:
        response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)
        if "Ollama is working" in response.textual content:
            print(f"  [PASS] Ollama is working on {OLLAMA_BASE_URL}")
            return True
        else:
            print(f"  [FAIL] Sudden response: {response.textual content[:100]}")
            return False
    besides httpx.ConnectError:
        print(f"  [FAIL] Can not hook up with {OLLAMA_BASE_URL}")
        print("         Is Ollama working? Attempt: ollama serve")
        return False


def check_model_available() -> bool:
    """
    Examine 2: Confirm the precise mannequin variant is on the market in Ollama.
    Makes use of the /api/tags endpoint which lists all pulled fashions.
    """
    print("nCheck 2: Mannequin availability")
    attempt:
        response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5.0)
        information     = response.json()
        fashions   = [m["name"] for m in information.get("fashions", [])]

        # Normalize: Ollama could add ":newest" if not specified
        normalized = [m.split(":")[0] for m in fashions]

        if MODEL_NAME in fashions or MODEL_NAME in normalized:
            print(f"  [PASS] Mannequin '{MODEL_NAME}' is on the market")
            return True
        else:
            print(f"  [FAIL] Mannequin '{MODEL_NAME}' not discovered")
            print(f"         Out there fashions: {', '.be part of(fashions) or 'none'}")
            print(f"         Run: ollama create {MODEL_NAME} -f ~/.ollama/Modelfiles/gemma4-claude")
            return False
    besides Exception as e:
        print(f"  [FAIL] Error checking mannequin checklist: {e}")
        return False


def check_messages_api() -> bool:
    """
    Examine 3: Ship a primary Anthropic Messages API name to the native endpoint.
    Verifies the request format, mannequin routing, and primary era work.
    Makes use of the identical /v1/messages path and request schema that Claude Code makes use of.
    Observe: Claude Code makes use of http://localhost:11434 (root), not /v1.
    The Anthropic-compatible API is at /api/chat or the foundation -- Ollama routes it.
    """
    print("nCheck 3: Anthropic Messages API name")

    payload = {
        "mannequin": MODEL_NAME,
        "max_tokens": 100,
        "messages": [
            {
                "role": "user",
                "content": "Reply with exactly: VERIFICATION_OK"
            }
        ]
    }

    headers = {
        "Content material-Kind":      "utility/json",
        "x-api-key":         "ollama",            # Required by the API spec; worth ignored domestically
        "anthropic-version": "2023-06-01"         # Required model header
    }

    attempt:
        response = httpx.publish(
            f"{OLLAMA_BASE_URL}/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP {response.status_code}: {response.textual content[:200]}")
            return False

        information = response.json()

        # Anthropic Messages API response construction:
        # { "content material": [{"type": "text", "text": "..."}], "stop_reason": "..." }
        content_blocks = information.get("content material", [])
        text_blocks    = [b for b in content_blocks if b.get("type") == "text"]

        if not text_blocks:
            print(f"  [FAIL] No textual content content material in response: {json.dumps(information, indent=2)}")
            return False

        response_text = text_blocks[0].get("textual content", "")
        print(f"  [PASS] Anthropic Messages API name profitable")
        print(f"         Mannequin response: {response_text[:80]}")
        return True

    besides Exception as e:
        print(f"  [FAIL] Request failed: {e}")
        return False


def check_tool_calling() -> bool:
    """
    Examine 4: Confirm instrument calling works end-to-end.
    That is crucial verify for Claude Code agentic use.
    Claude Code depends on the mannequin appropriately producing tool_use blocks
    for each file operation, shell command, and code execution.

    Sends a easy instrument definition and a immediate that ought to set off it.
    Verifies the mannequin returns a tool_use block (not simply textual content describing the decision).
    """
    print("nCheck 4: Device calling verification")

    # A minimal instrument definition utilizing the Anthropic operate calling schema
    instruments = [
        {
            "name": "read_file",
            "description": "Read the contents of a file at the given path.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "The absolute or relative file path to read"
                    }
                },
                "required": ["path"]
            }
        }
    ]

    payload = {
        "mannequin": MODEL_NAME,
        "max_tokens": 256,
        "instruments": instruments,
        # Pressure the mannequin to name a instrument moderately than reply in textual content.
        # tool_choice: {"kind": "any"} requires any instrument use.
        # Take away this if testing whether or not the mannequin self-selects instruments.
        "tool_choice": {"kind": "any"},
        "messages": [
            {
                "role": "user",
                "content": "Read the file at /tmp/test.py and show me its contents."
            }
        ]
    }

    headers = {
        "Content material-Kind":      "utility/json",
        "x-api-key":         "ollama",
        "anthropic-version": "2023-06-01"
    }

    attempt:
        response = httpx.publish(
            f"{OLLAMA_BASE_URL}/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP {response.status_code}: {response.textual content[:200]}")
            return False

        information           = response.json()
        content_blocks = information.get("content material", [])
        tool_blocks    = [b for b in content_blocks if b.get("type") == "tool_use"]

        if not tool_blocks:
            print("  [FAIL] Mannequin didn't produce a tool_use block")
            print("         This implies instrument calling shouldn't be working appropriately.")
            print("         Agentic Claude Code classes will fail on file operations.")
            print(f"         Full response: {json.dumps(information, indent=2)}")
            return False

        tool_call  = tool_blocks[0]
        tool_name  = tool_call.get("identify", "")
        tool_input = tool_call.get("enter", {})

        print(f"  [PASS] Device calling: mannequin produced a legitimate tool_use block")
        print(f"         Device referred to as: {tool_name}")
        print(f"         Parameters:  {json.dumps(tool_input)}")

        # Sanity verify: did it name the correct instrument with the correct parameter?
        if tool_name == "read_file" and "path" in tool_input:
            print(f"         Device identify and parameter are right.")
        else:
            print(f"         WARNING: Sudden instrument identify or lacking 'path' parameter.")
            print(f"         The mannequin referred to as a instrument however not the anticipated one.")

        return True

    besides Exception as e:
        print(f"  [FAIL] Request failed: {e}")
        return False


def foremost():
    print("=" * 60)
    print("Claude Code + Ollama + Gemma 4 Setup Verification")
    print("=" * 60)

    checks = [
        check_ollama_health,
        check_model_available,
        check_messages_api,
        check_tool_calling,
    ]

    outcomes = [check() for check in checks]

    print("n" + "=" * 60)
    handed = sum(outcomes)
    complete  = len(outcomes)

    if all(outcomes):
        print(f"All {complete} checks handed.")
        print("Claude Code + Ollama + Gemma 4 is prepared.")
        print(f"nLaunch with: claude")
        sys.exit(0)
    else:
        failed_checks = [i + 1 for i, r in enumerate(results) if not r]
        print(f"{handed}/{complete} checks handed. Failed: {failed_checks}")
        print("Resolve the failures above earlier than utilizing Claude Code domestically.")
        sys.exit(1)


if __name__ == "__main__":
    foremost()

 

Tips on how to run:

pip set up httpx
python verify_local_setup.py

 

Agentic Activity Walkthrough

 
With verification passing, here’s what an actual agentic session appears like. The duty: take an current Python module with no assessments, analyze it, write a take a look at suite, run the assessments, and repair any failures.

# Navigate to a undertaking listing
cd ~/tasks/my-service

# Affirm Claude Code sees the native configuration
claude --version
# Confirm it doesn't immediate for an Anthropic API key -- if it does,
# the settings.json shouldn't be being learn appropriately

# Begin an agentic session
claude

# Inside Claude Code, give the agent a concrete activity:
# > Analyze the UserService class in src/user_service.py.
# > Write a pytest take a look at suite overlaying all public strategies.
# > Run the assessments and repair any failures.
# > The aim is a clear pytest run with no skips.

 

What the Claude Code instrument name hint appears like throughout this session:

→ read_file("src/user_service.py")
  Studying 247 traces...

→ list_files("src/")
  Discovered: user_service.py, fashions.py, db.py, exceptions.py

→ read_file("src/fashions.py")
  Studying 89 traces...

→ write_file("assessments/test_user_service.py", [test content])
  Written: 312 traces

→ bash("python -m pytest assessments/test_user_service.py -v 2>&1")
  Operating 14 assessments...
  FAILED assessments/test_user_service.py::test_update_email_invalid
    AssertionError: Anticipated ValidationError, obtained None

→ read_file("src/user_service.py")  [targeted re-read of update_email method]
  ...

→ write_file("assessments/test_user_service.py", [corrected test])
  Patched test_update_email_invalid assertion

→ bash("python -m pytest assessments/test_user_service.py -v 2>&1")
  14 handed in 1.23s

 

Gemma 4 handles this sample reliably — studying information earlier than modifying, working assessments after modifications, and diagnosing failures from error output moderately than retrying blindly. The habits on advanced architectural selections throughout many information is the place cloud fashions nonetheless have an edge. For the duty above (evaluation, take a look at era, and focused fixes), the native setup is absolutely succesful.

What to look at for: If you happen to see the agent produce “Invalid instrument parameters” errors after which retry with the identical parameters repeatedly, the temperature is just too excessive, or the mannequin shouldn’t be utilizing the gemma4-claude Modelfile variant. Each temperature and the context window override are baked into the variant; the uncooked gemma4:26b tag doesn’t carry them.

 

// What Breaks and Tips on how to Repair It

  1. Device Parameter Formatting Errors

    • Symptom: Claude Code experiences Invalid instrument parameters repeatedly. The agent apologizes and retries with similar or almost similar parameters, then loops.
    • Trigger: That is documented within the Ollama GitHub points. The mannequin produces instrument name JSON that doesn’t match the schema Claude Code expects. Mostly: unsuitable subject names, lacking required fields, or nested objects the place scalars are anticipated.
    • Repair: Affirm you might be working gemma4-claude (the Modelfile variant) not gemma4:26b instantly. The temperature: 0.2 and system immediate within the Modelfile considerably cut back this. If the difficulty persists, drop the temperature to 0.1 within the Modelfile and rebuild.
  2. Context Window Swapping to Disk

    • Symptom: Technology slows to a crawl after a number of turns. ollama ps reveals GPU utilization dropping. The OS is paging the KV cache to disk.
    • Repair:
      # Possibility 1: Scale back context window within the Modelfile
      # Edit ~/.ollama/Modelfiles/gemma4-claude
      # Change: PARAMETER num_ctx 65536
      # To:     PARAMETER num_ctx 32768
      # Then rebuild: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude
      
      # Possibility 2: Allow KV cache quantization to scale back reminiscence footprint
      export OLLAMA_KV_CACHE_TYPE=q8_0
      # This quantizes the KV cache itself, decreasing reminiscence at a small high quality value
      # Restart Ollama after setting this: pkill ollama && ollama serve

       

  3. Mannequin Unloading Between Agent Turns

    • Symptom: Noticeable cold-start delay originally of every Claude Code message. Ollama is unloading the mannequin after an inactivity timeout and reloading it for every request.
    • Repair:
      # Hold the mannequin loaded indefinitely throughout your work session
      export OLLAMA_KEEP_ALIVE=-1
      
      # Or set it in your shell profile for everlasting impact
      echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.zshrc
      
      # Alternatively, use the Ollama API to pin the mannequin
      curl http://localhost:11434/api/generate 
        -d '{"mannequin": "gemma4-claude", "keep_alive": -1}'
      # This pins the mannequin till you explicitly unload it or restart Ollama

       

  4. Beta Header Rejection Errors

    • Symptom: Claude Code produces Sudden worth(s) for the anthropic-beta header errors on launch or mid-session.
    • Repair: Affirm CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" is in your settings.json. If you happen to set it by way of shell export as a substitute of settings.json, confirm it’s exported in the identical shell session the place claude is working:
      echo $CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS
      # Should print: 1

       

Wrapping Up

 
The stack described on this article shouldn’t be a proof of idea. It’s a working manufacturing configuration that engineers have been working day by day since Ollama added Anthropic Messages API assist in January 2026. The Modelfile shouldn’t be elective; it’s the distinction between a instrument that works and one which silently produces incomplete outputs on multi-file duties. The verification script catches configuration points earlier than they floor mid-session as complicated agent failures.

The setup constructed on this article is a personal, zero-per-token-cost coding agent that handles nearly all of day by day engineering duties — code evaluation, take a look at era, focused refactoring, and debugging — at era speeds which might be usable on fashionable {hardware}.

This setup shouldn’t be a alternative for cloud inference on advanced architectural reasoning throughout giant codebases or SWE-bench class duties that require deep repository understanding at scale.
 
 

Shittu Olumide is a software program engineer and technical author keen about leveraging cutting-edge applied sciences to craft compelling narratives, with a eager eye for element and a knack for simplifying advanced ideas. You too can discover Shittu on Twitter.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments