Wednesday, July 1, 2026
HomeArtificial IntelligenceConstructing Native AI Methods: Qwen3.6 + MCPs

Constructing Native AI Methods: Qwen3.6 + MCPs

Constructing Native AI Methods: Qwen3.6 + MCPs
 

Introducing MCP

 
Each developer constructing with native AI hits the identical wall finally. The mannequin works. It causes nicely, writes stable code, and solutions advanced questions. However it can not do every part. It can not question your database, open a GitHub subject, or name your inner API. You might be left writing customized Python wrappers for each instrument you want, hardcoding the glue between mannequin output and gear execution, and sustaining these wrappers each time an API modifications.

The Mannequin Context Protocol (MCP) was designed to resolve precisely this. It’s an open commonplace by Anthropic: a common, pluggable protocol for AI instrument connectivity. Outline a instrument as soon as as an MCP server. Any MCP-compatible consumer, any mannequin, any framework, can uncover and name it with zero customized integration code per mannequin.

Qwen3.6-35B-A3B is probably the most succesful native mannequin for this sort of work proper now. It has a 262,144-token context window, a Combination of Consultants (MoE) structure that prompts solely 3B of its 35B parameters per ahead move (which is why it matches on {hardware} that shouldn’t be in a position to run a 35B mannequin), and was explicitly educated and evaluated on MCP-based agentic duties.

This text builds a neighborhood GitHub developer assistant: an agent that reads a repository’s open points, searches the related code, drafts a repair, and creates a pull request. The entire thing runs in your {hardware}, by means of MCP servers, with no cloud dependency.

 

Understanding Qwen3.6-35B-A3B

 
Understanding the structure issues right here as a result of it straight explains what {hardware} you want and why the mannequin performs the way in which it does on agentic duties.

The title encodes the important thing reality: 35B whole parameters, A3B that means 3B activated per ahead move. It’s an MoE mannequin with 256 consultants per layer, routing 8 plus 1 shared consultants per token. You get the data capability of a 35B mannequin on the inference compute value of a 3B mannequin. That trade-off is why it matches on {hardware} that may collapse below a dense 35B.

The hidden structure is the place Qwen3.6 diverges most from different MoE fashions. Every block within the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Consideration layers. DeltaNet is a linear consideration mechanism; it processes sequences extra effectively than full quadratic consideration, particularly at lengthy context lengths. The interleaved full Gated Consideration layers present the deep relational reasoning that linear consideration alone misses. For an agent working by means of a 500-file repository, that mixture issues: environment friendly processing at size mixed with exact reasoning on the related sections.

The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context size will not be a consolation function; it’s an operational constraint. An agent studying supply information, sustaining instrument name historical past, monitoring a multi-step plan, and injecting instrument outcomes again into context wants actual headroom. Most 7B and 13B fashions cap at 8k or 32k tokens. Working out of context mid-task means the agent loses its personal historical past and begins hallucinating instrument outcomes.

Qwen3.6 was explicitly educated and evaluated on MCP-based agentic benchmarks. Two headline options got here out of that coaching:

  • Agentic Coding. Frontend workflows and repository-level reasoning — the mannequin handles multi-file refactoring duties with coherent reasoning throughout information, not simply single-file edits in isolation.
  • Pondering Preservation. A preserve_thinking flag that retains reasoning traces from prior turns in a multi-turn dialog. When an agent causes by means of a plan in flip one after which executes instrument calls in turns two by means of 5, preserve_thinking=True retains the turn-one reasoning obtainable within the KV cache. Every subsequent flip advantages from that prior reasoning with out paying the price of re-deriving it.

 

System Necessities

 
There are three lifelike deployment paths, and which one you utilize relies upon completely in your {hardware}.

  • GPU inference (advisable for manufacturing agent workloads). Qwen3.6-35B-A3B in bfloat16 requires roughly 70 GB VRAM. In This autumn quantization, it matches in roughly 20–24 GB. A single RTX 4090 (24 GB) handles This autumn. Two RTX 3090s with tensor parallelism deal with This autumn as nicely. An A100 80 GB handles the total bfloat16 mannequin.
  • CPU/Hybrid through KTransformers. KTransformers is the accessible path for builders with out a 24 GB GPU. It offloads compute-heavy layers to GPU when obtainable and runs the remaining on CPU. With 64 GB system RAM, you possibly can run Qwen3.6-35B-A3B in a usable (if slower) configuration. Response latency shall be 30–120 seconds per flip relying in your CPU, which is workable for an agent doing background repository evaluation however not for interactive coding classes.
  • Smaller fashions for tutorial testing. All the MCP integration sample on this article is equivalent no matter mannequin measurement. If you wish to comply with alongside with out the {hardware} for the total 35B mannequin, use Qwen/Qwen2.5-7B-Instruct through Ollama (ollama pull qwen2.5:7b) or the Qwen3-8B mannequin. The serving API is similar, the code is equivalent, and you’ll swap within the 35B mannequin when {hardware} permits.

Software program necessities:

# Python 3.11+ required
python --version

python -m venv qwen-mcp-env
supply qwen-mcp-env/bin/activate    # macOS / Linux
qwen-mcp-envScriptsactivate       # Home windows

# Core packages
pip set up 
  "openai>=1.30.0" 
  "qwen-agent>=0.0.10" 
  "mcp>=1.0.0" 
  "httpx>=0.27.0"

# Serving framework -- select one
pip set up "vllm>=0.19.0"       # NVIDIA GPU
pip set up "sglang>=0.5.10"     # NVIDIA GPU (sooner prefill for lengthy context)
pip set up "ktransformers"      # CPU/hybrid

# Node.js 18+ is required for pre-built MCP servers put in through npx
node --version

 

 

Serving Qwen3.6 Regionally with an OpenAI-Suitable API

 
Earlier than wiring in any MCP servers, you want a operating inference server. Each SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the identical API floor, simply pointed at localhost as a substitute of api.openai.com.

 

// SGLang (Really useful for Lengthy-Context Agent Workloads)

# Set up SGLang with full dependencies
pip set up "sglang[all]>=0.5.10"

# Serve Qwen3.6-35B-A3B with reasoning and tool-call parsers enabled.
# --reasoning-parser qwen3 accurately handles the ... blocks.
# --tool-call-parser qwen3_coder routes instrument name outputs to the proper format.
# --enable-prefix-caching is crucial for agent workloads -- permits KV cache reuse
#   throughout turns, which is what makes preserve_thinking environment friendly in observe.

python -m sglang.launch_server 
    --model-path Qwen/Qwen3.6-35B-A3B 
    --host 0.0.0.0 
    --port 30000 
    --reasoning-parser qwen3 
    --tool-call-parser qwen3_coder 
    --enable-prefix-caching 
    --tp 2    # tensor parallel throughout 2 GPUs; take away if utilizing single GPU

 

// vLLM

pip set up "vllm>=0.19.0"

# vLLM equal with the identical crucial flags
vllm serve Qwen/Qwen3.6-35B-A3B 
    --host 0.0.0.0 
    --port 8000 
    --reasoning-parser qwen3 
    --tool-call-parser qwen3_coder 
    --enable-prefix-caching-v2 
    --tensor-parallel-size 2

 

// Smaller Mannequin through Ollama

ollama pull qwen2.5:7b
ollama serve
# Ollama's API is OpenAI-compatible at http://localhost:11434/v1

 

As soon as the server is operating, confirm it earlier than going any additional:

# Well being test -- ought to return {"standing": "okay"} or related
curl http://localhost:30000/well being

# Check the chat completions endpoint with a easy question
curl http://localhost:30000/v1/chat/completions 
  -H "Content material-Sort: utility/json" 
  -d '{
    "mannequin": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Reply with: ready"}],
    "max_tokens": 10
  }'

 

For those who get a JSON response with a selections array, the server is prepared. Don’t proceed to MCP setup till this works. Each integration failure you’ll encounter later is less complicated to debug when you recognize the serving layer is stable.

 

Understanding MCP and Why It Modifications the Agent Structure

 
Earlier than writing any agent code, it helps to grasp what MCP truly does on the protocol degree, as a result of that understanding prevents a class of bugs that come from treating MCP as only a fancier function-calling API.

MCP is a JSON-RPC 2.0 protocol operating over stdio or HTTP transport. When an MCP consumer connects to a server, the very first thing it does is name instruments/checklist to find what instruments the server exposes. Every instrument comes again with a reputation, an outline, and an enter schema outlined in JSON Schema. The mannequin reads this schema. It’s the mannequin’s contract with the instrument.

When the mannequin needs to name a instrument, it emits a structured instrument name object. The MCP consumer — not the mannequin — truly executes the decision by sending a instruments/name request to the server. The server handles execution and returns a consequence. The consumer injects that consequence again into the dialog as a instrument function message. The mannequin reads the consequence and decides the subsequent step.

This separation is necessary. The mannequin decides what to name and with what arguments. The consumer handles execution. The server handles the precise work. Your code by no means hardwires a instrument to a mannequin; you simply inform the consumer which servers can be found.

There are two methods to make use of MCP with Qwen3.6:

  • By way of Qwen-Agent: the official qwen_agent library handles instrument discovery, name parsing, consequence injection, and multi-turn dialog administration mechanically. Much less code, much less management. Proper for many use circumstances.
  • By way of the MCP Python SDK straight: you deal with the agentic loop your self utilizing mcp.ClientSession. Extra code, full visibility into each message, full management over error dealing with and retry logic. Proper for manufacturing methods the place it’s worthwhile to monitor each step.

This text covers each, beginning with Qwen-Agent.

 

Constructing the Native GitHub Developer Assistant

 
The agent does 4 issues in sequence: reads open points from a GitHub repository, finds the related code, drafts a repair, and opens a pull request. All regionally, all by means of MCP.

 

// Half 1: Surroundings and MCP Server Setup

# Set your GitHub private entry token
# Required by the GitHub MCP server for API calls
export GITHUB_TOKEN=ghp_your_token_here

# Pre-built MCP servers set up through npx -- no separate set up step
# npx handles this on first use when the agent begins the servers
# Confirm npx is accessible:
npx --version

 

Create a challenge listing:

mkdir qwen-github-agent
cd qwen-github-agent

 

// Half 2: Qwen-Agent Implementation

The quickest path to a working agent. Qwen-Agent handles the total loop mechanically.

# github_agent_qwenagent.py
# Stipulations: pip set up qwen-agent openai
#   npm / npx should be put in for the MCP servers
#   GITHUB_TOKEN env var should be set
#   Native serving endpoint should be operating (see earlier part)
#
# The best way to run:
#   python github_agent_qwenagent.py

from qwen_agent.brokers import Assistant

# ── Server configuration ──────────────────────────────────────────────────────

# Level at your native serving endpoint.
# Change the base_url to match whichever server you began:
#   SGLang:  http://localhost:30000/v1
#   vLLM:    http://localhost:8000/v1
#   Ollama:  http://localhost:11434/v1
LLM_CONFIG = {
    "mannequin":     "Qwen/Qwen3.6-35B-A3B",
    "model_server": "http://localhost:30000/v1",
    "api_key":   "EMPTY",           # Native servers don't require an actual key

    # Pondering mode sampling params (from the official mannequin card finest practices)
    "generate_cfg": {
        "temperature":       0.6,
        "top_p":             0.95,
        "top_k":             20,
        "min_p":             0.0,
        "thought_in_history": True,   # That is the preserve_thinking flag in Qwen-Agent
    },
}

# ── MCP server configuration ──────────────────────────────────────────────────
# Every server key names the server; the worth is the stdio launch command.
# Qwen-Agent begins every server as a subprocess and manages the MCP classes.

MCP_SERVERS = {
    "mcpServers": {
        "filesystem": {
            "command": "npx",
            "args": [
                "-y",
                "@modelcontextprotocol/server-filesystem",
                # Grant the agent access to the current working directory
                # In production, restrict to the specific repository path
                "."
            ]
        },
        "github": {
            "command": "npx",
            "args": ["-y", "@modelcontextprotocol/server-github"],
            "env": {
                # The GitHub MCP server reads this env var for API authentication
                "GITHUB_TOKEN": "${GITHUB_TOKEN}"
            }
        },
    }
}

# ── System immediate ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """You're a senior software program engineer with full entry to a GitHub repository
through MCP instruments.

When given a repository and process:
1. Checklist open points to grasp what wants fixing
2. Use filesystem instruments to learn related supply information and exams
3. Establish the foundation trigger primarily based on the code and the problem description
4. Write a focused repair -- minimal modifications, no refactoring unrelated to the bug
5. Create a pull request with a transparent title and outline referencing the problem

All the time clarify your reasoning at every step. Assume by means of edge circumstances earlier than writing code.
In case you are unsure a few file's function, learn it earlier than modifying it."""

# ── Agent setup ───────────────────────────────────────────────────────────────

agent = Assistant(
    llm=LLM_CONFIG,
    title="GitHub Developer Assistant",
    description="Reads points, fixes bugs, opens pull requests -- regionally through MCP.",
    system_message=SYSTEM_PROMPT,
    mcp_servers=MCP_SERVERS,
)

# ── Run the agent ─────────────────────────────────────────────────────────────

def run_agent(process: str):
    """
    Run the agent on a process description and stream the output.
    The agent will make instrument calls mechanically; Qwen-Agent handles
    the total loop together with instrument execution and consequence injection.
    """
    messages = [{"role": "user", "content": task}]

    print(f"Process: {process}n{'─' * 70}")

    # Qwen-Agent's run() is a generator that yields intermediate steps
    # Every yielded message reveals a instrument name, a instrument consequence, or the ultimate reply
    for response in agent.run(messages=messages):
        # response is a listing of messages representing the dialog to this point
        # The final message accommodates the newest output
        final = response[-1]
        function    = final.get("function", "")
        content material = final.get("content material", "")

        if function == "assistant" and content material:
            # Strip and show the pondering block individually for readability
            import re
            pondering = re.search(r"(.*?)", content material, re.DOTALL)
            if pondering:
                print(f"[thinking] {pondering.group(1).strip()[:200]}...")
            clear = re.sub(r".*?", "", content material, flags=re.DOTALL).strip()
            if clear:
                print(f"[agent] {clear}")

        elif function == "instrument":
            tool_name = final.get("title", "unknown_tool")
            print(f"[tool:{tool_name}] consequence acquired")


if __name__ == "__main__":
    run_agent(
        "Within the repository myorg/my-api-project, discover the open subject about "
        "the login endpoint returning 200 for invalid tokens. Learn the related "
        "code and exams, repair the bug, and open a pull request."
    )

 

The best way to run:

python github_agent_qwenagent.py

 

// Half 3: Uncooked MCP SDK Implementation

For groups who want full management over each protocol message, customized error dealing with, per-tool retry logic, and audit logging of each instrument name and consequence:

# github_agent_raw.py
# Stipulations: pip set up mcp openai httpx
#   GITHUB_TOKEN env var should be set, native server should be operating
#
# The best way to run:
#   python github_agent_raw.py

import asyncio
import json
import os
import re
from openai import AsyncOpenAI
from mcp import ClientSession, StdioServerParameters
from mcp.consumer.stdio import stdio_client

# ── Native serving consumer ───────────────────────────────────────────────────────
consumer = AsyncOpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)

MODEL = "Qwen/Qwen3.6-35B-A3B"

# ── Response processing ───────────────────────────────────────────────────────

def strip_thinking(textual content: str) -> str:
    """Take away ... blocks. Used once we solely want the motion."""
    return re.sub(r".*?", "", textual content, flags=re.DOTALL).strip()

def extract_thinking(textual content: str) -> str:
    """Extract the content material of the pondering block for logging."""
    m = re.search(r"(.*?)", textual content, re.DOTALL)
    return m.group(1).strip() if m else ""

def process_response(response, preserve_thinking: bool = True) -> dict:
    """
    Course of a chat completion response from Qwen3.6.

    Handles two output codecs:
    1. Software name through the API's function_call / tool_calls subject (when --tool-call-parser is energetic)
    2. Software name embedded within the message content material as JSON

    Args:
        response:          The OpenAI-compatible completion response
        preserve_thinking: If True, hold pondering content material in output for
                           the subsequent flip's KV cache profit

    Returns:
        dict with pondering, tool_calls, final_answer, has_tool_calls, is_terminal
    """
    alternative  = response.selections[0]
    message = alternative.message

    # Path 1: Software calls within the structured subject (most well-liked -- requires tool-call-parser flag)
    if message.tool_calls:
        tool_calls = [
            {
                "name":      tc.function.name,
                "arguments": json.loads(tc.function.arguments),
                "call_id":   tc.id,
            }
            for tc in message.tool_calls
        ]
        pondering = extract_thinking(message.content material or "")
        return {
            "pondering":       pondering if preserve_thinking else "",
            "tool_calls":     tool_calls,
            "final_answer":   "",
            "has_tool_calls": True,
            "is_terminal":    False,
        }

    # Path 2: Software calls embedded in content material textual content (fallback)
    content material = message.content material or ""
    tag_matches = re.findall(r"(.*?)", content material, re.DOTALL)
    tool_calls = []
    for m in tag_matches:
        strive:
            tool_calls.append(json.masses(m.strip()))
        besides json.JSONDecodeError:
            move

    pondering     = extract_thinking(content material)
    final_answer = re.sub(r".*?", "", content material, flags=re.DOTALL)
    final_answer = re.sub(r".*?", "", final_answer, flags=re.DOTALL).strip()

    return {
        "pondering":       pondering if preserve_thinking else "",
        "tool_calls":     tool_calls,
        "final_answer":   final_answer,
        "has_tool_calls": len(tool_calls) > 0,
        "is_terminal":    len(tool_calls) == 0 and bool(final_answer),
    }

# ── Core agent loop ───────────────────────────────────────────────────────────

async def run_github_agent(process: str, repo: str, max_turns: int = 20):
    """
    Run the GitHub developer assistant agent.

    Connects to filesystem and GitHub MCP servers, discovers their instruments,
    and runs the Qwen3.6 agent loop till the duty is full or max_turns reached.
    """
    # Begin each MCP servers and set up classes
    fs_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "."],
    )
    gh_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-github"],
        env={**os.environ, "GITHUB_TOKEN": os.environ.get("GITHUB_TOKEN", "")},
    )

    async with stdio_client(fs_params) as (fs_read, fs_write), 
               ClientSession(fs_read, fs_write) as fs_session, 
               stdio_client(gh_params) as (gh_read, gh_write), 
               ClientSession(gh_read, gh_write) as gh_session:

        # Initialize each classes
        await fs_session.initialize()
        await gh_session.initialize()

        # Uncover all obtainable instruments from each servers
        fs_tools_result = await fs_session.list_tools()
        gh_tools_result = await gh_session.list_tools()

        # Construct the OpenAI-format instrument checklist for the mannequin
        all_tools = []
        tool_to_session = {}   # Maps instrument title to the MCP session that owns it

        for instrument in fs_tools_result.instruments:
            all_tools.append({
                "kind": "perform",
                "perform": {
                    "title":        instrument.title,
                    "description": instrument.description,
                    "parameters":  instrument.inputSchema,
                }
            })
            tool_to_session[tool.name] = fs_session

        for instrument in gh_tools_result.instruments:
            all_tools.append({
                "kind": "perform",
                "perform": {
                    "title":        instrument.title,
                    "description": instrument.description,
                    "parameters":  instrument.inputSchema,
                }
            })
            tool_to_session[tool.name] = gh_session

        print(f"Instruments obtainable: {len(all_tools)} ({len(fs_tools_result.instruments)} filesystem, "
              f"{len(gh_tools_result.instruments)} GitHub)")

        # Construct dialog historical past
        system_prompt = f"""You're a senior software program engineer with entry to the repository {repo}.
Use the obtainable instruments to research points, learn code, write fixes, and create pull requests.
Assume step-by-step. Learn earlier than you modify. Minimal modifications solely."""

        messages = [
            {"role": "system",  "content": system_prompt},
            {"role": "user",    "content": task},
        ]

        # ── Agent loop ─────────────────────────────────────────────────────────
        for flip in vary(max_turns):
            print(f"n[Turn {turn + 1}]")

            # Name the mannequin
            response = await consumer.chat.completions.create(
                mannequin=MODEL,
                messages=messages,
                instruments=all_tools if all_tools else None,
                tool_choice="auto",
                # Pondering mode sampling params from the official finest practices
                temperature=0.6,
                top_p=0.95,
                top_k=20,
                min_p=0.0,
                max_tokens=4096,
                extra_body={
                    # preserve_thinking retains reasoning context throughout turns
                    # for KV cache effectivity on lengthy agent classes
                    "preserve_thinking": True,
                }
            )

            consequence = process_response(response, preserve_thinking=True)

            if consequence["thinking"]:
                print(f"[thinking] {consequence['thinking'][:200]}...")

            # Terminal state -- agent has produced a remaining reply
            if consequence["is_terminal"]:
                print(f"n[DONE]n{consequence['final_answer']}")
                return consequence["final_answer"]

            # Software name state -- execute every instrument and inject outcomes
            if consequence["has_tool_calls"]:
                # Append the assistant's message with instrument calls to historical past
                messages.append({
                    "function":       "assistant",
                    "content material":    response.selections[0].message.content material or "",
                    "tool_calls": response.selections[0].message.tool_calls or [],
                })

                for name in consequence["tool_calls"]:
                    tool_name = name["name"]
                    tool_args = name.get("arguments", {})
                    call_id   = name.get("call_id", "")

                    print(f"[tool] {tool_name}({json.dumps(tool_args)[:80]}...)")

                    session = tool_to_session.get(tool_name)
                    if not session:
                        result_content = f"Error: instrument '{tool_name}' not discovered"
                    else:
                        strive:
                            tool_result = await session.call_tool(tool_name, tool_args)
                            result_content = str(tool_result.content material)
                            # Truncate very lengthy outcomes to guard context price range
                            if len(result_content) > 12000:
                                result_content = result_content[:12000] + "n...[truncated]"
                        besides Exception as e:
                            result_content = f"Error: {e}"

                    print(f"[result] {result_content[:150]}...")

                    messages.append({
                        "function":        "instrument",
                        "content material":     result_content,
                        "tool_call_id": call_id,
                        "title":        tool_name,
                    })

        print(f"[WARNING] max_turns ({max_turns}) reached with out terminal state")


# ── Entry level ───────────────────────────────────────────────────────────────

if __name__ == "__main__":
    asyncio.run(run_github_agent(
        process=(
            "Discover the open subject in regards to the login endpoint returning 200 for invalid tokens. "
            "Learn src/auth.py and exams/test_auth.py to grasp the bug. "
            "Repair the verify_token perform and open a pull request together with your modifications."
        ),
        repo="myorg/my-api-project",
    ))

 

The best way to run:

python github_agent_raw.py

 

The uncooked SDK path provides you what Qwen-Agent abstracts: you possibly can see each instrument name, each consequence, and each message injected into the dialog historical past. The tool_to_session routing dict is the important thing mechanism; it maps every instrument title to the MCP session that owns it, so the agent can name any instrument from any linked server with out understanding which server gives it.

 

Writing a Customized MCP Server

 
Pre-built MCP servers deal with the filesystem and GitHub. Once you want one thing that doesn’t exist — querying an inner database, wrapping a CI/CD API, operating code evaluation instruments — you write an MCP server. Here’s a full code_quality server that exposes ruff and pytest as MCP instruments.

# code_quality_server.py
# A customized MCP server exposing code high quality instruments to Qwen3.6.
#
# Stipulations:
#   pip set up mcp ruff pytest
#
# The best way to run standalone (for testing):
#   python code_quality_server.py
#
# So as to add to the Qwen-Agent config:
#   "code_quality": {
#       "command": "python",
#       "args": ["/absolute/path/to/code_quality_server.py"]
#   }

import asyncio
import json
import subprocess
import sys
from mcp.server.fastmcp import FastMCP

# FastMCP is a high-level MCP server framework -- reduces boilerplate considerably
mcp = FastMCP("code_quality")


@mcp.instrument()
def run_linter(file_path: str, repair: bool = False) -> str:
    """
    Run ruff linter on a Python file and return structured lint outcomes.
    Use this earlier than modifying a file to grasp its present high quality state,
    and after making modifications to confirm the repair didn't introduce new points.

    Args:
        file_path: Absolute or relative path to the Python file to lint.
        repair:       If true, mechanically repair protected points in place.

    Returns:
        JSON string with points checklist, subject rely, and information modified.
    """
    cmd = ["python", "-m", "ruff", "check", file_path, "--output-format=json"]
    if repair:
        cmd.append("--fix")

    strive:
        consequence = subprocess.run(cmd, capture_output=True, textual content=True, timeout=30)
        # ruff returns exit code 1 when points are discovered -- not an error
        output = consequence.stdout or consequence.stderr

        # Parse ruff's JSON output
        strive:
            points = json.masses(output) if output.strip() else []
        besides json.JSONDecodeError:
            points = []

        formatted = [
            {
                "line":    issue.get("location", {}).get("row", 0),
                "col":     issue.get("location", {}).get("column", 0),
                "code":    issue.get("code", ""),
                "message": issue.get("message", ""),
                "fix_available": issue.get("fix") is not None,
            }
            for issue in issues
            if isinstance(issue, dict)
        ]

        return json.dumps({
            "file":         file_path,
            "points":       formatted,
            "total_issues": len(formatted),
            "fastened":        "auto-fix utilized" if repair else "no auto-fix",
        }, indent=2)

    besides subprocess.TimeoutExpired:
        return json.dumps({"error": "Linter timed out after 30s", "file": file_path})
    besides FileNotFoundError:
        return json.dumps({"error": "ruff not discovered -- set up with: pip set up ruff"})


@mcp.instrument()
def run_tests(goal: str, verbose: bool = False) -> str:
    """
    Run pytest on a module or listing and return structured move/fail outcomes.
    Use this after writing a repair to confirm the repair makes failing exams move
    with out breaking different exams.

    Args:
        goal:  Path to the check file or listing to run (e.g. exams/, exams/test_auth.py)
        verbose: If true, embody full pytest output within the consequence.

    Returns:
        JSON string with move rely, fail rely, failure particulars, and length.
    """
    cmd = ["python", "-m", "pytest", target, "--json-report", "--json-report-file=-", "-q"]
    if verbose:
        cmd.append("-v")

    strive:
        consequence = subprocess.run(cmd, capture_output=True, textual content=True, timeout=120)
        output = consequence.stdout

        # Parse pytest-json-report output if obtainable
        strive:
            report = json.masses(output)
            abstract  = report.get("abstract", {})
            failures = [
                {
                    "test":    t["nodeid"],
                    "message": t.get("name", {}).get("longrepr", "")[:500],
                }
                for t in report.get("exams", [])
                if t.get("final result") == "failed"
            ]
            return json.dumps({
                "goal":   goal,
                "handed":   abstract.get("handed", 0),
                "failed":   abstract.get("failed", 0),
                "errors":   abstract.get("error", 0),
                "whole":    abstract.get("whole", 0),
                "length": abstract.get("length", 0),
                "failures": failures,
                "stdout":   consequence.stdout[:2000] if verbose else "",
            }, indent=2)

        besides json.JSONDecodeError:
            # Fallback: return uncooked output if JSON report not obtainable
            return json.dumps({
                "goal":  goal,
                "stdout":  consequence.stdout[:3000],
                "stderr":  consequence.stderr[:1000],
                "exit_code": consequence.returncode,
            })

    besides subprocess.TimeoutExpired:
        return json.dumps({"error": f"Checks timed out after 120s for goal: {goal}"})
    besides FileNotFoundError:
        return json.dumps({"error": "pytest not discovered -- set up with: pip set up pytest"})


if __name__ == "__main__":
    mcp.run(transport="stdio")

 

Add it to both agent implementation’s server config:

# In Qwen-Agent MCP_SERVERS dict:
"code_quality": {
    "command": "python",
    "args": ["/absolute/path/to/code_quality_server.py"]
}

# Within the uncooked SDK, add a 3rd StdioServerParameters:
cq_params = StdioServerParameters(
    command="python",
    args=["/absolute/path/to/code_quality_server.py"],
)

 

Check the server standalone earlier than connecting the agent:

# Check the server in MCP inspector mode
npx @modelcontextprotocol/inspector python code_quality_server.py
# Opens a browser UI the place you possibly can name run_linter and run_tests straight

 

Tuning Pondering Mode and Preserving Reasoning

 
The pondering mode resolution impacts latency considerably sufficient that it’s value treating as an specific structure alternative, not an afterthought.

In pondering mode, Qwen3.6 generates a chain-of-thought reasoning hint inside ... tags earlier than producing its motion. For a 5-step agent process, that hint provides 1,000 to five,000 tokens per flip relying on process complexity. These tokens take time to generate and devour context price range.

When that value is value paying:

  • Planning steps the place the agent decides what to do subsequent.
  • Debugging classes the place the issue is genuinely ambiguous.
  • Multi-file refactoring the place the agent must purpose about negative effects throughout information.

The reasoning hint catches errors earlier than they change into instrument calls with incorrect arguments. When it’s not value paying: mechanical tool-call loops the place every step is unambiguous — checklist listing → learn file → write file → commit. The mannequin doesn’t must suppose laborious about these steps. Non-thinking mode is quicker and produces the identical high quality output.

Change modes per-request, not globally:

# Pondering mode (planning, debugging, advanced multi-file duties)
THINKING_PARAMS = {
    "temperature": 0.6,
    "top_p":       0.95,
    "top_k":       20,
    "min_p":       0.0,
}

# Non-thinking mode (mechanical loops, quick standing checks)
# Move enable_thinking=False within the chat template, or use system immediate:
# Add "/no_think" to the system immediate to suppress pondering mode.
NON_THINKING_PARAMS = {
    "temperature": 0.7,
    "top_p":       0.8,
    "top_k":       20,
    "min_p":       0.0,
}

 

The preserve_thinking flag — the Qwen3.6-specific functionality that retains reasoning context throughout turns — straight impacts inference effectivity when prefix caching is energetic. Right here is why it issues virtually: in a 10-turn agent session, every flip shares a prefix of the dialog historical past. When preserve_thinking=True, the total reasoning hint from prior turns stays within the historical past. The KV cache on the server facet acknowledges the shared prefix throughout turns and avoids recomputing it. The efficient tokens-per-second charge for lengthy classes is meaningfully increased than with out it, significantly when serving infrastructure like SGLang with --enable-prefix-caching is operating.

The sensible rule: use preserve_thinking=True for agent classes that may run for greater than 5 turns. Use preserve_thinking=False (or non-thinking mode) for single-turn queries and brief pipelines the place the overhead is a waste.

 

Conclusion

 
Qwen3.6-35B-A3B’s MoE structure provides you 35B mannequin high quality at 3B activation value. Its 262k context window provides you room to carry a complete code assessment session in context. Its specific coaching on MCP-based agentic benchmarks means it is aware of the best way to use instruments accurately, not simply name them.

MCP gives the connective tissue. Outline a instrument as soon as as an MCP server. Each Qwen3.6 session and each different MCP-compatible mannequin can uncover and name it with out customized glue. The GitHub and filesystem servers on this article are two of lots of of pre-built servers within the MCP ecosystem. The customized code_quality server reveals the sample for something that doesn’t exist already.

The GitHub developer assistant on this article is one utility of the sample. The identical structure — native mannequin, MCP instruments, and agentic loop — works for a analysis assistant that searches tutorial databases and drafts literature evaluations, a DevOps agent that reads CloudWatch logs and opens incident tickets, or a knowledge pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The MCP ecosystem is rising quick. The native mannequin functionality is already there.
 
 

Shittu Olumide is a software program engineer and technical author keen about leveraging cutting-edge applied sciences to craft compelling narratives, with a eager eye for element and a knack for simplifying advanced ideas. You too can discover Shittu on Twitter.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments