Saturday, May 2, 2026
HomeArtificial IntelligenceA Coding Implementation to Parsing, Analyzing, Visualizing, and Nice-Tuning Agent Reasoning Traces...

A Coding Implementation to Parsing, Analyzing, Visualizing, and Nice-Tuning Agent Reasoning Traces Utilizing the lambda/hermes-agent-reasoning-traces Dataset

On this tutorial, we discover the lambda/hermes-agent-reasoning-traces dataset to grasp how agent-based fashions assume, use instruments, and generate responses throughout multi-turn conversations. We begin by loading and inspecting the dataset, inspecting its construction, classes, and conversational format to get a transparent thought of the obtainable info. We then construct easy parsers to extract key elements resembling reasoning traces, instrument calls, and power responses, permitting us to separate inside considering from exterior actions. Additionally, we analyze patterns resembling instrument utilization frequency, dialog size, and error charges to higher perceive agent conduct. We additionally create visualizations to focus on these traits and make the evaluation extra intuitive. Lastly, we put together the dataset for coaching by changing it right into a model-friendly format, making it appropriate for duties like supervised fine-tuning.

!pip -q set up -U datasets pandas matplotlib seaborn transformers speed up trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, break up="practice")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Classes:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", break up="practice")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", break up="practice")
   ds_kimi = ds_kimi.add_column("supply", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("supply", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Mixed:", ds, "→ counts:", Counter(ds["source"]))


pattern = ds[0]
print("n=== Pattern 0 ===")
print("id        :", pattern["id"])
print("class  :", pattern["category"], "/", pattern["subcategory"])
print("process      :", pattern["task"])
print("turns     :", len(pattern["conversations"]))
print("system[0] :", pattern["conversations"][0]["value"][:220], "...n")

We set up all required libraries and import the mandatory modules to arrange our surroundings. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its construction, fields, and classes. We additionally optionally mix a number of dataset configurations and study a pattern to grasp the conversational format.

THINK_RE     = re.compile(r"(.*?)", re.DOTALL)
TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.DOTALL)
TOOL_RESP_RE = re.compile(r"s*(.*?)s*", re.DOTALL)


def parse_assistant(worth: str) -> dict:
   ideas = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for uncooked in TOOL_CALL_RE.findall(worth):
       attempt:
           calls.append(json.hundreds(uncooked))
       besides json.JSONDecodeError:
           calls.append({"title": "", "arguments": {}})
   ultimate = TOOL_CALL_RE.sub("", THINK_RE.sub("", worth)).strip()
   return {"ideas": ideas, "tool_calls": calls, "ultimate": ultimate}


def parse_tool(worth: str):
   uncooked = TOOL_RESP_RE.search(worth)
   if not uncooked: return {"uncooked": worth}
   physique = uncooked.group(1)
   attempt:    return json.hundreds(physique)
   besides: return {"uncooked": physique}


first_gpt = subsequent(t for t in pattern["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Device calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

We outline regex-based parsers to extract reasoning traces, instrument calls, and power responses from the dataset. We course of assistant messages to separate ideas, actions, and ultimate outputs in a structured approach. We then check the parser on a pattern dialog to confirm that the extraction works accurately.

N = 3000
sub = ds.choose(vary(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "instrument":
           r = parse_tool(t["value"])
           blob = json.dumps(r).decrease()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.imply(turns_per_traj):.1f}")
print(f"Avg instrument calls/traj : {np.imply(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.imply([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for ok,v in parallel_widths.objects() if ok>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("High 10 instruments        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


prime = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], colour="teal")
axes[0,0].set_title("High 15 instruments by name quantity")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for ok in ks], colour="coral")
axes[0,1].set_title("Device-calls per assistant flip (parallel width)")
axes[0,1].set_xlabel("# instrument calls in a single flip"); axes[0,1].set_ylabel("depend")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, colour="steelblue")
axes[1,0].set_title("Dialog size"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Class distribution")


plt.tight_layout(); plt.present()

We carry out dataset-wide analytics to measure instrument utilization, dialog lengths, and error patterns. We combination statistics throughout a number of samples to grasp general agent conduct. We additionally create visualizations to focus on traits resembling instrument frequency, parallel calls, and class distribution.

def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
   for t in ex["conversations"]:
       position = t["from"]
       if position == "system":
           proceed
       if position == "human":
           print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
       elif position == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('title')}({args})")
           if p["final"]:
               print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
       elif position == "instrument":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   attempt:    return json.hundreds(ex["tools"])
   besides: return []


schemas = get_tool_schemas(pattern)
print(f"nSample 0 has {len(schemas)} instruments obtainable")
for s in schemas[:3]:
   fn = s.get("perform", {})
   print(" -", fn.get("title"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "person", "gpt": "assistant", "instrument": "instrument"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content material": t["value"]} for t in conv]


example_msgs = to_openai_messages(pattern["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].exchange("n", " "), "...")

We construct utilities to render full dialog traces in a readable format for deeper inspection. We additionally extract instrument schemas and convert the dataset into OpenAI-style message format for compatibility with coaching pipelines. This helps us higher perceive each the construction of instruments and the way conversations could be standardized.

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "instrument":
           m["role"] = "person"
           m["content"] = "[TOOL OUTPUT]n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       textual content = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(textual content, add_special_tokens=False)
       input_ids.prolong(ids)
       labels.prolong(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(pattern["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized instance: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.choose(vary(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": proceed
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.determine(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Size distributions (log y)")
plt.tight_layout(); plt.present()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"assume": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "instrument" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f"💭 {textwrap.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"⚙️  {c.get('title')}({json.dumps(c.get('arguments', {}))[:140]})")
       for r in s["responses"]:
           print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
       if s["think"]["final"]:
           print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")


rp = TraceReplayer(pattern)
for i in vary(min(3, len(rp))):
   rp.play(i)


TRAIN = False
if TRAIN:
   import torch
   from transformers import AutoModelForCausalLM
   from trl import SFTTrainer, SFTConfig


   train_subset = ds.choose(vary(200))


   def to_text(batch):
       msgs = to_openai_messages(batch["conversations"])
       for m in msgs:
           if m["role"] == "instrument":
               m["role"] = "person"; m["content"] = "[TOOL]n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   mannequin = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
       device_map="auto" if torch.cuda.is_available() else None,
   )


   cfg = SFTConfig(
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="textual content",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(mannequin=mannequin, args=cfg, train_dataset=train_subset, processing_class=tok).practice()
   print("Nice-tune demo completed.")


print("n✅ Tutorial full. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an optionally available coaching hook.")

We tokenize the conversations and apply label masking so solely assistant responses contribute to coaching. We analyze the size distributions of reasoning, instrument calls, and solutions to realize additional insights. We additionally implement a hint replayer to step via agent conduct and optionally run a small fine-tuning loop.

In conclusion, we developed a structured workflow to parse, analyze, and work successfully with agent reasoning traces. We had been in a position to break down conversations into significant elements, study how brokers cause step-by-step, and measure how they work together with instruments throughout drawback fixing. Utilizing the visualizations and analytics, we gained insights into frequent patterns and behaviors throughout the dataset. As well as, we transformed the information right into a format appropriate for coaching language fashions, together with dealing with tokenization and label masking for assistant responses. Additionally, this course of offers a robust basis for learning, evaluating, and enhancing tool-using AI methods in a sensible, scalable approach.


Try the Full Codes with Pocket book. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as nicely.

Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments