
# Introduction
In a current article on Machine Studying Mastery, we constructed a tool-calling agent that reached outward, that’s pulling climate, information, foreign money charges, and time from public APIs. That article coated the synthesis half of the sample properly, but it surely left the extra attention-grabbing half on the desk: an agent that causes about its personal surroundings, inspects its personal machine, and offloads logic it would not belief itself to carry out. It might be argued that that is nearer to really “agentic.”
This text picks up the place that one left off. We are going to give Gemma 4 two new instruments — a sandboxed native filesystem explorer and a restricted Python interpreter — and watch the mannequin resolve, by itself, when to go searching and when to compute.
Matters we are going to cowl embody:
- Why “agentic” device calling wants greater than net APIs to be attention-grabbing
- The best way to construct a filesystem inspection device with arduous path-traversal guards
- The best way to wire a Python interpreter device to the mannequin with out handing it the keys to your machine
- How the identical orchestration loop from earlier than generalizes to those new capabilities
I extremely suggest that you just first learn this text earlier than persevering with on.
# From Dialog to Company
When the one instruments you give a language mannequin are read-only net APIs, primarily you continue to actually have a chatbot, albeit one with potential entry to higher info. The mannequin receives a immediate, decides which API to ping, and stitches the JSON response right into a paragraph. There isn’t a actual notion of surroundings, no state to examine, no consequence to purpose about; it is a state of affairs extra akin to retrieval augmented era than true company.
Company, within the sensible sense practitioners use the phrase, reveals up when a mannequin begins interacting with the system it’s operating on. That may imply studying from a neighborhood filesystem, executing code, modifying information, calling different processes, or any mixture of these. The second a device can do one thing aside from return a clear string from a distant service, the mannequin has to start out asking about itself: what information exist, what does this quantity really equal, what’s on this folder earlier than I declare it incorporates something.
The Gemma 4 household, and particularly the gemma4:e2b edge variant now we have been utilizing, is sufficiently small to run regionally on a laptop computer whereas being competent sufficient at structured output to drive this sort of loop reliably. That mixture is what makes the local-agentic sample attention-grabbing within the first place. The entire code for this tutorial may be discovered right here.
# The Architectural Reuse
The orchestration loop from the earlier tutorial doesn’t change. We outline Python capabilities, expose them through JSON schema, cross the registry to Ollama alongside the person immediate, intercept any tool_calls block on the response, execute the requested perform regionally, append the end result as a device-role message, and re-query the mannequin so it might probably synthesize a remaining reply. The identical call_ollama helper, the identical TOOL_FUNCTIONS dictionary, the identical available_tools schema array from the earlier tutorial all make appearances.
What adjustments is the character of the instruments themselves. The place the earlier batch had been all skinny purchasers over distant APIs, these we are going to construct now each run code on the machine. That shifts the design drawback from “how do I parse this response” to “how do I make certain the mannequin can not, even unintentionally, do one thing it shouldn’t be allowed to do.”
# Software 1: A Sandboxed Filesystem Explorer
The primary device, list_directory_contents, offers the mannequin the flexibility to see what information exist in a given folder. This sounds trivial till you do not forget that os.listdir accepts any string, together with /, ~, and ../../and many others. A naive implementation may fortunately stroll the mannequin’s “curiosity” straight to your API keys.
The design alternative right here is to pin a secure base listing at script begin and reject any request that resolves exterior of it:
# Safety: confine list_directory_contents to this base listing and its descendants
# Set to the present working listing when the script begins
SAFE_BASE_DIR = os.path.abspath(os.getcwd())
def list_directory_contents(path: str = ".") -> str:
"""Lists information and directories inside a path, constrained to the secure base listing."""
attempt:
# Resolve to an absolute path and confirm it sits inside SAFE_BASE_DIR
# This blocks traversal makes an attempt like '../../and many others' or absolute paths like "https://www.kdnuggets.com/"
requested = os.path.abspath(os.path.be a part of(SAFE_BASE_DIR, path))
if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
return (
f"Error: Entry denied. The trail '{path}' resolves exterior the "
f"permitted workspace ({SAFE_BASE_DIR})."
)
...
The sample is straightforward however price contemplating additional. We by no means belief the string the mannequin produced. We be a part of it onto the bottom listing, resolve it completely (so .. will get normalized away), after which confirm the resolved path nonetheless begins with the bottom. Each /and many others/passwd and ../../someplace collapse into paths that fail that prefix examine and are rejected earlier than os.listdir is ever known as.
The remainder of the perform is housekeeping: verify the trail exists and is a listing, record its contents, and format every entry as both [DIR] or [FILE] with a byte dimension. The returned string is apparent English with construction the mannequin can parse on the second cross:
entries = sorted(os.listdir(requested))
if not entries:
return f"The listing '{path}' is empty."
strains = [f"Contents of '{path}' ({len(entries)} item(s)):"]
for title in entries:
full = os.path.be a part of(requested, title)
if os.path.isdir(full):
strains.append(f" [DIR] {title}/")
else:
attempt:
dimension = os.path.getsize(full)
strains.append(f" [FILE] {title} ({dimension} bytes)")
besides OSError:
strains.append(f" [FILE] {title}")
return "n".be a part of(strains)
The JSON schema we hand to the mannequin is intentionally permissive on the parameter facet — path is non-obligatory, defaulting to the workspace root, as a result of most helpful first questions are in regards to the present folder:
{
"kind": "perform",
"perform": {
"title": "list_directory_contents",
"description": (
"Lists information and subdirectories inside a path throughout the person's workspace. "
"Use this to examine the surroundings earlier than answering questions on native information."
),
"parameters": {
"kind": "object",
"properties": {
"path": {
"kind": "string",
"description": (
"A relative path contained in the workspace, e.g. '.', 'information', or 'src/utils'. "
"Defaults to the workspace root."
)
}
},
"required": []
}
}
}
Notice the outline does a small quantity of immediate engineering: “Use this to examine the surroundings earlier than answering questions on native information.” That sentence pushes Gemma 4 towards calling the device when the person asks a imprecise query about “my information” moderately than guessing at what may be there.
# Software 2: A Restricted Python Interpreter
The second device, execute_python_code, is the extra harmful and the extra pedagogically attention-grabbing of the 2. The premise is that language fashions, particularly small ones, are unreliable at exact arithmetic, actual string manipulation, and something involving greater than a few steps of branching logic. A device that lets the mannequin write and run a deterministic snippet is a significantly better reply to these issues than asking it to purpose by way of them in pure language.
The implementation makes use of exec() with a intentionally stripped-down builtins namespace:
def execute_python_code(code: str) -> str:
"""Executes a snippet of Python code and returns no matter was printed to stdout.
This can be a learning-only sandbox. exec() is basically unsafe; don't expose this device
to untrusted customers or networks. The restrictions beneath cease the informal instances, not a
decided attacker.
"""
attempt:
# A minimal restricted surroundings. We strip __builtins__ right down to a small
# whitelist in order that, e.g., open(), eval(), and __import__ should not immediately
# obtainable from the snippet's international scope.
safe_builtins = {
"abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
"divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
"int": int, "len": len, "record": record, "map": map, "max": max, "min": min,
"pow": pow, "print": print, "vary": vary, "repr": repr, "reversed": reversed,
"spherical": spherical, "set": set, "sorted": sorted, "str": str, "sum": sum,
"tuple": tuple, "zip": zip,
}
# Pre-import a few secure, helpful modules so the mannequin would not must.
import math, statistics
restricted_globals = {
"__builtins__": safe_builtins,
"math": math,
"statistics": statistics,
}
A couple of selections price calling out. We exchange __builtins__ totally moderately than blacklisting particular person capabilities, which suggests open, eval, exec, compile, __import__, enter, and anything not in our whitelist merely doesn’t exist contained in the snippet. We pre-import math and statistics into the snippet’s globals as a result of the mannequin will attain for them always and we might moderately not pressure it to struggle __import__ restrictions. We seize stdout with contextlib.redirect_stdout so the mannequin will get again precisely what its snippet printed:
# Seize stdout so we are able to hand the printed output again to the mannequin
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
exec(code, restricted_globals, {})
output = buffer.getvalue().strip()
if not output:
return "Code executed efficiently however produced no output. Use print() to return a worth."
return f"Output:n{output}"
The empty-output department issues greater than it seems. Small fashions will routinely write expressions like x = sum(vary(101)) and overlook the print(x). Returning a particular error telling them to make use of print() offers the orchestration loop the choice to retry; with out it, the mannequin would synthesize a remaining reply primarily based on an empty string and confidently invent a worth.
A remaining phrase on security, for the reason that script’s docstring is blunt about it: this can be a studying sandbox, not a hardened one. A decided adversary can escape of a Python exec sandbox in a dozen methods, most of them involving object introspection by way of ().__class__.__mro__. For a single-user agent operating by yourself laptop computer by yourself prompts, the whitelist is loads. For anything, you’d need an actual isolation layer — a subprocess with seccomp, a container, or RestrictedPython.
# The Orchestration Loop
The principle loop is unchanged in construction from the earlier tutorial. The mannequin is queried with the person immediate and the device registry, and if it responds with tool_calls, every name is dispatched towards TOOL_FUNCTIONS:
if "tool_calls" in message and message["tool_calls"]:
print("[TOOL EXECUTION]")
messages.append(message)
num_tools = len(message["tool_calls"])
for i, tool_call in enumerate(message["tool_calls"]):
function_name = tool_call["function"]["name"]
arguments = tool_call["function"]["arguments"]
...
if function_name in TOOL_FUNCTIONS:
func = TOOL_FUNCTIONS[function_name]
attempt:
end result = func(**arguments)
...
messages.append({
"function": "device",
"content material": str(end result),
"title": function_name
})
The CLI formatting is price a small tweak for this script. The execute_python_code device’s code argument is usually a multi-line string with newlines in it, which can wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the show solely; the mannequin nonetheless receives the complete string when the perform runs:
def _short(v):
if isinstance(v, str):
flat = v.exchange("n", "n")
if len(flat) > 60:
flat = flat[:57] + "..."
return f"'{flat}'"
return str(v)
args_str = ", ".be a part of(f"{ok}={_short(v)}" for ok, v in arguments.objects())
As soon as every device result’s appended again into the message historical past as a "function": "device" entry, we re-call Ollama with the enriched payload and the mannequin produces its grounded remaining reply. Similar two-pass sample, similar logic.
# Testing the Instruments
And now we take a look at our device calling. Pull gemma4:e2b with ollama pull gemma4:e2b you probably have not already, then run the script from a folder you don’t thoughts the mannequin peeking at.
Let’s begin with the filesystem device. From the mission listing:
What scripts are in my present folder, and which one seems prefer it must be used to course of CSVs?
Consequence:
[SYSTEM]
○ Software: execute_python_code......................[LOADED]
○ Software: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/tasks/gemma_agent.....[SANDBOXED]
[PROMPT]
What scripts are in my present folder, and which one seems prefer it must be used to course of CSVs?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: list_directory_contents
├─ Args: path="."
└─ Consequence: Contents of '.' (5 merchandise(s)):
[FILE] README.md (412 bytes)
[FILE] csv_cleaner.py (1834 bytes)
[FILE] predominant.py (10786 bytes)
[FILE] notes.txt (88 bytes)
[FILE] sales_report.py (2210 bytes)
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
Your present folder incorporates 5 information. The one that appears supposed for CSV
processing is csv_cleaner.py — its title strongly suggests it handles CSV enter.
sales_report.py might also contact CSV information, however its title is extra about output than
ingestion.
The mannequin known as the device, appeared on the precise filenames, and made an affordable inference grounded within the itemizing moderately than in its weights. That’s the distinction between hallucination and statement.
Subsequent, the Python interpreter. A small job that small fashions reliably get flawed if requested to do it of their head:
What’s the customary deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
Consequence:
[SYSTEM]
○ Software: execute_python_code......................[LOADED]
○ Software: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/tasks/gemma_agent.....[SANDBOXED]
[PROMPT]
What's the customary deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: execute_python_code
├─ Args: code="nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]nprint(rou..."
└─ Consequence: Output:
11.4659
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The usual deviation of these numbers, rounded to 4 decimal locations, is 11.4659.
The mannequin offloaded the calculation totally; it wrote a snippet, known as statistics.stdev, rounded the end result, and reported what the interpreter mentioned. No psychological arithmetic, no approximation, no fabricated important digits.
Lastly, the extra attention-grabbing case: a immediate that requires each instruments in sequence. The mannequin has to examine the folder and compute one thing about what it finds:
Have a look at the information within the present folder and inform me the whole dimension in kilobytes, rounded to 2 decimal locations.
Output:
[SYSTEM]
○ Software: execute_python_code......................[LOADED]
○ Software: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/tasks/gemma_agent.....[SANDBOXED]
[PROMPT]
Have a look at the information within the present folder and inform me the whole dimension in kilobytes, rounded to 2 decimal locations.
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
┌── Calling: list_directory_contents
│ ├─ Args: path="."
│ └─ Consequence: Contents of '.' (5 merchandise(s)):
│ [FILE] README.md (412 bytes)
│ [FILE] csv_cleaner.py (1834 bytes)
│ [FILE] predominant.py (10786 bytes)
│ [FILE] notes.txt (88 bytes)
│ [FILE] sales_report.py (2210 bytes)
│
└── Calling: execute_python_code
├─ Args: code="sizes = [412, 1834, 10786, 88, 2210]nprint(spherical(sum(siz..."
└─ Consequence: Output:
15.33
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The 5 information within the present folder complete 15.33 KB.
Two instruments, in the proper order, with the output of 1 feeding the argument of the opposite — produced by a 2-billion-parameter mannequin operating on a laptop computer with no GPU. The filesystem device grounds the mannequin in what is definitely there; the interpreter device grounds the reply in what is definitely true. The mannequin contributes the half it’s genuinely good at, which is deciding which query to ask of which device.
It’s price poking on the security guards too, simply to substantiate they maintain. Asking the mannequin “record the contents of /and many others” produces the anticipated denial message within the device end result, which the mannequin then studies again gracefully moderately than fabricating a listing itemizing. Asking it to run open('/and many others/passwd').learn() contained in the interpreter produces a NameError, since open isn’t within the whitelisted builtins. Each failures degrade into helpful error strings as an alternative of silent compromises, which is precisely what you need at this layer.
# Conclusion
The sooner tutorial confirmed that Gemma 4 can attain throughout the web in your behalf. This one reveals it might probably attain into the machine you might be sitting at, rigorously, when you could have constructed the carefulness in. Upon getting a working tool-calling loop, the attention-grabbing query stops being “can the mannequin name a perform” and begins being “what ought to I let it contact.”
A filesystem-aware device and a code-execution device collectively get you a lot of the option to one thing that genuinely earns the time period agent: it might probably observe its surroundings, resolve what calculation issues, and run that calculation deterministically moderately than guessing. The sample generalizes from there. Database queries, shell instructions, git operations, doc parsing; every one in all these is similar JSON schema, the identical dispatch desk, the identical two-pass synthesis, with no matter security perimeter is acceptable for the blast radius of the underlying name.
Construct the perimeter first. Then hand the mannequin the keys to no matter sits inside it.
Matthew Mayo (@mattmayo13) holds a grasp’s diploma in laptop science and a graduate diploma in information mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Studying Mastery, Matthew goals to make advanced information science ideas accessible. His skilled pursuits embody pure language processing, language fashions, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize data within the information science group. Matthew has been coding since he was 6 years outdated.
