Agents and Tool Use
The Idea: LLMs That Act, Not Just Answerโ
What if instead of asking an LLM a question and getting a single answer, you gave it tools โ a calculator, a web browser, a database โ and let it decide what to do next?
That's the core idea behind agentic LLMs. The model doesn't just generate text; it reasons about what needs to be done, calls a tool to do it, observes the result, and decides what to do next. This loop continues until the task is complete โ or until something goes wrong.
The Agent Loop: Reason, Act, Observeโ
The basic structure is a loop:
- Reason: Given the goal and current context, what's the next step?
- Act: Call a tool with specific arguments (a web search, a calculation, a database query)
- Observe: Read the tool's output and add it to the context
- Repeat until the goal is achieved or the model decides it's done
Each iteration extends the context with new observations. The model always has access to the full conversation history including all previous tool calls and their results.
How Tool Calling Worksโ
The LLM doesn't execute tools directly. It outputs a structured JSON object naming which tool to call and what arguments to pass. The host system โ your application code โ executes the actual function and sends the result back as a new message in the conversation.
The model never runs code. It requests execution, then reasons about the result.
This matters because your application controls which tools are available, what they can do, and whether to actually execute them. You can intercept, log, or reject any tool call before it runs.
The ReAct Patternโ
The ReAct pattern (Yao et al., ICLR 2023) interleaves explicit reasoning traces with tool actions. Rather than jumping straight to a tool call, the model first writes out its reasoning: "I need to look up today's price to answer this question." Then it acts, observes, and reasons again.
This seemingly small addition โ generating reasoning text before and after each tool use โ substantially improves performance on knowledge-intensive tasks compared to using either reasoning alone (chain-of-thought) or tool use alone. The reasoning traces also make it much easier to debug what went wrong when an agent fails.
What Goes Wrongโ
Agents fail in distinct ways that differ from single-call LLM failures:
Infinite loops: The model keeps calling tools without ever deciding the task is done. It finds a new sub-question to answer each time. You need explicit termination logic and iteration limits.
Hallucinated tool calls: The model invents a function name that doesn't exist or passes arguments that don't match the function signature. Robust parsing and validation of tool call outputs is required on the host side.
Wrong tool selection: The model has multiple tools available and picks the wrong one. This gets worse as the number of available tools increases.
Context window exhaustion: Long multi-step tasks accumulate tool calls and results in the context window. Eventually the context fills up and early parts of the conversation are lost โ the parts that may contain the original goal or critical intermediate results.
Cascading errors: The model uses an incorrect intermediate result from step 2 as input to step 4. The final answer is wrong in a way that's hard to trace back because each individual step appeared to succeed.
What This Means for Product Designโ
The fact that agents can call tools that change state โ sending emails, writing to databases, making purchases โ means the failure modes are qualitatively different from a search or summarization feature. A hallucinated answer in a Q&A system is bad. A hallucinated tool call that sends an incorrect email to a customer is worse.
Design the guardrails before the capabilities. Which tools should be available? Which actions are reversible? What requires a human checkpoint before execution? What happens when the agent exceeds its iteration limit? These are product decisions, not just engineering ones.
The answer to "should we build this as an agent?" is often "no โ or at least not without clear bounds on what it can do."
An agent is not magic โ it's an LLM in a loop with tools. The hard problems are which tools to expose, how to handle failures, and how to know when the agent has gone off the rails. Design the guardrails before you design the features.