Agents and Tool Use

PM: Read in full — 20 min

The Idea: LLMs That Act, Not Just Answer

What if instead of asking an LLM a question and getting a single answer, you gave it tools — a calculator, a web browser, a database — and let it decide what to do next?

That's the core idea behind agentic LLMs. The model doesn't just generate text; it reasons about what needs to be done, calls a tool to do it, observes the result, and decides what to do next. This loop continues until the task is complete — or until something goes wrong.

The Agent Loop: Reason, Act, Observe

The basic structure is a loop:

Reason: Given the goal and current context, what's the next step?
Act: Call a tool with specific arguments (a web search, a calculation, a database query)
Observe: Read the tool's output and add it to the context
Repeat until the goal is achieved or the model decides it's done

Each iteration extends the context with new observations. The model always has access to the full conversation history including all previous tool calls and their results.

How Tool Calling Works

The LLM doesn't execute tools directly. It outputs a structured JSON object naming which tool to call and what arguments to pass. The host system — your application code — executes the actual function and sends the result back as a new message in the conversation.

The model never runs code. It requests execution, then reasons about the result.

This matters because your application controls which tools are available, what they can do, and whether to actually execute them. You can intercept, log, or reject any tool call before it runs.

The ReAct Pattern

The ReAct pattern (Yao et al., 2023) interleaves explicit reasoning traces with tool actions. Rather than jumping straight to a tool call, the model first writes out its reasoning: "I need to look up today's price to answer this question." Then it acts, observes, and reasons again.

This seemingly small addition — generating reasoning text before and after each tool use — substantially improves performance on knowledge-intensive tasks compared to using either reasoning alone (chain-of-thought) or tool use alone. The reasoning traces also make it much easier to debug what went wrong when an agent fails.

What Goes Wrong

Agents fail in distinct ways that differ from single-call LLM failures:

Infinite loops: The model keeps calling tools without ever deciding the task is done. It finds a new sub-question to answer each time. You need explicit termination logic and iteration limits.

Hallucinated tool calls: The model invents a function name that doesn't exist or passes arguments that don't match the function signature. Robust parsing and validation of tool call outputs is required on the host side.

Wrong tool selection: The model has multiple tools available and picks the wrong one. This gets worse as the number of available tools increases.

Context window exhaustion: Long multi-step tasks accumulate tool calls and results in the context window. Eventually the context fills up and early parts of the conversation are lost — the parts that may contain the original goal or critical intermediate results.

Cascading errors: The model uses an incorrect intermediate result from step 2 as input to step 4. The final answer is wrong in a way that's hard to trace back because each individual step appeared to succeed.

What This Means for Product Design

The fact that agents can call tools that change state — sending emails, writing to databases, making purchases — means the failure modes are qualitatively different from a search or summarization feature. A hallucinated answer in a Q&A system is bad. A hallucinated tool call that sends an incorrect email to a customer is worse.

Design the guardrails before the capabilities. Which tools should be available? Which actions are reversible? What requires a human checkpoint before execution? What happens when the agent exceeds its iteration limit? These are product decisions, not just engineering ones.

The answer to "should we build this as an agent?" is often "no — or at least not without clear bounds on what it can do."

Benchmarked reality: The Remote Labor Index (Mazeika et al., 2025) evaluated AI agents on 240 real tasks drawn from the Upwork marketplace — the kind of work humans routinely get paid to do. The best-performing agent completed only 2.5% of tasks; most completed under 1%. A parallel benchmark, HCAST (Rein et al., 2025), found agents succeed on 70–80% of tasks taking under one hour but below 20% on tasks requiring four or more hours of sustained work.

These numbers reflect the hardest end of the spectrum: unstructured, open-ended work with no scaffolding. For bounded tasks with clear success criteria — structured data extraction, code execution in a defined environment, form completion — agent success rates are substantially higher. The gap between benchmark performance and real-world performance grows as task scope, duration, and ambiguity increase.

PM Takeaway

An agent is not magic — it's an LLM in a loop with tools. The hard problems are which tools to expose, how to handle failures, and how to know when the agent has gone off the rails. Design the guardrails before you design the features.

The Idea: LLMs That Act, Not Just Answer​

The Agent Loop: Reason, Act, Observe​

How Tool Calling Works​

The ReAct Pattern​

What Goes Wrong​

What This Means for Product Design​

Further Reading​