Diagram showing an advanced agentic AI system with planning, tool calling, memory, and self-critique workflow. — A visual breakdown of how an advanced agentic AI system combines planning, tools, memory, and self-critique to automate tasks.

Building an advanced agentic AI system is now within reach for any developer with access to the OpenAI API — and the payoff is dramatic. Instead of a single-shot text generator, you get an autonomous pipeline that plans its own approach, calls real tools, remembers past context, and critiques its own output before delivering a final answer.

This guide walks you through every layer of that architecture, from environment setup to a working end-to-end demo, using Python and the OpenAI API.

What Is an Agentic AI System?

Definition: An agentic AI system is an AI pipeline in which a language model acts as an autonomous agent — it breaks down a goal, takes sequential actions, uses external tools, and refines its output through iterative self-evaluation, all without requiring a human to guide each step.

The key distinction from a standard LLM call is autonomy over time. A basic chatbot responds to one prompt. An agentic AI system reasons about what to do next, calls a calculator or database, stores intermediate results in memory, and applies a self-critique loop before producing the final deliverable.

Agentic frameworks have accelerated rapidly in 2026. Microsoft’s ARTIST, Google’s multi-agent RAG systems, and OpenAI‘s own function-calling infrastructure all point to the same insight: LLMs become dramatically more useful when they can act, not just generate. (advanced agentic ai ai tool calling ai memory systems self-critique ai architecture

)

The Three-Role Architecture: Planner, Executor, and Critic

The most reliable way to build an advanced agentic AI system is to separate responsibilities across three specialized roles. Each role uses a distinct system prompt and interacts with the model in a distinct way.

This separation of concerns prevents a single context from becoming cluttered with conflicting instructions and makes each role independently testable and replaceable.

The Planner Role

What does the Planner do? The Planner receives the user’s goal and returns a structured execution plan — typically a JSON object with an objective, a list of steps, and checkpoints where tools should be used.

Using a strict JSON schema instruction for the Planner is critical. It keeps the downstream executor from having to parse prose instructions, which reduces errors in long-horizon tasks.

python

PLANNER_SYS = """You are a senior planner.
Return STRICT JSON with keys:
objective (string), steps (array of strings), tool_checkpoints (array of strings)."""

The Planner should use a low temperature (0.1) to make its outputs deterministic and consistent. The goal here is strategy, not creativity.

The Executor Role (Tool-Calling Layer)

What does the Executor do? The Executor receives the plan and loops through model calls, detecting tool invocations, running them in Python, and feeding results back into the conversation.

This is where tool calling becomes central to the agentic AI system. The Executor doesn’t just generate text — it actively invokes functions like a safe calculator, a knowledge-base search, a JSON extractor, or a file writer. Each tool returns a structured dictionary that the model can interpret and reason over.

python

EXECUTOR_SYS = """You are a tool-using executor.
Use tools when needed. Keep intermediate notes short.
When done, return:
1) DRAFT output
2) Verification checklist"""

The Executor loop runs up to a configurable iteration limit (e.g., 12 turns), catching tool calls, executing them locally, and appending results as tool role messages. This creates a genuine feedback loop rather than a static generation.

The Critic Role (Self-Critique)

What does the Critic do? The Critic receives the original goal, the draft output, and the full tool-call trace, then returns a structured critique — identifying issues, proposing fixes, and producing an improved final answer.

Self-critique is what separates a robust agentic AI system from a fragile one. Without it, errors introduced during tool execution accumulate silently. The Critic acts as an integrated quality-control layer, reviewing the full chain of evidence before the user sees anything.

python

CRITIC_SYS = """You are a critic.
Given goal + draft, return:
- Issues (bullets)
- Fixes (bullets)
- Improved final answer (clean)"""

By including the tool trace in the Critic’s context, the system can catch numeric errors, missing action items, and structural gaps that wouldn’t be visible from the draft text alone.

Setting Up Your Environment with the OpenAI API

Before building the agent pipeline, configure your environment cleanly. Using getpass() to collect the API key prevents it from appearing in notebook outputs or version-controlled files.

python

import os, json, re, math, hashlib
from dataclasses import dataclass, field
from typing import Any, Dict, List
from getpass import getpass
from openai import OpenAI

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY (hidden): ").strip()

client = OpenAI()
MODEL = "gpt-4o"  # or your preferred model

Setting the model string once at the top and referencing it throughout the codebase avoids silent mismatches, especially if you need to swap to a different model during development.

Building Structured Tool Calling in Python

Tool calling is the mechanism that turns a language model into an agentic AI system capable of affecting the world. Rather than hallucinating numeric results or inventing facts, the model calls a declared function and receives a real answer.

Defining Tool Schemas

Each tool requires a JSON schema that describes its name, purpose, and expected parameters. The OpenAI API uses these schemas to guide the model in formatting its tool calls correctly.

python

TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": "calc",
            "description": "Safely compute a numeric expression.",
            "parameters": {
                "type": "object",
                "properties": {"expression": {"type": "string"}},
                "required": ["expression"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "kb_search",
            "description": "Search internal knowledge base.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "k": {"type": "integer", "default": 3}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file path.",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string"},
                    "content": {"type": "string"}
                },
                "required": ["path", "content"]
            }
        }
    }
]

Well-designed tool schemas do two things: they constrain what the model can call (preventing hallucinated function names) and they document the agent’s capabilities in a machine-readable format that can be audited later.

Executing Tools Safely

On the Python side, each tool maps to a callable function. A dispatcher handles routing by name and returns a structured dictionary — always including an ok field — so the model can detect and reason about failures.

python

TOOLS = {
    "calc": lambda expression: _safe_calc(expression),
    "kb_search": lambda query, k=3: _kb_search(query, int(k)),
    "write_file": lambda path, content: _write_file(path, content),
}

def run_tool(name, args):
    fn = TOOLS.get(name)
    if not fn:
        return {"ok": False, "error": f"Unknown tool: {name}"}
    try:
        return fn(**args)
    except Exception as e:
        return {"ok": False, "error": str(e), "args": args}

The _safe_calc function is worth emphasizing: it whitelists only numeric characters and operators before calling eval, preventing arbitrary code execution. This kind of defensive implementation is essential in any agentic AI system that processes untrusted input.

Implementing Agent Memory

Memory is what makes an agentic AI system capable of learning within a session and coordinating across long-horizon tasks. Without it, every turn starts from scratch and the agent cannot accumulate context.

Short-Term vs. Long-Term Memory in AI Agents

Memory Type	Scope	Implementation	Use Case
Short-term (in-context)	Current session only	Appended to message history	Passing tool results forward
Working memory	Current task	`AgentState.memory` list	Persisting intermediate findings
Long-term (external)	Cross-session	Vector DB or key-value store	Remembering user preferences
Episodic trace	Current session	`AgentState.trace` list	Debugging and critique

For the pipeline in this guide, working memory and an episodic trace are sufficient. Both are stored in the AgentState dataclass.

python

@dataclass
class AgentState:
    goal: str
    memory: List[str] = field(default_factory=list)
    trace: List[Dict[str, Any]] = field(default_factory=list)

The memory list holds short directives the executor should keep in mind (e.g., “use kb_search for formatting guidance”). The trace records every tool call, its arguments, and its output — giving the Critic full visibility into how the answer was produced.

For production deployments, extending this pattern with a vector store like Chroma or Pinecone enables persistent memory across sessions, turning a single-task agentic AI system into a persistent AI assistant with evolving context.

Agentic AI Architecture Comparison: Simple vs. Advanced

Understanding the upgrade from a basic LLM call to a full agentic AI system helps clarify which components to prioritize for your use case.

Feature	Basic LLM Call	Advanced Agentic AI System
Goal handling	Single prompt, single response	Multi-step planning with structured JSON
Tool access	None	Calculator, KB search, file I/O, custom APIs
Memory	None (stateless)	In-context working memory + episodic trace
Self-correction	None	Dedicated Critic role reviews and improves draft
Output format	Unstructured text	Structured deliverables (JSON, Markdown, files)
Iteration	1 LLM call	Up to N turns with tool feedback loops
Debuggability	Low (no trace)	High (full tool trace for auditing)
Best for	Q&A, summarization	Multi-step workflows, automated pipelines

The right architecture depends on task complexity. For simple retrieval or summarization, a basic LLM call is faster and cheaper. Once a task involves calculation, structured outputs, knowledge retrieval, or saving artifacts, the full agentic AI system pattern pays for itself immediately.

Running the Full Agentic AI Pipeline End-to-End

Putting all three roles together, the run_agent() function orchestrates the complete pipeline in five stages.

python

def run_agent(goal: str):
    # 1. Initialize state
    state = AgentState(goal=goal)
    state.memory.append(
        "Use kb_search if you need internal guidance or formatting playbooks."
    )

    # 2. Plan
    plan_obj = plan(state)

    # 3. Execute with tool calling
    draft = execute(state, plan_obj)

    # 4. Critique and refine
    final = critique(state, draft)

    # 5. Return structured result
    return {
        "plan": plan_obj,
        "draft": draft,
        "final": final,
        "trace": state.trace
    }

Each stage feeds into the next. The Planner’s structured JSON becomes the Executor’s roadmap. The Executor’s tool trace becomes the Critic’s evidence. The Critic’s improved output becomes the user’s deliverable.

Demo goal — meeting follow-up pipeline:

python

demo_goal = """
From this transcript, produce:
A) Concise meeting summary
B) Action items as JSON array with fields: owner, action, due_date
C) Follow-up email (subject + body)
D) Save output to /content/meeting_followup.md using write_file

Transcript:
- Decision: Ship v2 dashboard on March 15.
- Risk: Data latency might spike; Priya will run load tests.
- Amir will update the KPI definitions doc and share with finance.
- Next check-in: Tuesday. Owner: Nikhil.
"""

result = run_agent(demo_goal)
print(result["final"])

Running this demo triggers the following sequence in the agentic AI system:

The Planner returns a JSON plan with four steps and a write_file checkpoint.
The Executor calls kb_search for formatting guidance, generates the summary and action items JSON, drafts the email, then calls write_file to save the output.
The Critic reviews the draft against the tool trace, checks that all action items have owners and due dates, and produces a polished final response.

The result includes a Markdown file on disk, a verified JSON array, and a follow-up email — all from a single run_agent() call.

Common Pitfalls When Building an Agentic AI System

Even well-designed pipelines hit avoidable errors. These are the most common failure modes.

Missing tool_choice guard: Passing tool_choice="auto" without a tools list raises a 400 error from the OpenAI API. Always wrap tool parameters in a conditional: only include them when tools are actually provided.
Unguarded eval() in tool functions: Passing user input directly to Python’s eval() is a critical security vulnerability. Whitelist allowed characters explicitly, as shown in _safe_calc.
No iteration limit in the Executor loop: Without a maximum turn count, a stuck tool call can run indefinitely. Set a hard ceiling (e.g., 12 iterations) and return a graceful error message if it’s reached.
Planner JSON parse failure with no fallback: If the model returns a plan with prose instead of valid JSON, the pipeline should degrade gracefully — either retrying with a stricter prompt or falling back to a minimal default plan rather than crashing.
Ignoring the tool trace in the Critic: Passing only the draft to the Critic loses the most valuable debugging signal. The trace shows whether a calculation actually ran, what the KB returned, and whether file I/O succeeded — all evidence the Critic needs to catch real errors.
Shallow memory (no working state): An agentic AI system without persistent memory between sub-tasks forces the Executor to re-derive context on every turn. Maintain a memory list and inject it into every Executor prompt.
Oversized tool outputs in context: Tools like file writers or KB searches can return large payloads. Truncate tool outputs before adding them to the message history to avoid context overflow on long tasks.

Extending the Pipeline: What Comes Next

The architecture described here is deliberately minimal — just enough to be production-credible for real workflows. Natural next steps include:

Parallel sub-agents: Decompose complex goals into concurrent tasks handled by specialized executor instances, then merge results in a synthesis step.
Vector memory: Replace the in-memory KB with a proper embedding-based retrieval store for cross-session recall.
Tool retry policies: Wrap run_tool() in an exponential backoff decorator so transient failures don’t cascade.
Evaluation harnesses: Log plan quality, tool utilization rate, and Critic improvement scores across runs to benchmark the agentic AI system over time.
Human-in-the-loop approval: Add a checkpointing mechanism that pauses the Executor before irreversible actions (file writes, API calls, emails) and requests explicit confirmation.

Each of these extensions preserves the three-role structure — they add capability without changing the core logic. That’s the real benefit of the Planner/Executor/Critic pattern: it scales horizontally by adding roles, not vertically by bloating a single prompt.

Conclusion

An advanced agentic AI system built on the OpenAI API isn’t a research artifact — it’s a practical pattern you can implement today. By separating planning, tool-augmented execution, and self-critique into distinct roles, you get a pipeline that is debuggable, extensible, and genuinely useful for real-world tasks like meeting follow-ups, structured data extraction, and automated report generation.

The core insight is architectural: once you treat the LLM as an orchestrator rather than a generator, every component — memory, tools, critique — has a natural home in the pipeline. That is what distinguishes a true agentic AI system from a clever prompt.

kalinga.ai

How to Build an Advanced Agentic AI System with Planning, Tool Calling, Memory, and Self-Critique

What Is an Agentic AI System?

The Three-Role Architecture: Planner, Executor, and Critic

The Planner Role

The Executor Role (Tool-Calling Layer)

The Critic Role (Self-Critique)

Setting Up Your Environment with the OpenAI API

Building Structured Tool Calling in Python

Defining Tool Schemas

Executing Tools Safely

Implementing Agent Memory

Short-Term vs. Long-Term Memory in AI Agents

Agentic AI Architecture Comparison: Simple vs. Advanced

Running the Full Agentic AI Pipeline End-to-End

Common Pitfalls When Building an Agentic AI System

Extending the Pipeline: What Comes Next

Conclusion

Leave a Comment Cancel Reply

Kalinga .ai