84 / 100

SEO Score

Infographic showing the Plan-Act-Verify loop for long-horizon tasks with Codex, featuring a 25-hour autonomous AI engineering workflow. — The Plan-Act-Verify framework: How autonomous agents manage 25+ hour engineering tasks using Codex.

Long-horizon tasks with Codex represent the new frontier of AI-driven development, shifting the focus from simple code completion to multi-hour autonomous problem-solving. While standard LLMs excel at “one-shot” snippets, a long-horizon task requires an agent to plan, execute, validate, and repair code over hundreds of steps without human intervention.

In this guide, we will break down the architectural secrets used by OpenAI to keep Codex coherent during a 25-hour autonomous run, and how you can apply these principles to your own agentic workflows.

The Shift from Completion to Autonomy

The primary challenge of long-horizon tasks with Codex is “state drift.” In a typical chat session, the model eventually loses the thread of the original goal as the context window fills up. However, by treating Codex as an agent rather than a chatbot, we can enable it to handle complex migrations, feature builds, and bug-hunting missions that span entire days.

The real shift here is the time horizon. Instead of asking “How do I write this function?”, we are now asking “Build this entire interactive dashboard and ensure it passes all CI/CD checks.”

The Core Framework: Durable Project Memory

To successfully execute long-horizon tasks with Codex, the model cannot rely on its internal memory alone. It needs “externalized state”—a set of Markdown files that act as the agent’s short-term and long-term memory.

1. The Spec (Prompt.md)

Every long-running task starts with a clear source of truth. This file defines:

Goals & Non-goals: What the agent must and must not do.
Hard Constraints: Performance requirements, specific libraries (e.g., Tailwind, React), and platform limits.
“Done When” Criteria: A checklist of demo flows or test cases that signal completion.

2. The Execution Plan (Plan.md)

When tackling long-horizon tasks with Codex, the agent must never “wing it.” It creates a Plan.md file that breaks the massive goal into tiny, verifiable milestones. Each milestone must include a validation command (like npm test or a specific lint check). If validation fails, the agent is instructed to stop and repair before moving to the next step.

3. The Status Log (Documentation.md)

Because these tasks can run for 25+ hours, you need a way to audit progress. Codex maintains a live status log that records every major decision, the “why” behind architectural choices, and a list of known issues. This allows a human to step in, review the “audit trail,” and provide a course correction if necessary.

The “Plan-Act-Verify” Loop

The secret sauce for long-horizon tasks with Codex is the iterative loop. Codex does not just write code; it operates in a continuous cycle:

Plan: Identify the next small milestone from the Plan.md.
Edit: Use the apply_patch or file-writing tools to modify the codebase.
Observe: Run the build or test suite and capture the raw terminal output.
Repair: If the test fails, Codex analyzes the error and loops back to step 2.

This loop ensures that errors are caught early, preventing the “hallucination debt” that usually kills long AI sessions.

Key Features of GPT-5.3-Codex for Long Tasks

OpenAI recently updated the Codex-tuned models (specifically gpt-5.3-codex) with features designed specifically for long-horizon tasks with Codex:

Native Compaction: The model can now “compact” its own history, summarizing previous steps to fit more relevant information into the context window without losing the original goal.
High Reasoning Effort: For the hardest logic puzzles, users can toggle “xhigh” reasoning, allowing the model to think deeper before committing to a code change.
Mid-flight Steerability: You can now give the agent a “nudge” while it’s running. If you see it heading toward a sub-optimal architecture, you can update the Prompt.md, and Codex will adapt its plan in the next loop.

Practical Takeaways for Developers

If you are ready to start implementing long-horizon tasks with Codex, follow these three rules:

Isolate the Environment: Use Git worktrees. This allows Codex to work in a “sandbox” where its failures won’t mess up your main branch.
Exhaustive Testing: The agent is only as good as its feedback. If you don’t have tests, Codex has no way of knowing if it broke your app.
Small Milestones: Never let the agent try to do too much at once. A milestone should be something that can be completed and verified in under 5 minutes.

The Future of Autonomous Coding

We are moving toward a world where “coding” becomes “delegation.” By mastering long-horizon tasks with Codex, developers can stop babysitting every line of syntax and start acting as architects who manage a fleet of autonomous agents.

The 25-hour run showcased in the OpenAI Cookbook—which generated over 30,000 lines of code across 13 million tokens—is just the beginning. As we refine the tools for long-horizon tasks with Codex, the complexity of what an individual developer can achieve will scale exponentially.

kalinga.ai

Long-Horizon Tasks with Codex: The Ultimate Guide to Autonomous AI