My Experience with OpenAI's Codex AI CLI (o4-mini)

Executive Summary

This article details first-hand experiences using OpenAI's Codex AI CLI (o4-mini), highlighting its operational nuances, challenges with specific tasks like list formatting, and perceived token consumption. It offers a comparative perspective against tools like Cursor IDE and Gemini 2.5 Pro, particularly regarding workflow, document handling, and problem-solving capabilities.

Key observations include the need for explicit instructions with o4-mini, its context retention, differences in usability compared to GUI-based tools, and insights into the current AI landscape, including model capabilities (IQ scores) and potential release cycle acceleration.

While I usually try to avoid purely anecdotal content, this article, even though it will be reviewed and updated by AI, contains my experiences based on my recollections of using OpenAI's CODEX AI CLI - o4-mini. Evaluating developer tools is crucial for productivity, and understanding the practical strengths and weaknesses of emerging AI coding assistants like o4-mini is vital for individuals and teams navigating the rapidly evolving AI development landscape.

Setting Up and Initial Interactions

My setup involved using the interconnectivity features of Windows 11. Even though the CLI was running within Ubuntu, which itself ran via WSL (Windows Subsystem for Linux), it could still communicate with processes directly on Windows 11. I noticed that I needed to be very explicit with the instructions I gave it. However, once the context window contained the details of what I was working on, I didn't need to constantly reiterate the subject matter. I primarily used it in its 'full-auto' mode, and it performed the tasks I instructed it to do.

Professional Insight: Instruction Specificity

The need for highly explicit instructions is a common characteristic of many current AI models, especially in complex or multi-step tasks. While context retention helps, clarity in initial prompts remains key to achieving desired outcomes efficiently and minimizing corrective iterations.

Challenges and Workflow Observations

One challenge arose when working with numbered lists. The AI would sometimes add a second set of numbers to the existing list items. It took about three or four prompts to get o4-mini to consolidate these into a single, correctly ordered list. The token consumption also felt quite high; it seemed to burn through tokens rather quickly.

Comparing CLI vs. IDE Experience

Comparing tools like the Cursor IDE and OpenAI's Codex CLI (o4-mini) is complex, as they serve somewhat different purposes, but let's consider them for argument's sake. I found that token usage in the CLI seemed more optimized for o4-mini than in Cursor. Perhaps with better system prompts in Cursor, I could achieve more efficient results. Nevertheless, working directly with the o4-mini CLI alongside my code felt refreshing. I could issue a command roughly every 400 seconds, whereas in Cursor, I find myself manually approving changes approximately every 60 seconds.

CLI vs. GUI Workflow

The difference in interaction frequency (400s vs. 60s) highlights a potential trade-off between CLI automation and IDE integration. CLIs might offer longer autonomous runs ('full-auto' mode) but potentially require more setup and specific commands, while IDEs offer tighter integration but may involve more frequent user interventions (approvals).

Use Cases: Document Refactoring and Problem Solving

One way I used it was for refactoring project documentation for an external project. I tasked it with reviewing five documents, and it successfully consolidated them into three documents containing all the necessary information. However, when working with documents, Cursor felt easier due to its drag-and-drop interface for adding files, compared to manually specifying file paths for the Codex CLI. They are distinct tools, but I wanted to share my experience comparing them.

Troubleshooting with Different Models

Another interesting comparison point emerged when tackling a PHP issue. When comparing o4-mini to Gemini 2.5 Pro (experimental), I found that Gemini successfully resolved the PHP problem where o4-mini could not. After three attempts, o4-mini failed to identify the issue. While documenting my troubleshooting process, I provided the relevant details to Gemini 2.5 Pro. It devised a plan based on the error codes displayed when accessing the `.php` file directly, which ultimately resolved the problem.

Model Capabilities & Specialization

The differing success rates in resolving the PHP issue underscore that different models, even contemporary ones, can have varying strengths and weaknesses in specific domains or problem types. This highlights the potential value of having access to multiple models for diverse development tasks.

Broader AI Landscape Observations

Separately, I found an interesting website, trackingai.org/home, which presents AI model 'IQ' scores (using the 'Show Mensa Norway' option). It's quite intriguing to look at. According to the site at the time I checked, it showed 'o3-full' (presumably an OpenAI GPT-3 model) at an IQ of approximately 132, Gemini 2.5 Pro (experimental) around 128, and 'o4-mini' (presumably OpenAI's GPT-4-mini) around 118.

Based on these capabilities, I believe that these models are now capable of handling most coding tasks, potentially coding even better than I can at this point. It would be interesting to see a detailed report on each model's proficiency in specific programming languages, although I assume they all possess sufficient programming knowledge to create complex applications.

The Future of AI in Coding

Regarding the future of coding, Anthropic CEO Dario Amodei suggested around March 14, 2024, that AI might handle 90-95% of coding within the next few years (Source: Entrepreneur, Mar 14, 2024). Given the capabilities of current models, I now believe AI can write nearly 100% of the code required for most developers today. The main exception might be the core, proprietary algorithms developed by the AI companies themselves, although I've seen reports suggesting that even OpenAI utilizes AI for 20-30% of its own internal coding. A quick Google search I performed on this topic seemed to primarily surface results related to Salesforce's use of AI in coding.

Accelerated Release Cycles?

I've also noticed a potential acceleration in model release cycles. Previously, it seemed to take companies roughly five months to test and release new models. However, we received 'o3-mini high' (?) in January, and now, at the end of April, 'o4-mini' has been released – a gap of only about three months. Is OpenAI setting a precedent for faster model releases? I recall news from January mentioning OpenAI planned to open a data center in May, so I wasn't expecting another model release from them before then.

Perspectives on AI and Knowledge Work

Finally, former Google CEO Eric Schmidt has shared some interesting predictions about the near future of AI. I encourage you to watch this YouTube video; it's about two and a half minutes long: https://www.youtube.com/watch?v=GoKb5nQzgKY

Based on his prediction, one might speculate that within two years, AI could handle 80% of knowledge work. However, this remains speculation.

Key Takeaways

OpenAI's o4-mini CLI requires explicit instructions but retains context effectively.
Specific tasks (like list reformatting) may require iterative prompting.
CLI workflows can differ significantly from IDE integrations (e.g., interaction frequency).
Usability for specific tasks (like document handling) can vary between tools (CLI vs. GUI).
Different AI models exhibit varying strengths in problem-solving (e.g., o4-mini vs. Gemini 2.5 Pro for PHP).
AI coding capabilities are rapidly advancing, potentially handling a vast majority of coding tasks.
Model release cycles might be accelerating, impacting the pace of AI tool evolution.

Business Implications

Tool Selection: Evaluating AI coding tools requires considering not just features but also workflow fit, instruction style, and specific task performance.
Developer Productivity: Understanding the nuances of different AI tools can help optimize developer workflows, balancing automation with necessary oversight.
Skill Adaptation: Developers need to adapt to working *with* AI, focusing on effective prompting, verification, and integrating AI outputs.
Rapid Evolution: Businesses must stay informed about the accelerating pace of AI model releases and capability improvements to leverage the best tools.
Strategic Implementation: Choosing the right mix of AI tools (CLI, IDE plugins, different models) can impact project efficiency and success.

Article published on April 22, 2025