Real‑World Use Language Model Selection & Orchestration

There’s a growing emphasis on multi-model orchestration to tackle more complex AI implementations, complemented by compelling research into how enterprises are deploying Language Models.

Aug 04, 2025

In Short…what is happening in enterprise AI?

Research from Menlo ventures shows that enterprise spending on inference doubled the last six months.

Code generation has become AI’s first breakout use case.

Beyond pre-training, foundation models are now scaling along a second axis…reinforcement learning with verifiers (this align with recent research from OpenAI on best practices).

This leads to additional inference or the introduction of more models.

Enterprise dollars are consolidating around a few high-performing, closed-source models, with the new market leader in Anthropic.

By the end of 2023, OpenAI commanded 50% of the enterprise LLM market, but its early lead has eroded.

Today, it captures just 25% of enterprise usage — half of what it held two years ago.

Code generation became AI’s first killer app…

Reinforcement learning with verifiers (other models) is the new path to scaling intelligence.

Training models as “AI Agents” to use tools makes them far more useful.

OpenAI has been sharing the underlying architecture of ChatGPT, illustrating how multiple models are orchestrated under the hood.

This is also the case with OpenAI’s deep research API, where it was shown that they orchestrate multiple models behind the API.

Code generation has become AI’s first breakout use case. Beyond pre-training, foundation models are now scaling along a second axis: reinforcement learning with verifiers.

Model Selection and Market Growth

In a remarkably short timeframe, model API expenditures have more than doubled, rising from $3.5 billion — part of the $13.8 billion total generative AI spending estimated last year — to $8.4 billion.

This growth highlights a notable transition among enterprises, with increased emphasis on production inference over mere model development, diverging from trends in prior years.

OpenAI held a commanding 50% share of the enterprise LLM market by the end of 2023, but that dominance has significantly diminished, now standing at just 25% of enterprise usage — half its level from two years ago.

Anthropic has risen to the forefront in enterprise AI adoption with a 32% share, outpacing OpenAI and Google (at 20%, bolstered by its recent momentum).

Meta’s Llama secures 9%, while DeepSeek, despite a buzzworthy launch early in the year, claims only 1%.

The three growth drivers for Anthropic are code generation, reinforcement learning and Agent training…

Enterprises Switch Models for Performance, Not Price

Switching between vendors is relatively easy, but increasingly rare.

The majority of development teams exhibit loyalty to their existing AI providers, primarily by upgrading to the latest models as they are released.

Once committed to a specific platform, builders tend to remain steadfast, yet they eagerly transition to newer, higher-performing models upon availability.

According to the survey, 66% of builders upgraded models within their current provider, while 23% made no model switches at all in the past year.

Only 11% opted to change vendors entirely.

OpenAI Real-World Use Cases

Model orchestration coordinates multiple AI models to handle tasks.

OpenAI uses models like

o4-mini, o3,
gpt-4.1, and
gpt-4.1-mini

in workflows.

This approach assigns roles based on model strengths, such as ideation, routing, synthesis, critique & verification.

OpenAI’s method includes:

Multi-Agent Collaboration

o4-mini for ideation or routing; o3 for reasoning and critique.

Hierarchical Navigation

Models process data recursively with a reasoning scratchpad.

Tool Integration

Use tools like chemical lookups or cost estimators for grounded decisions.

Escalation

Switch models dynamically, escalate to o3 for analysis or use gpt-4.1-mini for efficiency.

Human-in-the-loop

Include validation and learning loops for refinement.

Examples from OpenAI

Long-Context RAG for Legal Q&A

OpenAI models like gpt-4.5 (used via tools like ChatGPT or API) can route queries across legal documents such as the Trademark Trial and Appeal Board Manual.

A larger model like gpt-4-turbo can synthesize answers with citations.

Smaller models (e.g., gpt-4o-mini) can act as verification agents or LLM judges to cross-check outputs.

This architecture enables low-latency querying of newly uploaded documents and handles paraphrased questions robustly.

AI Co-Scientist for Pharmaceutical R&D

Using agent workflows, models like gpt-4o-mini can propose synthesis protocols, while others like gpt-3.5 or gpt-4-turbo critique for rigor or feasibility.

Safety checks can be integrated via predefined tools or prompts.

Databases can be updated iteratively for continual learning, helping reduce development time and cost.

Insurance Claim Processing

OCR and vision tasks (reading scanned forms or photos) can be handled by gpt-4o, which supports multimodal input. Smaller models (e.g., gpt-4o-mini) can resolve ambiguous entries or use web-enhanced search tools when integrated with external systems.

High-throughput processing (1,000 pages) can be done for under $16 using cost-efficient model configurations.

Benefits

Hybrid systems allow trade-offs between speed and depth using model modes like “Fast” vs “Thorough.”

Verifiability and reliability improve with model cross-checking.

Systems scale well via function routing and agent coordination.

OpenAI provides example code and workflows (e.g., routing functions, safety checks, RAG patterns) in its Cookbook to help developers implement such multi-agent architectures.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots

Discussion about this post