If you were building an app on top of a large language model in early 2023, you were paying meaningfully more per token than you would today — often ten times more, sometimes more than that. The drop has been steep enough that products which weren't economically viable two years ago now are. That shift is worth understanding in concrete terms rather than just accepting it as "AI is getting cheaper."
Three things are happening simultaneously: GPU hardware is getting more efficient, model architectures have been heavily optimized for inference, and intense provider competition has compressed margins. They reinforce each other, but they're distinct.
Hardware efficiency is compounding
The H100 GPU, which became the primary workhorse for LLM inference at scale in 2023–2024, offered significant throughput improvements over the A100 generation. NVIDIA's Blackwell architecture pushed that further. More importantly, inference doesn't require the same memory bandwidth profile as training — providers have gotten better at matching hardware configurations to the specific demands of serving tokens rather than computing gradients.
Speculative decoding has also had a real impact. The technique uses a smaller "draft" model to generate candidate tokens, which the main model can verify in parallel rather than generating each token sequentially. For typical conversational workloads, this can roughly double throughput without changing output quality. It's been widely deployed across providers because the implementation cost is relatively low and the gains are immediate.
Model architecture changes matter more than raw size
The prevailing assumption in 2022 was that better performance meant bigger models. That's been complicated by results from models like DeepSeek-R1 and the Mistral family, which showed that careful architecture decisions and training data curation could get you close to frontier performance at a fraction of the parameter count. A model with 70 billion parameters running on optimized inference infrastructure is simply cheaper to serve than a 540 billion parameter model, even before other optimizations.
Quantization — reducing the numerical precision of model weights from 32-bit floats to 8-bit or 4-bit integers — has matured significantly. Early quantization approaches degraded model quality noticeably. Techniques like GPTQ and AWQ (Activation-aware Weight Quantization) preserve output quality much better, and most production deployments now use some form of quantization. A model that fits in less GPU memory means more requests can be served concurrently on the same hardware.
Mixture-of-experts (MoE) architectures take a different approach: instead of activating all model parameters for every token, they route each token through a subset of specialized subnetworks. GPT-4 was widely reported to use this approach. The result is a model that has a large theoretical parameter count but activates only a fraction of them per inference call, reducing compute requirements substantially.
Competition has done the rest
By mid-2025, enterprise spending on LLM APIs had risen to roughly $8.4 billion — up from about $3.5 billion in late 2024. That growth attracted competitors. OpenAI still commands around 74–75% of chatbot market share, but the API market is more fragmented. Anthropic, Google (Gemini), Meta (Llama-based providers), and dozens of smaller inference providers are competing for the same enterprise budgets.
When providers can offer comparable quality at lower prices — partly because open-weight models like Llama allow anyone to run their own inference — the market pressure on pricing is real. Some providers have cut prices multiple times in 18 months. The cost per million tokens for models comparable to GPT-3.5-era capability has fallen to a point where it's no longer a significant line item for most applications.
What this actually changes for developers
The more interesting consequence isn't just that existing applications get cheaper to run. It's that use cases that were previously cost-prohibitive become viable. Running an LLM call on every user interaction — summarizing, classifying, extracting structure — was expensive enough in 2023 to require careful justification. At 2025 prices, many of those use cases just aren't worth optimizing around.
This is also why the number of LLM-powered applications has grown so sharply. By some estimates, the total count was on track to reach 750 million globally by the end of 2025. That figure almost certainly includes a lot of low-quality products, but the underlying trend is real: cheap inference means the barrier to shipping something LLM-based is now mostly about product judgment, not infrastructure cost.
The trend isn't expected to stop. Inference costs have dropped roughly 10x per year for equivalent capability since 2022. Whether that rate continues depends partly on how much headroom remains in hardware and model architecture, and partly on whether competitive dynamics hold. Both are uncertain, but there's no strong reason to expect the trajectory to reverse in the near term.