The Great Model Downgrade: Why Tech Companies Are Ditching Expensive AI
As inference costs soar, enterprises are discovering that smaller models handle 80% of workloads just fine—and the economics could reshape OpenAI and Anthropic's path to IPO.
As inference costs soar, enterprises are discovering that smaller models handle 80% of workloads just fine—and the economics could reshape OpenAI and Anthropic's path to IPO.
A GitHub research project claims to compile LLM computation graphs into single CUDA kernels with formal correctness guarantees, but lacks published benchmarks or third-party validation.
A Hugging Face hackathon project demonstrates that serving four different small models in a single agent economy is tractable when infrastructure abstracts tokenizer variance.
Organizations are exploring on-premise language models as pre-filters to reduce API spend on commercial LLMs, though cost savings remain context-dependent.
The AI chip startup is shifting focus to its inference-as-a-service platform following its $20B partial exit with Nvidia.
The AI chip startup is pivoting toward inference-as-a-service, backed by existing investors including Disruptive and Infinitium.
The Korean chip startup raises Series B at $570M valuation, targeting the data-movement bottleneck that GPUs can't solve alone.
A new multi-part guide demystifies torch.profiler traces, starting with matrix operations and scaling to large language model optimization.
A new inference cloud startup backed by FUSE VC is deploying specialized chips to undercut GPU-heavy competitors in the race for AI inference capacity.
A new TRL protocol reduces per-step model synchronization from terabytes to tens of megabytes by shipping only changed parameters across distributed training pipelines.
AI gateway OpenRouter raises $113M Series B from CapitalG, doubling its valuation in 12 months as enterprises increasingly avoid vendor lock-in.
NVIDIA releases diffusion language models at 3B, 8B, and 14B scales that generate and refine tokens in parallel, offering latency improvements for GPU-constrained inference workloads.
Cerebras Systems priced its IPO at $185/share and opened at $385, closing day one at $311 with a $66B market cap.
Hugging Face's engineering blog details how asynchronous continuous batching eliminates CPU-GPU idle gaps that waste nearly a quarter of LLM inference runtime.
Oracle has staked its enterprise future on a $300B compute deal with OpenAI, betting that AI's real profits live in inference — not model training.