Topic

#Inference

9 articles on Inference — news, releases, guides and analysis from the SourceFeed engine.

Popping the CPU-GPU Latency Bubble in Inference

Pipelined decoding techniques show that software optimization, not just raw hardware scaling, is the key to maximizing GPU utilization.

Emeka Okafor

OpenAI Jalapeno and the Shift to Custom Inference Silicon

Custom ASICs are replacing general-purpose GPUs for running large language models to survive the crushing cost of scale.

Article · 3d ago7

The LLM Cost Cliff Your Budget Isn't Ready For

Per-token prices are collapsing, yet AI bills keep exploding. The two facts aren't a contradiction, and confusing them will wreck your business case.

Article · 4d ago1

OpenAI's Jalapeño Chip Is a Bet on Inference Economics

A custom Broadcom-built ASIC for LLM inference puts OpenAI on the same vertical-integration path Google and Amazon paved years ago.

News · 6d ago2

How OpenAI's Jalapeño Chip Changes Production LLM Serving

The custom silicon shift signals a move away from general-purpose GPUs toward highly specialized, memory-optimized inference architectures.

Article · 6d ago1

Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.

Tutorial · 1w ago0

Running 70B Models on 4GB VRAM: The AirLLM Layer-Swap Hack

AirLLM trades disk I/O for VRAM, letting developers run massive models locally without renting enterprise GPU clusters.

Article · 1w ago1

Unified x86 AI Acceleration: Inside the New ACE Specification

The x86 Ecosystem Advisory Group's new spec brings standardized matrix multiplication and tile registers to modern CPU architectures.

Article · 1w ago2

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Pushes a 1T Model Past 1000 Tokens/Sec on Commodity GPUs

Through FP4 quantization, block-level speculative decoding, and the TileRT system stack, Xiaomi claims trillion-parameter decode speeds normally reserved for custom silicon — on a single 8-GPU node.

News · 3w ago5

Inference in your inbox

The best developer & AI content, delivered. No spam.