Topic

#Gpu

10 articles on Gpu — news, releases, guides and analysis from the SourceFeed engine.

Popping the CPU-GPU Latency Bubble in Inference

Pipelined decoding techniques show that software optimization, not just raw hardware scaling, is the key to maximizing GPU utilization.

Emeka Okafor

OpenAI Jalapeno and the Shift to Custom Inference Silicon

Custom ASICs are replacing general-purpose GPUs for running large language models to survive the crushing cost of scale.

Article · 3d ago7

Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.

Tutorial · 1w ago0

The Architecture of Monopoly: Inside NVIDIA's Supercomputing Hegemony

NVIDIA now powers 81 percent of the world's fastest supercomputers, forcing a fundamental rewrite of high-performance software.

Article · 1w ago0

Running 70B Models on 4GB VRAM: The AirLLM Layer-Swap Hack

AirLLM trades disk I/O for VRAM, letting developers run massive models locally without renting enterprise GPU clusters.

Article · 1w ago1

TPU vs GPU: The Architecture and Software Trade-offs

Choosing between GPUs and TPUs requires balancing CUDA's dynamic flexibility against the static compilation efficiency of systolic arrays.

Article · 1w ago1

Disaggregating LLM Inference: Inside AMD's ATOM and ATOMesh Stack

AMD's native ROCm serving stack splits prefill and decode to eliminate head-of-line blocking on Instinct hardware.

Article · 1w ago0

NVIDIA's cuTile Brings Fearless Concurrency to GPU Kernels in Rust

A new tile-based DSL from NVIDIA Labs extends Rust's strict ownership model directly to high-performance GPU programming.

Article · 1w ago5

xAI Is Becoming the Landlord of the AI Compute Stack — and That Matters for Developers

xAI's deals to lease GPU capacity to Anthropic and Google reframe it as infrastructure provider first, frontier lab second. Here's what that means for the APIs you're building on.

Article · 3w ago1

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Pushes a 1T Model Past 1000 Tokens/Sec on Commodity GPUs

Through FP4 quantization, block-level speculative decoding, and the TileRT system stack, Xiaomi claims trillion-parameter decode speeds normally reserved for custom silicon — on a single 8-GPU node.

News · 3w ago5