vLLM traces long-running memory leak to low-level system calls in disaggregated AI setup
Engineers found a slow but steady rise in memory use during production-like tests, eventually linking the problem to how data was moved between AI components rather than to Python code or the model itself.
vLLM engineers have detailed how they identified the cause of a memory leak that gradually exhausted system memory in a production-style artificial intelligence deployment, despite standard debugging tools initially showing no obvious problem.
According to the company, the issue appeared in a disaggregated “prefill/decode” setup, an architecture increasingly used to run large language models more efficiently. In this design, one part of the system prepares context for the model, while another handles the step-by-step generation of text. Under realistic traffic, memory usage on the decode side rose by roughly 400 megabytes per minute, eventually triggering out-of-memory errors after several hours.
The problem was first seen during pre-production testing using the Mistral Medium 3.1 model, with graph compilation enabled. Importantly, the leak only occurred in a specific configuration: a disaggregated deployment using NIXL to transfer key-value cache data between components over UCX and InfiniBand, technologies commonly used for high-performance networking.
For a lay reader, the key point is that nothing was “crashing” immediately. Instead, memory was being consumed very slowly and steadily, making the issue hard to spot and harder to diagnose. Standard Python memory profiling tools such as Memray and Guppy 3 showed no obvious leak, and attempts to attach a traditional debugger caused the process to crash, limiting visibility.
Engineers then took a different approach. By running Heaptrack, a low-level memory analysis tool, they found that the program’s heap, where most application memory normally lives, remained stable. However, the overall memory used by the process, known as resident set size or RSS, continued to grow. This pointed to memory being allocated outside the heap, beyond what Python’s own tools could easily see.
Using system utilities such as pmap and the Linux /proc interface, the team monitored how memory regions changed over time. They observed anonymous memory mappings, blocks of memory not tied to files, growing steadily and shifting addresses as the program ran.
To understand where those allocations were coming from, the engineers traced system calls using BPFtrace, a tool that can observe what a running program asks the operating system to do. They also automated GDB, a low-level debugger, to break execution at key points. This combination revealed repeated mmap calls, the system function used to request new chunks of memory from the operating system.
Crucially, those calls did not originate from typical application code. BPFtrace showed they came from the raw syscall wrapper, and GDB stack traces linked them to libucm, a library involved in high-performance communication, and to Python’s PyMem_ArenaAlloc, which manages memory arenas for the Python runtime.
By matching the memory addresses returned by these calls with the growing anonymous regions seen earlier, the team was able to connect the steady RSS growth directly to this interaction between Python’s memory management and the networking layer used to move data between AI components.
In practical terms, the finding suggests that the leak was not caused by the AI model itself or by user-level Python code, but by how low-level libraries handled memory in this specific distributed configuration. That distinction matters for operators running large AI systems, because it changes where fixes need to be applied and which configurations may be risky at scale.
Related reading
- Nvidia’s Jensen Huang casts AI as the backbone of a once-in-a-generation infrastructure boom
- Meta launches AI Glasses Impact Grants to fund social and economic projects in the US
- Ahrefs adds custom AI prompt tracking as brands watch how chatbots talk about them
The episode highlights a broader challenge in modern AI infrastructure. As systems become more modular and rely on specialised networking and acceleration technologies, problems can surface far below the level most developers usually inspect. Slow leaks that only appear after hours of operation can be especially costly in production, where they underline the importance of long-running tests and deep observability.
vLLM said the investigation underscores the limits of high-level profiling tools for diagnosing issues in complex AI stacks, and the need to combine them with system-level tracing when behaviour does not match expectations.
The Recap
- Memory leak appeared in vLLM disaggregated prefill/decode setup.
- System memory rose about 400 MB per minute.
- Engineers opened an issue on vLLM’s GitHub repository.