Why We Chose Rust to Accelerate Python AI Infrastructure
Python AI libraries like LiteLLM, LangGraph, and CrewAI hit severe performance bottlenecks at production scale due to the GIL, memory overhead, and single-threaded serialization. Neul Labs rewrites the hot paths in Rust using PyO3, shipping drop-in replacements that achieve 3.2x-737x speedups with a single import and zero code changes.
Quick Reference
| Library | What it accelerates | Key speedup | Install |
|---|---|---|---|
| fast-litellm | LiteLLM connection pooling & rate limiting | 3.2x faster, 42x less memory | pip install fast-litellm |
| fast-langgraph | LangGraph checkpoint serialization | Up to 737x faster | pip install fast-langgraph |
| fast-crewai | CrewAI serialization | 34x faster | pip install fast-crewai |
| fast-axolotl | Axolotl large dataset training | Eliminates OOM errors | pip install fast-axolotl |
All libraries are MIT licensed, require no Rust installation, and ship prebuilt wheels for Linux, macOS, and Windows.
The Problem: Python AI Frameworks Hit a Concurrency Ceiling
The AI agent ecosystem runs on Python. LiteLLM, LangGraph, CrewAI, Axolotl — these are the libraries that power production AI systems worldwide. But they share a common problem: they were never designed for production concurrency.
When you deploy AI agents at scale, three specific bottlenecks emerge:
1. Connection Pooling Under the GIL
LiteLLM’s HTTP connection management uses Python’s threading primitives. Under the Global Interpreter Lock (GIL), these threads cannot execute Python bytecode in parallel. At high concurrency (50+ simultaneous agent requests), connection acquisition latency spikes from microseconds to milliseconds.
2. Checkpoint Serialization Bottleneck
LangGraph serializes agent state to storage on every graph node completion using json.dumps with custom encoders. For a typical agent state (50 messages, tool results, metadata), each checkpoint takes 50-100ms in pure Python. A 10-step agent execution spends 500ms-1s on serialization alone — often exceeding the LLM API call time.
3. Memory Overhead at Scale
Python’s object model adds ~232 bytes of overhead per rate-limiting entry in LiteLLM. At high cardinality (thousands of API keys x models x endpoints), this compounds to gigabytes of memory for what is fundamentally a simple lookup table.
Why Not Just Optimize the Python Code?
We tried. There’s a hard ceiling.
Python’s memory model, its GIL, and the per-object overhead of its type system create a performance floor that no amount of algorithmic optimization can break through. Even with Python’s C-accelerated json module, LangGraph’s checkpoint serialization is bound by object allocation and traversal in Python’s managed heap.
The only way to break through is to bypass these constraints entirely.
Why Rust + PyO3 Is the Answer
Rust provides three capabilities that Python structurally cannot:
Zero-cost abstractions. Lock-free data structures like DashMap eliminate GIL contention entirely. Multiple threads can read and write concurrently without acquiring any Python-level lock.
Memory efficiency. A rate-limiting entry in Rust uses ~48 bytes vs ~232 bytes in Python. That’s a 42x reduction in memory consumption for high-cardinality rate limiting.
True parallelism. Rust threads run independently of the GIL. Connection pooling, serialization, and state management all execute in genuine parallel — not interleaved via Python’s cooperative threading.
PyO3 is the bridge. It provides zero-overhead Rust-Python bindings, exposing Rust types as native Python objects with no serialization cost. From Python’s perspective, the Rust objects are indistinguishable from pure Python ones.
How Drop-In Replacement Works
Every Neul Labs accelerator follows the same integration pattern:
# Step 1: Install (no Rust toolchain needed)
# pip install fast-litellm
# Step 2: Import before the library you're accelerating
import fast_litellm
import litellm
# Step 3: There is no step 3
# Your existing code, tests, and config all continue to work
The import fast_litellm statement triggers automatic monkeypatching that replaces LiteLLM’s hot-path implementations (connection pooling, rate limiting, token counting) with Rust equivalents. The library detects the installed version of LiteLLM and patches only compatible code paths, with automatic fallback to the original Python implementation if any incompatibility is detected.
Benchmark Results
All benchmarks measured on production-representative workloads. Source code and reproduction instructions available in each project’s GitHub repository.
| Library | Operation | Python Baseline | Rust Accelerated | Speedup |
|---|---|---|---|---|
| fast-litellm | Connection pool acquisition | 3.1ms | 0.97ms | 3.2x |
| fast-litellm | Rate limiting (memory, 10K entries) | 2.3MB | 0.055MB | 42x less |
| fast-litellm | Token counting (large doc) | 12.4ms | 7.3ms | 1.7x |
| fast-langgraph | Checkpoint serialization | 73.5ms | 0.1ms | 737x |
| fast-langgraph | State deserialization | 45.2ms | 0.3ms | 151x |
| fast-langgraph | Full state update cycle | 92.0ms | 2.0ms | 46x |
| fast-crewai | Agent serialization | 340ms | 10ms | 34x |
Platform Compatibility
| Platform | Python Versions | Prebuilt Wheels |
|---|---|---|
| Linux (x86_64, aarch64) | 3.8 - 3.13 | Yes |
| macOS (x86_64, Apple Silicon) | 3.8 - 3.13 | Yes |
| Windows (x86_64) | 3.8 - 3.13 | Yes |
Beyond Accelerators: The Full Agent Infrastructure Stack
The Rust+PyO3 acceleration approach is just the foundation. Neul Labs is building the complete infrastructure layer for production AI agents:
- brat — Multi-agent orchestration harness for coordinating Claude Code, Aider, Codex, and other coding agents in parallel
- m9m — Go-based workflow automation, 5-10x faster than n8n with 70% less memory
- ormai — Policy-first database access layer that gives AI agents safe, auditable ORM operations
- fastworker — Brokerless Python task queue eliminating Redis/RabbitMQ for moderate-scale workloads
All projects are MIT licensed and available at github.com/neul-labs.
Frequently Asked Questions
Do I need to install Rust to use fast-litellm or fast-langgraph?
No. All Neul Labs accelerators ship prebuilt binary wheels for Linux, macOS, and Windows. Install with pip install fast-litellm — no Rust toolchain required.
Will fast-litellm break my existing code?
No. The accelerators use automatic monkeypatching with version detection and fallback. If an incompatibility is detected, the library falls back to the original Python implementation silently. Your tests, configuration, and application code remain unchanged.
How is a 737x speedup possible?
The 737x figure is for checkpoint serialization specifically. Python’s json.dumps with custom encoders allocates Python objects for every value in the state tree. Rust’s serde with SIMD-accelerated JSON processes raw bytes without any Python object allocation, and can parallelize across CPU cores without GIL contention.
What’s the difference between automatic and manual integration?
Automatic integration (one import line) gives 2-3x speedups by patching the most impactful hot paths. Manual integration (replacing specific components like the checkpointer) gives access to the full Rust implementations for maximum speedup (up to 737x).
Is this the same as Cython or Numba?
No. Cython and Numba optimize Python code by compiling it to C or LLVM IR. Neul Labs accelerators are complete Rust reimplementations of specific components, using lock-free data structures and SIMD instructions that aren’t available through Python compilation approaches.
Where can I see the benchmark methodology?
Each accelerator’s GitHub repository contains benchmark scripts with full reproduction instructions, including hardware specs, Python/library versions, and dataset descriptions.