Why We Chose Rust to Accelerate Python AI Infrastructure

Quick Reference

Library	What it accelerates	Key speedup	Install
fast-litellm	LiteLLM connection pooling & rate limiting	3.2x faster, 42x less memory	`pip install fast-litellm`
fast-langgraph	LangGraph checkpoint serialization	Up to 737x faster	`pip install fast-langgraph`
fast-crewai	CrewAI serialization	34x faster	`pip install fast-crewai`
fast-axolotl	Axolotl large dataset training	Eliminates OOM errors	`pip install fast-axolotl`

All libraries are MIT licensed, require no Rust installation, and ship prebuilt wheels for Linux, macOS, and Windows.

The Problem: Python AI Frameworks Hit a Concurrency Ceiling

The AI agent ecosystem runs on Python. LiteLLM, LangGraph, CrewAI, Axolotl — these are the libraries that power production AI systems worldwide. But they share a common problem: they were never designed for production concurrency.

When you deploy AI agents at scale, three specific bottlenecks emerge:

1. Connection Pooling Under the GIL

LiteLLM’s HTTP connection management uses Python’s threading primitives. Under the Global Interpreter Lock (GIL), these threads cannot execute Python bytecode in parallel. At high concurrency (50+ simultaneous agent requests), connection acquisition latency spikes from microseconds to milliseconds.

2. Checkpoint Serialization Bottleneck

LangGraph serializes agent state to storage on every graph node completion using json.dumps with custom encoders. For a typical agent state (50 messages, tool results, metadata), each checkpoint takes 50-100ms in pure Python. A 10-step agent execution spends 500ms-1s on serialization alone — often exceeding the LLM API call time.

3. Memory Overhead at Scale

Python’s object model adds ~232 bytes of overhead per rate-limiting entry in LiteLLM. At high cardinality (thousands of API keys x models x endpoints), this compounds to gigabytes of memory for what is fundamentally a simple lookup table.

Why Not Just Optimize the Python Code?

We tried. There’s a hard ceiling.

Python’s memory model, its GIL, and the per-object overhead of its type system create a performance floor that no amount of algorithmic optimization can break through. Even with Python’s C-accelerated json module, LangGraph’s checkpoint serialization is bound by object allocation and traversal in Python’s managed heap.

The only way to break through is to bypass these constraints entirely.

Why Rust + PyO3 Is the Answer

Rust provides three capabilities that Python structurally cannot:

Zero-cost abstractions. Lock-free data structures like DashMap eliminate GIL contention entirely. Multiple threads can read and write concurrently without acquiring any Python-level lock.

Memory efficiency. A rate-limiting entry in Rust uses ~48 bytes vs ~232 bytes in Python. That’s a 42x reduction in memory consumption for high-cardinality rate limiting.

True parallelism. Rust threads run independently of the GIL. Connection pooling, serialization, and state management all execute in genuine parallel — not interleaved via Python’s cooperative threading.

PyO3 is the bridge. It provides zero-overhead Rust-Python bindings, exposing Rust types as native Python objects with no serialization cost. From Python’s perspective, the Rust objects are indistinguishable from pure Python ones.

How Drop-In Replacement Works

Every Neul Labs accelerator follows the same integration pattern:

# Step 1: Install (no Rust toolchain needed)
# pip install fast-litellm

# Step 2: Import before the library you're accelerating
import fast_litellm
import litellm

# Step 3: There is no step 3
# Your existing code, tests, and config all continue to work

The import fast_litellm statement triggers automatic monkeypatching that replaces LiteLLM’s hot-path implementations (connection pooling, rate limiting, token counting) with Rust equivalents. The library detects the installed version of LiteLLM and patches only compatible code paths, with automatic fallback to the original Python implementation if any incompatibility is detected.

Benchmark Results

All benchmarks measured on production-representative workloads. Source code and reproduction instructions available in each project’s GitHub repository.

Library	Operation	Python Baseline	Rust Accelerated	Speedup
fast-litellm	Connection pool acquisition	3.1ms	0.97ms	3.2x
fast-litellm	Rate limiting (memory, 10K entries)	2.3MB	0.055MB	42x less
fast-litellm	Token counting (large doc)	12.4ms	7.3ms	1.7x
fast-langgraph	Checkpoint serialization	73.5ms	0.1ms	737x
fast-langgraph	State deserialization	45.2ms	0.3ms	151x
fast-langgraph	Full state update cycle	92.0ms	2.0ms	46x
fast-crewai	Agent serialization	340ms	10ms	34x

Platform Compatibility

Platform	Python Versions	Prebuilt Wheels
Linux (x86_64, aarch64)	3.8 - 3.13	Yes
macOS (x86_64, Apple Silicon)	3.8 - 3.13	Yes
Windows (x86_64)	3.8 - 3.13	Yes

Beyond Accelerators: The Full Agent Infrastructure Stack

The Rust+PyO3 acceleration approach is just the foundation. Neul Labs is building the complete infrastructure layer for production AI agents:

brat — Multi-agent orchestration harness for coordinating Claude Code, Aider, Codex, and other coding agents in parallel (deep dive)
m9m — Go-based workflow automation, 5-10x faster than n8n with 70% less memory
ormai — Policy-first database access layer that gives AI agents safe, auditable ORM operations
fastworker — Brokerless Python task queue eliminating Redis/RabbitMQ for moderate-scale workloads (deep dive)

All projects are MIT licensed and available at github.com/neul-labs. See the full Agent Infrastructure and Developer Tools pages for details.

Frequently Asked Questions

Do I need to install Rust to use fast-litellm or fast-langgraph?

No. All Neul Labs accelerators ship prebuilt binary wheels for Linux, macOS, and Windows. Install with pip install fast-litellm — no Rust toolchain required.

Will fast-litellm break my existing code?

No. The accelerators use automatic monkeypatching with version detection and fallback. If an incompatibility is detected, the library falls back to the original Python implementation silently. Your tests, configuration, and application code remain unchanged.

How is a 737x speedup possible?

The 737x figure is for checkpoint serialization specifically. Python’s json.dumps with custom encoders allocates Python objects for every value in the state tree. Rust’s serde with SIMD-accelerated JSON processes raw bytes without any Python object allocation, and can parallelize across CPU cores without GIL contention.

What’s the difference between automatic and manual integration?

Automatic integration (one import line) gives 2-3x speedups by patching the most impactful hot paths. Manual integration (replacing specific components like the checkpointer) gives access to the full Rust implementations for maximum speedup (up to 737x).

Is this the same as Cython or Numba?

No. Cython and Numba optimize Python code by compiling it to C or LLVM IR. Neul Labs accelerators are complete Rust reimplementations of specific components, using lock-free data structures and SIMD instructions that aren’t available through Python compilation approaches.

Where can I see the benchmark methodology?

Each accelerator’s GitHub repository contains benchmark scripts with full reproduction instructions, including hardware specs, Python/library versions, and dataset descriptions.