← Back to Blog
rustpythonperformancepyo3ai-agents

Why We Chose Rust to Accelerate Python AI Infrastructure

Dipankar Sarkar · · Updated
TL;DR

Python AI libraries like LiteLLM, LangGraph, and CrewAI hit severe performance bottlenecks at production scale due to the GIL, memory overhead, and single-threaded serialization. Neul Labs rewrites the hot paths in Rust using PyO3, shipping drop-in replacements that achieve 3.2x-737x speedups with a single import and zero code changes.

Quick Reference

LibraryWhat it acceleratesKey speedupInstall
fast-litellmLiteLLM connection pooling & rate limiting3.2x faster, 42x less memorypip install fast-litellm
fast-langgraphLangGraph checkpoint serializationUp to 737x fasterpip install fast-langgraph
fast-crewaiCrewAI serialization34x fasterpip install fast-crewai
fast-axolotlAxolotl large dataset trainingEliminates OOM errorspip install fast-axolotl

All libraries are MIT licensed, require no Rust installation, and ship prebuilt wheels for Linux, macOS, and Windows.


The Problem: Python AI Frameworks Hit a Concurrency Ceiling

The AI agent ecosystem runs on Python. LiteLLM, LangGraph, CrewAI, Axolotl — these are the libraries that power production AI systems worldwide. But they share a common problem: they were never designed for production concurrency.

When you deploy AI agents at scale, three specific bottlenecks emerge:

1. Connection Pooling Under the GIL

LiteLLM’s HTTP connection management uses Python’s threading primitives. Under the Global Interpreter Lock (GIL), these threads cannot execute Python bytecode in parallel. At high concurrency (50+ simultaneous agent requests), connection acquisition latency spikes from microseconds to milliseconds.

2. Checkpoint Serialization Bottleneck

LangGraph serializes agent state to storage on every graph node completion using json.dumps with custom encoders. For a typical agent state (50 messages, tool results, metadata), each checkpoint takes 50-100ms in pure Python. A 10-step agent execution spends 500ms-1s on serialization alone — often exceeding the LLM API call time.

3. Memory Overhead at Scale

Python’s object model adds ~232 bytes of overhead per rate-limiting entry in LiteLLM. At high cardinality (thousands of API keys x models x endpoints), this compounds to gigabytes of memory for what is fundamentally a simple lookup table.

Why Not Just Optimize the Python Code?

We tried. There’s a hard ceiling.

Python’s memory model, its GIL, and the per-object overhead of its type system create a performance floor that no amount of algorithmic optimization can break through. Even with Python’s C-accelerated json module, LangGraph’s checkpoint serialization is bound by object allocation and traversal in Python’s managed heap.

The only way to break through is to bypass these constraints entirely.

Why Rust + PyO3 Is the Answer

Rust provides three capabilities that Python structurally cannot:

Zero-cost abstractions. Lock-free data structures like DashMap eliminate GIL contention entirely. Multiple threads can read and write concurrently without acquiring any Python-level lock.

Memory efficiency. A rate-limiting entry in Rust uses ~48 bytes vs ~232 bytes in Python. That’s a 42x reduction in memory consumption for high-cardinality rate limiting.

True parallelism. Rust threads run independently of the GIL. Connection pooling, serialization, and state management all execute in genuine parallel — not interleaved via Python’s cooperative threading.

PyO3 is the bridge. It provides zero-overhead Rust-Python bindings, exposing Rust types as native Python objects with no serialization cost. From Python’s perspective, the Rust objects are indistinguishable from pure Python ones.

How Drop-In Replacement Works

Every Neul Labs accelerator follows the same integration pattern:

# Step 1: Install (no Rust toolchain needed)
# pip install fast-litellm

# Step 2: Import before the library you're accelerating
import fast_litellm
import litellm

# Step 3: There is no step 3
# Your existing code, tests, and config all continue to work

The import fast_litellm statement triggers automatic monkeypatching that replaces LiteLLM’s hot-path implementations (connection pooling, rate limiting, token counting) with Rust equivalents. The library detects the installed version of LiteLLM and patches only compatible code paths, with automatic fallback to the original Python implementation if any incompatibility is detected.

Benchmark Results

All benchmarks measured on production-representative workloads. Source code and reproduction instructions available in each project’s GitHub repository.

LibraryOperationPython BaselineRust AcceleratedSpeedup
fast-litellmConnection pool acquisition3.1ms0.97ms3.2x
fast-litellmRate limiting (memory, 10K entries)2.3MB0.055MB42x less
fast-litellmToken counting (large doc)12.4ms7.3ms1.7x
fast-langgraphCheckpoint serialization73.5ms0.1ms737x
fast-langgraphState deserialization45.2ms0.3ms151x
fast-langgraphFull state update cycle92.0ms2.0ms46x
fast-crewaiAgent serialization340ms10ms34x

Platform Compatibility

PlatformPython VersionsPrebuilt Wheels
Linux (x86_64, aarch64)3.8 - 3.13Yes
macOS (x86_64, Apple Silicon)3.8 - 3.13Yes
Windows (x86_64)3.8 - 3.13Yes

Beyond Accelerators: The Full Agent Infrastructure Stack

The Rust+PyO3 acceleration approach is just the foundation. Neul Labs is building the complete infrastructure layer for production AI agents:

  • brat — Multi-agent orchestration harness for coordinating Claude Code, Aider, Codex, and other coding agents in parallel
  • m9m — Go-based workflow automation, 5-10x faster than n8n with 70% less memory
  • ormai — Policy-first database access layer that gives AI agents safe, auditable ORM operations
  • fastworker — Brokerless Python task queue eliminating Redis/RabbitMQ for moderate-scale workloads

All projects are MIT licensed and available at github.com/neul-labs.


Frequently Asked Questions

Do I need to install Rust to use fast-litellm or fast-langgraph?

No. All Neul Labs accelerators ship prebuilt binary wheels for Linux, macOS, and Windows. Install with pip install fast-litellm — no Rust toolchain required.

Will fast-litellm break my existing code?

No. The accelerators use automatic monkeypatching with version detection and fallback. If an incompatibility is detected, the library falls back to the original Python implementation silently. Your tests, configuration, and application code remain unchanged.

How is a 737x speedup possible?

The 737x figure is for checkpoint serialization specifically. Python’s json.dumps with custom encoders allocates Python objects for every value in the state tree. Rust’s serde with SIMD-accelerated JSON processes raw bytes without any Python object allocation, and can parallelize across CPU cores without GIL contention.

What’s the difference between automatic and manual integration?

Automatic integration (one import line) gives 2-3x speedups by patching the most impactful hot paths. Manual integration (replacing specific components like the checkpointer) gives access to the full Rust implementations for maximum speedup (up to 737x).

Is this the same as Cython or Numba?

No. Cython and Numba optimize Python code by compiling it to C or LLVM IR. Neul Labs accelerators are complete Rust reimplementations of specific components, using lock-free data structures and SIMD instructions that aren’t available through Python compilation approaches.

Where can I see the benchmark methodology?

Each accelerator’s GitHub repository contains benchmark scripts with full reproduction instructions, including hardware specs, Python/library versions, and dataset descriptions.