MLX Batch Server

High-performance local AI inference server for Apple Silicon with batch processing

MLX Batch Server is a production-grade inference server optimized for Apple Silicon, featuring concurrent batch processing, OpenAI Responses API, and Harmony parser for GPT-OSS models.

Features • Quick Start • API Reference • Configuration

Origin & Acknowledgments

This project is a standalone fork of mlx-batch-server by @madroidmaq, whose excellent work laid the foundation for local MLX inference with OpenAI/Anthropic API compatibility.

VetCoders extended the original project with:

Batch inference coordinator (10+ concurrent requests)
Full OpenAI Responses API (/v1/responses)
Streaming Harmony parser for GPT-OSS models
Production hardening for 24/7 operation

We maintain this as a separate project due to significant architectural divergence, while continuing to contribute improvements back to the upstream project where applicable.

Features

Feature	Description
Batch Processing	Handle 10+ concurrent requests via mlx-lm BatchGenerator
Responses API	Full OpenAI `/v1/responses` with SSE streaming
Harmony Parser	Native GPT-OSS model support with channel parsing
Dual API	Compatible with OpenAI and Anthropic SDKs
Model Management	Dynamic load/unload endpoints
Privacy-First	All processing happens locally on your Mac

What's Different From Upstream

├── Batch Coordinator      → Concurrent request batching (NEW)
├── /v1/responses          → OpenAI Responses API (NEW)
├── Harmony Streaming      → GPT-OSS channel parser (NEW)
├── /v1/models/load        → Dynamic model loading (NEW)
├── /v1/models/unload      → Model unloading (NEW)
└── Production Config      → Environment-based settings (NEW)

Quick Start

Installation

# Clone
git clone https://github.com/VetCoders/mlx-batch-server.git
cd mlx-batch-server

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e .

Run the Server

# Default (port 10240)
mlx-batch-server

# Custom port
mlx-batch-server --port 10240

# With debug logging
MLX_BATCH_LOG_LEVEL=debug mlx-batch-server

Test It

# Chat completion
curl http://localhost:10240/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Responses API (streaming)
curl http://localhost:10240/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}],
    "stream": true
  }'

API Reference

OpenAI Compatible (`/v1/*`)

Endpoint	Description	Status
`POST /v1/responses`	Responses API with SSE streaming	Stable
`POST /v1/chat/completions`	Chat with tools, streaming, structured output	Stable
`GET /v1/batch/stats`	Batch coordinator statistics	Stable
`POST /v1/models/load`	Dynamic model loading	Stable
`POST /v1/models/unload`	Model unloading	Stable
`POST /v1/audio/speech`	Text-to-Speech	Stable
`POST /v1/audio/transcriptions`	Speech-to-Text (Whisper)	Stable
`POST /v1/images/generations`	Image Generation	Stable
`POST /v1/embeddings`	Text Embeddings	Stable
`GET /v1/models`	List available models	Stable

Anthropic Compatible (`/anthropic/v1/*`)

Endpoint	Description	Status
`POST /anthropic/v1/messages`	Messages with tools, streaming, thinking	Stable
`GET /anthropic/v1/models`	Model listing with pagination	Stable

Configuration

Environment Variables

Variable	Description	Default
`MLX_BATCH_LOG_LEVEL`	Logging level (`debug`, `info`, `warning`)	`info`
`MLX_BATCH_CORS`	CORS origins (comma-separated)	`*`
`MLX_BATCH_ENABLE_BATCH`	Enable batch inference	`true`
`MLX_BATCH_BATCH_WINDOW_MS`	Batch collection window (ms)	`50`
`MLX_BATCH_MAX_BATCH_SIZE`	Maximum concurrent requests	`10`
`MLX_BATCH_DEFAULT_MODEL`	Model to load on startup	-

Batch Processing

Batch processing collects incoming requests within a time window and processes them together, significantly improving throughput on Apple Silicon:

# Tune for your workload
MLX_BATCH_BATCH_WINDOW_MS=100 \
MLX_BATCH_MAX_BATCH_SIZE=16 \
mlx-batch-server

Performance (M3 Ultra, 512GB):

Single request: ~50 tok/s
Batched (10 requests): ~35 tok/s per request = 350 tok/s total

Development

# Setup
make setup          # Install deps + pre-commit hooks

# Run
make dev            # Start with hot-reload
make dev PORT=10240  # Custom port

# Test
make test           # All tests
make test-responses # Responses API tests
make test-fast      # Skip slow tests

# Quality
make lint           # Run linters
make format         # Format code
make check          # Full CI check

# Model management
make load MODEL=mlx-community/Qwen3-0.6B-4bit
make unload
make ps             # List loaded models
make batch-stats    # Coordinator stats

Documentation

Resource	Description
Responses API Guide	Full Responses API reference
Batch Processing Guide	Batch inference configuration
Harmony Parser	GPT-OSS channel parsing
OpenAI API Guide	OpenAI compatibility reference
Anthropic API Guide	Anthropic compatibility reference
Examples	Practical usage examples

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.11+
MLX framework (auto-installed)

Contributing

git clone https://github.com/VetCoders/mlx-batch-server.git
cd mlx-batch-server
make setup && make test

Pull requests welcome! For major changes, please open an issue first.

License

MIT License

Original project: mlx-batch-server by @madroidmaq

Built with MLX by Apple • FastAPI • MLX-LM

Not affiliated with OpenAI, Anthropic, or Apple

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
playground		playground
src/mlx_batch_server		src/mlx_batch_server
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
Makefile.include		Makefile.include
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Batch Server

Origin & Acknowledgments

Features

What's Different From Upstream

Quick Start

Installation

Run the Server

Test It

API Reference

OpenAI Compatible (`/v1/*`)

Anthropic Compatible (`/anthropic/v1/*`)

Configuration

Environment Variables

Batch Processing

Development

Documentation

Requirements

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 8

Uh oh!

Languages

License

VetCoders/mlx-batch-server

Folders and files

Latest commit

History

Repository files navigation

MLX Batch Server

Origin & Acknowledgments

Features

What's Different From Upstream

Quick Start

Installation

Run the Server

Test It

API Reference

OpenAI Compatible (/v1/*)

Anthropic Compatible (/anthropic/v1/*)

Configuration

Environment Variables

Batch Processing

Development

Documentation

Requirements

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Uh oh!

Languages

OpenAI Compatible (`/v1/*`)

Anthropic Compatible (`/anthropic/v1/*`)

Packages