High-performance local AI inference server for Apple Silicon with batch processing
MLX Batch Server is a production-grade inference server optimized for Apple Silicon, featuring concurrent batch processing, OpenAI Responses API, and Harmony parser for GPT-OSS models.
This project is a standalone fork of mlx-batch-server by @madroidmaq, whose excellent work laid the foundation for local MLX inference with OpenAI/Anthropic API compatibility.
VetCoders extended the original project with:
- Batch inference coordinator (10+ concurrent requests)
- Full OpenAI Responses API (
/v1/responses) - Streaming Harmony parser for GPT-OSS models
- Production hardening for 24/7 operation
We maintain this as a separate project due to significant architectural divergence, while continuing to contribute improvements back to the upstream project where applicable.
| Feature | Description |
|---|---|
| Batch Processing | Handle 10+ concurrent requests via mlx-lm BatchGenerator |
| Responses API | Full OpenAI /v1/responses with SSE streaming |
| Harmony Parser | Native GPT-OSS model support with channel parsing |
| Dual API | Compatible with OpenAI and Anthropic SDKs |
| Model Management | Dynamic load/unload endpoints |
| Privacy-First | All processing happens locally on your Mac |
├── Batch Coordinator → Concurrent request batching (NEW)
├── /v1/responses → OpenAI Responses API (NEW)
├── Harmony Streaming → GPT-OSS channel parser (NEW)
├── /v1/models/load → Dynamic model loading (NEW)
├── /v1/models/unload → Model unloading (NEW)
└── Production Config → Environment-based settings (NEW)
# Clone
git clone https://github.com/VetCoders/mlx-batch-server.git
cd mlx-batch-server
# Install with uv (recommended)
uv sync
# Or with pip
pip install -e .# Default (port 10240)
mlx-batch-server
# Custom port
mlx-batch-server --port 10240
# With debug logging
MLX_BATCH_LOG_LEVEL=debug mlx-batch-server# Chat completion
curl http://localhost:10240/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-0.6B-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Responses API (streaming)
curl http://localhost:10240/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-0.6B-4bit",
"input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}],
"stream": true
}'| Endpoint | Description | Status |
|---|---|---|
POST /v1/responses |
Responses API with SSE streaming | Stable |
POST /v1/chat/completions |
Chat with tools, streaming, structured output | Stable |
GET /v1/batch/stats |
Batch coordinator statistics | Stable |
POST /v1/models/load |
Dynamic model loading | Stable |
POST /v1/models/unload |
Model unloading | Stable |
POST /v1/audio/speech |
Text-to-Speech | Stable |
POST /v1/audio/transcriptions |
Speech-to-Text (Whisper) | Stable |
POST /v1/images/generations |
Image Generation | Stable |
POST /v1/embeddings |
Text Embeddings | Stable |
GET /v1/models |
List available models | Stable |
| Endpoint | Description | Status |
|---|---|---|
POST /anthropic/v1/messages |
Messages with tools, streaming, thinking | Stable |
GET /anthropic/v1/models |
Model listing with pagination | Stable |
| Variable | Description | Default |
|---|---|---|
MLX_BATCH_LOG_LEVEL |
Logging level (debug, info, warning) |
info |
MLX_BATCH_CORS |
CORS origins (comma-separated) | * |
MLX_BATCH_ENABLE_BATCH |
Enable batch inference | true |
MLX_BATCH_BATCH_WINDOW_MS |
Batch collection window (ms) | 50 |
MLX_BATCH_MAX_BATCH_SIZE |
Maximum concurrent requests | 10 |
MLX_BATCH_DEFAULT_MODEL |
Model to load on startup | - |
Batch processing collects incoming requests within a time window and processes them together, significantly improving throughput on Apple Silicon:
# Tune for your workload
MLX_BATCH_BATCH_WINDOW_MS=100 \
MLX_BATCH_MAX_BATCH_SIZE=16 \
mlx-batch-serverPerformance (M3 Ultra, 512GB):
- Single request: ~50 tok/s
- Batched (10 requests): ~35 tok/s per request = 350 tok/s total
# Setup
make setup # Install deps + pre-commit hooks
# Run
make dev # Start with hot-reload
make dev PORT=10240 # Custom port
# Test
make test # All tests
make test-responses # Responses API tests
make test-fast # Skip slow tests
# Quality
make lint # Run linters
make format # Format code
make check # Full CI check
# Model management
make load MODEL=mlx-community/Qwen3-0.6B-4bit
make unload
make ps # List loaded models
make batch-stats # Coordinator stats| Resource | Description |
|---|---|
| Responses API Guide | Full Responses API reference |
| Batch Processing Guide | Batch inference configuration |
| Harmony Parser | GPT-OSS channel parsing |
| OpenAI API Guide | OpenAI compatibility reference |
| Anthropic API Guide | Anthropic compatibility reference |
| Examples | Practical usage examples |
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.11+
- MLX framework (auto-installed)
git clone https://github.com/VetCoders/mlx-batch-server.git
cd mlx-batch-server
make setup && make testPull requests welcome! For major changes, please open an issue first.
Original project: mlx-batch-server by @madroidmaq
Fork maintained by: VetCoders — M&K (c)2026
Built with MLX by Apple • FastAPI • MLX-LM
Not affiliated with OpenAI, Anthropic, or Apple