What Got Done#
1. Project Scaffold#
Set up the Rust project with Cargo.
read : tonic-helloworld-tutorial
2. Proto Contract — inference.proto#
Defined the core service contract under the nanoinfer package:
HealthRPC — Simple health check returning a status string.InferRPC — Server-streaming RPC. The client sends anInferRequest(prompt, max_tokens, temperature, top_p) and receives a stream ofInferResponsetokens.FinishReasonenum — Captures why generation stopped.UsageMetrics— Only sent with the final token to avoid per-token overhead. Reports prompt tokens, generated tokens, and total.request_id— Client-provided correlation ID for tracing requests.
3. gRPC Service Implementation (grpc.rs)#
Implemented InferServiceImpl with a simulated inference loop:
- Spawns a background
tokiotask that sends 5 tokens through a boundedmpscchannel (capacity 64). - Each token is sent with a 100ms delay to simulate a ggml forward pass.
- The final token includes
FinishReason::StopSequenceand aggregatedUsageMetrics. - Gracefully handles client disconnects — if
tx.send()fails, the worker logs and exits.
The server binds to 0.0.0.0:50052 and is ready to accept connections.
4. Test Script (test.sh)#
Wrote a zsh test script using grpcurl for manual smoke testing:
./test.sh health # → {"status": "OK"}
./test.sh infer # → streams 5 tokens with usage stats
./test.sh infer "Custom prompt" 20Challenges#
- Streaming response design — Chose
mpsc::channelovertokio_stream::iterbecause it naturally supports backpressure and allows the inference worker to run independently of the gRPC send loop.
Architecture So Far#
┌─────────────┐ gRPC/HTTP2 ┌──────────────────┐
│ Client │ ──────────────────────► │ nanoinfer │
│ (grpcurl) │ ◄── stream of tokens ── │ :50052 │
└─────────────┘ │ │
│ InferService │
│ ├─ Health() │
│ └─ Infer() │
│ └─ tokio:: │
│ spawn() │
└──────────────────┘