When we set out to build the task engine powering Distributed Task Engine, the goal was simple on paper: process millions of background jobs reliably, without affecting user-facing APIs.
Simple on paper. Complex in production.
The naive approach and why it breaks
Most teams start with a simple database-backed queue. You insert a job row, a worker polls for it, marks it in-progress, does the work, marks it done. Works fine at 10k jobs/day. Starts to crumble at 500k. At 10M it's a disaster.
The pathologies are predictable:
- Lock contention on the status column under concurrent workers
- Row bloat from historical job records that nobody wants to delete but can't safely archive
- Polling overhead — constant
SELECT FOR UPDATE SKIP LOCKEDchews CPU - Observer blindness — you can't tell in realtime how backed up you are
Redis Streams: the right primitive
Redis Streams gave us the primitives we needed without building them from scratch:
// Enqueue a job
err := rdb.XAdd(ctx, &redis.XAddArgs{
Stream: "jobs:payment-reconciliation",
Values: map[string]interface{}{
"payload": string(payloadJSON),
"created_at": time.Now().Unix(),
"retries": 0,
},
}).Err()
The consumer group model means each worker pulls a unique message, acknowledges it on success, or leaves it in the pending entries list (PEL) for redelivery on timeout.
The dead-letter queue pattern
Any job that fails MAX_RETRIES times moves to a DLQ stream with full context:
func (w *Worker) handleFailure(ctx context.Context, msg redis.XMessage, err error) {
retries, _ := strconv.Atoi(msg.Values["retries"].(string))
if retries >= MaxRetries {
// Move to DLQ with error context
w.rdb.XAdd(ctx, &redis.XAddArgs{
Stream: "jobs:dlq",
Values: map[string]interface{}{
"original_stream": w.stream,
"payload": msg.Values["payload"],
"error": err.Error(),
"failed_at": time.Now().Unix(),
},
})
w.rdb.XAck(ctx, w.stream, w.group, msg.ID)
return
}
// Exponential backoff requeue
delay := time.Duration(math.Pow(2, float64(retries))) * time.Second
w.requeueWithDelay(ctx, msg, delay, retries+1)
}
Horizontal scaling: Kubernetes HPA
The beauty of stateless Go workers is trivial horizontal scaling. With Kubernetes HPA watching a custom metric (queue depth via Prometheus), we auto-scale workers in under 30 seconds of lag detection.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: task-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: task-worker
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: redis_stream_length
selector:
matchLabels:
stream: payment-reconciliation
target:
type: AverageValue
averageValue: "500" # Scale up if avg queue depth > 500/worker
Key takeaways
- Model your job types explicitly — not all jobs are equal. Priority queues matter.
- Observability from day one — you will be blind without Prometheus metrics on queue depth, processing rate, and error rate
- Test failure modes first — write tests that simulate worker crashes, network partitions, and clock skew
- The DLQ is your safety net — instrument it heavily, alert on it aggressively
The full source is on GitHub if you want to dig in.