systems-designdistributed-systemsgolangredis

Designing for 10M+ Tasks/Day: Lessons from a Distributed Job Queue

What I learned building a production-grade distributed task engine — pitfalls, architectural decisions, and hard-won lessons from running at scale.

Shehan Induwara··8 min read

When we set out to build the task engine powering Distributed Task Engine, the goal was simple on paper: process millions of background jobs reliably, without affecting user-facing APIs.

Simple on paper. Complex in production.

The naive approach and why it breaks

Most teams start with a simple database-backed queue. You insert a job row, a worker polls for it, marks it in-progress, does the work, marks it done. Works fine at 10k jobs/day. Starts to crumble at 500k. At 10M it's a disaster.

The pathologies are predictable:

  • Lock contention on the status column under concurrent workers
  • Row bloat from historical job records that nobody wants to delete but can't safely archive
  • Polling overhead — constant SELECT FOR UPDATE SKIP LOCKED chews CPU
  • Observer blindness — you can't tell in realtime how backed up you are

Redis Streams: the right primitive

Redis Streams gave us the primitives we needed without building them from scratch:

// Enqueue a job
err := rdb.XAdd(ctx, &redis.XAddArgs{
  Stream: "jobs:payment-reconciliation",
  Values: map[string]interface{}{
    "payload":    string(payloadJSON),
    "created_at": time.Now().Unix(),
    "retries":    0,
  },
}).Err()

The consumer group model means each worker pulls a unique message, acknowledges it on success, or leaves it in the pending entries list (PEL) for redelivery on timeout.

The dead-letter queue pattern

Any job that fails MAX_RETRIES times moves to a DLQ stream with full context:

func (w *Worker) handleFailure(ctx context.Context, msg redis.XMessage, err error) {
  retries, _ := strconv.Atoi(msg.Values["retries"].(string))
  
  if retries >= MaxRetries {
    // Move to DLQ with error context
    w.rdb.XAdd(ctx, &redis.XAddArgs{
      Stream: "jobs:dlq",
      Values: map[string]interface{}{
        "original_stream": w.stream,
        "payload":         msg.Values["payload"],
        "error":           err.Error(),
        "failed_at":       time.Now().Unix(),
      },
    })
    w.rdb.XAck(ctx, w.stream, w.group, msg.ID)
    return
  }
  
  // Exponential backoff requeue
  delay := time.Duration(math.Pow(2, float64(retries))) * time.Second
  w.requeueWithDelay(ctx, msg, delay, retries+1)
}

Horizontal scaling: Kubernetes HPA

The beauty of stateless Go workers is trivial horizontal scaling. With Kubernetes HPA watching a custom metric (queue depth via Prometheus), we auto-scale workers in under 30 seconds of lag detection.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: task-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: task-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: redis_stream_length
        selector:
          matchLabels:
            stream: payment-reconciliation
      target:
        type: AverageValue
        averageValue: "500"  # Scale up if avg queue depth > 500/worker

Key takeaways

  1. Model your job types explicitly — not all jobs are equal. Priority queues matter.
  2. Observability from day one — you will be blind without Prometheus metrics on queue depth, processing rate, and error rate
  3. Test failure modes first — write tests that simulate worker crashes, network partitions, and clock skew
  4. The DLQ is your safety net — instrument it heavily, alert on it aggressively

The full source is on GitHub if you want to dig in.