Before getting into Kafka, I want to start with two real incidents from my work that pushed us away from traditional queues.
But first, a quick acknowledgment. Traditional message brokers like RabbitMQ, SQS, and Celery were designed for a specific model: destructive reads, per-message routing, acknowledgment-based delivery. That model has real strengths. It gives you flexible routing through exchanges, fine-grained control over individual message lifecycles, and mature retry/dead-letter semantics out of the box. For a large class of problems, that's exactly what you want.
The trouble starts when your scale or your requirements outgrow that model.
Instance 1: When a Queue Became a Bottleneck
At one of my previous companies, two microservices were closely coupled. One of them did heavy database work. To decouple them, we put Amazon SQS in between, assuming the queue would absorb any spikes.
It worked until traffic increased slightly.
Database queries slowed down, consumers couldn't keep up, and messages started piling up in the queue. Latency kept increasing, and the system eventually became unstable. The queue didn't solve the problem — it just delayed it.
Instance 2: Background Jobs We Couldn't Trust
In another setup, we were running important background jobs using Celery. These jobs were responsible for creating important database objects and were triggered asynchronously. We started observing multiple failure patterns:
- Kubernetes pods running Celery workers were being killed
- In-flight tasks were lost permanently
- There was no reliable way to replay or audit what failed
- At the same time, background job volume spiked
- We couldn't afford long delays in object creation
Patterns We Noticed at Scale with Traditional Queues
Looking back at these incidents, a few patterns kept showing up. None of them were surprising on their own, but together they explained why things kept breaking once scale entered the picture.
Ordering constraints limit parallelism. Strict FIFO ordering sounds nice, but it limits how much you can scale consumers. Without partitioning, you can't parallelize at all. With partitioning, you end up dealing with rebalancing and uneven load. When every message must be processed in exact order and is destroyed on read, you're locked into a single-threaded consumption model that doesn't scale cleanly.
Backpressure piles up at the broker. Producers keep pushing, consumers pull when they can. As soon as consumers slow down, the queue starts growing. But unlike systems where a slow consumer simply falls behind harmlessly, here the growing queue becomes a shared pressure point — latency increases, retries pile up, memory usage spikes, and in the worst case the broker itself becomes unstable. The queue quietly turns from a buffer into a bottleneck.
Capacity planning becomes guesswork. Traditional queues need you to predefine memory or disk limits. At higher traffic, especially with sudden spikes, predicting how much capacity you'll need becomes unreliable. You either over-provision or get surprised in production.
Operational flexibility is limited. Upgrading brokers or changing configurations often means downtime or risky operations. Doing this while the system is under load is hard, and sometimes not even an option.
No replay, limited fan-out. Once a message is consumed and acknowledged, it's gone. That makes replaying events, debugging past failures, or having multiple independent consumers read the same stream of messages much harder than it should be. You end up building workarounds — logging messages separately, maintaining shadow queues — for something that arguably should be a basic capability.
What Still Works
In short, traditional queues tightly couple storage, consumption, and acknowledgments. That works well up to a point, and then it starts to hurt.
That said, they're still useful. We continue to use Celery in places where:
- Event volume is low
- Latency tolerance is higher
- The problem naturally fits a task-and-retry model
The issue wasn't the tools. It was the kind of problem we were trying to solve with them.
What Comes Next
These experiences forced us to rethink a core assumption:
What if message storage and message consumption didn't have to be tightly coupled?
That question leads naturally to log-based systems — and that's where Part 2 picks up.