blog-cover-imageblog-cover-image
[Queues vs Logs: A Series] Part 2 — The Log Mental Model
Distributed Systems
#kafka
#traditional_queues
14 / 5 /2026
19 / 5 /2026

The previous blog ended with a question:

What if message storage and message consumption didn't have to be tightly coupled?

A colleague mentioned Kafka. I'd read about it before but never used it for this kind of problem. That's where it started.


The Core Idea: a Log

Before I get into the two problems, a quick mental model that makes everything else easier to follow.

A log is simple. Producers append to the end. Consumers read sequentially. When a consumer reaches the end, it waits.

The key difference from a queue: reading is non-destructive. The message stays. The consumer just moves its offset forward.

image

This one shift changes everything about how you think about scale, replay, and consumer independence. It took me two production incidents to actually internalize it.


Problem 1: millions of events, losing a few was fine

There was a huge sports tournament happening. Millions of people were joining live streams, watching, and reacting all at the same time. At that scale, losing a few events was not an issue. The system just needed to stay alive.

Previously, for the same feature during non-live tournament time, we were using Amazon SQS. It scaled fine for regular engagement. But when we hosted a smaller tournament and stress-tested it, things started breaking. Events weren't getting processed fast enough. The consumer side was lagging badly, and the producer wasn't scaling either. An hour-long stream was taking hours to finish processing. That's when we knew something had to change.

We switched to Kafka. Different partition keys per video and engagement type kept things clean: same video, same partition, same consumer.

producer.send({
    topic: 'engagement-events',
    messages: [{
        key: videoId,        // same video → same partition → ordered
        value: JSON.stringify(event)
    }]
})

The key insight: not all data needs to be delivered perfectly. We needed the system to stay alive under load, not guarantee every single event.

Kafka's fan-out meant multiple consumer groups, analytics, leaderboard, and counters could each process the same stream independently. If one fell behind, it didn't block the others.

placeholder imageimage

Jay Kreps described this well in his 2013 essay, The Log — a log is the single source of truth, and consumers just track where they are in it. A slow consumer doesn't hurt anyone else. It just has a higher lag number.


Problem 2: When order and correctness were non-negotiable

This time, I was dealing with financial data. Multiple systems were interacting with each other, and at the end of it all, everything needed to land correctly in the bookkeeping store.

The events I was tracking went from when a transaction starts all the way to its settled state: pending, processing, settled, in that order. If a settled arrives before pending, your general ledger syncs in the wrong order. The books are wrong. That's not a bug you can easily roll back.

Leslie Lamport figured this out in 1978: In a distributed system, there is no global clock. If you want ordering, you have to build it in. We just keep rediscovering this.

Because we had many different services communicating and generating these events, I built a unified system. Every other service would send its events to one place, and this unified system would handle the processing and bookkeeping. Kafka was the natural fit; Other systems publish to a partition I control, order is guaranteed within a partition, and because these events are critical, multiple consumer groups could each process the same event for different purposes.

While working on this, I noticed something happening occasionally: events were getting lost from the Kafka buffer before reaching the broker. The default setup is async:

Your producer fires events into a local buffer, and acknowledgment from the broker comes back asynchronously.

With many services now calling my producer, this meant a crash could silently drop whatever was sitting in that buffer.

The fix was switching to sync acknowledgment. The producer sends an event, waits for the broker to confirm, then moves to the next one. Slower, yes, but for financial records, consistency mattered more than speed. I also made sure the consumer was idempotent, so if an event was replayed, it wouldn't cause double processing. And we added logging to Datadog for how many events were remaining in the buffer at any point, plus a retry mechanism for failures.

producer.produce(
    'transactions',
    key=bill_id.encode(),
    value=json.dumps({
        'bill_id': 'bill_123',
        'state': 'pending',
        'timestamp': '2024-01-15T10:30:00Z'
    }).encode()
)

# flush blocks until all buffered events are confirmed by broker
# returns number of events still sitting in buffer
remaining = producer.flush(timeout=10)

if remaining == 0:
    # broker confirmed everything — safe to move offset forward
    consumer.commit(asynchronous=False)
else:
    # events still in buffer — don't commit, log and investigate
    log_to_datadog('buffer.remaining', remaining)
produce, flush, commit - in that order
import signal
import atexit

def flush_and_commit():
    # drain buffer and commit offset on any exit
    remaining = producer.flush(timeout=10)
    if remaining == 0:
        consumer.commit(asynchronous=False)
    else:
        log_to_datadog('buffer.remaining', remaining)

# handles Kubernetes pod kills and Ctrl+C
signal.signal(signal.SIGTERM, lambda s, f: flush_and_commit())
signal.signal(signal.SIGINT, lambda s, f: flush_and_commit())

# fallback for any other process exit
atexit.register(flush_and_commit)
handling graceful shutdown

Martin Kleppmann covers this in Designing Data-Intensive Applications, Chapter 11the broker behaves like a leader database, the consumer like a follower tracking a log sequence number. The same replication logic databases have been used for decades, applied to messaging. It clicked for me only after I'd already built it.

placeholder imageimage

What I took away from both

After solving problem 1, I understood the scale. After solving problem 2, I understood the correctness. They're completely different problems, solved with the same tool, tuned differently.

If you want to maintain order, whether your system is async or sync, Kafka handles it well. You just have to be careful about two things: your producer configuration and making sure your consumer is idempotent. Get those right, and the system is remarkably forgiving.


What this doesn't solve

This model works well when throughput is high, and each message is fast to process. It gets harder when:

  • A single message is slow and holds up its entire partition head-of-line blocking

  • You need per-message routing flexibility that partition keys can't express

  • Message ordering across partitions matters

For those cases, a traditional queue might actually be the right answer.


Closing thoughts

There's no universally correct answer here. We still run Celery in places where event volume is low, and the task-and-retry model fits naturally. Kafka isn't a replacement; it's a different tool for a different class of problem.

The question worth sitting with: do you need the message gone after it's read, or do you need it to still be there?

That answer usually tells you which direction to go.