I tried to speed up a python for-loop by splitting up across 4 processes, and along the way I accidently discovered four distributed system concepts that companies like Google, Apache have spent decades solving: Computation-to-communication, the straggler problem problem, serialisation overhead, and data locality.
Here is how processing satellite data at 2 different scales made each concept visible.
The Datasets
Australia's vegetation raster: 40,100 × 38,370 pixels, 3.09 GB of satellite data. Each pixel holds a vegetation density value (1–687) or nodata (−32768) for ocean. That's 1.54 billion pixels.
I wanted to classify every pixel into a fire risk category — simple threshold logic, but on 1.54 billion pixels, it would be interesting to see how the final outcome would come up.
The obvious answer to get the outcome: split the data across multiple workers and compute in parallel. I divided the raster into 4 horizontal strips.
The below image you are seeing is analysis of the *.tiff file I have downloaded via https://www.agriculture.gov.au/abares/forestsaustralia/forest-data-maps-and-tools/spatial-data/forest-cover
Why Pure Python? Why not numpy? Why not threads?
Why not numpy?: Numpy is fast for array level operations(mean, sum, filtering), but the kind of data and operation we are dealing with often involves per-pixel logic that can't be vectorized — branching, stateful classification, neighbor lookups. The for-loop represents that class of problem. Note: Numpy also parallizes internally via BLAS/LAPACK, adding multi-processing on top it only adds overhead, and no gain.
Why not threads?: Python’s GIL. Our per-pixel loop is CPU-bound task, so we need true parallelism, and threads would have given zero speedup, while multiprocessing spawns separate OS processes, each with its own Python interpreter and its own GIL.
The Plan:
In any multiprocessing setup, how you distribute data across processes matters as much as the parallelism itself. You'll see why shortly.
I explored three different data distribution strategies:
Send pixel data via Queue - Main process reads everything, pickles and sends to workers
Workers read from Disk - Each worker reads its own strip directly from disk
Send file paths and coordinates - Minimal communication, workers handle everything
Each approach answers the fundamental question in distributed systems: How does each worker get its data?
For this blog, I'm doing a deep dive into Approach 1 because it reveals the most interesting distributed systems concepts. I tested it at two scales:
Small scale: 1,000 rows (160 million pixels) - to understand baseline behavior
Full scale: 38,370 rows (1.54 billion pixels) - to see what breaks at scale
Each test compares:
Single process baseline (for comparison)
Multiprocessing implementation (the actual approach)
Note: I'm assuming you have basic familiarity with multiprocessing concepts. If not, give this a read first.
Approaches 2 and 3 will come in follow-up posts - this blog was getting too heavy, and Approach 1 alone taught me enough to fill an article.
Approach 1: Send Pixel Data via Queue (Serialization)
The strategy: Main process reads all the data, then sends it to workers via process arguments. Sounds simple, but under the hood Python has to pickle (serialize) the entire NumPy array to transfer it across process boundaries.
Let's see how this plays out at two different scales.
Small scale 1000 rows (160 million pixels) — Single Process:
First, the baseline - how fast is a single process?
# CONFIG
STRIP_DIR = 'downloaded_file_path'
FULL_BIN_PATH = os.path.join(STRIP_DIR, 'aus_full.bin')
WIDTH = 40100
FULL_STRIP_ROWS = [9592, 9592, 9592, 9594]
MAX_ROWS = 1000
NUM_WORKERS = 4
# Per-pixel computation (the bottleneck — pure Python, no numpy)
def classify_fire_risk(value):
if value <= 0 or value != value:
return 0
if value > 400:
return 3 # high risk
elif value > 150:
return 2 # medium risk
else:
return 1 # low risk
def process_strip(strip):
"""Run classify_fire_risk on every pixel. Returns risk counts."""
risk_counts = {0: 0, 1: 0, 2: 0, 3: 0}
rows, cols = strip.shape
for row in range(rows):
for col in range(cols):
risk = classify_fire_risk(strip[row, col])
risk_counts[risk] += 1
return risk_counts
def read_strip_1000(bin_path, full_rows):
"""Read a full strip .bin file but return only the first 1000 rows."""
strip = np.fromfile(bin_path, dtype=np.float32).reshape(full_rows, WIDTH)
return strip[:MAX_ROWS, :].copy()
# SINGLE PROCESS BASELINE
def run_single_process():
print("\n" + "=" * 60)
print("SINGLE PROCESS (baseline)")
print("=" * 60)
all_counts = {0: 0, 1: 0, 2: 0, 3: 0}
t_total = time.time()
for i in range(NUM_WORKERS):
bin_path = os.path.join(STRIP_DIR, f'strip_{i}.bin')
strip = read_strip_1000(bin_path, FULL_STRIP_ROWS[i])
strip[strip < 0] = float('nan')
t_strip = time.time()
risk_counts = process_strip(strip)
elapsed = time.time() - t_strip
print(f" Strip {i}: {elapsed:.1f}s {risk_counts}")
for k in risk_counts:
all_counts[k] += risk_counts[k]
total = time.time() - t_total
print(f" TOTAL: {total:.1f}s")
print(f" Results: {all_counts}")
return total
Nothing fancy - just reading strips one by one and processing them sequentially.
Small scale 1000 rows — Multi Process:
Now let's parallelize it. The main process reads all strips, then spawns 4 workers and passes each one its data.
STRIP_DIR = 'downloaded_file_path'
FULL_BIN_PATH = os.path.join(STRIP_DIR, 'aus_full.bin')
WIDTH = 40100
FULL_STRIP_ROWS = [9592, 9592, 9592, 9594]
NUM_WORKERS = 4
def classify_fire_risk(value):
if value <= 0 or value != value:
return 0
if value > 400:
return 3 # high risk
elif value > 150:
return 2 # medium risk
else:
return 1 # low risk
def process_strip(strip):
risk_counts = {0: 0, 1: 0, 2: 0, 3: 0}
rows, cols = strip.shape
for row in range(rows):
for col in range(cols):
risk = classify_fire_risk(strip[row, col])
risk_counts[risk] += 1
return risk_counts
def worker_approach1(result_queue, strip_data, worker_id):
t0 = time.time()
strip = strip_data
strip[strip < 0] = float('nan')
risk_counts = process_strip(strip)
result_queue.put({
'worker_id': worker_id,
'risk_counts': risk_counts,
'time': time.time() - t0,
})
def read_strip_1000(bin_path, full_rows):
"""Read a full strip .bin file but return only the first 1000 rows."""
strip = np.fromfile(bin_path, dtype=np.float32).reshape(full_rows, WIDTH)
return strip[:MAX_ROWS, :].copy()
def run_approach1():
print("\n" + "=" * 60)
print("APPROACH 1: Send pixel data via Queue (serialization overhead)")
print("=" * 60)
result_queue = mp.Queue()
# Main reads all strip data FIRST (only 1000 rows each)
t_read = time.time()
strips = []
for i in range(NUM_WORKERS):
bin_path = os.path.join(STRIP_DIR, f'strip_{i}.bin')
strip = read_strip_1000(bin_path, FULL_STRIP_ROWS[i])
strips.append(strip)
print(f" Main read time: {time.time() - t_read:.1f}s")
print(f" Data per strip: {strips[0].nbytes / 1e6:.1f} MB")
# Spawn workers — strip data gets pickled into each child process
t_total = time.time()
processes = []
for i, strip in enumerate(strips):
p = mp.Process(target=worker_approach1, args=(result_queue, strip, i))
processes.append(p)
for p in processes:
p.start()
results = []
for _ in processes:
results.append(result_queue.get())
for p in processes:
p.join()
total = time.time() - t_total
for r in sorted(results, key=lambda x: x['worker_id']):
print(f" Worker {r['worker_id']}: {r['time']:.1f}s {r['risk_counts']}")
print(f" TOTAL (incl. serialization): {total:.1f}s")
return total
Results after executing:
if __name__ == "__main__":
mp.set_start_method('spawn')
print("Australia Vegetation Raster — 1000-Row Benchmark")
print(f"Width: {WIDTH}, Rows per strip: {MAX_ROWS}")
print(f"Total pixels: {MAX_ROWS * WIDTH * NUM_WORKERS:,}")
t_single = run_single_process()
t1 = run_approach1()
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print(f" Single process: {t_single:.1f}s")
print(f" Approach 1 (Queue data): {t1:.1f}s ({t_single/t1:.2f}x speedup)" if t1 else "")3.4x speedup! Single process took 25.5s, multiprocessing took 7.5s. Not bad for 160 million pixels.
What Happened: Breaking Down the Numbers
Small Scale: 1000 rows (160 million pixels)
The Results:
Single process: 25.5s
Approach 1 (multiprocessing): 7.5s
Speedup: 3.4x
Time breakdown: Let's trace where every second went:
Main read time: 1.6s - Reading 4 strips from disk (160.4 MB each = 641.6 MB total)
Worker computation times:
Worker 0: 5.5s
Worker 1: 6.4s
Worker 2: 6.9s
Worker 3: 5.7s
Total multiprocessing time: 7.5s
The math doesn't add up, does it?
If the slowest worker took 6.9s, and we spent 1.6s reading, shouldn't the total be ~8.5s? But it's only 7.5s. Here's what's happening:
Overlap is your friend: While the main process was reading data (1.6s), it was also spawning processes and pickling data. The workers started computing before all reading was done. The 7.5s total includes:
Reading: 1.6s (overlapped with spawn/pickle)
Pickling 641.6 MB: ~1-1.5s (hidden in the 7.5s total)
Actual parallel computation: ~6.9s (slowest worker)
Why this worked well:
The computation-to-communication ratio was heavily in our favor:
Pure computation time: 25.5s (if done serially)
Communication overhead: ~3s (read + pickle)
Ratio: 8.5:1 - We're doing 8.5 seconds of useful work for every second of overhead
This is the Computation-to-Communication Ratio - a fundamental distributed systems concept. When this ratio is high (>5:1), parallelism pays off. When it's low (<2:1), you might be faster staying single-threaded.
Full scale(1.54 billion pixel) — Single Process:
Now let's see what happens when we process the entire dataset - 10x more data.
def run_single_process():
print("\n" + "=" * 60)
print("SINGLE PROCESS (baseline)")
print("=" * 60)
all_counts = {0: 0, 1: 0, 2: 0, 3: 0}
t_total = time.time()
for i in range(NUM_WORKERS):
bin_path = os.path.join(STRIP_DIR, f'strip_{i}.bin')
strip = np.fromfile(bin_path, dtype=np.float32).reshape(STRIP_ROWS[i], WIDTH)
strip[strip < 0] = float('nan')
t_strip = time.time()
risk_counts = process_strip(strip)
elapsed = time.time() - t_strip
print(f" Strip {i}: {elapsed:.1f}s {risk_counts}")
for k in risk_counts:
all_counts[k] += risk_counts[k]
total = time.time() - t_total
print(f" TOTAL: {total:.1f}s")
print(f" Results: {all_counts}")
return total
Same logic, just processing the full STRIP_ROWS instead of capping at 1000.
Full scale — Multiprocess:
def worker_approach1(result_queue, strip_data, worker_id):
t0 = time.time()
strip = strip_data
strip[strip < 0] = float('nan')
risk_counts = process_strip(strip)
result_queue.put({
'worker_id': worker_id,
'risk_counts': risk_counts,
'time': time.time() - t0,
})
def run_approach1():
print("\n" + "=" * 60)
print("APPROACH 1: Send pixel data via Queue (serialization overhead)")
print("=" * 60)
result_queue = mp.Queue()
# Main reads all strip data FIRST
t_read = time.time()
strips = []
for i in range(NUM_WORKERS):
bin_path = os.path.join(STRIP_DIR, f'strip_{i}.bin')
strip = np.fromfile(bin_path, dtype=np.float32).reshape(STRIP_ROWS[i], WIDTH)
strips.append(strip)
print(f" Main read time: {time.time() - t_read:.1f}s")
# Spawn workers — strip data gets pickled into each child process
t_total = time.time()
processes = []
for i, strip in enumerate(strips):
p = mp.Process(target=worker_approach1, args=(result_queue, strip, i))
processes.append(p)
for p in processes:
p.start()
results = []
for _ in processes:
results.append(result_queue.get())
for p in processes:
p.join()
total = time.time() - t_total
for r in sorted(results, key=lambda x: x['worker_id']):
print(f" Worker {r['worker_id']}: {r['time']:.1f}s {r['risk_counts']}")
print(f" TOTAL (incl. serialization): {total:.1f}s")
return total
Results after executing:
import numpy as np
import time
import multiprocessing as mp
import os
if __name__ == "__main__":
mp.set_start_method('spawn')
print("Australia Vegetation Raster — Full Raster Benchmark")
print(f"Width: {WIDTH}, Strips: {STRIP_ROWS}, Total rows: {sum(STRIP_ROWS)}")
print(f"Total pixels: {sum(STRIP_ROWS) * WIDTH:,}")
t_single = run_single_process()
t1 = run_approach1()
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print(f" Single process: {t_single:.1f}s")
print(f" Approach 1 (Queue data): {t1:.1f}s ({t_single/t1:.2f}x speedup)" if t1 else "")The Results:
Single process: 227.1s
Approach 1 (multiprocessing): 71.2s
Speedup: 3.19x
Wait. We have 10x more data, but roughly the same speedup? Something's breaking down.
Time breakdown:
Main read time: 1.4s - Still fast! (NumPy's memory-mapped file reading is efficient)
Worker computation times:
Worker 0: 58.3s
Worker 1: 67.9s
Worker 2: 64.9s
Worker 3: 53.0s
Total multiprocessing time: 71.2s
Problem #1: The Straggler
Worker 3 finished in 53.0s, but we had to wait until 67.9s for Worker 1. That's a 14.9-second gap where 3 CPUs sat idle while Worker 1 finished.
But this wasn't random OS scheduling or thermal throttling - look at the data:
From your visualization:
Strip 0: Mostly ocean (72% ocean, sparse low vegetation) - Fewer actual pixels to compute
Strip 1: Arid inland (84% land, low-mid vegetation) - Most land pixels = most computation
Strip 2: Dense vegetation (55% dense, median 234)
Strip 3: Mostly ocean + dense forest (89% ocean, but the 11% land is 99.5% dense)
The smoking gun in the numbers:
Look at the risk counts each worker processed:
Worker 0: {0: 276294417, 1: 107717378, 2: 627405, 3: 0}
Worker 1: {0: 61927634, 1: 308118204, 2: 14593362, 3: 0} ← Processed WAY more non-zero pixels
Worker 2: {0: 88161207, 1: 133008630, 2: 161216629, 3: 22527734}
Worker 3: {0: 343338964, 1: 213748, 2: 13342362, 3: 27824326}Let's count actual vegetation pixels (non-zero risk):
Worker 0: 108,344,783 vegetation pixels (28% of strip)
Worker 1: 322,711,566 vegetation pixels (84% of strip) ← 3x more work than Worker 0!
Worker 2: 316,753,993 vegetation pixels (78% of strip)
Worker 3: 41,180,436 vegetation pixels (11% of strip)
This is the real Straggler Problem:
We divided the data geographically equally (each strip ~9,592 rows), but we didn't divide the computational work equally. Worker 1 had to process 3x more actual pixels than Worker 0.
This is exactly what happens in real distributed systems:
MapReduce: Some reducers get keys with way more values (data skew)
Spark: Some partitions have more records after filtering
Database sharding: Some shards get hot keys with more traffic
The lesson: Naive horizontal splitting doesn't guarantee balanced workload. You need to understand your data distribution.
How could we fix this?
Analyze first, then split: Count non-ocean pixels per strip, then redistribute to balance work
Dynamic work stealing: Let fast workers (Worker 3) steal chunks from slow workers (Worker 1)
Finer-grained parallelism: Split into 16 or 32 smaller chunks instead of 4 large strips, let a work queue distribute them
This is why frameworks like Spark use partitioning strategies (hash partitioning, range partitioning) and why Google's MapReduce had a combiner phase to reduce skew.
Problem #2: Where did those extra 3+ seconds go?
Let's do the math:
Reading: 1.4s
Slowest worker: 67.9s
Expected total: ~69.3s
Actual total: 71.2s
Missing time: ~2s
That missing 2 seconds is serialization overhead - the time spent pickling 3.09 GB of NumPy arrays and transferring them across process boundaries.
Here's what Python had to do:
Traverse every float32 in the array (1.54 billion of them)
Convert to pickle protocol format
Copy data from main process memory to child process memory
Child processes unpickle and reconstruct arrays
For 641 MB (small scale), this took ~1-1.5s. For 3.09 GB (full scale), it took ~2s. The overhead is growing, but not linearly - this is a sign we're hitting Serialization Overhead, a classic distributed systems bottleneck.
Problem #3: Memory Explosion
Let's calculate peak memory usage:
Single process approach:
Loads one strip at a time: ~770 MB peak
Multiprocessing approach:
Main process holds all 4 strips: 3.09 GB
During pickling, each child gets its own copy: 4 × 770 MB = 3.08 GB
Peak memory: ~6.17 GB (more than double!)
This is the Data Locality problem. In distributed systems, moving data is expensive:
Network bandwidth in distributed systems
Serialization overhead in multiprocessing
Memory duplication costs
We're duplicating the entire dataset just to parallelize the computation.
The Four Distributed System Concepts
1. Computation-to-Communication Ratio
Small scale: 25.5s compute / 3s overhead = 8.5:1 ratio
Full scale: 227.1s compute / ~4s overhead = 56:1 ratio
Takeaway: Both scales have favorable ratios, which is why multiprocessing worked. If this ratio drops below 2:1, you're spending more time moving data than computing.
2. Serialization Overhead
Small scale: ~1.5s to pickle 641 MB
Full scale: ~2s to pickle 3.09 GB
Takeaway: Serialization doesn't scale linearly. At some point (10GB? 50GB?), the pickle overhead would dominate and kill our speedup.
3. The Straggler Problem
Small scale: Workers finished within 1.4s of each other (5.5s to 6.9s)
Full scale: Workers finished within 14.9s of each other (53.0s to 67.9s)
Takeaway: As computation time grows, small variations in worker speed become large absolute delays. Google's MapReduce and Apache Spark have entire frameworks dedicated to detecting and restarting stragglers.
4. Data Locality
We moved 3.09 GB of data to achieve parallelism
Peak memory: 6.17 GB (2x the dataset size)
Takeaway: "Moving computation to data" is cheaper than "moving data to computation." This is why Hadoop moved computation to the data nodes rather than centralizing data.
Why the Speedup Degraded Slightly (3.4x → 3.19x)!
With 4 workers, we'd expect 4x speedup in a perfect world. Here's why we're not getting it:
Small scale (3.4x):
Lost 0.6x to: straggler variance (1.4s gap) + serialization (~1.5s) + process spawn overhead
Full scale (3.19x):
Lost 0.81x to: worse straggler variance (14.9s gap) + serialization (~2s) + memory contention
The gap widened because:
Longer-running jobs amplify straggler effects
More data = more serialization time
6GB memory footprint = more RAM bandwidth contention between workers
Key Insight: Approach 1 works well at both scales because our computation-to-communication ratio is excellent. But as we scaled up 10x in data, we only lost 6% speedup efficiency (3.4x → 3.19x). The cracks are starting to show. At 100x scale (30GB), this approach would likely break down entirely.
This is why Google, Apache, and other distributed systems companies have spent decades optimizing these exact problems. What we're seeing in miniature is what happens at massive scale in production systems.
What's Next!
Approach 1 gave us 3.19x speedup, but it also exposed real problems:
6GB peak memory (2x the dataset size)
14.9-second straggler delays from data skew
Serialization overhead that grows with data size
I have two more approaches waiting to be explored:
Approach 2: Workers Read from Disk
What if we eliminate serialization entirely? Each worker reads its strip directly from disk. Zero pickling overhead, but now we're hitting disk I/O in parallel. Better or worse?
Approach 3: Send File Paths and Coordinates
The minimalist approach - just tell workers "read rows X-Y from file Z". This mirrors how Hadoop and Spark actually work. Does minimal communication always win?
And here's the kicker: I've also implemented all three approaches in Node.js and Go to see how different concurrency models (event loops vs goroutines vs multiprocessing) handle the same problem. The results are... surprising.
This post was already 15+ minutes of reading (thanks for sticking with me!), so I'm splitting the rest:
Part 2: Approaches 2 & 3 in Python - full benchmark comparison
Part 3: Same problem, different languages - how Node.js and Go change the game
Follow me on LinkedIn to catch the next parts. :)