# A Field Guide to System Design Trade-offs

Every system design decision is a trade-off. Here are the ones that actually matter, explained through the lens of "what breaks when."

## Consistency vs. Availability (it's not just CAP)

**The real question:** When two users read the same data at the same time, must they see the same thing?

- **Strong consistency** (e.g., single Postgres primary): Both readers always agree. Breaks when the primary is down — writes stop entirely.
- **Eventual consistency** (e.g., DynamoDB, Cassandra): Readers might briefly disagree. In exchange, the system stays writable even during partitions.
- **The thing nobody tells you:** Most applications can tolerate *seconds* of inconsistency in *most* of their data. Your user's profile picture can be stale for 5 seconds. Their bank balance cannot.

**Design heuristic:** Default to strong consistency. Switch to eventual only for specific fields where you can articulate what "stale" means and why it's acceptable.

## Latency vs. Throughput

These sound related but optimize differently:

- **Low latency** = respond to one request fast. Optimize: caching, connection pooling, edge compute, avoid network hops.
- **High throughput** = handle many requests. Optimize: batching, async processing, horizontal scaling, queue-based architectures.

**The conflict:** Batching increases throughput but increases latency for individual items in the batch. Caching reduces latency but adds invalidation complexity that can reduce throughput under write-heavy loads.

**Design heuristic:** Optimize for latency at the edges (user-facing APIs) and throughput in the core (data pipelines, batch jobs).

## Simplicity vs. Flexibility

| Simple | Flexible | When to choose simple |
|--------|----------|-----------------------|
| SQLite | PostgreSQL cluster | < 100 writes/sec, single server |
| Monolith | Microservices | < 10 engineers, < 1M users |
| REST | GraphQL | < 5 client types |
| Cron job | Event-driven pipeline | < 1M events/day |

**The hidden cost of flexibility:** Every abstraction layer, every config option, every plugin system is code that must be understood, tested, and maintained. A system with 3 config options has 8 possible states. A system with 10 has 1,024.

**Design heuristic:** Choose the simplest option that won't need to be replaced in the next 18 months. Not 5 years — you can't predict that far. Not 6 months — that's just kicking the can.

## Reads vs. Writes

Most systems are read-heavy (100:1 or more). This unlocks powerful optimizations:

- **Materialized views**: Precompute expensive queries at write time
- **Denormalization**: Store data redundantly to avoid joins
- **Read replicas**: Scale reads horizontally, writes stay on primary
- **CDN/Cache layers**: Serve reads from memory, bypass the database entirely

**The write-heavy exception:** Analytics, logging, IoT sensor data, and event sourcing are write-heavy. These need append-only stores (Kafka, ClickHouse, TimescaleDB) that sacrifice read flexibility for write throughput.

**Design heuristic:** Measure your read/write ratio before choosing your data model. A 100:1 read/write ratio means you should optimize schema for reads even if writes become 10x more complex.

## The meta-lesson

Trade-offs aren't permanent. Start with the simpler option. Monitor. When the simpler option starts hurting — and you'll know because you're monitoring — migrate to the more complex one. The cost of migrating later is almost always less than the cost of over-engineering now.

The best architecture is the one your team can understand, debug at 3 AM, and explain to a new hire in under an hour.
