C4 Modelling for Complex Distributed Systems

Architecture diagrams in most organisations fall into two categories: either someone drew boxes on a whiteboard three years ago and took a photo that now lives in a wiki nobody reads, or every team maintains their own contradictory Visio diagram that’s perpetually six months out of date.

The C4 model, created by Simon Brown, solves this by providing four levels of abstraction ~ Context, Containers, Components, and Code ~ each targeting a different audience and answering different questions. After applying C4 to Mesh-Sync, a distributed 3D model processing platform with a NestJS backend, TypeScript orchestration engine, Python worker pool, and half a dozen infrastructure services, I’m convinced it’s the most practical architecture documentation approach for complex systems.

This article walks through all four C4 levels using a real production system, shows how to keep diagrams in sync with code, and demonstrates integration with Architecture Decision Records (ADRs).

The Documentation Problem

Before C4, our architecture documentation looked like this:

A one-page “system overview” diagram in Confluence that mixed infrastructure (Redis, PostgreSQL) with application components (AuthModule, ModelService) at the same level of abstraction
Inline ASCII diagrams in README files that nobody updated
Deployment diagrams that showed AWS services but not application boundaries
Zero documentation of the internal structure of our most complex service ~ the pipeline orchestration engine

The problem isn’t that people don’t want to document. It’s that there’s no shared vocabulary for what a “component” means or what level of detail belongs where. C4 provides that vocabulary.

Level 1: System Context ~ Who Uses What

The System Context diagram is the most zoomed-out view. It shows your system as a single box, surrounded by the users and external systems it interacts with. Non-technical stakeholders should be able to read this.

graph TB
    subgraph ext[External Systems]
        S3[AWS S3<br/><i>File Storage</i>]
        Stripe[Stripe<br/><i>Payments</i>]
        Email[SendGrid<br/><i>Email Delivery</i>]
        OAuth[Google/GitHub OAuth<br/><i>Authentication</i>]
    end

    User([3D Artist / Designer])
    Admin([Platform Admin])
    API([API Consumer<br/><i>Third-party integrations</i>])

    User -->|Uploads models,<br/>browses marketplace| MS
    Admin -->|Manages users,<br/>monitors pipelines| MS
    API -->|REST API calls| MS

    MS[Mesh-Sync Platform<br/><i>3D Model Processing<br/>& Marketplace</i>]

    MS -->|Stores/retrieves files| S3
    MS -->|Processes payments| Stripe
    MS -->|Sends notifications| Email
    MS -->|Authenticates users| OAuth

    style MS fill:#1e1e24,stroke:#5eead4,color:#e4e4e7
    style User fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style Admin fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style API fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style S3 fill:#1e1e24,stroke:#fbbf24,color:#e4e4e7
    style Stripe fill:#1e1e24,stroke:#fbbf24,color:#e4e4e7
    style Email fill:#1e1e24,stroke:#fbbf24,color:#e4e4e7
    style OAuth fill:#1e1e24,stroke:#fbbf24,color:#e4e4e7

Key decisions visible at this level:

The platform has three distinct user personas (artists, admins, API consumers) with different interaction patterns
External system boundaries are explicit ~ if Stripe goes down, payments are affected but model processing continues
Authentication is delegated to OAuth providers, not built in-house

This diagram doesn’t show Redis, PostgreSQL, or BullMQ. Those are implementation details. A VP of Engineering or a new team member can look at this and understand what the system does and who it serves in 30 seconds.

Level 2: Container ~ What Runs Where

The Container diagram zooms into the “Mesh-Sync Platform” box and shows the separately deployable units ~ applications, databases, message brokers, and file stores. Each container is a process or a data store that communicates over the network.

graph TB
    subgraph meshsync[Mesh-Sync Platform]
        BE[NestJS Backend<br/><i>REST API, Auth,<br/>Business Logic</i>]
        WB[Worker Backend<br/><i>Pipeline Orchestration,<br/>Job Dispatch</i>]
        
        subgraph workers[Python Worker Pool]
            W1[Thumbnail Generator<br/><i>Blender + Python</i>]
            W2[Semantic Analyzer<br/><i>LLM-powered classification</i>]
            W3[Metadata Extractor<br/><i>Format parsing</i>]
            W4[Model Discovery<br/><i>Search indexing</i>]
        end

        PG[(PostgreSQL<br/><i>Primary data store</i>)]
        Redis[(Redis<br/><i>Queues, Cache,<br/>Pipeline State</i>)]
        MinIO[(MinIO<br/><i>Object Storage<br/>Pipeline Cache</i>)]
        ELK[(Elasticsearch<br/><i>Observability,<br/>Search Index</i>)]
    end

    User([Users]) -->|HTTPS| BE
    
    BE -->|REST Webhooks| WB
    BE -->|SQL| PG
    WB -->|BullMQ Jobs| Redis
    WB -->|Cache R/W| MinIO
    WB -->|Events| ELK
    
    Redis -->|Job Dispatch| W1
    Redis -->|Job Dispatch| W2
    Redis -->|Job Dispatch| W3
    Redis -->|Job Dispatch| W4
    
    W1 & W2 & W3 & W4 -->|HMAC Webhooks| WB
    
    style BE fill:#1e1e24,stroke:#5eead4,color:#e4e4e7
    style WB fill:#1e1e24,stroke:#5eead4,color:#e4e4e7
    style W1 fill:#1e1e24,stroke:#34d399,color:#e4e4e7
    style W2 fill:#1e1e24,stroke:#34d399,color:#e4e4e7
    style W3 fill:#1e1e24,stroke:#34d399,color:#e4e4e7
    style W4 fill:#1e1e24,stroke:#34d399,color:#e4e4e7
    style PG fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style Redis fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style MinIO fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style ELK fill:#1e1e24,stroke:#818cf8,color:#e4e4e7

Key decisions visible at this level:

Two backend services ~ the NestJS Backend handles user-facing API/business logic while the Worker Backend handles pipeline orchestration. This separation means orchestration complexity doesn’t leak into the API layer.
CQRS via webhooks ~ Workers don’t write to PostgreSQL directly. They report results via HMAC-signed webhooks to the Worker Backend, which decides what to persist. This is ADR-010 in our decision log.
Redis serves triple duty ~ message queue (BullMQ), pipeline state cache, and distributed lock store. A pragmatic trade-off: fewer infrastructure components to operate.
Python workers are stateless ~ They pull jobs from Redis, process them, send webhooks, and terminate. No local state, no direct DB access. This makes scaling trivial: add more worker replicas.

Level 3: Component ~ What’s Inside

The Component diagram zooms into a single container to show its major structural building blocks. Let’s look inside the Worker Backend ~ the orchestration engine ~ since it’s the most architecturally complex container:

graph TB
    subgraph wb[Worker Backend ~ Pipeline Orchestration Engine]
        PO[Pipeline Orchestrator<br/><i>Facade: Pipeline lifecycle</i>]
        DR[Dependency Resolver<br/><i>DAG graph builder</i>]
        SE[Stage Executor<br/><i>Routes to handler by type</i>]
        PV[Pipeline Validator<br/><i>Schema + Semantic checks</i>]
        IV[Interpolation Validator<br/><i>Variable resolution safety</i>]
        
        subgraph events[Domain Event System]
            DED[Domain Event Dispatcher<br/><i>Mediator pattern</i>]
            MSH[Model Status Handler]
            TMH[Technical Metadata Handler]
            FCH[Folder Completion Handler]
        end
        
        subgraph actions[Action Registry]
            AR[Action Registry<br/><i>Command pattern</i>]
            MA[Model Actions<br/><i>Status updates</i>]
            CA[Context Actions<br/><i>State mutations</i>]
        end
        
        subgraph infra[Infrastructure Services]
            ELK[ELK Event Publisher<br/><i>Observability</i>]
            MC[MinIO Cache Manager<br/><i>Result caching</i>]
            TM[Timeout Monitor<br/><i>Deadline enforcement</i>]
        end
    end

    API([REST API Endpoints]) --> PO
    PO --> DR
    PO --> PV
    PO --> SE
    SE --> AR
    SE --> DED
    SE --> IV
    DED --> MSH & TMH & FCH
    MSH -->|Webhook| ExtBE([NestJS Backend])
    
    SE -->|Enqueue| Redis[(Redis / BullMQ)]
    ELK -->|Batch publish| ES[(Elasticsearch)]
    MC -->|Cache R/W| MinIOStore[(MinIO)]
    TM -->|Scan running stages| Redis

    style PO fill:#1e1e24,stroke:#5eead4,color:#e4e4e7
    style DR fill:#1e1e24,stroke:#5eead4,color:#e4e4e7
    style SE fill:#1e1e24,stroke:#5eead4,color:#e4e4e7
    style DED fill:#1e1e24,stroke:#818cf8,color:#e4e4e7
    style AR fill:#1e1e24,stroke:#34d399,color:#e4e4e7
    style ELK fill:#1e1e24,stroke:#fbbf24,color:#e4e4e7

Component responsibilities:

Component	Responsibility	Pattern
PipelineOrchestrator	Entry point facade ~ receives pipeline start/stop requests, coordinates lifecycle	Facade
DependencyResolver	Builds DAG from stage definitions, checks if dependencies are satisfied	Graph analysis
StageExecutor	Routes stage execution to the correct handler based on type (worker/internal/parallel/decision)	Strategy
PipelineValidator	Three-layer validation: JSON Schema → semantic → interpolation	Chain of Responsibility
DomainEventDispatcher	Routes domain events to registered handlers without coupling emitters to consumers	Mediator
ActionRegistry	Maps action names to handler implementations for internal stages	Command + Registry
ELKEventPublisher	Batched event streaming to Elasticsearch for pipeline observability	Observer + Buffer
MinIOCacheManager	Content-addressable caching of stage results to skip redundant computation	Cache-Aside
TimeoutMonitor	Background scanner that detects and escalates timed-out stages	Polling Monitor

This level of detail is useful for developers working on the orchestration engine. It shows which component to modify for a given change, identifies the design patterns in use, and maps data flow through the system.

Level 4: Code ~ The Implementation Detail

The Code level zooms into a single component to show classes, interfaces, and their relationships. This is the most ephemeral level ~ it changes with every refactor ~ so we only create Code diagrams for critical abstractions that need to be well-understood.

Here’s the Domain Event System in detail:

classDiagram
    class DomainEvent {
        <<interface>>
        +type: string
        +correlationId: string
        +timestamp: Date
        +payload: any
    }

    class DomainEventHandler {
        <<interface>>
        +handle(event: DomainEvent) Promise~void~
    }

    class DomainEventDispatcher {
        -handlers: Map~string, DomainEventHandler~
        +register(eventType: string, handler: DomainEventHandler) void
        +dispatch(event: DomainEvent) Promise~any~
    }

    class ModelStatusUpdateEvent {
        +type: "model.status.update_requested"
        +modelId: string
        +newStatus: string
    }

    class TechnicalMetadataSaveEvent {
        +type: "model.technical_metadata.save_requested"
        +modelId: string
        +metadata: object
    }

    class ModelStatusUpdateHandler {
        -modelWebhookClient: ModelWebhookClient
        +handle(event: ModelStatusUpdateEvent) Promise~void~
    }

    class TechnicalMetadataSaveHandler {
        -modelWebhookClient: ModelWebhookClient
        +handle(event: TechnicalMetadataSaveEvent) Promise~void~
    }

    class FolderCompletionCheckHandler {
        -folderService: FolderService
        +handle(event: DomainEvent) Promise~void~
    }

    DomainEvent <|-- ModelStatusUpdateEvent
    DomainEvent <|-- TechnicalMetadataSaveEvent
    DomainEventHandler <|.. ModelStatusUpdateHandler
    DomainEventHandler <|.. TechnicalMetadataSaveHandler
    DomainEventHandler <|.. FolderCompletionCheckHandler
    DomainEventDispatcher --> DomainEventHandler : routes to
    DomainEventDispatcher --> DomainEvent : dispatches

Why this component gets a Code diagram:

The Domain Event System is a critical integration point ~ it’s how the orchestration engine communicates state changes to the NestJS backend without direct coupling. New developers need to understand the registration pattern, the one-handler-per-event-type constraint, and the fact that handlers are the only place where webhooks are sent. This diagram makes that structure explicit.

Integrating C4 with Architecture Decision Records

C4 diagrams answer “what does the system look like?” ADRs answer “why does it look that way?” Linking them creates architecture documentation that’s both visual and rationale-rich.

We embed ADR references directly in our C4 descriptions:

C4 Element	ADR	Decision
Workers → Webhooks → Worker Backend	ADR-010	CQRS via webhooks ~ workers never write to the database directly
Pipeline definitions in YAML	ADR-007	Event-driven worker architecture with declarative pipeline models
Domain Event Dispatcher	ADR-012	Mediator pattern for domain events ~ single handler per event type, no fan-out
BullMQ over custom queue	ADR-003	Use BullMQ for job queuing ~ mature, Redis-backed, supports priorities and rate limiting

When someone reads the Container diagram and wonders “why do workers send webhooks instead of writing to PostgreSQL directly?”, they follow the ADR-010 link and find the context, options considered, and rationale. The diagram shows what; the ADR explains why.

ADR Format We Use

# ADR-010: CQRS Webhook Architecture

## Status
Accepted ~ 2025-08-15

## Context
Workers process jobs asynchronously. They need to report results
back to the system. Options considered:
1. Direct database writes from workers (shared database)
2. Message queue events consumed by backend
3. HTTP webhook callbacks to backend API

## Decision
Option 3 ~ Workers send HMAC-signed HTTP webhooks to the
Worker Backend, which handles persistence and side effects.

## Consequences
- Workers have zero knowledge of the database schema
- Backend controls all write operations (single writer principle)
- Workers can be implemented in any language
- Added latency from HTTP round-trip (acceptable: <50ms)
- Requires webhook signature verification (HMAC-SHA256)

Keeping C4 Diagrams Alive

The biggest risk with any architecture documentation is drift. Here’s our strategy for keeping C4 diagrams in sync with reality:

1. Diagrams Live in Code

All C4 diagrams are Mermaid blocks inside Markdown files in the repository ~ alongside the code they describe. Not in Confluence, not in a shared drive, not in a Structurizr cloud instance. When you change the code, you see the diagram in the same PR.

2. ADR-Triggered Updates

Every new ADR that affects system structure triggers a C4 update as part of the same PR. The PR template includes a checkbox: “Does this change affect the C4 model? If yes, update the relevant diagram.”

3. Quarterly Architecture Review

Every quarter, lead engineers walk through the C4 diagrams in a 1-hour session. We project the Context and Container diagrams and ask: “Does this still match reality?” The Component and Code diagrams are reviewed by the team that owns the container.

4. Level-Appropriate Detail

We deliberately keep Level 1 (Context) and Level 2 (Container) very stable ~ these change only when we add/remove external integrations or deploy new services. Level 3 (Component) changes with significant refactors. Level 4 (Code) is generated on demand and never persisted ~ it’s too volatile to maintain.

Common Mistakes

Mixing Abstraction Levels

The most common C4 mistake is putting databases on the same diagram as code classes, or showing network protocols alongside business concepts. Each C4 level has a vocabulary:

Context: Systems, users, external services
Container: Applications, databases, file systems, message brokers
Component: Modules, services, controllers, repositories (within a container)
Code: Classes, interfaces, enums, functions (within a component)

If your diagram has a “PostgreSQL” box next to a “UserService” class, you’re mixing levels.

Over-Detailing Level 1

The System Context diagram should be understandable by someone who’s never written code. If it has more than 10 boxes, you’re showing too much. Merge external systems into categories if needed: “Cloud Infrastructure (AWS)” instead of listing every Lambda, S3 bucket, and SQS queue.

Skipping Level 3

Many teams draw Context and Container diagrams but never produce Component diagrams. This leaves a gap: developers can see the containers but don’t know how their internals are structured. Level 3 is where the real architectural value lives ~ it shows the design patterns, responsibilities, and data flows that determine how easy the system is to change.

Not Linking to ADRs

A C4 diagram without ADR references is a picture of the current state with no explanation of how you got there. Six months later, a new team member looks at the webhook arrows and asks “why don’t workers just write to the database?” Without ADR-010, nobody remembers.

Conclusion

C4 modelling works because it gives teams a shared abstraction hierarchy that scales from executive summaries (Level 1) to implementation details (Level 4). The format is lightweight ~ Mermaid diagrams in Markdown cost nothing to produce and live naturally alongside code.

For distributed systems with multiple services, worker pools, and infrastructure dependencies, C4 is the difference between “I think the data flows through…” and “here’s exactly what talks to what, and here’s why we designed it that way.”

Start with Level 1. Draw it on a whiteboard. If your entire team agrees it’s accurate, write it down. Then zoom in.