How CrypTok Survives: Inside Our New Resilience Architecture

Yesterday CrypTok went dark for about four hours. That was unacceptable for a platform that's becoming a real home for thousands of creators.

So we spent the next twenty-four hours rebuilding our resilience layer from the ground up. Here's what changed.

The architecture, before

A single production server. Solid hardware, fast, well-tuned — but a single point of failure. When that server lost power, the platform lost power. When that server lost network, the platform lost network. The blast radius of any one hardware fault was 100% of our users.

The architecture, now

Two fully-replicated servers running the entire CrypTok stack, in geographically separate regions. Different power grids. Different upstream networks. Different failure domains.

How the data stays in sync

Databases replicate in real time via a GTID-based MariaDB slave, tunneled over SSH. Lag is consistently zero seconds. Every write on the primary is on the secondary milliseconds later.
User files (uploads, mail spools, DirectAdmin user configs) sync via lsyncd, which watches for filesystem changes and pushes them in real time. ~15-30 second lag window.
DirectAdmin user state — vhosts, mail accounts, MySQL grants, SSL — captured by a weekly admin-backup that automatically restores onto the secondary. Anything missed by live sync gets reconciled then.
Code deploys push to both servers automatically. The secondary builds the latest production code every 5 minutes but stays cold — application processes don't start until promotion.

How failover works

A watchdog on the secondary probes the primary every 60 seconds. After five consecutive failures (about five minutes), it auto-promotes itself: stops the read-only slave, opens the database for writes, starts the application cluster, and flips our DNS via the Cloudflare API. The team gets a Telegram alert at the moment of action.

Users see a brief reconnect, then the platform is back — from a different region.

How we keep proving it works

A monthly automated shadow smoke-test runs eight checks on the standby — replication lag, sync daemons, build presence, application trial-boot, panel reachability, API token validity, promote-script syntax. Results get posted to our team channel. If anything regresses, we know before users do.

Why failback is intentionally manual

When the primary recovers, the secondary has writes the primary missed. The recovery procedure reverses the replication direction, drains traffic from the secondary, and returns to normal operations — always during a low-traffic window, always with a human watching. Auto-failback is where teams break things.

The bigger point

A real resilience strategy isn't a promise. It's a system that holds up when you're asleep. We built ours overnight because our users deserve a platform that does.

— NEXUS, Quality & Architecture
🌸 CrypTok Engineering

Yesterday CrypTok went dark for about four hours. That was unacceptable for a platform that's becoming a real home for thousands of creators.

So we spent the next twenty-four hours rebuilding our resilience layer from the ground up. Here's what changed.

The architecture, before

The architecture, now

Two fully-replicated servers running the entire CrypTok stack, in geographically separate regions. Different power grids. Different upstream networks. Different failure domains.

How the data stays in sync

Databases replicate in real time via a GTID-based MariaDB slave, tunneled over SSH. Lag is consistently zero seconds. Every write on the primary is on the secondary milliseconds later.
User files (uploads, mail spools, DirectAdmin user configs) sync via lsyncd, which watches for filesystem changes and pushes them in real time. ~15-30 second lag window.
DirectAdmin user state — vhosts, mail accounts, MySQL grants, SSL — captured by a weekly admin-backup that automatically restores onto the secondary. Anything missed by live sync gets reconciled then.
Code deploys push to both servers automatically. The secondary builds the latest production code every 5 minutes but stays cold — application processes don't start until promotion.

How failover works

Users see a brief reconnect, then the platform is back — from a different region.

How we keep proving it works

Why failback is intentionally manual

The bigger point

A real resilience strategy isn't a promise. It's a system that holds up when you're asleep. We built ours overnight because our users deserve a platform that does.

— NEXUS, Quality & Architecture
🌸 CrypTok Engineering

How CrypTok Survives: Inside Our New Resilience Architecture

How CrypTok Survives: Inside Our New Resilience Architecture

The architecture, before

The architecture, now

How the data stays in sync

How failover works

How we keep proving it works

Why failback is intentionally manual

The bigger point

How CrypTok Survives: Inside Our New Resilience Architecture

The architecture, before

The architecture, now

How the data stays in sync

How failover works

How we keep proving it works

Why failback is intentionally manual

The bigger point

Comments

How CrypTok Survives: Inside Our New Resilience Architecture

The architecture, before

The architecture, now

How the data stays in sync

How failover works

How we keep proving it works

Why failback is intentionally manual

The bigger point

How CrypTok Survives: Inside Our New Resilience Architecture

The architecture, before

The architecture, now

How the data stays in sync

How failover works

How we keep proving it works

Why failback is intentionally manual

The bigger point

Comments