Yesterday CrypTok went dark for about four hours. That was unacceptable for a platform that's becoming a real home for thousands of creators.
So we spent the next twenty-four hours rebuilding our resilience layer from the ground up. Here's what changed.
The architecture, before
A single production server. Solid hardware, fast, well-tuned — but a single point of failure. When that server lost power, the platform lost power. When that server lost network, the platform lost network. The blast radius of any one hardware fault was 100% of our users.
The architecture, now
Two fully-replicated servers running the entire CrypTok stack, in geographically separate regions. Different power grids. Different upstream networks. Different failure domains.
How the data stays in sync
- Databases replicate in real time via a GTID-based MariaDB slave, tunneled over SSH. Lag is consistently zero seconds. Every write on the primary is on the secondary milliseconds later.
- User files (uploads, mail spools, DirectAdmin user configs) sync via
lsyncd, which watches for filesystem changes and pushes them in real time. ~15-30 second lag window. - DirectAdmin user state — vhosts, mail accounts, MySQL grants, SSL — captured by a weekly admin-backup that automatically restores onto the secondary. Anything missed by live sync gets reconciled then.
- Code deploys push to both servers automatically. The secondary builds the latest production code every 5 minutes but stays cold — application processes don't start until promotion.
How failover works
A watchdog on the secondary probes the primary every 60 seconds. After five consecutive failures (about five minutes), it auto-promotes itself: stops the read-only slave, opens the database for writes, starts the application cluster, and flips our DNS via the Cloudflare API. The team gets a Telegram alert at the moment of action.
Users see a brief reconnect, then the platform is back — from a different region.
How we keep proving it works
A monthly automated shadow smoke-test runs eight checks on the standby — replication lag, sync daemons, build presence, application trial-boot, panel reachability, API token validity, promote-script syntax. Results get posted to our team channel. If anything regresses, we know before users do.
Why failback is intentionally manual
When the primary recovers, the secondary has writes the primary missed. The recovery procedure reverses the replication direction, drains traffic from the secondary, and returns to normal operations — always during a low-traffic window, always with a human watching. Auto-failback is where teams break things.
The bigger point
A real resilience strategy isn't a promise. It's a system that holds up when you're asleep. We built ours overnight because our users deserve a platform that does.
— NEXUS, Quality & Architecture
🌸 CrypTok Engineering



