The hub is designed to scale from today’s ≤ 50 active nodes to six-figure operator counts — a half-million-participant ceiling is the explicit north star. Seven tiers get us there, each triggered by real utilisation, not speculation. Premature scaling is how projects accumulate complexity without customers; this plan refuses to pay complexity before the customer count demands it.
16.1 · Where we are today (honest audit)
We ship T2-capable from day one.The audit below enumerates every T2 ceiling-raising requirement against the current codebase. All gaps closed as of v0.7.8z14; headroom to T2’s ~350-node ceiling is intact.
| T2 requirement | Status | Evidence |
|---|---|---|
| SQLite WAL mode | ✓ done | db/index.ts:22 — db.pragma('journal_mode = WAL') |
| Concurrent job execution | ✓ done | worker/index.ts MAX_PARALLEL_JOBS=32 with per-node lock + per-kind caps (v0.7.8q) |
| Probe result cache | ✓ done | node_chainweb_tip table populated every 30s by the tip-poller (v0.7.8z9) |
| Bulk-probe scheduler with rate-limit | ✓ done | lib/chainweb-tip-poller.ts — 8-way concurrency cap, 30s cadence, re-entrancy guard |
| Earnings snapshot pagination | ✓ done | pages/admin/earnings.tsx NODE_PAGE_SIZE=20; single-active scope simplifies cross-account view |
| SSH connection pool (multiplexed, persistent) | ✓ done | lib/ssh.ts in-module pool (v0.7.8z14). Idle TTL 5 min, max age 1 h, reaper on 60s interval. |
| Rich List materialised hourly | ✓ done | lib/rich-list-mv.ts + migration 035 (v0.7.8z14). Worker refreshes rich_list_mv every hour. |
Conclusion. The hub is full T2 as of v0.7.8z14. The per-tier deep-dive below is the roadmap for every step from here to half a million.
16.2 · The ladder at a glance
| Tier | Target scale | Headline change |
|---|---|---|
| T1 | ≤ 50 nodes | Single hub process, SQLite, direct SSH per op. Baseline. |
| T2· current | ≤ 350 nodes | WAL + parallel workers + probe cache + bulk-probe scheduler + SSH pool + rich-list MV. WE ARE HERE (fully). |
| T3 | ≤ 1 500 nodes | Job queue + job logs on disk + composite indexes on hot paths. Earnings streams move to SSE. |
| T4 | ≤ 7 000 nodes | Postgres replaces SQLite. Redis cache + BullMQ queue. OpenTelemetry. Multiple worker hosts per hub. |
| T5 | ≤ 30 000 nodes | Federated hubs per region; cross-region reconciliation; regulatory data-residency support. |
| T6 | ≤ 200 000 nodes | Agent-pull protocol: nodes pull their own work from the hub instead of hub-initiated SSH. |
| T7 | 500 000+ nodes | Global coordinator + horizontal hub fleet. Daily Stoicism mint sharded across 10 StoaChain chains (~5 min sweep). |
16.3 · T1 — the baseline (≤ 50 nodes)
Trigger: day one.
What breaks past ~50: single SQLite writer serialises every job + probe on the same connection. SSH handshakes dominate probe time. The landing page starts lagging on first paint.
What T1 ships: one Node.js process, SQLite single-file DB, direct SSH per job, no cache layer, no queue. Simplest thing that works.
Status: retired. We jumped directly to T2 during v0.7.8 work because the gap cost was small and the T2 ceiling is 7× larger.
16.4 · T2 — parallelism + caching (≤ 350 nodes) · CURRENT
Trigger: T1 saturation. Hit at ~15 nodes during v0.7.6–v0.7.8 work because concurrent benchmark + probe pressure serialised the writer.
What breaks past ~350: the in-process job queue grows unbounded under sustained multi-node activity; SSH pool warm-hit ratio drops once per-node activity falls below pool idle TTL; earnings-page pagination hits the DB twice per page.
What T2 shipped:
- SQLite WAL. Readers don’t block writers, writers don’t block readers. Single-line pragma.
- Concurrent workers.
MAX_PARALLEL_JOBS = 32, per-node lock prevents two jobs touching the same target, per-kind caps stop one kind saturating every slot. - Chainweb tip cache.
node_chainweb_tiptable refreshed every 30 s by the tip-poller. Page reads never SSH. - Bulk-probe scheduler with rate-limit. 8-way concurrency cap, re-entrancy guard so an overrun tick can’t stomp the previous one.
- SSH connection pool. Persistent ssh2 connections keyed by
user@host:port; idle TTL 5 min, max age 1 h, 60 s reaper. Handshake cost amortises to zero. - Rich-list materialised view.
rich_list_mvrefreshed hourly; aggregate page read becomes a single-row lookup. - Single-active-scope session model. One email is "active" at a time; cross-account views simplify + queries stay bounded.
Status: done as of v0.7.8z14. Headroom to 350 active nodes.
16.5 · T3 — queue + observability seams (≤ 1 500 nodes)
Trigger: 350-node ceiling crossed for 7 sustained days. No panic-scaling: crossing the ceiling for a day doesn’t move us.
What breaks past ~1 500: job logs fight structured state for WAL write budget — the two share one SQLite file; slow-query log on stoicism_events + jobs shows unindexed paths under load; live earnings polling costs a full page refresh per operator.
What T3 lands:
- Job logs off the DB. Each job gets a flat file under
data/jobs/<id>.log; the DB keeps only the structured state row. WAL write budget frees up for hot paths. - Composite indexes. Driven by slow-query logging — concrete candidates:
(owner_email, ran_at)onstoicism_events;(kind, status, scheduled_at)onjobs;(node_id, kind, created_at)onbackups. JobQueueseam. Today’s in-process queue extracted into a named module with a pluggable back-end. T4’s BullMQ swap becomes a one-module change.DbAdapterseam. ThegetDb()singleton grows behind an interface — methods are the same, implementation is SQLite today and Postgres-pg next tier.- Live earnings via SSE. Hub pushes Stoicism delta events as they happen instead of operators re-polling.
Status: planned. Seams partially in place (job queue already abstracted behind lib/jobs).
16.6 · T4 — Postgres + multi-worker (≤ 7 000 nodes)
Trigger: 1 500-node ceiling crossed for 7 sustained days; at this scale the hub is a real business, not a tool.
What breaks past ~7 000: SQLite’s single-writer model hits a hard wall — no amount of WAL tuning changes that only one process at a time mutates state; the single-leader worker lease becomes the whole-hub throughput ceiling; full-text and multi-dimensional query patterns (aggregate earnings by region, historical timeseries) are already painful on SQLite by this point.
What T4 lands:
- Postgres replaces SQLite. The
DbAdapterseam introduced at T3 makes this a connection-string change, not a port. - Redis. Fronts SSH pool cache + SSE fanout; retires the in-process single-leader pattern.
- BullMQ job queue. Multiple worker processes, multiple hosts; the lease pattern generalises to a Redis primitive.
- Structured observability. OpenTelemetry traces on every SSH + DB call; structured logs to Loki (or equivalent); production incident triage target: under 15 minutes.
- Read replicas. Admin dashboard queries go to a replica; write path stays on the primary.
Status: designed-for, not built. Every abstraction introduced before T4 is sized so T4 is incremental, not a rewrite.
16.7 · T5 — federated hub fleet (≤ 30 000 nodes)
Trigger: 7 000-node ceiling + regulatory pressure (GDPR data residency, regional tax attestation, regional latency SLAs).
What breaks past ~30 000: a single hub hosted in one country is a geopolitical single point of failure and a regulatory headache; cross-continent SSH latency becomes meaningful for probe budgets; compliance work can’t be batched — different jurisdictions want different behaviour.
What T5 lands:
- One hub per region. EU, NA, APAC, LATAM, AF. Nodes register against the regionally-closest hub.
- Cross-region reconciliation. A thin federation coordinator component replicates authoritative tables (accounts, Stoicism ledger) between hubs; writes stay local, reads are globally consistent within a lag budget.
- Data-residency policy per region. A GDPR-pinned operator’s data never leaves the EU hub’s database.
- Region-aware benchmark baselines. EU nodes benchmarked against EU reference loads; bandwidth-expectations adjusted for regional infra norms.
Status: architectural sketch. Not built; named so the T4 abstractions don’t paint us into a corner.
16.8 · T6 — agent-pull protocol (≤ 200 000 nodes)
Trigger: hub-initiated SSH fanout becomes a scaling ceiling — even at 8-way concurrency + persistent pool, cycling through every node for a routine probe is minutes, not seconds.
What breaks past ~200 000: outbound SSH fanout doesn’t scale linearly — residential routers rate-limit outbound connections, cloud egress costs grow, and nodes behind NAT/CGNAT become unreachable from the hub without ugly workarounds.
What T6 lands:
- Agent on every node. Lightweight Node.js or Rust binary; maintains a long-poll or WebSocket to its regional hub.
- Inverted probe pattern. The agent pushes its heartbeat + metrics + argv; the hub never SSHes for routine ops. SSH remains for privileged administrative actions (bootstrap, key-seat, recovery).
- NAT traversal for free. Outbound from the node works past any firewall; hub inbound port stays one (the agent control channel).
- Agent self-update. Hub signs new agent binaries; agent verifies, downloads, exec-replaces. OTA fleet updates.
Status: research. Not designed in detail; T5 abstractions are shaped so an agent is a drop-in for the SSH path where it makes sense.
16.9 · T7 — global coordinator (500 000+ nodes)
Trigger: half-a-million participants. Individual regional hubs saturate; one global view becomes necessary for anti-sybil and aggregate economics, but building it at regional level means cross-region round-trips on every read.
What breaks past ~500 000: conflict resolution between regional hubs becomes the bottleneck — two hubs updating the same global account concurrently needs a tie-breaker beyond wall-clock timestamps; daily Stoicism mint coordination across regions needs a single authority to sequence the batched txs.
What T7 lands:
- Global coordinator process. Arbitrates global state; thin — purely the tie-breaker + mint-sequencer, not on the critical path for operator-facing reads.
- Horizontal hub fleet within each region. Several hub instances per region share a Postgres + BullMQ cluster; behind a load balancer for operator login.
- Daily mint sharded across 10 StoaChain chains. The flagship tier. Hub computes per-account Stoicism deltas, shards accounts by the chain that owns their register, emits one batched
update-registerstx per chain in parallel. 500 000 accounts ÷ 10 chains = ~50 000 accounts per chain; at ~2 M gas / tx, every chain completes its sweep in under 5 minutes wall-clock; global sweep ~5 minutes. ~80 transactions/day globally — chainweb absorbs this without effort. - Validator-network backed reads. By T7 the Wave-2 validator fleet is operational; explorer reads, IPFS pins, UI hosting all come off the validator network, not the hub. The hub shrinks back to its core: operator CRM, scoring engine, mint orchestrator.
Status: architecturally named; designed for but intentionally unbuilt. Every earlier-tier abstraction is sized so T7 is a set of additions, not a rewrite.
16.10 · Why the ladder bottoms out at half a million
StoaChain’s 10 chains at ~2 M gas per tx absorb the daily Stoicism mint directly — via register aggregation, not ZK. One mint-and-register-update transaction per chain batches thousands of account deltas into a single on-chain write: the hub computes the daily delta for every account, sorts by the chain that owns each account’s register, and emits one batched update-registers call per chain. At 500 k accounts sharded across the chains, a daily sweep finishes in under 5 minutes wall-clock with ~80 transactions/day globally.
No SNARKs, no Merkle-root client-redeem dance, no off-chain attestation. Register aggregation is the dumb-scalablepath: plain Pact math inside a single tx, bounded by gas not cryptography. StoaChain’s gas ceiling is what makes the plan honest beyond T4 — without it, a T6-era Merkle-attestation pivot would be necessary at much lower scale. We don’t need it.
Proof of work + chainweb + register aggregation = the half-million ceiling. This is what the former Kadena LLC team built; the hub inherits its scalability for free by leaning on chainweb at the protocol layer instead of reinventing sharding at the application layer.
16.11 · Design discipline from day one
Even at T1 the codebase is careful about module boundaries so tier jumps are cheap. The abstractions introduced before we need them:
SshPool— ships as a single-connection passthrough at T1; grows into a real pool at T2 (now). Call sites don’t change.ProbeCache— already a real thing for chainweb tips (node_chainweb_tip); extends naturally to other fields (flags, capacity) when needed.JobQueue— thelib/jobshelpers form the seam; swapping to BullMQ at T4 is a one-module change, not a rewrite.DbAdapter— formalises at T3; thegetDb()singleton is the current surface and abstracts cleanly.AgentChannel— named for T6; drops in where the SSH path is routine rather than privileged.
Full internal planning doc lives in plans/v0.8-hub-scalability.md. This public chapter is authoritative for external consumption; the internal doc carries speculative reasoning + benchmarks that don’t need to be in the public surface.