5
Critical risks fixed
6
Medium risks mitigated
7
Build phases
v2
Architecture version
New addition Changed from v1 Unchanged
Presentation
Next.js 14 monorepo Middleware tenant routing Server-side contact API CMS panel Super admin
API / Logic
Next.js API routes Provisioning saga engine BullMQ job queue Reconciliation cron Subscription lifecycle
Data
PostgreSQL 15 PgBouncer (Phase 1) Schema-per-tenant RLS policies Field-level encryption Redis
Payment
Razorpay Orders API Daily reconciliation job Webhook + HMAC verify Idempotency keys
Infrastructure
Coolify + Traefik Docker containers Cold standby VPS WAL → secondary region MinIO + offsite backup Let's Encrypt SSL
Observability
Sentry Uptime checks Staging environment Job failure alerts Structured logs
Compliance
DPDP consent flow Encrypted PII fields Data export API Breach procedure DPA template
v1 decisionProblemv2 fix
Sequential webhook handlerPartial failure = paid customer, no siteSaga state machine with resumable steps
No payment reconciliationSilent revenue leak if webhook never arrivesDaily Razorpay API diff job
PgBouncer deferred to Phase 7Connection exhaustion at ~50 tenantsPgBouncer transaction mode from Phase 1
No staging environmentBad deploy hits all tenants simultaneouslyStaging branch + Coolify staging env
Single VPS, no DR planAny hardware/network event = total outageCold standby + WAL to secondary region + defined RTO/RPO
PAN/Aadhaar in plaintextDPDP Act violation, UIDAI compliance riskAES-256 field encryption, Aadhaar verify-and-discard
EmailJS on frontendCredential exposure, unprofessional signalServer-side /api/contact with SMTP relay
Cron worker, no job queueOne crash silently kills all automationBullMQ on Redis with retry + dead-letter queue
"60s" promise for custom domainsSSL issuance takes 15-90s, user feels misledSubdomain: 60s. Custom domain: async, user notified when ready
Core principle: Every provisioning step must be idempotent. The saga persists its state to the database after each step. If the process crashes at step 6, it resumes from step 6 — not from step 1.
-- New table in shared schema CREATE TABLE provisioning_jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), payment_id TEXT UNIQUE NOT NULL, tenant_id UUID, status TEXT NOT NULL DEFAULT 'pending', -- ENUM: pending | schema_created | content_seeded | domain_assigned | email_sent | complete | failed last_step TEXT, error_msg TEXT, attempt_count INT DEFAULT 0, created_at TIMESTAMPTZ DEFAULT now(), updated_at TIMESTAMPTZ DEFAULT now() );
01
Webhook received → verify HMAC-SHA256 signature
Reject invalid. Return 200 immediately to Razorpay — don't make Razorpay wait for provisioning. Push a job to BullMQ queue instead of processing inline.
02
BullMQ worker picks up provisioning job
Worker creates a provisioning_jobs row with status=pending. If row already exists for this payment_id → idempotency exit. This is now the saga's persistent record.
03
Create tenant record + subscription row
Insert into tenants and subscriptions. Update provisioning_jobs.last_step = 'tenant_created'. All DB operations in this step are inside a single Postgres transaction — atomic and rollbackable.
04
Create tenant schema + run migrations
Execute CREATE SCHEMA tenant_{uuid} and apply table migrations from a pre-tested migration template. Update last_step. Schema creation is idempotent: CREATE SCHEMA IF NOT EXISTS.
05
Seed default content
Insert default rows into the new schema's tables. Check if rows exist before inserting — idempotent. Update last_step = 'content_seeded'.
06
Record payment
Insert into payments table with idempotency_key = razorpay_payment_id. Uses ON CONFLICT DO NOTHING. Always safe to retry.
07
Assign domain (async — does not block)
For subdomains: immediate (wildcard cert). For custom domains: insert domain record with ssl_status='pending', trigger async Coolify API call, notify tenant separately when SSL is ready. The site goes live at the subdomain immediately regardless.
Custom domain SSL is async. Never promise 60s for custom domains. Show tenant a "your domain is being configured" status in CMS until ssl_status = 'active'.
08
Send welcome email (non-blocking)
Email is pushed to a separate BullMQ email queue. Welcome email failure must NOT fail provisioning. Mark last_step = 'complete' before email fires. Email has its own retry with exponential backoff.
Email is a side-effect, not part of the saga's success condition. Tenant gets their site regardless of whether the email delivers.
09
Mark provisioning_jobs.status = 'complete'
Notify Nimit via Slack webhook (also non-blocking). The saga is done. If any step 3–6 failed, BullMQ retries the whole job — all steps are idempotent so re-running them is safe.
// workers/provisioning.ts const provisioningQueue = new Queue('provisioning', { connection: redis }); const worker = new Worker('provisioning', processProvisioningJob, { connection: redis, concurrency: 5, // max 5 parallel provisionings limiter: { max: 10, duration: 60000 }, // 10 per minute (Coolify API rate limit) }); // Retry config: 3 attempts, exponential backoff const jobOptions = { attempts: 3, backoff: { type: 'exponential', delay: 5000 }, removeOnComplete: false, // keep for audit removeOnFail: false, // keep for debugging }; // On final failure → alert Nimit + set status='failed' in DB worker.on('failed', (job, err) => { alertNimit({ payment_id: job.data.payment_id, error: err.message }); updateProvisioningJob(job.data.payment_id, 'failed', err.message); });
DUPLICATE WEBHOOK
Idempotency check on payment_id before creating provisioning_jobs row. BullMQ deduplication via job ID = payment_id.
PARTIAL FAILURE
Each step is idempotent. Saga resumes from last_step. No double-creation of schema, no duplicate payments row.
COOLIFY API DOWN
Domain assignment step retries independently. Site goes live at subdomain immediately. Custom domain provisioned when Coolify recovers.
EMAIL SMTP DOWN
Welcome email is on its own queue. Provisioning completes regardless. Email retries up to 5 times over 2 hours.
SERVER RESTART MID-SAGA
BullMQ persists jobs in Redis. On restart, worker picks up the incomplete job and resumes from DB-recorded last_step.
RAZORPAY RETRIES
Webhook endpoint returns 200 immediately. BullMQ job deduplication prevents double-provisioning even if Razorpay fires the event twice.
Connection pool architecture
Transaction mode pooling — the correct mode for schema-per-tenant
FROM DAY ONE
# pgbouncer.ini — transaction mode is mandatory for SET search_path [databases] saas_db = host=postgres port=5432 dbname=saas pool_size=25 [pgbouncer] pool_mode = transaction # NOT session — session mode breaks search_path max_client_conn = 1000 # Next.js instances can hold many idle connections default_pool_size = 25 reserve_pool_size = 5 reserve_pool_timeout = 3 server_idle_timeout = 300 log_connections = 0 # disable in prod — log spam
Important: With PgBouncer in transaction mode, SET search_path TO tenant_{id} must be called at the start of every transaction, not once per connection. Your DB middleware wrapper must handle this. Session-level search_path leaks between tenants in transaction mode.
Sensitive fields that must be encrypted at rest
AES-256-GCM, application-level. Not just disk encryption.
REQUIRED
FieldTableTreatmentReason
pantenantsAES-256-GCM encryptedDPDP sensitive personal data
aadhaartenantsVerify-and-discard. Store only last 4 digits.UIDAI regulation — must not store full Aadhaar
dobtenantsAES-256-GCM encryptedDPDP personal data
hashed_passwordtenantsbcrypt 12 rounds (unchanged)Correct
owner_phonetenantsAES-256-GCM encryptedDPDP personal data
gsttenantsAES-256-GCM encryptedBusiness identity data
// lib/encrypt.ts — application-level field encryption import { createCipheriv, createDecipheriv, randomBytes } from 'crypto'; const KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY, 'hex'); // 32 bytes export function encrypt(plaintext: string): string { const iv = randomBytes(12); const cipher = createCipheriv('aes-256-gcm', KEY, iv); const encrypted = Buffer.concat([cipher.update(plaintext, 'utf8'), cipher.final()]); const tag = cipher.getAuthTag(); return [iv.toString('hex'), tag.toString('hex'), encrypted.toString('hex')].join('.'); } export function decrypt(ciphertext: string): string { const [ivHex, tagHex, dataHex] = ciphertext.split('.'); const decipher = createDecipheriv('aes-256-gcm', KEY, Buffer.from(ivHex, 'hex')); decipher.setAuthTag(Buffer.from(tagHex, 'hex')); return decipher.update(dataHex, 'hex', 'utf8') + decipher.final('utf8'); }
// lib/db.ts — tenant-scoped query wrapper export async function withTenantSchema<T>( tenantId: string, fn: (client: PoolClient) => Promise<T> ): Promise<T> { const client = await pool.connect(); try { await client.query('BEGIN'); // Must set search_path inside EVERY transaction in transaction-mode pooling await client.query(`SET LOCAL search_path TO tenant_${tenantId}, public`); const result = await fn(client); await client.query('COMMIT'); return result; } catch (err) { await client.query('ROLLBACK'); throw err; } finally { client.release(); } }
CATALOG BLOAT PREVENTION
Pre-warm schema migration template at startup. Never run DDL live during provisioning. Use a pre-built schema template dump and COPY it per tenant. Catalog queries stay fast.
CONNECTION CEILING
PgBouncer caps Postgres connections at 25-50 regardless of how many tenants. 500 tenants with 1000 Next.js connections → still only 25 Postgres connections.
SCHEMA COUNT LIMIT
Monitor pg_namespace row count. Alert at 5000 schemas. Plan shard migration before hitting 8000. At current growth rate this is a year+ problem, not a launch problem.
SLOW CATALOG QUERIES
pg_stat_statements enabled from day one. Alert if any information_schema or pg_class query exceeds 50ms p95. Add index on pg_namespace.nspname if needed.
Razorpay ↔ DB reconciliation cron
Runs at 06:00 IST daily. Catches any payment that webhook missed.
NEW IN V2
// cron/reconcile.ts — runs daily 06:00 IST async function reconcilePayments() { // Fetch last 48h of Razorpay payments (overlap buffer) const rzPayments = await razorpay.payments.all({ from: Math.floor(Date.now() / 1000) - 172800, count: 100 }); for (const payment of rzPayments.items) { if (payment.status !== 'captured') continue; const existing = await db.query( 'SELECT id FROM payments WHERE razorpay_payment_id = $1', [payment.id] ); if (!existing.rows.length) { // Payment exists in Razorpay but NOT in our DB → missed webhook logger.error({ payment_id: payment.id }, 'MISSED_PAYMENT — queuing provisioning'); await provisioningQueue.add('provision', { payment_id: payment.id }, jobOptions); await alertNimit(`Missed payment recovered: ${payment.id}`); } } }
Why 48 hours lookback? If the reconciliation job itself fails one day, the next day's run still catches the previous day's missed payments. Single-day lookback creates a gap if the cron fails.
A
Happy path: webhook fires, provisioning completes
Customer pays → webhook within seconds → BullMQ job → saga completes → site live in ~60s. No change needed here.
B
Webhook fires but server is briefly down
Razorpay retries webhooks for up to 24 hours with exponential backoff. When server recovers, webhook arrives. BullMQ handles it normally. Reconciliation job at 06:00 acts as a final safety net.
C
Webhook never arrives (Razorpay internal issue)
Reconciliation job catches it within 24 hours. Customer is delayed by up to ~24h in worst case. Alert fires to Nimit with the payment details so manual intervention is possible if needed.
Improve: also poll Razorpay API every 15 minutes for payments made in the last hour. Reduces worst-case delay to 15 minutes for missed webhooks.
D
Customer pays and immediately closes browser
Provisioning is fully server-side triggered by webhook. Browser state is irrelevant. Site provisions regardless. Customer gets welcome email when they check inbox.
E
Renewal webhook arrives for a suspended tenant
Webhook handler checks if tenant exists. If status = 'suspended' → update end_date, set status = 'active', re-enable site. No re-provisioning. Existing schema and content preserved.
F
Refund issued in Razorpay dashboard
Razorpay fires a payment.failed or refund.created event. Handle this webhook: set tenant status = 'suspended', log the refund. V1 document has no refund handling at all.
Add refund.created webhook handler. Suspend tenant, send notification, log in audit_logs. Do NOT delete schema immediately — give 7 days for dispute resolution.
Recovery objectives
DEFINED IN V2
ScenarioRTORPORecovery method
Bad deployment<2 min0 (no data loss)Coolify auto-rollback on health check fail
VPS full restart<5 min0Docker containers auto-restart. PG data on persistent volume.
VPS hardware failure<4 hours<24 hoursRestore from nightly pg_dump to cold standby VPS. WAL archiving reduces RPO to <1 hour for paid plans.
Datacenter outage<6 hours<24 hoursSpin up cold standby in different region from snapshot. Point DNS to new IP.
Database corruption<2 hours<1 hour (WAL)PITR from WAL archives in MinIO secondary. Restore to any second in last 7 days.
# Cold standby VPS runbook (kept up to date, tested monthly) # 1. VPS provisioned with same specs, Coolify installed, same env vars # 2. Daily: pg_dump uploaded to MinIO + replicated to secondary MinIO in different region # 3. WAL archives shipped continuously to secondary region MinIO # 4. To activate standby: # a. Download latest pg_dump + WAL to standby VPS # b. pg_restore → PostgreSQL instance # c. Apply WAL to reach latest consistent state # d. Update DNS A records to standby IP (TTL should be 300s, not 3600s) # e. Let's Encrypt certs re-issue on new server # f. Estimated downtime: 2–4 hours # DNS TTL — keep at 300s (5 min), NOT the default 3600s # High TTL means DNS change takes 1 hour to propagate. 5 min TTL = fast failover.
The problem with multiple Next.js replicas + ISR
Each replica has its own in-memory cache. CMS save on replica A doesn't invalidate cache on replica B.
FIXED IN V2
// Solution: use Redis as the shared cache layer, not in-process Next.js cache // On CMS content save: async function savePageContent(tenantId: string, page: string, content: object) { await db.query(`UPDATE page_content SET content_json = $1 WHERE page = $2`, [content, page]); // Invalidate Redis cache — hits ALL replicas because they all read from same Redis await redis.del(`page:${tenantId}:${page}`); // Also call Next.js on-demand revalidation endpoint on all replicas // Use Next.js revalidateTag() if using App Router caching await revalidatePath(`/`); // triggers ISR regeneration } // On public website render — read from Redis first: async function getPageContent(tenantId: string, page: string) { const cached = await redis.get(`page:${tenantId}:${page}`); if (cached) return JSON.parse(cached); const fresh = await db.query(`SELECT content_json FROM page_content WHERE page = $1`, [page]); await redis.setex(`page:${tenantId}:${page}`, 60, JSON.stringify(fresh.rows[0])); return fresh.rows[0].content_json; }
Key insight: Redis-backed caching means all replicas share one cache. A Redis DEL is seen by every replica instantly. No stale content, no replica drift.
v1 cron worker risks
REMOVED
Single node-cron container. No retry on failure. No concurrency control. No visibility into job health. One unhandled rejection crashes all automation silently.
v2 BullMQ queues
NEW
Separate queues per job type. Each has independent retry config and dead-letter queue. Failed jobs alert Nimit. Dashboard visibility via Bull Board. Redis persistence survives worker restarts.
// BullMQ queue registry — each job type is isolated const queues = { provisioning: new Queue('provisioning', { defaultJobOptions: { attempts: 3, backoff: 'exponential' }}), emails: new Queue('emails', { defaultJobOptions: { attempts: 5, backoff: { delay: 30000 }}}), subscription: new Queue('subscription', { defaultJobOptions: { attempts: 2 }}), reconciliation:new Queue('reconciliation',{ defaultJobOptions: { attempts: 3 }}), maintenance: new Queue('maintenance', { defaultJobOptions: { attempts: 1 }}), }; // Scheduled jobs — using BullMQ's built-in repeat await queues.subscription.add('lifecycle', {}, { repeat: { cron: '0 2 * * *' }}); // 2am IST await queues.reconciliation.add('daily', {}, { repeat: { cron: '30 6 * * *' }}); // 6:30am IST await queues.maintenance.add('vacuum', {}, { repeat: { cron: '0 3 * * 0' }}); // Sunday 3am await queues.maintenance.add('backup-verify',{},{ repeat: { cron: '0 5 1 * *' }}); // 1st of month
Aadhaar storage is the most urgent issue. UIDAI regulations prohibit storing Aadhaar numbers in databases without explicit UIDAI license approval. The correct approach: use Aadhaar for identity verification only (via an OTP or UIDAI API), confirm the user's identity, then discard the full number. Store only the last 4 digits for reference.
1
Aadhaar: verify-and-discard pattern
Remove Aadhaar column from tenants table. Instead: accept Aadhaar during onboarding for verification only, confirm identity via DigiLocker or UIDAI OTP API, then store only last_4_aadhaar (VARCHAR 4) for audit purposes.
ALTER TABLE tenants DROP COLUMN aadhaar; ADD COLUMN aadhaar_last4 CHAR(4);
2
Consent mechanism at onboarding
Registration form must have explicit, unbundled consent checkboxes. "I agree to the Terms" is not DPDP-compliant consent for data processing. Each purpose needs a separate checkbox. Store consent timestamp and IP.
Add consent_logs table: tenant_id, purpose (marketing/billing/operations), consented_at, ip_address, consent_version. Checkbox per purpose, all required.
3
Data export API — defined and implemented
DPDP gives individuals the right to access their data. The "data export email" mentioned in v1 must be a real API that generates a structured JSON/CSV export of everything stored under that tenant's identity.
POST /api/tenant/data-export → generates export of tenants row + subscription + payments + all tenant schema content → emails secure download link valid for 48 hours.
4
Data deletion on request
Right to erasure under DPDP. When a tenant requests deletion, all PII in the tenants table must be overwritten (not just the schema dropped). Keep payment records (legal requirement for GST compliance) but anonymise the personal fields.
5
Breach notification procedure
DPDP requires notification to the Data Protection Board within 72 hours of discovering a breach. Draft the procedure now: who decides it's a breach, who notifies the Board, what template to use. This cannot be improvised during an incident.
6
Privacy policy and Data Processing Agreement
As a SaaS platform, Nimit is a Data Fiduciary. Each tenant is also a Data Fiduciary for their end customers. A DPA template must be part of the Terms of Service that tenants accept at signup.
-- consent_logs: tracks what each tenant agreed to and when CREATE TABLE consent_logs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id UUID REFERENCES tenants(id), purpose TEXT NOT NULL, -- 'billing' | 'operations' | 'marketing' consented BOOLEAN NOT NULL, consented_at TIMESTAMPTZ DEFAULT now(), ip_address INET, consent_version TEXT NOT NULL -- version of privacy policy accepted ); -- data_requests: tracks access/deletion/export requests CREATE TABLE data_requests ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id UUID REFERENCES tenants(id), request_type TEXT NOT NULL, -- 'export' | 'deletion' | 'correction' status TEXT DEFAULT 'pending', requested_at TIMESTAMPTZ DEFAULT now(), completed_at TIMESTAMPTZ, notes TEXT );
The rule: Never take money from a customer until Phase 3 is complete AND the compliance checklist is signed off. Selling subscriptions without DPDP consent mechanisms in place is a regulatory risk.
PHASE 1 Foundation — with real infrastructure Build first
Next.js monorepo setup, Coolify on VPS, staging environment (separate branch + Coolify staging app)
PostgreSQL 15 + PgBouncer in transaction mode — not deferred
Redis for sessions, cache, and job queue persistence
BullMQ worker container with Bull Board admin UI at /admin/jobs
Shared schema tables: tenants, subscriptions, payments, provisioning_jobs, consent_logs
Super admin login, test tenant registration with consent checkboxes
MinIO for object storage + secondary backup location configured
Field encryption utility (lib/encrypt.ts) wired into tenant model
Signal: Nimit logs in to admin. Test registration creates tenant record with encrypted PII. BullMQ Bull Board shows healthy queues.
PHASE 2 Core product — website + CMS Week 2-4
Agency website template (4 pages), multi-tenancy middleware, subdomain routing
CMS panel with all editors, content API with Redis-backed cache layer
Server-side contact form /api/contact (replaces EmailJS)
withTenantSchema() DB wrapper — enforces SET LOCAL search_path on every query
Cache invalidation on CMS save (Redis DEL + Next.js revalidateTag)
Signal: Manually created test tenant edits CMS. Public site reflects changes. 4 Next.js replicas all serve fresh content after CMS save.
PHASE 3 Payments + provisioning saga Revenue gate
Razorpay integration: 3 plan checkout, Orders API, webhook endpoint
Provisioning saga engine — BullMQ-backed, idempotent, resumable from last_step
Refund webhook handler (refund.created → suspend tenant)
Daily reconciliation job — 48h lookback against Razorpay API
Invoice generation + S3 storage
DPDP compliance gate: consent checkboxes live, Aadhaar verify-and-discard, privacy policy published
Renewal subscription lifecycle cron (BullMQ scheduled job)
Signal: Customer pays → site live in 60s. Provisioning_jobs table shows 'complete'. Reconciliation job runs without errors. Intentionally kill server mid-provisioning → saga resumes on restart.
PHASE 4 Super admin dashboard Week 5-6
Full tenant list: all fields, subscription status, days remaining, payment history
Live/Shutdown toggle, expiry alerts, Razorpay transaction view
Provisioning job monitor — see in-flight and failed provisioning attempts
BullMQ Bull Board embedded in admin panel
Data request management (export / deletion requests from tenants)
PHASE 5 Monitoring + observability Week 7
Sentry: errors, slow transactions, payment webhook failures
Internal uptime checker + external BetterUptime on main domain
Health endpoint: DB, Redis, MinIO, queue depths, worker status
Job failure alerting: any BullMQ job hitting dead-letter queue → Slack alert to Nimit
Structured logging: provisioning, webhooks, cron, admin actions
pg_stat_statements monitoring — alert on catalog query degradation
PHASE 6 Custom domains Week 8
Custom domain input in CMS, async Coolify API provisioning
ssl_status polling: pending → verifying → active, shown in CMS with instructions
Honest UX: "Your domain will be live with HTTPS within 10–15 minutes" — not 60 seconds
DNS instruction generator (A record pointing to server IP, shown inline)
Domain verification check: ping tenant domain, confirm it resolves to our server before requesting cert
PHASE 7 Scale & harden When needed
Load test: 100 concurrent site visitors, target <200ms p95
Security audit: JWT rotation, rate limiting review, dependency scan
Read replica for PostgreSQL (at 500+ tenants)
Cloudflare CDN in front of platform for static asset edge delivery
Cold standby VPS DR test — full restore drill, measure actual RTO
Backup verification monthly cron — restore and verify row counts
Schema catalog monitoring — alert if pg_namespace count exceeds 5000
Note: PgBouncer and BullMQ are no longer in this phase — they were moved to Phase 1 where they belong.