Optimised architecture stack
New addition
Changed from v1
Unchanged
Presentation
Next.js 14 monorepo
Middleware tenant routing
Server-side contact API
CMS panel
Super admin
API / Logic
Next.js API routes
Provisioning saga engine
BullMQ job queue
Reconciliation cron
Subscription lifecycle
Data
PostgreSQL 15
PgBouncer (Phase 1)
Schema-per-tenant
RLS policies
Field-level encryption
Redis
Payment
Razorpay Orders API
Daily reconciliation job
Webhook + HMAC verify
Idempotency keys
Infrastructure
Coolify + Traefik
Docker containers
Cold standby VPS
WAL → secondary region
MinIO + offsite backup
Let's Encrypt SSL
Observability
Sentry
Uptime checks
Staging environment
Job failure alerts
Structured logs
Compliance
DPDP consent flow
Encrypted PII fields
Data export API
Breach procedure
DPA template
What changed from v1 and why
| v1 decision | Problem | v2 fix |
| Sequential webhook handler | Partial failure = paid customer, no site | Saga state machine with resumable steps |
| No payment reconciliation | Silent revenue leak if webhook never arrives | Daily Razorpay API diff job |
| PgBouncer deferred to Phase 7 | Connection exhaustion at ~50 tenants | PgBouncer transaction mode from Phase 1 |
| No staging environment | Bad deploy hits all tenants simultaneously | Staging branch + Coolify staging env |
| Single VPS, no DR plan | Any hardware/network event = total outage | Cold standby + WAL to secondary region + defined RTO/RPO |
| PAN/Aadhaar in plaintext | DPDP Act violation, UIDAI compliance risk | AES-256 field encryption, Aadhaar verify-and-discard |
| EmailJS on frontend | Credential exposure, unprofessional signal | Server-side /api/contact with SMTP relay |
| Cron worker, no job queue | One crash silently kills all automation | BullMQ on Redis with retry + dead-letter queue |
| "60s" promise for custom domains | SSL issuance takes 15-90s, user feels misled | Subdomain: 60s. Custom domain: async, user notified when ready |
Core principle: Every provisioning step must be idempotent. The saga persists its state to the database after each step. If the process crashes at step 6, it resumes from step 6 — not from step 1.
New: provisioning_jobs table
-- New table in shared schema
CREATE TABLE provisioning_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
payment_id TEXT UNIQUE NOT NULL,
tenant_id UUID,
status TEXT NOT NULL DEFAULT 'pending',
-- ENUM: pending | schema_created | content_seeded | domain_assigned | email_sent | complete | failed
last_step TEXT,
error_msg TEXT,
attempt_count INT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
Optimised 9-step saga with failure handling
01
Webhook received → verify HMAC-SHA256 signature
Reject invalid. Return 200 immediately to Razorpay — don't make Razorpay wait for provisioning. Push a job to BullMQ queue instead of processing inline.
02
BullMQ worker picks up provisioning job
Worker creates a provisioning_jobs row with status=pending. If row already exists for this payment_id → idempotency exit. This is now the saga's persistent record.
03
Create tenant record + subscription row
Insert into tenants and subscriptions. Update provisioning_jobs.last_step = 'tenant_created'. All DB operations in this step are inside a single Postgres transaction — atomic and rollbackable.
04
Create tenant schema + run migrations
Execute CREATE SCHEMA tenant_{uuid} and apply table migrations from a pre-tested migration template. Update last_step. Schema creation is idempotent: CREATE SCHEMA IF NOT EXISTS.
05
Seed default content
Insert default rows into the new schema's tables. Check if rows exist before inserting — idempotent. Update last_step = 'content_seeded'.
06
Record payment
Insert into payments table with idempotency_key = razorpay_payment_id. Uses ON CONFLICT DO NOTHING. Always safe to retry.
07
Assign domain (async — does not block)
For subdomains: immediate (wildcard cert). For custom domains: insert domain record with ssl_status='pending', trigger async Coolify API call, notify tenant separately when SSL is ready. The site goes live at the subdomain immediately regardless.
Custom domain SSL is async. Never promise 60s for custom domains. Show tenant a "your domain is being configured" status in CMS until ssl_status = 'active'.
08
Send welcome email (non-blocking)
Email is pushed to a separate BullMQ email queue. Welcome email failure must NOT fail provisioning. Mark last_step = 'complete' before email fires. Email has its own retry with exponential backoff.
Email is a side-effect, not part of the saga's success condition. Tenant gets their site regardless of whether the email delivers.
09
Mark provisioning_jobs.status = 'complete'
Notify Nimit via Slack webhook (also non-blocking). The saga is done. If any step 3–6 failed, BullMQ retries the whole job — all steps are idempotent so re-running them is safe.
BullMQ worker configuration
// workers/provisioning.ts
const provisioningQueue = new Queue('provisioning', { connection: redis });
const worker = new Worker('provisioning', processProvisioningJob, {
connection: redis,
concurrency: 5, // max 5 parallel provisionings
limiter: { max: 10, duration: 60000 }, // 10 per minute (Coolify API rate limit)
});
// Retry config: 3 attempts, exponential backoff
const jobOptions = {
attempts: 3,
backoff: { type: 'exponential', delay: 5000 },
removeOnComplete: false, // keep for audit
removeOnFail: false, // keep for debugging
};
// On final failure → alert Nimit + set status='failed' in DB
worker.on('failed', (job, err) => {
alertNimit({ payment_id: job.data.payment_id, error: err.message });
updateProvisioningJob(job.data.payment_id, 'failed', err.message);
});
Edge cases handled
DUPLICATE WEBHOOK
Idempotency check on payment_id before creating provisioning_jobs row. BullMQ deduplication via job ID = payment_id.
PARTIAL FAILURE
Each step is idempotent. Saga resumes from last_step. No double-creation of schema, no duplicate payments row.
COOLIFY API DOWN
Domain assignment step retries independently. Site goes live at subdomain immediately. Custom domain provisioned when Coolify recovers.
EMAIL SMTP DOWN
Welcome email is on its own queue. Provisioning completes regardless. Email retries up to 5 times over 2 hours.
SERVER RESTART MID-SAGA
BullMQ persists jobs in Redis. On restart, worker picks up the incomplete job and resumes from DB-recorded last_step.
RAZORPAY RETRIES
Webhook endpoint returns 200 immediately. BullMQ job deduplication prevents double-provisioning even if Razorpay fires the event twice.
PgBouncer — required from Phase 1, not Phase 7
# pgbouncer.ini — transaction mode is mandatory for SET search_path
[databases]
saas_db = host=postgres port=5432 dbname=saas pool_size=25
[pgbouncer]
pool_mode = transaction # NOT session — session mode breaks search_path
max_client_conn = 1000 # Next.js instances can hold many idle connections
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 300
log_connections = 0 # disable in prod — log spam
Important: With PgBouncer in transaction mode, SET search_path TO tenant_{id} must be called at the start of every transaction, not once per connection. Your DB middleware wrapper must handle this. Session-level search_path leaks between tenants in transaction mode.
Field-level encryption for PII — DPDP Act requirement
| Field | Table | Treatment | Reason |
| pan | tenants | AES-256-GCM encrypted | DPDP sensitive personal data |
| aadhaar | tenants | Verify-and-discard. Store only last 4 digits. | UIDAI regulation — must not store full Aadhaar |
| dob | tenants | AES-256-GCM encrypted | DPDP personal data |
| hashed_password | tenants | bcrypt 12 rounds (unchanged) | Correct |
| owner_phone | tenants | AES-256-GCM encrypted | DPDP personal data |
| gst | tenants | AES-256-GCM encrypted | Business identity data |
// lib/encrypt.ts — application-level field encryption
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';
const KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY, 'hex'); // 32 bytes
export function encrypt(plaintext: string): string {
const iv = randomBytes(12);
const cipher = createCipheriv('aes-256-gcm', KEY, iv);
const encrypted = Buffer.concat([cipher.update(plaintext, 'utf8'), cipher.final()]);
const tag = cipher.getAuthTag();
return [iv.toString('hex'), tag.toString('hex'), encrypted.toString('hex')].join('.');
}
export function decrypt(ciphertext: string): string {
const [ivHex, tagHex, dataHex] = ciphertext.split('.');
const decipher = createDecipheriv('aes-256-gcm', KEY, Buffer.from(ivHex, 'hex'));
decipher.setAuthTag(Buffer.from(tagHex, 'hex'));
return decipher.update(dataHex, 'hex', 'utf8') + decipher.final('utf8');
}
Search path safety with PgBouncer transaction mode
// lib/db.ts — tenant-scoped query wrapper
export async function withTenantSchema<T>(
tenantId: string,
fn: (client: PoolClient) => Promise<T>
): Promise<T> {
const client = await pool.connect();
try {
await client.query('BEGIN');
// Must set search_path inside EVERY transaction in transaction-mode pooling
await client.query(`SET LOCAL search_path TO tenant_${tenantId}, public`);
const result = await fn(client);
await client.query('COMMIT');
return result;
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}
PostgreSQL catalog performance — schema growth mitigation
CATALOG BLOAT PREVENTION
Pre-warm schema migration template at startup. Never run DDL live during provisioning. Use a pre-built schema template dump and COPY it per tenant. Catalog queries stay fast.
CONNECTION CEILING
PgBouncer caps Postgres connections at 25-50 regardless of how many tenants. 500 tenants with 1000 Next.js connections → still only 25 Postgres connections.
SCHEMA COUNT LIMIT
Monitor pg_namespace row count. Alert at 5000 schemas. Plan shard migration before hitting 8000. At current growth rate this is a year+ problem, not a launch problem.
SLOW CATALOG QUERIES
pg_stat_statements enabled from day one. Alert if any information_schema or pg_class query exceeds 50ms p95. Add index on pg_namespace.nspname if needed.
Daily reconciliation job — the missing piece in v1
// cron/reconcile.ts — runs daily 06:00 IST
async function reconcilePayments() {
// Fetch last 48h of Razorpay payments (overlap buffer)
const rzPayments = await razorpay.payments.all({
from: Math.floor(Date.now() / 1000) - 172800,
count: 100
});
for (const payment of rzPayments.items) {
if (payment.status !== 'captured') continue;
const existing = await db.query(
'SELECT id FROM payments WHERE razorpay_payment_id = $1',
[payment.id]
);
if (!existing.rows.length) {
// Payment exists in Razorpay but NOT in our DB → missed webhook
logger.error({ payment_id: payment.id }, 'MISSED_PAYMENT — queuing provisioning');
await provisioningQueue.add('provision', { payment_id: payment.id }, jobOptions);
await alertNimit(`Missed payment recovered: ${payment.id}`);
}
}
}
Why 48 hours lookback? If the reconciliation job itself fails one day, the next day's run still catches the previous day's missed payments. Single-day lookback creates a gap if the cron fails.
Payment flow — complete edge case map
A
Happy path: webhook fires, provisioning completes
Customer pays → webhook within seconds → BullMQ job → saga completes → site live in ~60s. No change needed here.
B
Webhook fires but server is briefly down
Razorpay retries webhooks for up to 24 hours with exponential backoff. When server recovers, webhook arrives. BullMQ handles it normally. Reconciliation job at 06:00 acts as a final safety net.
C
Webhook never arrives (Razorpay internal issue)
Reconciliation job catches it within 24 hours. Customer is delayed by up to ~24h in worst case. Alert fires to Nimit with the payment details so manual intervention is possible if needed.
Improve: also poll Razorpay API every 15 minutes for payments made in the last hour. Reduces worst-case delay to 15 minutes for missed webhooks.
D
Customer pays and immediately closes browser
Provisioning is fully server-side triggered by webhook. Browser state is irrelevant. Site provisions regardless. Customer gets welcome email when they check inbox.
E
Renewal webhook arrives for a suspended tenant
Webhook handler checks if tenant exists. If status = 'suspended' → update end_date, set status = 'active', re-enable site. No re-provisioning. Existing schema and content preserved.
F
Refund issued in Razorpay dashboard
Razorpay fires a payment.failed or refund.created event. Handle this webhook: set tenant status = 'suspended', log the refund. V1 document has no refund handling at all.
Add refund.created webhook handler. Suspend tenant, send notification, log in audit_logs. Do NOT delete schema immediately — give 7 days for dispute resolution.
Disaster recovery — defined RTO / RPO
| Scenario | RTO | RPO | Recovery method |
| Bad deployment | <2 min | 0 (no data loss) | Coolify auto-rollback on health check fail |
| VPS full restart | <5 min | 0 | Docker containers auto-restart. PG data on persistent volume. |
| VPS hardware failure | <4 hours | <24 hours | Restore from nightly pg_dump to cold standby VPS. WAL archiving reduces RPO to <1 hour for paid plans. |
| Datacenter outage | <6 hours | <24 hours | Spin up cold standby in different region from snapshot. Point DNS to new IP. |
| Database corruption | <2 hours | <1 hour (WAL) | PITR from WAL archives in MinIO secondary. Restore to any second in last 7 days. |
Cold standby setup
# Cold standby VPS runbook (kept up to date, tested monthly)
# 1. VPS provisioned with same specs, Coolify installed, same env vars
# 2. Daily: pg_dump uploaded to MinIO + replicated to secondary MinIO in different region
# 3. WAL archives shipped continuously to secondary region MinIO
# 4. To activate standby:
# a. Download latest pg_dump + WAL to standby VPS
# b. pg_restore → PostgreSQL instance
# c. Apply WAL to reach latest consistent state
# d. Update DNS A records to standby IP (TTL should be 300s, not 3600s)
# e. Let's Encrypt certs re-issue on new server
# f. Estimated downtime: 2–4 hours
# DNS TTL — keep at 300s (5 min), NOT the default 3600s
# High TTL means DNS change takes 1 hour to propagate. 5 min TTL = fast failover.
Distributed cache coherence fix
// Solution: use Redis as the shared cache layer, not in-process Next.js cache
// On CMS content save:
async function savePageContent(tenantId: string, page: string, content: object) {
await db.query(`UPDATE page_content SET content_json = $1 WHERE page = $2`, [content, page]);
// Invalidate Redis cache — hits ALL replicas because they all read from same Redis
await redis.del(`page:${tenantId}:${page}`);
// Also call Next.js on-demand revalidation endpoint on all replicas
// Use Next.js revalidateTag() if using App Router caching
await revalidatePath(`/`); // triggers ISR regeneration
}
// On public website render — read from Redis first:
async function getPageContent(tenantId: string, page: string) {
const cached = await redis.get(`page:${tenantId}:${page}`);
if (cached) return JSON.parse(cached);
const fresh = await db.query(`SELECT content_json FROM page_content WHERE page = $1`, [page]);
await redis.setex(`page:${tenantId}:${page}`, 60, JSON.stringify(fresh.rows[0]));
return fresh.rows[0].content_json;
}
Key insight: Redis-backed caching means all replicas share one cache. A Redis DEL is seen by every replica instantly. No stale content, no replica drift.
BullMQ replaces bare cron worker
Single node-cron container. No retry on failure. No concurrency control. No visibility into job health. One unhandled rejection crashes all automation silently.
Separate queues per job type. Each has independent retry config and dead-letter queue. Failed jobs alert Nimit. Dashboard visibility via Bull Board. Redis persistence survives worker restarts.
// BullMQ queue registry — each job type is isolated
const queues = {
provisioning: new Queue('provisioning', { defaultJobOptions: { attempts: 3, backoff: 'exponential' }}),
emails: new Queue('emails', { defaultJobOptions: { attempts: 5, backoff: { delay: 30000 }}}),
subscription: new Queue('subscription', { defaultJobOptions: { attempts: 2 }}),
reconciliation:new Queue('reconciliation',{ defaultJobOptions: { attempts: 3 }}),
maintenance: new Queue('maintenance', { defaultJobOptions: { attempts: 1 }}),
};
// Scheduled jobs — using BullMQ's built-in repeat
await queues.subscription.add('lifecycle', {}, { repeat: { cron: '0 2 * * *' }}); // 2am IST
await queues.reconciliation.add('daily', {}, { repeat: { cron: '30 6 * * *' }}); // 6:30am IST
await queues.maintenance.add('vacuum', {}, { repeat: { cron: '0 3 * * 0' }}); // Sunday 3am
await queues.maintenance.add('backup-verify',{},{ repeat: { cron: '0 5 1 * *' }}); // 1st of month
Aadhaar storage is the most urgent issue. UIDAI regulations prohibit storing Aadhaar numbers in databases without explicit UIDAI license approval. The correct approach: use Aadhaar for identity verification only (via an OTP or UIDAI API), confirm the user's identity, then discard the full number. Store only the last 4 digits for reference.
DPDP compliance checklist
1
Aadhaar: verify-and-discard pattern
Remove Aadhaar column from tenants table. Instead: accept Aadhaar during onboarding for verification only, confirm identity via DigiLocker or UIDAI OTP API, then store only last_4_aadhaar (VARCHAR 4) for audit purposes.
ALTER TABLE tenants DROP COLUMN aadhaar; ADD COLUMN aadhaar_last4 CHAR(4);
2
Consent mechanism at onboarding
Registration form must have explicit, unbundled consent checkboxes. "I agree to the Terms" is not DPDP-compliant consent for data processing. Each purpose needs a separate checkbox. Store consent timestamp and IP.
Add consent_logs table: tenant_id, purpose (marketing/billing/operations), consented_at, ip_address, consent_version. Checkbox per purpose, all required.
3
Data export API — defined and implemented
DPDP gives individuals the right to access their data. The "data export email" mentioned in v1 must be a real API that generates a structured JSON/CSV export of everything stored under that tenant's identity.
POST /api/tenant/data-export → generates export of tenants row + subscription + payments + all tenant schema content → emails secure download link valid for 48 hours.
4
Data deletion on request
Right to erasure under DPDP. When a tenant requests deletion, all PII in the tenants table must be overwritten (not just the schema dropped). Keep payment records (legal requirement for GST compliance) but anonymise the personal fields.
5
Breach notification procedure
DPDP requires notification to the Data Protection Board within 72 hours of discovering a breach. Draft the procedure now: who decides it's a breach, who notifies the Board, what template to use. This cannot be improvised during an incident.
6
Privacy policy and Data Processing Agreement
As a SaaS platform, Nimit is a Data Fiduciary. Each tenant is also a Data Fiduciary for their end customers. A DPA template must be part of the Terms of Service that tenants accept at signup.
New tables required for compliance
-- consent_logs: tracks what each tenant agreed to and when
CREATE TABLE consent_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID REFERENCES tenants(id),
purpose TEXT NOT NULL, -- 'billing' | 'operations' | 'marketing'
consented BOOLEAN NOT NULL,
consented_at TIMESTAMPTZ DEFAULT now(),
ip_address INET,
consent_version TEXT NOT NULL -- version of privacy policy accepted
);
-- data_requests: tracks access/deletion/export requests
CREATE TABLE data_requests (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID REFERENCES tenants(id),
request_type TEXT NOT NULL, -- 'export' | 'deletion' | 'correction'
status TEXT DEFAULT 'pending',
requested_at TIMESTAMPTZ DEFAULT now(),
completed_at TIMESTAMPTZ,
notes TEXT
);
The rule: Never take money from a customer until Phase 3 is complete AND the compliance checklist is signed off. Selling subscriptions without DPDP consent mechanisms in place is a regulatory risk.
Next.js monorepo setup, Coolify on VPS, staging environment (separate branch + Coolify staging app)
PostgreSQL 15 + PgBouncer in transaction mode — not deferred
Redis for sessions, cache, and job queue persistence
BullMQ worker container with Bull Board admin UI at /admin/jobs
Shared schema tables: tenants, subscriptions, payments, provisioning_jobs, consent_logs
Super admin login, test tenant registration with consent checkboxes
MinIO for object storage + secondary backup location configured
Field encryption utility (lib/encrypt.ts) wired into tenant model
Signal: Nimit logs in to admin. Test registration creates tenant record with encrypted PII. BullMQ Bull Board shows healthy queues.
Agency website template (4 pages), multi-tenancy middleware, subdomain routing
CMS panel with all editors, content API with Redis-backed cache layer
Server-side contact form /api/contact (replaces EmailJS)
withTenantSchema() DB wrapper — enforces SET LOCAL search_path on every query
Cache invalidation on CMS save (Redis DEL + Next.js revalidateTag)
Signal: Manually created test tenant edits CMS. Public site reflects changes. 4 Next.js replicas all serve fresh content after CMS save.
Razorpay integration: 3 plan checkout, Orders API, webhook endpoint
Provisioning saga engine — BullMQ-backed, idempotent, resumable from last_step
Refund webhook handler (refund.created → suspend tenant)
Daily reconciliation job — 48h lookback against Razorpay API
Invoice generation + S3 storage
DPDP compliance gate: consent checkboxes live, Aadhaar verify-and-discard, privacy policy published
Renewal subscription lifecycle cron (BullMQ scheduled job)
Signal: Customer pays → site live in 60s. Provisioning_jobs table shows 'complete'. Reconciliation job runs without errors. Intentionally kill server mid-provisioning → saga resumes on restart.
Full tenant list: all fields, subscription status, days remaining, payment history
Live/Shutdown toggle, expiry alerts, Razorpay transaction view
Provisioning job monitor — see in-flight and failed provisioning attempts
BullMQ Bull Board embedded in admin panel
Data request management (export / deletion requests from tenants)
Sentry: errors, slow transactions, payment webhook failures
Internal uptime checker + external BetterUptime on main domain
Health endpoint: DB, Redis, MinIO, queue depths, worker status
Job failure alerting: any BullMQ job hitting dead-letter queue → Slack alert to Nimit
Structured logging: provisioning, webhooks, cron, admin actions
pg_stat_statements monitoring — alert on catalog query degradation
Custom domain input in CMS, async Coolify API provisioning
ssl_status polling: pending → verifying → active, shown in CMS with instructions
Honest UX: "Your domain will be live with HTTPS within 10–15 minutes" — not 60 seconds
DNS instruction generator (A record pointing to server IP, shown inline)
Domain verification check: ping tenant domain, confirm it resolves to our server before requesting cert
Load test: 100 concurrent site visitors, target <200ms p95
Security audit: JWT rotation, rate limiting review, dependency scan
Read replica for PostgreSQL (at 500+ tenants)
Cloudflare CDN in front of platform for static asset edge delivery
Cold standby VPS DR test — full restore drill, measure actual RTO
Backup verification monthly cron — restore and verify row counts
Schema catalog monitoring — alert if pg_namespace count exceeds 5000
Note: PgBouncer and BullMQ are no longer in this phase — they were moved to Phase 1 where they belong.