Optimised SaaS Architecture

5

Critical risks fixed

6

Medium risks mitigated

7

Build phases

v2

Architecture version

Optimised architecture stack

New addition Changed from v1 Unchanged

Presentation

Next.js 14 monorepo Middleware tenant routing Server-side contact API CMS panel Super admin

API / Logic

Next.js API routes Provisioning saga engine BullMQ job queue Reconciliation cron Subscription lifecycle

Data

PostgreSQL 15 PgBouncer (Phase 1) Schema-per-tenant RLS policies Field-level encryption Redis

Payment

Razorpay Orders API Daily reconciliation job Webhook + HMAC verify Idempotency keys

Infrastructure

Coolify + Traefik Docker containers Cold standby VPS WAL → secondary region MinIO + offsite backup Let's Encrypt SSL

Observability

Sentry Uptime checks Staging environment Job failure alerts Structured logs

Compliance

DPDP consent flow Encrypted PII fields Data export API Breach procedure DPA template

What changed from v1 and why

v1 decision	Problem	v2 fix
Sequential webhook handler	Partial failure = paid customer, no site	Saga state machine with resumable steps
No payment reconciliation	Silent revenue leak if webhook never arrives	Daily Razorpay API diff job
PgBouncer deferred to Phase 7	Connection exhaustion at ~50 tenants	PgBouncer transaction mode from Phase 1
No staging environment	Bad deploy hits all tenants simultaneously	Staging branch + Coolify staging env
Single VPS, no DR plan	Any hardware/network event = total outage	Cold standby + WAL to secondary region + defined RTO/RPO
PAN/Aadhaar in plaintext	DPDP Act violation, UIDAI compliance risk	AES-256 field encryption, Aadhaar verify-and-discard
EmailJS on frontend	Credential exposure, unprofessional signal	Server-side /api/contact with SMTP relay
Cron worker, no job queue	One crash silently kills all automation	BullMQ on Redis with retry + dead-letter queue
"60s" promise for custom domains	SSL issuance takes 15-90s, user feels misled	Subdomain: 60s. Custom domain: async, user notified when ready

    Core principle: Every provisioning step must be idempotent. The saga persists its state to the database after each step. If the process crashes at step 6, it resumes from step 6 — not from step 1.
  

New: provisioning_jobs table

-- New table in shared schema
CREATE TABLE provisioning_jobs (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  payment_id  TEXT UNIQUE NOT NULL,
  tenant_id   UUID,
  status      TEXT NOT NULL DEFAULT 'pending',
  -- ENUM: pending | schema_created | content_seeded | domain_assigned | email_sent | complete | failed
  last_step   TEXT,
  error_msg   TEXT,
  attempt_count INT DEFAULT 0,
  created_at  TIMESTAMPTZ DEFAULT now(),
  updated_at  TIMESTAMPTZ DEFAULT now()
);
  

Optimised 9-step saga with failure handling

01

Webhook received → verify HMAC-SHA256 signature

Reject invalid. Return 200 immediately to Razorpay — don't make Razorpay wait for provisioning. Push a job to BullMQ queue instead of processing inline.

02

BullMQ worker picks up provisioning job

Worker creates a provisioning_jobs row with status=pending. If row already exists for this payment_id → idempotency exit. This is now the saga's persistent record.

03

Create tenant record + subscription row

Insert into tenants and subscriptions. Update provisioning_jobs.last_step = 'tenant_created'. All DB operations in this step are inside a single Postgres transaction — atomic and rollbackable.

04

Create tenant schema + run migrations

Execute CREATE SCHEMA tenant_{uuid} and apply table migrations from a pre-tested migration template. Update last_step. Schema creation is idempotent: CREATE SCHEMA IF NOT EXISTS.

05

Seed default content

Insert default rows into the new schema's tables. Check if rows exist before inserting — idempotent. Update last_step = 'content_seeded'.

06

Record payment

Insert into payments table with idempotency_key = razorpay_payment_id. Uses ON CONFLICT DO NOTHING. Always safe to retry.

07

Assign domain (async — does not block)

For subdomains: immediate (wildcard cert). For custom domains: insert domain record with ssl_status='pending', trigger async Coolify API call, notify tenant separately when SSL is ready. The site goes live at the subdomain immediately regardless.

Custom domain SSL is async. Never promise 60s for custom domains. Show tenant a "your domain is being configured" status in CMS until ssl_status = 'active'.

08

Send welcome email (non-blocking)

Email is pushed to a separate BullMQ email queue. Welcome email failure must NOT fail provisioning. Mark last_step = 'complete' before email fires. Email has its own retry with exponential backoff.

Email is a side-effect, not part of the saga's success condition. Tenant gets their site regardless of whether the email delivers.

09

Mark provisioning_jobs.status = 'complete'

Notify Nimit via Slack webhook (also non-blocking). The saga is done. If any step 3–6 failed, BullMQ retries the whole job — all steps are idempotent so re-running them is safe.

BullMQ worker configuration

// workers/provisioning.ts
const provisioningQueue = new Queue('provisioning', { connection: redis });

const worker = new Worker('provisioning', processProvisioningJob, {
  connection: redis,
  concurrency: 5,           // max 5 parallel provisionings
  limiter: { max: 10, duration: 60000 }, // 10 per minute (Coolify API rate limit)
});

// Retry config: 3 attempts, exponential backoff
const jobOptions = {
  attempts: 3,
  backoff: { type: 'exponential', delay: 5000 },
  removeOnComplete: false, // keep for audit
  removeOnFail: false,    // keep for debugging
};

// On final failure → alert Nimit + set status='failed' in DB
worker.on('failed', (job, err) => {
  alertNimit({ payment_id: job.data.payment_id, error: err.message });
  updateProvisioningJob(job.data.payment_id, 'failed', err.message);
});
  

Edge cases handled

DUPLICATE WEBHOOK

Idempotency check on payment_id before creating provisioning_jobs row. BullMQ deduplication via job ID = payment_id.

PARTIAL FAILURE

Each step is idempotent. Saga resumes from last_step. No double-creation of schema, no duplicate payments row.

COOLIFY API DOWN

Domain assignment step retries independently. Site goes live at subdomain immediately. Custom domain provisioned when Coolify recovers.

EMAIL SMTP DOWN

Welcome email is on its own queue. Provisioning completes regardless. Email retries up to 5 times over 2 hours.

SERVER RESTART MID-SAGA

BullMQ persists jobs in Redis. On restart, worker picks up the incomplete job and resumes from DB-recorded last_step.

RAZORPAY RETRIES

Webhook endpoint returns 200 immediately. BullMQ job deduplication prevents double-provisioning even if Razorpay fires the event twice.

PgBouncer — required from Phase 1, not Phase 7

Connection pool architecture

Transaction mode pooling — the correct mode for schema-per-tenant

FROM DAY ONE

# pgbouncer.ini — transaction mode is mandatory for SET search_path
[databases]
saas_db = host=postgres port=5432 dbname=saas pool_size=25

[pgbouncer]
pool_mode = transaction        # NOT session — session mode breaks search_path
max_client_conn = 1000        # Next.js instances can hold many idle connections
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 300
log_connections = 0          # disable in prod — log spam
    

      Important: With PgBouncer in transaction mode, SET search_path TO tenant_{id} must be called at the start of every transaction, not once per connection. Your DB middleware wrapper must handle this. Session-level search_path leaks between tenants in transaction mode.
    

Field-level encryption for PII — DPDP Act requirement

Sensitive fields that must be encrypted at rest

AES-256-GCM, application-level. Not just disk encryption.

REQUIRED

Field	Table	Treatment	Reason
pan	tenants	AES-256-GCM encrypted	DPDP sensitive personal data
aadhaar	tenants	Verify-and-discard. Store only last 4 digits.	UIDAI regulation — must not store full Aadhaar
dob	tenants	AES-256-GCM encrypted	DPDP personal data
hashed_password	tenants	bcrypt 12 rounds (unchanged)	Correct
owner_phone	tenants	AES-256-GCM encrypted	DPDP personal data
gst	tenants	AES-256-GCM encrypted	Business identity data

// lib/encrypt.ts — application-level field encryption
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';

const KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY, 'hex'); // 32 bytes

export function encrypt(plaintext: string): string {
  const iv = randomBytes(12);
  const cipher = createCipheriv('aes-256-gcm', KEY, iv);
  const encrypted = Buffer.concat([cipher.update(plaintext, 'utf8'), cipher.final()]);
  const tag = cipher.getAuthTag();
  return [iv.toString('hex'), tag.toString('hex'), encrypted.toString('hex')].join('.');
}

export function decrypt(ciphertext: string): string {
  const [ivHex, tagHex, dataHex] = ciphertext.split('.');
  const decipher = createDecipheriv('aes-256-gcm', KEY, Buffer.from(ivHex, 'hex'));
  decipher.setAuthTag(Buffer.from(tagHex, 'hex'));
  return decipher.update(dataHex, 'hex', 'utf8') + decipher.final('utf8');
}
    

Search path safety with PgBouncer transaction mode

// lib/db.ts — tenant-scoped query wrapper
export async function withTenantSchema<T>(
  tenantId: string,
  fn: (client: PoolClient) => Promise<T>
): Promise<T> {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    // Must set search_path inside EVERY transaction in transaction-mode pooling
    await client.query(`SET LOCAL search_path TO tenant_${tenantId}, public`);
    const result = await fn(client);
    await client.query('COMMIT');
    return result;
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();
  }
}
  

PostgreSQL catalog performance — schema growth mitigation

CATALOG BLOAT PREVENTION

Pre-warm schema migration template at startup. Never run DDL live during provisioning. Use a pre-built schema template dump and COPY it per tenant. Catalog queries stay fast.

CONNECTION CEILING

PgBouncer caps Postgres connections at 25-50 regardless of how many tenants. 500 tenants with 1000 Next.js connections → still only 25 Postgres connections.

SCHEMA COUNT LIMIT

Monitor pg_namespace row count. Alert at 5000 schemas. Plan shard migration before hitting 8000. At current growth rate this is a year+ problem, not a launch problem.

SLOW CATALOG QUERIES

pg_stat_statements enabled from day one. Alert if any information_schema or pg_class query exceeds 50ms p95. Add index on pg_namespace.nspname if needed.

Daily reconciliation job — the missing piece in v1

Razorpay ↔ DB reconciliation cron

Runs at 06:00 IST daily. Catches any payment that webhook missed.

NEW IN V2

// cron/reconcile.ts — runs daily 06:00 IST
async function reconcilePayments() {
  // Fetch last 48h of Razorpay payments (overlap buffer)
  const rzPayments = await razorpay.payments.all({
    from: Math.floor(Date.now() / 1000) - 172800,
    count: 100
  });

  for (const payment of rzPayments.items) {
    if (payment.status !== 'captured') continue;

    const existing = await db.query(
      'SELECT id FROM payments WHERE razorpay_payment_id = $1',
      [payment.id]
    );

    if (!existing.rows.length) {
      // Payment exists in Razorpay but NOT in our DB → missed webhook
      logger.error({ payment_id: payment.id }, 'MISSED_PAYMENT — queuing provisioning');
      await provisioningQueue.add('provision', { payment_id: payment.id }, jobOptions);
      await alertNimit(`Missed payment recovered: ${payment.id}`);
    }
  }
}
    

      Why 48 hours lookback? If the reconciliation job itself fails one day, the next day's run still catches the previous day's missed payments. Single-day lookback creates a gap if the cron fails.
    

Payment flow — complete edge case map

A

Happy path: webhook fires, provisioning completes

Customer pays → webhook within seconds → BullMQ job → saga completes → site live in ~60s. No change needed here.

B

Webhook fires but server is briefly down

Razorpay retries webhooks for up to 24 hours with exponential backoff. When server recovers, webhook arrives. BullMQ handles it normally. Reconciliation job at 06:00 acts as a final safety net.

C

Webhook never arrives (Razorpay internal issue)

Reconciliation job catches it within 24 hours. Customer is delayed by up to ~24h in worst case. Alert fires to Nimit with the payment details so manual intervention is possible if needed.

Improve: also poll Razorpay API every 15 minutes for payments made in the last hour. Reduces worst-case delay to 15 minutes for missed webhooks.

D

Customer pays and immediately closes browser

Provisioning is fully server-side triggered by webhook. Browser state is irrelevant. Site provisions regardless. Customer gets welcome email when they check inbox.

E

Renewal webhook arrives for a suspended tenant

Webhook handler checks if tenant exists. If status = 'suspended' → update end_date, set status = 'active', re-enable site. No re-provisioning. Existing schema and content preserved.

F

Refund issued in Razorpay dashboard

Razorpay fires a payment.failed or refund.created event. Handle this webhook: set tenant status = 'suspended', log the refund. V1 document has no refund handling at all.

Add refund.created webhook handler. Suspend tenant, send notification, log in audit_logs. Do NOT delete schema immediately — give 7 days for dispute resolution.

Disaster recovery — defined RTO / RPO

Recovery objectives

DEFINED IN V2

Scenario	RTO	RPO	Recovery method
Bad deployment	<2 min	0 (no data loss)	Coolify auto-rollback on health check fail
VPS full restart	<5 min	0	Docker containers auto-restart. PG data on persistent volume.
VPS hardware failure	<4 hours	<24 hours	Restore from nightly pg_dump to cold standby VPS. WAL archiving reduces RPO to <1 hour for paid plans.
Datacenter outage	<6 hours	<24 hours	Spin up cold standby in different region from snapshot. Point DNS to new IP.
Database corruption	<2 hours	<1 hour (WAL)	PITR from WAL archives in MinIO secondary. Restore to any second in last 7 days.

Cold standby setup

# Cold standby VPS runbook (kept up to date, tested monthly)
# 1. VPS provisioned with same specs, Coolify installed, same env vars
# 2. Daily: pg_dump uploaded to MinIO + replicated to secondary MinIO in different region
# 3. WAL archives shipped continuously to secondary region MinIO
# 4. To activate standby:
#    a. Download latest pg_dump + WAL to standby VPS
#    b. pg_restore → PostgreSQL instance
#    c. Apply WAL to reach latest consistent state
#    d. Update DNS A records to standby IP (TTL should be 300s, not 3600s)
#    e. Let's Encrypt certs re-issue on new server
#    f. Estimated downtime: 2–4 hours

# DNS TTL — keep at 300s (5 min), NOT the default 3600s
# High TTL means DNS change takes 1 hour to propagate. 5 min TTL = fast failover.
  

Distributed cache coherence fix

The problem with multiple Next.js replicas + ISR

Each replica has its own in-memory cache. CMS save on replica A doesn't invalidate cache on replica B.

FIXED IN V2

// Solution: use Redis as the shared cache layer, not in-process Next.js cache

// On CMS content save:
async function savePageContent(tenantId: string, page: string, content: object) {
  await db.query(`UPDATE page_content SET content_json = $1 WHERE page = $2`, [content, page]);
  
  // Invalidate Redis cache — hits ALL replicas because they all read from same Redis
  await redis.del(`page:${tenantId}:${page}`);
  
  // Also call Next.js on-demand revalidation endpoint on all replicas
  // Use Next.js revalidateTag() if using App Router caching
  await revalidatePath(`/`);  // triggers ISR regeneration
}

// On public website render — read from Redis first:
async function getPageContent(tenantId: string, page: string) {
  const cached = await redis.get(`page:${tenantId}:${page}`);
  if (cached) return JSON.parse(cached);
  
  const fresh = await db.query(`SELECT content_json FROM page_content WHERE page = $1`, [page]);
  await redis.setex(`page:${tenantId}:${page}`, 60, JSON.stringify(fresh.rows[0]));
  return fresh.rows[0].content_json;
}
    

      Key insight: Redis-backed caching means all replicas share one cache. A Redis DEL is seen by every replica instantly. No stale content, no replica drift.
    

BullMQ replaces bare cron worker

v1 cron worker risks

REMOVED

Single node-cron container. No retry on failure. No concurrency control. No visibility into job health. One unhandled rejection crashes all automation silently.

v2 BullMQ queues

NEW

Separate queues per job type. Each has independent retry config and dead-letter queue. Failed jobs alert Nimit. Dashboard visibility via Bull Board. Redis persistence survives worker restarts.

// BullMQ queue registry — each job type is isolated
const queues = {
  provisioning:  new Queue('provisioning',  { defaultJobOptions: { attempts: 3, backoff: 'exponential' }}),
  emails:        new Queue('emails',        { defaultJobOptions: { attempts: 5, backoff: { delay: 30000 }}}),
  subscription:  new Queue('subscription',  { defaultJobOptions: { attempts: 2 }}),
  reconciliation:new Queue('reconciliation',{ defaultJobOptions: { attempts: 3 }}),
  maintenance:   new Queue('maintenance',   { defaultJobOptions: { attempts: 1 }}),
};

// Scheduled jobs — using BullMQ's built-in repeat
await queues.subscription.add('lifecycle', {}, { repeat: { cron: '0 2 * * *' }});   // 2am IST
await queues.reconciliation.add('daily', {},   { repeat: { cron: '30 6 * * *' }});  // 6:30am IST
await queues.maintenance.add('vacuum', {},    { repeat: { cron: '0 3 * * 0' }});   // Sunday 3am
await queues.maintenance.add('backup-verify',{},{ repeat: { cron: '0 5 1 * *' }});  // 1st of month
  

    Aadhaar storage is the most urgent issue. UIDAI regulations prohibit storing Aadhaar numbers in databases without explicit UIDAI license approval. The correct approach: use Aadhaar for identity verification only (via an OTP or UIDAI API), confirm the user's identity, then discard the full number. Store only the last 4 digits for reference.
  

DPDP compliance checklist

1

Aadhaar: verify-and-discard pattern

Remove Aadhaar column from tenants table. Instead: accept Aadhaar during onboarding for verification only, confirm identity via DigiLocker or UIDAI OTP API, then store only last_4_aadhaar (VARCHAR 4) for audit purposes.

ALTER TABLE tenants DROP COLUMN aadhaar; ADD COLUMN aadhaar_last4 CHAR(4);

2

Consent mechanism at onboarding

Registration form must have explicit, unbundled consent checkboxes. "I agree to the Terms" is not DPDP-compliant consent for data processing. Each purpose needs a separate checkbox. Store consent timestamp and IP.

Add consent_logs table: tenant_id, purpose (marketing/billing/operations), consented_at, ip_address, consent_version. Checkbox per purpose, all required.

3

Data export API — defined and implemented

DPDP gives individuals the right to access their data. The "data export email" mentioned in v1 must be a real API that generates a structured JSON/CSV export of everything stored under that tenant's identity.

POST /api/tenant/data-export → generates export of tenants row + subscription + payments + all tenant schema content → emails secure download link valid for 48 hours.

4

Data deletion on request

Right to erasure under DPDP. When a tenant requests deletion, all PII in the tenants table must be overwritten (not just the schema dropped). Keep payment records (legal requirement for GST compliance) but anonymise the personal fields.

5

Breach notification procedure

DPDP requires notification to the Data Protection Board within 72 hours of discovering a breach. Draft the procedure now: who decides it's a breach, who notifies the Board, what template to use. This cannot be improvised during an incident.

6

Privacy policy and Data Processing Agreement

As a SaaS platform, Nimit is a Data Fiduciary. Each tenant is also a Data Fiduciary for their end customers. A DPA template must be part of the Terms of Service that tenants accept at signup.

New tables required for compliance

-- consent_logs: tracks what each tenant agreed to and when
CREATE TABLE consent_logs (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id     UUID REFERENCES tenants(id),
  purpose       TEXT NOT NULL,  -- 'billing' | 'operations' | 'marketing'
  consented     BOOLEAN NOT NULL,
  consented_at  TIMESTAMPTZ DEFAULT now(),
  ip_address    INET,
  consent_version TEXT NOT NULL  -- version of privacy policy accepted
);

-- data_requests: tracks access/deletion/export requests
CREATE TABLE data_requests (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id     UUID REFERENCES tenants(id),
  request_type  TEXT NOT NULL,  -- 'export' | 'deletion' | 'correction'
  status        TEXT DEFAULT 'pending',
  requested_at  TIMESTAMPTZ DEFAULT now(),
  completed_at  TIMESTAMPTZ,
  notes         TEXT
);
  

    The rule: Never take money from a customer until Phase 3 is complete AND the compliance checklist is signed off. Selling subscriptions without DPDP consent mechanisms in place is a regulatory risk.
  

PHASE 1 Foundation — with real infrastructure Build first

Next.js monorepo setup, Coolify on VPS, staging environment (separate branch + Coolify staging app)

PostgreSQL 15 + PgBouncer in transaction mode — not deferred

Redis for sessions, cache, and job queue persistence

BullMQ worker container with Bull Board admin UI at /admin/jobs

Shared schema tables: tenants, subscriptions, payments, provisioning_jobs, consent_logs

Super admin login, test tenant registration with consent checkboxes

MinIO for object storage + secondary backup location configured

Field encryption utility (lib/encrypt.ts) wired into tenant model

Signal: Nimit logs in to admin. Test registration creates tenant record with encrypted PII. BullMQ Bull Board shows healthy queues.

PHASE 2 Core product — website + CMS Week 2-4

Agency website template (4 pages), multi-tenancy middleware, subdomain routing

CMS panel with all editors, content API with Redis-backed cache layer

Server-side contact form /api/contact (replaces EmailJS)

withTenantSchema() DB wrapper — enforces SET LOCAL search_path on every query

Cache invalidation on CMS save (Redis DEL + Next.js revalidateTag)

Signal: Manually created test tenant edits CMS. Public site reflects changes. 4 Next.js replicas all serve fresh content after CMS save.

PHASE 3 Payments + provisioning saga Revenue gate

Razorpay integration: 3 plan checkout, Orders API, webhook endpoint

Provisioning saga engine — BullMQ-backed, idempotent, resumable from last_step

Refund webhook handler (refund.created → suspend tenant)

Daily reconciliation job — 48h lookback against Razorpay API

Invoice generation + S3 storage

DPDP compliance gate: consent checkboxes live, Aadhaar verify-and-discard, privacy policy published

Renewal subscription lifecycle cron (BullMQ scheduled job)

Signal: Customer pays → site live in 60s. Provisioning_jobs table shows 'complete'. Reconciliation job runs without errors. Intentionally kill server mid-provisioning → saga resumes on restart.

PHASE 4 Super admin dashboard Week 5-6

Full tenant list: all fields, subscription status, days remaining, payment history

Live/Shutdown toggle, expiry alerts, Razorpay transaction view

Provisioning job monitor — see in-flight and failed provisioning attempts

BullMQ Bull Board embedded in admin panel

Data request management (export / deletion requests from tenants)

PHASE 5 Monitoring + observability Week 7

Sentry: errors, slow transactions, payment webhook failures

Internal uptime checker + external BetterUptime on main domain

Health endpoint: DB, Redis, MinIO, queue depths, worker status

Job failure alerting: any BullMQ job hitting dead-letter queue → Slack alert to Nimit

Structured logging: provisioning, webhooks, cron, admin actions

pg_stat_statements monitoring — alert on catalog query degradation

PHASE 6 Custom domains Week 8

Custom domain input in CMS, async Coolify API provisioning

ssl_status polling: pending → verifying → active, shown in CMS with instructions

Honest UX: "Your domain will be live with HTTPS within 10–15 minutes" — not 60 seconds

DNS instruction generator (A record pointing to server IP, shown inline)

Domain verification check: ping tenant domain, confirm it resolves to our server before requesting cert

PHASE 7 Scale & harden When needed

Load test: 100 concurrent site visitors, target <200ms p95

Security audit: JWT rotation, rate limiting review, dependency scan

Read replica for PostgreSQL (at 500+ tenants)

Cloudflare CDN in front of platform for static asset edge delivery

Cold standby VPS DR test — full restore drill, measure actual RTO

Backup verification monthly cron — restore and verify row counts

Schema catalog monitoring — alert if pg_namespace count exceeds 5000

Note: PgBouncer and BullMQ are no longer in this phase — they were moved to Phase 1 where they belong.

Hardened SaaS Platform v2.0

Provisioning saga engine

Database — hardened design

Payment system — zero revenue leakage

Infrastructure — DR, caching & cron

DPDP Act compliance

Build phases — v2