Migrating off Supabase the boring way: native Postgres logical replication to RDS
Apr 19 • From Duc Duong
Migrating off Supabase the boring way: native Postgres logical replication to RDS
How we mirrored a managed Supabase project into our own RDS Postgres 17 without DMS, without downtime, and for about $4.3k/yr less than the path AWS recommends.
The setup
A client's production workload had accumulated about 1.5 GB of operational state across 23 tables on a Supabase Postgres instance. We needed to migrate it into our own VPC, piece by piece, without ever taking the source offline. Writes from third-party integrations happen around the clock; nobody was going to schedule a freeze window.
So the shape of the problem is: build a continuously-updated mirror in our Postgres, run the existing app against Supabase for now, and once a given slice of functionality is moved into our stack, stop replicating that slice.
Classic strangler pattern. The trick is getting the mirror right, because every future step depends on it being faithful.
Three paths, one obvious pick
flowchart TD Start[Continuous mirror<br/>Supabase → our VPC] --> Q1{What do we want<br/>to operate?} Q1 -->|Cheap, loose RPO| Dump[Periodic pg_dump] Q1 -->|Managed middleware| DMS[AWS DMS] Q1 -->|Core Postgres only| Native[Native logical<br/>replication] Dump --> DumpOut[Rejected:<br/>hourly dumps mean<br/>hourly inconsistency<br/>and live read bursts] DMS --> DMSCost[~$215/mo dev<br/>~$600/mo prod] DMS --> DMSQuirk[Supabase pooler breaks it<br/>IPv4 add-on still needed] DMS --> DMSOut[Rejected:<br/>paying for middleware<br/>we don't need] Native --> NatGood[No extra compute] Native --> NatGood2[Public mechanism,<br/>public docs] Native --> NatCost[~$80/mo dev<br/>~$350/mo prod] Native --> NatBuild[Monitoring is on us] Native --> NatPick[Picked] classDef out fill:#fce4ec,stroke:#c2185b,color:#000 classDef pick fill:#e8f5e9,stroke:#2e7d32,color:#000 class DumpOut,DMSOut out class NatPick pickWe went in thinking DMS was the default answer. A week of evaluating it made us reconsider. The honest summary: DMS is a well-built piece of software, and if your relationship with the source database is adversarial (vendor opacity, bad DBA coverage, compliance audits) that $600/mo is buying you a number to call. We had none of that. What we had was 23 tables, one client, one team, and a Postgres source that speaks the same native replication protocol as our Postgres target. The middleware wasn't solving a problem we actually had.
Annualised savings vs the DMS path, after actually adding up VPC endpoints and multi-AZ Lambda costs: about $4.3k/yr. Not eye-popping money, but it buys you not having another service to own.
How it actually works
flowchart LR subgraph SB["Supabase (source)"] SRC_WAL[(WAL)] SRC_PUB[Publication<br/>mirror<br/>23 tables<br/>publish=i/u/d/t] SRC_SLOT[Replication slot<br/>mirror] SRC_HB[heartbeat table] SRC_WAL --> SRC_PUB SRC_PUB --> SRC_SLOT end subgraph RDS["RDS Postgres 17 (target)"] SUB[Subscription<br/>mirror<br/>disable_on_error=true] APPLY[Apply worker<br/>long-lived<br/>1 per subscription] SYNC[Sync workers<br/>short-lived<br/>N per COPY wave] TGT_TABLES[(23 target tables)] SUB --> APPLY SUB --> SYNC APPLY --> TGT_TABLES SYNC --> TGT_TABLES end subgraph MON["AWS monitoring"] HB_LAMBDA[Heartbeat Lambda<br/>5-min schedule] METRICS_LAMBDA[Metrics Lambda<br/>1-min schedule] CW[CloudWatch<br/>9 metrics + 10 alarms] end SRC_SLOT -- WAL stream<br/>TCP/5432 --> APPLY APPLY -. LSN ACK .-> SRC_SLOT HB_LAMBDA -- UPDATE heartbeat --> SRC_HB SRC_HB -. replicated via slot .-> TGT_TABLES METRICS_LAMBDA -- scrape catalogs --> SRC_SLOT METRICS_LAMBDA -- scrape catalogs --> SUB METRICS_LAMBDA --> CW classDef src fill:#e3f2fd,stroke:#1565c0,color:#000 classDef tgt fill:#f3e5f5,stroke:#6a1b9a,color:#000 classDef mon fill:#fff3e0,stroke:#e65100,color:#000 class SRC_WAL,SRC_PUB,SRC_SLOT,SRC_HB src class SUB,APPLY,SYNC,TGT_TABLES tgt class HB_LAMBDA,METRICS_LAMBDA,CW monThe replication worker lives inside the target RDS instance. The walsender lives inside the source Supabase instance. There's no box in the middle and nothing to pay for in the middle. One TCP connection, one slot, commit order preserved by construction. It's almost anticlimactic once you have it running.
The three ways this will lie to you
Most tutorials on Postgres logical replication cover the DDL fine and wave vaguely at the operational bits. We spent roughly a day on each of the three failure modes below. All three share a property that makes them particularly annoying: the catalogs will tell you the subscription is healthy while the data is quietly wrong.
Trap #1 — the source is IPv6-only
Supabase's direct database hostname resolves to IPv6. Your RDS subscriber lives in a VPC that, almost certainly, has no IPv6 CIDR. First attempt at CREATE SUBSCRIPTION returns:
ERROR: could not connect to the publisher: Network is unreachableIs the server running on host "db.xxx.supabase.co"(2600:1f1c:f9f:b100:...) and accepting TCP/IP connections on port 5432?The natural next thought is: "fine, I'll use Supabase's connection pooler, that has an IPv4." This is a trap.
sequenceDiagram participant RDS as RDS subscriber participant DNS as Supabase DNS participant SRC as Supabase direct DB participant POOL as Supavisor pooler Note over RDS,SRC: First try — direct host RDS->>DNS: resolve db.xxx.supabase.co DNS-->>RDS: AAAA only RDS->>SRC: CONNECT (IPv6) SRC-->>RDS: Network is unreachable Note over RDS,POOL: "Let me just use the pooler" RDS->>POOL: CREATE SUBSCRIPTION via pooler POOL->>POOL: silently strips<br/>replication=database POOL-->>RDS: "subscription created" Note over RDS: catalogs show subscription exists<br/>srsubstate stuck at 'i' foreverSupavisor is Supabase's session-mode pooler. It accepts the DDL, tells you the subscription was created, and drops the one connection parameter (replication=database) that the subscription actually needs. The subscription object exists in your catalogs. It will never successfully connect to anything.
Fix: enable Supabase's IPv4 Address add-on (dashboard setting, $4/mo). The direct hostname then returns both A and AAAA records. Don't use the pooler.
Trap #2 — RLS eats your COPY silently
Supabase turns on Row-Level Security on every public table by default. This is normally the right call — it's what lets customers expose their database to their front-end without leaking other tenants' data. It is also what happens next:
sequenceDiagram participant SUB as Subscriber participant WS as Source walsender participant COPY as Source COPY participant RLS as RLS policy check participant TBL as source_table (500k rows) SUB->>WS: CREATE SUBSCRIPTION (copy_data=true) WS->>COPY: COPY source_table TO STDOUT COPY->>RLS: SELECT check for replication role RLS-->>COPY: no policy grants access COPY-->>WS: 0 rows WS-->>SUB: stream ends SUB->>SUB: mark srsubstate='r' (ready) Note over SUB,TBL: Target table: 0 rows.<br/>Source table: 500k rows.<br/>Catalogs: "everything is fine."COPY on the source runs through RLS. A plain replication role has no policy granting SELECT, so COPY honestly sends zero rows. The subscriber honestly receives zero rows, marks the table r (ready), and continues. Every health check is green. Twenty of the 23 tables on our target were completely empty. The other three lived in schemas where RLS wasn't enabled, so they worked. That split was what finally tipped us off — the "some tables work, some don't, and they happen to be in different schemas" pattern is a dead giveaway in retrospect.
Fix: grant the replication role BYPASSRLS at creation.
CREATE ROLE replicator WITH LOGIN REPLICATION BYPASSRLS PASSWORD '<secret>'; -- Verify (this is the query that would have saved us a day):SELECT rolname, rolbypassrls FROM pg_roles WHERE rolname='replicator';Trap #3 — TRUNCATE is not in the default publication
By default, CREATE PUBLICATION ... FOR TABLE ... replicates insert, update, delete. It does not replicate truncate. Read that twice; it costs real money.
We have a few tables on the source that refresh their contents on a schedule via a TRUNCATE + INSERT pattern. What happens:
sequenceDiagram participant JOB as Source refresh job (daily) participant SRC as Source table participant PUB as Publication participant TGT as Target table Note over SRC,TGT: Day 0 — baseline 40k rows each side JOB->>SRC: TRUNCATE JOB->>SRC: INSERT 40k new rows SRC->>PUB: TRUNCATE event PUB--xTGT: dropped (default doesn't include truncate) SRC->>PUB: INSERT events × 40k PUB->>TGT: apply 40k INSERTs Note over TGT: target: 80k rows. Hmm. JOB->>SRC: next day's refresh JOB->>SRC: TRUNCATE + INSERT 40k SRC->>PUB: TRUNCATE (dropped again) SRC->>PUB: 40k INSERTs PUB->>TGT: apply 40k INSERTs Note over TGT: target: 120k rows. An integer multiple. Uh oh.You don't notice until a JOIN in your new code returns more rows than you expected, and you start poking around, and eventually you run a SELECT count(*) on both sides and the number on the right is some round multiple of the number on the left. It's maddening because every operational alarm stays green. Slot is active, no apply errors, heartbeat fresh, row states are r, WAL lag is zero. The mirror is just... wrong.
Fix:
CREATE PUBLICATION mirror FOR TABLE schema.table_a, schema.table_b, ... WITH (publish = 'insert, update, delete, truncate'); -- Verify:SELECT pubname, pubinsert, pubupdate, pubdelete, pubtruncate FROM pg_publication WHERE pubname='mirror';The per-table state machine (and why it lies)
Initial sync is per-table; steady-state streaming is per-subscription.
stateDiagram-v2 [*] --> i: table added to<br/>subscription i --> d: sync worker<br/>spawns d --> s: COPY finishes s --> r: catches up to<br/>apply worker LSN r --> [*]: table dropped<br/>from publication note right of d sync worker holds a brief AccessShareLock during COPY end note note right of r apply worker now handles this table alongside all others in source commit order end notepg_subscription_rel.srsubstate reports exactly this progression. The docs imply that reaching r means everything is fine. It doesn't. Reaching r proves that COPY didn't throw an error. It does not prove COPY returned the right rows — that's precisely what burned us in trap #2. The state machine is a necessary condition for correctness, not a sufficient one.
One more thing that surprised us: once every table reaches r, there is exactly one apply worker for the entire subscription. Not one per table. A single long-lived process receives the decoded WAL stream, figures out which table each change belongs to, and applies them in the order they committed on the source. That's how commit ordering is preserved. It also means parallelism during steady state is bounded by that single worker — which turned out to be plenty for our volume but might matter if you're pushing tens of thousands of writes per second.
Knowing the mirror is actually correct
We built monitoring around one idea: trust round-tripped data, not state transitions.
The heartbeat trick
A tiny table, one row, updated on the source every five minutes. The whole point is that reading this value on the target tells you something no catalog query ever will: that the complete end-to-end path is alive.
sequenceDiagram participant EB as EventBridge<br/>5-min rule participant HL as Heartbeat Lambda participant SRC as Supabase<br/>heartbeat table participant SLOT as Slot + WAL participant APPLY as RDS apply worker participant TGT as RDS<br/>heartbeat table participant ML as Metrics Lambda participant CW as CloudWatch alarm EB->>HL: tick HL->>SRC: UPDATE written_at = now() SRC->>SLOT: WAL record SLOT->>APPLY: stream (~seconds) APPLY->>TGT: apply UPDATE ML->>TGT: SELECT now() - written_at TGT-->>ML: HeartbeatLagSeconds ML->>CW: PutMetricData CW->>CW: alarm if > 60s for 10 minIf that value is fresh on the target, then in a single observation we've verified:
- The walsender on Supabase is connected and decoding.
- The apply worker on RDS is running.
- The TCP path between them is healthy.
- The publication still includes the heartbeat table (drop it and lag climbs — which is how we'd find out).
One metric. End-to-end. Bounded steady-state value. It's doing the work of roughly fifteen dashboard tiles.
There's a second job the heartbeat is doing quietly: it keeps the replication slot advancing during idle periods. Supabase pins max_slot_wal_keep_size at around 4 GB, and they do not expose it for negotiation. (We asked.) If the subscriber is connected but there's no traffic on the source, the slot's confirmed_flush_lsn doesn't move and WAL starts accumulating. Five minutes of poking the heartbeat table is enough to keep that pointer moving through dead weekends.
Nine metrics from the catalogs
flowchart LR subgraph SOURCE[Source catalogs] SC1[pg_replication_slots] SC2[pg_stat_replication] end subgraph TARGET[Target catalogs] TC1[pg_subscription] TC2[pg_stat_subscription] TC3[pg_stat_subscription_stats] TC4[heartbeat row] TC5[pg_publication] end subgraph M[Metrics Lambda — 1 min] direction TB M1[SlotActive] M2[SlotBacklogBytes] M3[SlotInvalidationReason] M4[SubscriptionEnabled] M5[SubscriptionWorkerCount] M6[ApplyErrorCount] M7[SyncErrorCount] M8[PublicationHasHeartbeat] M9[ReceivedLagBytes] end SC1 --> M1 SC1 --> M2 SC1 --> M3 SC2 --> M9 TC1 --> M4 TC2 --> M5 TC3 --> M6 TC3 --> M7 TC5 --> M8 M --> A[10 CloudWatch alarms]Alarm thresholds are tuned against Supabase's 4 GB cap. We page at SlotBacklogBytes > 2 GB — halfway to invalidation, so there's time to investigate before Supabase defensively drops the slot out from under us.
Content integrity is a separate problem
The metrics above tell you the mechanism is healthy. They don't tell you the content is correct. For that, we just issue SELECT count(*) on both sides. Unglamorous, works:
TABLE SOURCE TARGET DIFF----- ------ ------ ----schema_a.table_1 432 432 okschema_a.table_2 14821 14821 okschema_a.table_3 500929 500929 okschema_a.table_4 2277593 2277593 okschema_b.table_5 40210 40210 ok...RESULT: EXACT MATCH — 23/23 tablesRun after every config change. Run before every cutover. Do not trust pg_stat_user_tables.n_live_tup — it's a statistics estimate, not a row count, and on a busy table it can be wrong by 10-20% in either direction.
What it costs the source
We measured during dry-run week.
flowchart LR subgraph SS[Steady state — negligible] direction TB S1[App query latency: no change we could measure] S2[CPU: 2–4% sustained] S3[Network egress: under 1 Mbps] S4[WAL volume: ~1.15× baseline<br/>REPLICA IDENTITY FULL on 6 tables] S5[Disk: under 100 MB slot pin] S6[Connection budget: 1 walsender] end subgraph IC[Initial COPY — one-time, weekend job] direction TB I1[Duration: 1h 40m for 1.5 GB] I2[p99 read latency: +5ms<br/>during biggest-table COPY] I3[Zero write blocking] end subgraph FM[Worst case — bounded by Supabase's own cap] direction TB F1[Stuck slot accumulates WAL] F2[Supabase caps at ~4 GB] F3[At cap: slot invalidated,<br/>source protected] F4[We pay: 2–4h re-sync] F1 --> F2 --> F3 --> F4 end classDef ok fill:#e8f5e9,stroke:#2e7d32,color:#000 classDef ic fill:#fff3e0,stroke:#e65100,color:#000 classDef bad fill:#ffebee,stroke:#c62828,color:#000 class S1,S2,S3,S4,S5,S6 ok class I1,I2,I3 ic class F1,F2,F3,F4 badThe initial COPY is the only phase users could theoretically notice, and only the p99 reads on the biggest table during its own COPY window moved at all. It returned to baseline a minute after the COPY finished. Writes are never blocked — commit on the source does not wait for the subscriber to acknowledge anything. That's the whole point of async logical replication.
The failure mode on the right side of the diagram deserves one more sentence. We can't cause a disk-full outage on the source, because Supabase's 4 GB cap is a floor we can't dig below. At worst, Supabase invalidates the slot and we re-sync. The cap feels like a constraint when you first hit it; it's actually doing you a favour.
The cutover loop
The mirror is the stable ground for strangling out each slice of functionality. Per slice:
flowchart TD Start[Pick next domain] --> Val[Row-count validate<br/>source vs target] Val -->|mismatch| Fix[Fix first] Val -->|match| Deploy[Deploy new code path:<br/>read + write our DB] Deploy --> Cut[Cut integration writes<br/>from source to our DB] Cut --> Drop[ALTER PUBLICATION DROP TABLE] Drop --> ReVal[Re-validate the rest] ReVal -->|all match| Done[Slice decommissioned] ReVal -->|mismatch| Rollback Fix --> Val Rollback[ADD TABLE back +<br/>REFRESH PUBLICATION<br/>WITH copy_data=true] --> Val classDef safe fill:#e8f5e9,stroke:#2e7d32,color:#000 classDef danger fill:#fff3e0,stroke:#e65100,color:#000 class Done safe class Fix,Rollback dangerEvery step on the happy path is a single DDL statement, and every one has a symmetric rollback. That's the property we were really paying for when we picked native replication over DMS — we own every piece and can surgically undo any step.
Once the last table is strangled out, we drop the subscription, drop the slot, decommission the RDS. The Supabase project can be deleted by the client at their leisure.
Things we'd do differently
- Stand up the dry-run first. All three traps above were discovered on a scratch Supabase project restored from a production snapshot. That exercise cost two days. It would have cost two weeks if we'd hit them for the first time in production.
- Write the content-integrity check before anything else. We wrote the catalog metrics first because they felt more engineering-y. They're also what lied to us.
- Stop trying to negotiate
max_slot_wal_keep_size. We spent an afternoon on this. It's not a conversation Supabase is interested in having. Once we accepted the cap as a property of the environment, the alarm thresholds and heartbeat cadence basically designed themselves.
Numbers
| Metric | Value |
|---|---|
| Source → target tables | 23 |
| Source database size | 1.5 GB |
| Initial COPY duration | 1h 40m |
| End-to-end replication lag (steady state) | ~10 seconds |
| Monthly cost, dev | ~$80 |
| Monthly cost, prod | ~$350 |
| Annualised savings vs AWS DMS (prod) | ~$4.3k |
| Additional compute | 2× 128 MB Lambdas |
| Custom CloudWatch metrics | 9 |
| CloudWatch alarms | 10 |
| Lines of Python | ~200 |
| Lines of Terraform | ~900 |
| Hours of client-side DBA time, steady state | 0 |
Closing
Logical replication in Postgres isn't magic. It's a set of primitives — publications, slots, walsenders, apply workers — that have shipped in the core database since version 10. The AWS-recommended path wraps those same primitives in managed compute and charges about $7k/yr for the wrapper. For a source we understood well and a target we owned, the wrapper wasn't earning its keep.
What we bought ourselves instead is roughly 200 lines of Python across two Lambdas, 900 lines of Terraform, three bash scripts, and a runbook we'll actually read when the pager goes off. One PR. The "something is wrong" signal on the whole system is a single alarm: HeartbeatLagSeconds > 60s for 10 min. If that's green, the mirror is faithful. If it's red, the runbook tells you exactly what to check and in what order.
The boring answer was the right one, which — as these things go — is pretty often the way it works out.