Migrating off Supabase the boring way: native Postgres logical replication to RDS

Apr 19 • From Duc Duong

Migrating off Supabase the boring way: native Postgres logical replication to RDS

How we mirrored a managed Supabase project into our own RDS Postgres 17 without DMS, without downtime, and for about $4.3k/yr less than the path AWS recommends.

The setup

A client's production workload had accumulated about 1.5 GB of operational state across 23 tables on a Supabase Postgres instance. We needed to migrate it into our own VPC, piece by piece, without ever taking the source offline. Writes from third-party integrations happen around the clock; nobody was going to schedule a freeze window.

So the shape of the problem is: build a continuously-updated mirror in our Postgres, run the existing app against Supabase for now, and once a given slice of functionality is moved into our stack, stop replicating that slice.

Classic strangler pattern. The trick is getting the mirror right, because every future step depends on it being faithful.

Three paths, one obvious pick

flowchart TD    Start[Continuous mirror<br/>Supabase → our VPC] --> Q1{What do we want<br/>to operate?}    Q1 -->|Cheap, loose RPO| Dump[Periodic pg_dump]    Q1 -->|Managed middleware| DMS[AWS DMS]    Q1 -->|Core Postgres only| Native[Native logical<br/>replication]     Dump --> DumpOut[Rejected:<br/>hourly dumps mean<br/>hourly inconsistency<br/>and live read bursts]     DMS --> DMSCost[~$215/mo dev<br/>~$600/mo prod]    DMS --> DMSQuirk[Supabase pooler breaks it<br/>IPv4 add-on still needed]    DMS --> DMSOut[Rejected:<br/>paying for middleware<br/>we don't need]     Native --> NatGood[No extra compute]    Native --> NatGood2[Public mechanism,<br/>public docs]    Native --> NatCost[~$80/mo dev<br/>~$350/mo prod]    Native --> NatBuild[Monitoring is on us]    Native --> NatPick[Picked]     classDef out fill:#fce4ec,stroke:#c2185b,color:#000    classDef pick fill:#e8f5e9,stroke:#2e7d32,color:#000    class DumpOut,DMSOut out    class NatPick pick

We went in thinking DMS was the default answer. A week of evaluating it made us reconsider. The honest summary: DMS is a well-built piece of software, and if your relationship with the source database is adversarial (vendor opacity, bad DBA coverage, compliance audits) that $600/mo is buying you a number to call. We had none of that. What we had was 23 tables, one client, one team, and a Postgres source that speaks the same native replication protocol as our Postgres target. The middleware wasn't solving a problem we actually had.

Annualised savings vs the DMS path, after actually adding up VPC endpoints and multi-AZ Lambda costs: about $4.3k/yr. Not eye-popping money, but it buys you not having another service to own.

How it actually works

flowchart LR    subgraph SB["Supabase (source)"]        SRC_WAL[(WAL)]        SRC_PUB[Publication<br/>mirror<br/>23 tables<br/>publish=i/u/d/t]        SRC_SLOT[Replication slot<br/>mirror]        SRC_HB[heartbeat table]        SRC_WAL --> SRC_PUB        SRC_PUB --> SRC_SLOT    end     subgraph RDS["RDS Postgres 17 (target)"]        SUB[Subscription<br/>mirror<br/>disable_on_error=true]        APPLY[Apply worker<br/>long-lived<br/>1 per subscription]        SYNC[Sync workers<br/>short-lived<br/>N per COPY wave]        TGT_TABLES[(23 target tables)]        SUB --> APPLY        SUB --> SYNC        APPLY --> TGT_TABLES        SYNC --> TGT_TABLES    end     subgraph MON["AWS monitoring"]        HB_LAMBDA[Heartbeat Lambda<br/>5-min schedule]        METRICS_LAMBDA[Metrics Lambda<br/>1-min schedule]        CW[CloudWatch<br/>9 metrics + 10 alarms]    end     SRC_SLOT -- WAL stream<br/>TCP/5432 --> APPLY    APPLY -. LSN ACK .-> SRC_SLOT     HB_LAMBDA -- UPDATE heartbeat --> SRC_HB    SRC_HB -. replicated via slot .-> TGT_TABLES    METRICS_LAMBDA -- scrape catalogs --> SRC_SLOT    METRICS_LAMBDA -- scrape catalogs --> SUB    METRICS_LAMBDA --> CW     classDef src fill:#e3f2fd,stroke:#1565c0,color:#000    classDef tgt fill:#f3e5f5,stroke:#6a1b9a,color:#000    classDef mon fill:#fff3e0,stroke:#e65100,color:#000    class SRC_WAL,SRC_PUB,SRC_SLOT,SRC_HB src    class SUB,APPLY,SYNC,TGT_TABLES tgt    class HB_LAMBDA,METRICS_LAMBDA,CW mon

The replication worker lives inside the target RDS instance. The walsender lives inside the source Supabase instance. There's no box in the middle and nothing to pay for in the middle. One TCP connection, one slot, commit order preserved by construction. It's almost anticlimactic once you have it running.

The three ways this will lie to you

Most tutorials on Postgres logical replication cover the DDL fine and wave vaguely at the operational bits. We spent roughly a day on each of the three failure modes below. All three share a property that makes them particularly annoying: the catalogs will tell you the subscription is healthy while the data is quietly wrong.

Trap #1 — the source is IPv6-only

Supabase's direct database hostname resolves to IPv6. Your RDS subscriber lives in a VPC that, almost certainly, has no IPv6 CIDR. First attempt at CREATE SUBSCRIPTION returns:

ERROR: could not connect to the publisher: Network is unreachableIs the server running on host "db.xxx.supabase.co"(2600:1f1c:f9f:b100:...) and accepting TCP/IP connections on port 5432?

The natural next thought is: "fine, I'll use Supabase's connection pooler, that has an IPv4." This is a trap.

sequenceDiagram    participant RDS as RDS subscriber    participant DNS as Supabase DNS    participant SRC as Supabase direct DB    participant POOL as Supavisor pooler     Note over RDS,SRC: First try — direct host    RDS->>DNS: resolve db.xxx.supabase.co    DNS-->>RDS: AAAA only    RDS->>SRC: CONNECT (IPv6)    SRC-->>RDS: Network is unreachable     Note over RDS,POOL: "Let me just use the pooler"    RDS->>POOL: CREATE SUBSCRIPTION via pooler    POOL->>POOL: silently strips<br/>replication=database    POOL-->>RDS: "subscription created"    Note over RDS: catalogs show subscription exists<br/>srsubstate stuck at 'i' forever

Supavisor is Supabase's session-mode pooler. It accepts the DDL, tells you the subscription was created, and drops the one connection parameter (replication=database) that the subscription actually needs. The subscription object exists in your catalogs. It will never successfully connect to anything.

Fix: enable Supabase's IPv4 Address add-on (dashboard setting, $4/mo). The direct hostname then returns both A and AAAA records. Don't use the pooler.

Trap #2 — RLS eats your COPY silently

Supabase turns on Row-Level Security on every public table by default. This is normally the right call — it's what lets customers expose their database to their front-end without leaking other tenants' data. It is also what happens next:

sequenceDiagram    participant SUB as Subscriber    participant WS as Source walsender    participant COPY as Source COPY    participant RLS as RLS policy check    participant TBL as source_table (500k rows)     SUB->>WS: CREATE SUBSCRIPTION (copy_data=true)    WS->>COPY: COPY source_table TO STDOUT    COPY->>RLS: SELECT check for replication role    RLS-->>COPY: no policy grants access    COPY-->>WS: 0 rows    WS-->>SUB: stream ends    SUB->>SUB: mark srsubstate='r' (ready)    Note over SUB,TBL: Target table: 0 rows.<br/>Source table: 500k rows.<br/>Catalogs: "everything is fine."

COPY on the source runs through RLS. A plain replication role has no policy granting SELECT, so COPY honestly sends zero rows. The subscriber honestly receives zero rows, marks the table r (ready), and continues. Every health check is green. Twenty of the 23 tables on our target were completely empty. The other three lived in schemas where RLS wasn't enabled, so they worked. That split was what finally tipped us off — the "some tables work, some don't, and they happen to be in different schemas" pattern is a dead giveaway in retrospect.

Fix: grant the replication role BYPASSRLS at creation.

CREATE ROLE replicator WITH  LOGIN REPLICATION BYPASSRLS  PASSWORD '<secret>'; -- Verify (this is the query that would have saved us a day):SELECT rolname, rolbypassrls FROM pg_roles WHERE rolname='replicator';

Trap #3 — `TRUNCATE` is not in the default publication

By default, CREATE PUBLICATION ... FOR TABLE ... replicates insert, update, delete. It does not replicate truncate. Read that twice; it costs real money.

We have a few tables on the source that refresh their contents on a schedule via a TRUNCATE + INSERT pattern. What happens:

sequenceDiagram    participant JOB as Source refresh job (daily)    participant SRC as Source table    participant PUB as Publication    participant TGT as Target table     Note over SRC,TGT: Day 0 — baseline 40k rows each side    JOB->>SRC: TRUNCATE    JOB->>SRC: INSERT 40k new rows    SRC->>PUB: TRUNCATE event    PUB--xTGT: dropped (default doesn't include truncate)    SRC->>PUB: INSERT events × 40k    PUB->>TGT: apply 40k INSERTs    Note over TGT: target: 80k rows. Hmm.     JOB->>SRC: next day's refresh    JOB->>SRC: TRUNCATE + INSERT 40k    SRC->>PUB: TRUNCATE (dropped again)    SRC->>PUB: 40k INSERTs    PUB->>TGT: apply 40k INSERTs    Note over TGT: target: 120k rows. An integer multiple. Uh oh.

You don't notice until a JOIN in your new code returns more rows than you expected, and you start poking around, and eventually you run a SELECT count(*) on both sides and the number on the right is some round multiple of the number on the left. It's maddening because every operational alarm stays green. Slot is active, no apply errors, heartbeat fresh, row states are r, WAL lag is zero. The mirror is just... wrong.

Fix:

CREATE PUBLICATION mirror  FOR TABLE schema.table_a, schema.table_b, ...  WITH (publish = 'insert, update, delete, truncate'); -- Verify:SELECT pubname, pubinsert, pubupdate, pubdelete, pubtruncate  FROM pg_publication WHERE pubname='mirror';

The per-table state machine (and why it lies)

Initial sync is per-table; steady-state streaming is per-subscription.

stateDiagram-v2    [*] --> i: table added to<br/>subscription    i --> d: sync worker<br/>spawns    d --> s: COPY finishes    s --> r: catches up to<br/>apply worker LSN    r --> [*]: table dropped<br/>from publication     note right of d        sync worker holds a brief        AccessShareLock during COPY    end note     note right of r        apply worker now handles        this table alongside all others        in source commit order    end note

pg_subscription_rel.srsubstate reports exactly this progression. The docs imply that reaching r means everything is fine. It doesn't. Reaching r proves that COPY didn't throw an error. It does not prove COPY returned the right rows — that's precisely what burned us in trap #2. The state machine is a necessary condition for correctness, not a sufficient one.

One more thing that surprised us: once every table reaches r, there is exactly one apply worker for the entire subscription. Not one per table. A single long-lived process receives the decoded WAL stream, figures out which table each change belongs to, and applies them in the order they committed on the source. That's how commit ordering is preserved. It also means parallelism during steady state is bounded by that single worker — which turned out to be plenty for our volume but might matter if you're pushing tens of thousands of writes per second.

Knowing the mirror is actually correct

We built monitoring around one idea: trust round-tripped data, not state transitions.

The heartbeat trick

A tiny table, one row, updated on the source every five minutes. The whole point is that reading this value on the target tells you something no catalog query ever will: that the complete end-to-end path is alive.

sequenceDiagram    participant EB as EventBridge<br/>5-min rule    participant HL as Heartbeat Lambda    participant SRC as Supabase<br/>heartbeat table    participant SLOT as Slot + WAL    participant APPLY as RDS apply worker    participant TGT as RDS<br/>heartbeat table    participant ML as Metrics Lambda    participant CW as CloudWatch alarm     EB->>HL: tick    HL->>SRC: UPDATE written_at = now()    SRC->>SLOT: WAL record    SLOT->>APPLY: stream (~seconds)    APPLY->>TGT: apply UPDATE    ML->>TGT: SELECT now() - written_at    TGT-->>ML: HeartbeatLagSeconds    ML->>CW: PutMetricData    CW->>CW: alarm if > 60s for 10 min

If that value is fresh on the target, then in a single observation we've verified:

The walsender on Supabase is connected and decoding.
The apply worker on RDS is running.
The TCP path between them is healthy.
The publication still includes the heartbeat table (drop it and lag climbs — which is how we'd find out).

One metric. End-to-end. Bounded steady-state value. It's doing the work of roughly fifteen dashboard tiles.

There's a second job the heartbeat is doing quietly: it keeps the replication slot advancing during idle periods. Supabase pins max_slot_wal_keep_size at around 4 GB, and they do not expose it for negotiation. (We asked.) If the subscriber is connected but there's no traffic on the source, the slot's confirmed_flush_lsn doesn't move and WAL starts accumulating. Five minutes of poking the heartbeat table is enough to keep that pointer moving through dead weekends.

Nine metrics from the catalogs

flowchart LR    subgraph SOURCE[Source catalogs]        SC1[pg_replication_slots]        SC2[pg_stat_replication]    end     subgraph TARGET[Target catalogs]        TC1[pg_subscription]        TC2[pg_stat_subscription]        TC3[pg_stat_subscription_stats]        TC4[heartbeat row]        TC5[pg_publication]    end     subgraph M[Metrics Lambda — 1 min]        direction TB        M1[SlotActive]        M2[SlotBacklogBytes]        M3[SlotInvalidationReason]        M4[SubscriptionEnabled]        M5[SubscriptionWorkerCount]        M6[ApplyErrorCount]        M7[SyncErrorCount]        M8[PublicationHasHeartbeat]        M9[ReceivedLagBytes]    end     SC1 --> M1    SC1 --> M2    SC1 --> M3    SC2 --> M9    TC1 --> M4    TC2 --> M5    TC3 --> M6    TC3 --> M7    TC5 --> M8     M --> A[10 CloudWatch alarms]

Alarm thresholds are tuned against Supabase's 4 GB cap. We page at SlotBacklogBytes > 2 GB — halfway to invalidation, so there's time to investigate before Supabase defensively drops the slot out from under us.

Content integrity is a separate problem

The metrics above tell you the mechanism is healthy. They don't tell you the content is correct. For that, we just issue SELECT count(*) on both sides. Unglamorous, works:

TABLE                                          SOURCE          TARGET        DIFF-----                                          ------          ------        ----schema_a.table_1                                  432             432          okschema_a.table_2                                14821           14821          okschema_a.table_3                               500929          500929          okschema_a.table_4                              2277593         2277593          okschema_b.table_5                                40210           40210          ok...RESULT: EXACT MATCH — 23/23 tables

Run after every config change. Run before every cutover. Do not trust pg_stat_user_tables.n_live_tup — it's a statistics estimate, not a row count, and on a busy table it can be wrong by 10-20% in either direction.

What it costs the source

We measured during dry-run week.

flowchart LR    subgraph SS[Steady state — negligible]        direction TB        S1[App query latency: no change we could measure]        S2[CPU: 2–4% sustained]        S3[Network egress: under 1 Mbps]        S4[WAL volume: ~1.15× baseline<br/>REPLICA IDENTITY FULL on 6 tables]        S5[Disk: under 100 MB slot pin]        S6[Connection budget: 1 walsender]    end     subgraph IC[Initial COPY — one-time, weekend job]        direction TB        I1[Duration: 1h 40m for 1.5 GB]        I2[p99 read latency: +5ms<br/>during biggest-table COPY]        I3[Zero write blocking]    end     subgraph FM[Worst case — bounded by Supabase's own cap]        direction TB        F1[Stuck slot accumulates WAL]        F2[Supabase caps at ~4 GB]        F3[At cap: slot invalidated,<br/>source protected]        F4[We pay: 2–4h re-sync]        F1 --> F2 --> F3 --> F4    end     classDef ok fill:#e8f5e9,stroke:#2e7d32,color:#000    classDef ic fill:#fff3e0,stroke:#e65100,color:#000    classDef bad fill:#ffebee,stroke:#c62828,color:#000    class S1,S2,S3,S4,S5,S6 ok    class I1,I2,I3 ic    class F1,F2,F3,F4 bad

The initial COPY is the only phase users could theoretically notice, and only the p99 reads on the biggest table during its own COPY window moved at all. It returned to baseline a minute after the COPY finished. Writes are never blocked — commit on the source does not wait for the subscriber to acknowledge anything. That's the whole point of async logical replication.

The failure mode on the right side of the diagram deserves one more sentence. We can't cause a disk-full outage on the source, because Supabase's 4 GB cap is a floor we can't dig below. At worst, Supabase invalidates the slot and we re-sync. The cap feels like a constraint when you first hit it; it's actually doing you a favour.

The cutover loop

The mirror is the stable ground for strangling out each slice of functionality. Per slice:

flowchart TD    Start[Pick next domain] --> Val[Row-count validate<br/>source vs target]    Val -->|mismatch| Fix[Fix first]    Val -->|match| Deploy[Deploy new code path:<br/>read + write our DB]    Deploy --> Cut[Cut integration writes<br/>from source to our DB]    Cut --> Drop[ALTER PUBLICATION DROP TABLE]    Drop --> ReVal[Re-validate the rest]    ReVal -->|all match| Done[Slice decommissioned]    ReVal -->|mismatch| Rollback    Fix --> Val    Rollback[ADD TABLE back +<br/>REFRESH PUBLICATION<br/>WITH copy_data=true] --> Val     classDef safe fill:#e8f5e9,stroke:#2e7d32,color:#000    classDef danger fill:#fff3e0,stroke:#e65100,color:#000    class Done safe    class Fix,Rollback danger

Every step on the happy path is a single DDL statement, and every one has a symmetric rollback. That's the property we were really paying for when we picked native replication over DMS — we own every piece and can surgically undo any step.

Once the last table is strangled out, we drop the subscription, drop the slot, decommission the RDS. The Supabase project can be deleted by the client at their leisure.

Things we'd do differently

Stand up the dry-run first. All three traps above were discovered on a scratch Supabase project restored from a production snapshot. That exercise cost two days. It would have cost two weeks if we'd hit them for the first time in production.
Write the content-integrity check before anything else. We wrote the catalog metrics first because they felt more engineering-y. They're also what lied to us.
Stop trying to negotiate max_slot_wal_keep_size. We spent an afternoon on this. It's not a conversation Supabase is interested in having. Once we accepted the cap as a property of the environment, the alarm thresholds and heartbeat cadence basically designed themselves.

Numbers

Metric	Value
Source → target tables	23
Source database size	1.5 GB
Initial COPY duration	1h 40m
End-to-end replication lag (steady state)	~10 seconds
Monthly cost, dev	~$80
Monthly cost, prod	~$350
Annualised savings vs AWS DMS (prod)	~$4.3k
Additional compute	2× 128 MB Lambdas
Custom CloudWatch metrics	9
CloudWatch alarms	10
Lines of Python	~200
Lines of Terraform	~900
Hours of client-side DBA time, steady state	0

Closing

Logical replication in Postgres isn't magic. It's a set of primitives — publications, slots, walsenders, apply workers — that have shipped in the core database since version 10. The AWS-recommended path wraps those same primitives in managed compute and charges about $7k/yr for the wrapper. For a source we understood well and a target we owned, the wrapper wasn't earning its keep.

What we bought ourselves instead is roughly 200 lines of Python across two Lambdas, 900 lines of Terraform, three bash scripts, and a runbook we'll actually read when the pager goes off. One PR. The "something is wrong" signal on the whole system is a single alarm: HeartbeatLagSeconds > 60s for 10 min. If that's green, the mirror is faithful. If it's red, the runbook tells you exactly what to check and in what order.

The boring answer was the right one, which — as these things go — is pretty often the way it works out.

Back to Blog

Migrating off Supabase the boring way: native Postgres logical replication to RDS

Migrating off Supabase the boring way: native Postgres logical replication to RDS

The setup

Three paths, one obvious pick

How it actually works

The three ways this will lie to you

Trap #1 — the source is IPv6-only

Trap #2 — RLS eats your COPY silently

Trap #3 — `TRUNCATE` is not in the default publication

The per-table state machine (and why it lies)

Knowing the mirror is actually correct

The heartbeat trick

Nine metrics from the catalogs

Content integrity is a separate problem

What it costs the source

The cutover loop

Things we'd do differently

Numbers

Closing

The Latest

Harness Engineering - Lessons learned from Coding Agents

The Open-Source Toolkit That Stops AI Coding Agents From Shipping Sloppy Code

Migrating off Supabase the boring way: native Postgres logical replication to RDS

Migrating off Supabase the boring way: native Postgres logical replication to RDS

The setup

Three paths, one obvious pick

How it actually works

The three ways this will lie to you

Trap #1 — the source is IPv6-only

Trap #2 — RLS eats your COPY silently

Trap #3 — TRUNCATE is not in the default publication

The per-table state machine (and why it lies)

Knowing the mirror is actually correct

The heartbeat trick

Nine metrics from the catalogs

Content integrity is a separate problem

What it costs the source

The cutover loop

Things we'd do differently

Numbers

Closing

The Latest

Harness Engineering - Lessons learned from Coding Agents

The Open-Source Toolkit That Stops AI Coding Agents From Shipping Sloppy Code

Trap #3 — `TRUNCATE` is not in the default publication