Postgres — Logical Replication and long running transactions

Virender Singla
6 min readDec 16, 2021
reference — highgo

I was exploring logical replication in Postgres with respect to long running transactions and see how it replicates/behaves in different scenarios. For Ex:

  • how replication happens for a long running transaction.
  • How logical replication recovers after Subscriber restarts.
  • Manually advancing the LSN value for a logical replication slot.

I did a few tests regarding the above and presented the summary in this article.

Setup:

Spin up two small RDS Instances.

Postgres RDS 11.13
Instance1: test1 — m5.large (2vCPU), 100GB GP2 volumes (300 IOPS)
Instance2: test2 — m5.large (2vCPU), 100GB GP2 volumes (300 IOPS)

I have set up logical replication between test1 database (publisher/source) and test2 database (subscriber/target). Then created two tables test1 and test2 with vacuum disabled.

Publisher/Subscriber:create table test1(lsn pg_lsn, id int);alter table  test1 set (
autovacuum_enabled = false, toast.autovacuum_enabled = false
);
create table test2(lsn pg_lsn, id int);alter table test2 set (
autovacuum_enabled = false, toast.autovacuum_enabled = false
);

Case1: how replication happens for long running transaction

Insert some data in Publisher and see how it is replicating to Subscriber.

Publisher:postgres=> insert into test1 select pg_current_wal_insert_lsn(),r from generate_series(1,100000000) as r;
INSERT 0 100000000
Time: 203169.104 ms (03:23.169)
Schema | Name | Type | Owner | Persistence | Size |
--------+------+-------+----------+-------------+--------
public | test1| table | postgres| permanent | 4224 MB|

I observed the behavior for above transaction replication and here is the summary.

* Publisher streams data only after a transaction has been     committed.If transaction is not completed for some reason then replication logs (mined wal logs) are discarded and not sent to Subscriber.* Replication logs are spilled to disk in case of huge data changes on Publisher.* There is a CPU spike on Publisher (even after INSERT is completed) as logical decoding from wal logs and then data transfer takes CPU cycles. * Storage Space requirement: For the above case, 4GB table generated around 4GB of wal logs and then 12GB of replication logs so total 20GB storage consumed on Publisher to replicate this 4GB table.* Transaction took around 15 more minutes to appear on Subscriber after it has been committed on Publisher. This timing includes generating replication logs, transferring those logs over network and then applying those logs onto Subscriber.
CPU — Publisher
Storage — Publisher
Network Receive Throughput — Subscriber

So the next question came up: In case of multiple transactions, in what order logical replication replicates data on Subscriber?

Trx1: Starts at 10.00 on test1 (long running)
Trx2: Starts at 10.01 on test2
Trx2: Ends at 10.02
Trx1: Ends at 10.30
Trx2 appears on Subscriber quickly whereas Trx1 takes time.

Summary:

Logical replication replays transactions on the target side in the order of commit time. That makes sense because there could be some foreign key kind of dependency between those transactions so they should appear on the Subscriber database in the same order as on the Publisher side. In other words, a long running transaction does not block other quick transactions replication (though as a long running transaction generates a lot of wal logs so that can sometimes overwhelm walsender process).

Case2: How logical replication recovers after Subscriber restarts

Publisher:###truncate and reload the test table again.postgres=> truncate table test1;
TRUNCATE TABLE
postgres=> insert into test1 select pg_current_wal_insert_lsn(),r from generate_series(1,100000000) as r;
INSERT 0 100000000
Time: 213496.802 ms (03:33.497)
Schema | Name | Type | Owner | Persistence | Size |
--------+------+-------+----------+-------------+---------+------
public | test1| table | postgres | permanent | 4224 MB |

Also check for restart_lsn on Publisher:

Publisher:select slot_name,plugin,slot_type,restart_lsn,confirmed_flush_lsn from pg_replication_slots; slot_name  |  plugin  | slot_type | restart_lsn | confirmed_flush_lsn
------------------+----------+-----------+-------------+------------
subscription_ec2 | pgoutput | logical | 6F/A8004C68 | 70/416E1998

What is restart_lsn? This is the start LSN of the oldest transaction which is not replicated to the target side yet. Postgres won’t delete the wal starting from this LSN as that is still required.

So once the above transaction apply started on Subscriber (can check using table size) and table size reached to 2000MB (means 50% apply is done), I rebooted the test2 database to see how recovery happens.

Subscriber:postgres=> \dt+ test1
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------+-------+----------+-------------+---------+
public | test1| table | postgres | permanent | 2033 MB |

After the subscriber comes up, let’s take a look at logs.

Subscriber Logs:

recovered replication state of node 1 to 6F/A8004C68

Publisher Logs:

postgres@postgres:[9024]:LOG:   logical decoding found consistent point at 6F/A8004C68postgres@postgres:[9024]:STATEMENT:   START_REPLICATION SLOT "subscription_ec2" LOGICAL 6F/A8004C68  (proto_version '1', publication_names '"replication_ec2"')

So looking at the logs above, replication starts for the test1 table from restart_lsn (start lsn of the transaction) and replicates the aborted transaction again. This also resulted into bigger table size (until vacuum runs) so toal 2033 (dead tuples)+4224(live tuples) ~ 6GB

Subscriber:

Schema | Name | Type | Owner | Persistence | Size |
--------+------+-------+----------+-------------+---------+
public | test1| table | postgres | permanent | 6252 MB |

One more thing I noticed here is that once Subscriber goes down, Publisher deletes all the already generated replication logs and then once it contacts Subscriber, Publisher starts mining all the wal logs again. So looking at this behavior, we can say that there is no checkpoint mechanism for log mining on a transaction and that’s the reason it has to save all the wal logs (restart_lsn) until transaction is fully replicates to other side

See the dip in replication slot disk usage graph:

Replication Slot Disk Usage (MB) — Publisher

So far so good. Now let’s do the above exercise for two overlapping transactions.

Trx1: Starts at 10.00 on test1 (long running)
Trx2: Starts at 10.01 on test2

Trx2: Ends at 10.02
Trx1: Ends at 10.30
Let's wait for Trx2 to appear on the target side and then reboot the test2 database (Subscriber) while Trx1 replication is still going on after commit.

Here I was looking for lsn consistent point Postgres will choose as a recovery point. If it picks restart_lsn — 71/74005068 ( Trx1) then Trx2 data will appear twice on target database (there is no PK defined). Again after reboot, Postgres chooses some other lsn instead of restart_lsn but replicates Trx1 data fine whereas Trx2 data was already there. So here Postgres knows what transactions have already completed and what were in progress when Subscriber crashed.

Publisher Logs:

postgres@postgres:[24505]:STATEMENT:   START_REPLICATION SLOT "subscription_ec2" LOGICAL 71/A054E198  (proto_version '1', publication_names '"replication_ec2"')
postgres@postgres:[24505]:LOG: logical decoding found consistent point at 71/74005068
postgres@postgres:[24505]:DETAIL: There are no running transactions.
postgres@postgres:[24505]:STATEMENT: START_REPLICATION SLOT "subscription_ec2" LOGICAL 71/A054E198 (proto_version '1', publication_names '"replication_ec2"')

Case3: Manually advancing LSN value for a logical replication slot

As we can disable/enable a subscription and also advance it’s lsn. I did a few tests advancing lsn value with a long running transaction going on.

alter subscription subscription_ec2 disable;SELECT pg_replication_origin_advance('pg_16545','61/380000D8');alter subscription subscription_ec2 enable;###Single Transactiondisable subscription.Trx1: Starts at 10.00 on test1Take LSN value at 10.15
select pg_current_wal_insert_lsn()
Trx1: Ends at 10.30Now advance lsn value to above noted lsnenable subscription.###Multiple Transactionsdisable subscruiptionTrx1: Start at 10.00 on test1
Trx2: Start at 10.01 on test2
Trx2: Ends at 10.03
Take LSN valueTrx1: Ends at 10.30Now advance lsn value to above noted lsn enable subscription.

In both the cases Trx1 replicated fine. That means give any lsn value between start and end time of a transaction and that will replicate fine. In other words, Postgres checks for in progress transactions for the lsn value and replicates those transactions.

Closing Notes: Now a million dollar question comes up that whether we should set up (safely) Subscriber based on the snapshot procedure mentioned on instacart post. However, the above tests confirm that an ongoing transaction replicates fine in case we advance lsn but I still believe that there are too many internals which one need to understand.

Do share your thoughts and expertise on this.

--

--