PostgreSQL: Cloning your instance can mess up replication

Virender Singla
3 min readApr 11, 2024

Many times we need to clone a production instance for some testing purpose. For example, validate a major version upgrade on a clone instance before attempting on the prod instance. And this clone is created using a snapshot backup of the prod. Though just be careful while doing this, as it can break the replication setup of the Prod instance.

Let’s get into details:

Assume we have set up logical replication between a Publisher node and a Subscriber node and then we create a clone of the Subscriber node called as Subscriber Clone.

After some time we realized that replication is broken between Publisher and Subscriber and instead Subscriber Clone started fetching new changes from Publisher node.

So what happened here

When we createSubscriber Clone instance from snapshot backup of the Subscriber, obviously it inherits all the subscription properties of it, like replication slot name and Publisher connection string and continues to ping Publisher node to make a replication connection.

select subname, subenabled, subconninfo, subslotname from pg_subscription;
subname | subenabled | subconninfo | subslotname
-------+---------+-----------------+----------+------------+-------------------------------------------------------------------
my_subscription | t | host=10.127.208.xxx port=5432 password=a*****e user=postgres dbname=postgres | my_subscription
LOG:  logical replication apply worker for subscription "my_subscription" has started
ERROR: could not start WAL streaming: ERROR: replication slot "my_subscription" is active for PID 172
LOG: background worker "logical replication worker" (PID 183) exited with exit code 1
LOG: logical replication apply worker for subscription "my_subscription" has started
ERROR: could not start WAL streaming: ERROR: replication slot "my_subscription" is active for PID 172
LOG: background worker "logical replication worker" (PID 184) exited with exit code 1

Though initially Subscriber Clone is not able to set up the replication because Publisher has an active process on this slot which is sending data to Subscriber. But as soon as replication is paused/break between Publisher and Subscriber for a moment (say Subscriber got restarted/failed over) then Subscriber Clone is able to communicate to Publisher and we will see a new process id is spawned to the IP address of this clone instance.

select pid,usename,application_name,client_addr,client_port,state from pg_stat_replication;
pid | usename | application_name | client_addr | client_port | state
-----+----------+------------------+---------------+-------------+-----------
172 | postgres | my_subscription | 10.127.208.57 | 39088 | streaming

Escape Route

Either create your clone instance in a separate network so it can not communicate to Publisher node or alternate, as soon as this clone instance comes up, remove the replication setup from the clone instance.

/* Instance which we need to clone */

ALTER SUBSCRIPTION my_subscription DISABLE;

/* Cloned instance */

ALTER SUBSCRIPTION my_subscription SET (slot_name=NONE);
DROP SUBSCRIPTION my_subscription;

/* Instance which we need to clone */

ALTER SUBSCRIPTION my_subscription ENABLE;

Some Thoughts

So in PostgreSQL, logical replication communication is initiated by the Subscriber node and the Publisher does not know the identity of the Subscriber node. All here thePublisher node does is to respond to any instance trying to connect with the same replication slot name irrespective of its IP address and that’s what happened above. This is indeed helpful for failover scenarios. Think of a real Prod setup where both of your Publisher and Subscriber nodes are HA enabled through synchronous means, so even Primary Subscriber node is failed over to other side and it’s DNS/IP is changed, the other side still continues to restart the replication and we don’t need any custom DNS failover mechanism.

--

--