PostgreSQL: Logical Replication Cloning Gotcha
Many times we need to clone a production instance for some testing purpose. For example, validate a major version upgrade on a clone instance before attempting on the prod instance. And this clone is created using a snapshot backup of the prod. Though just be careful while doing this, as it can break the replication setup of the Prod instance.
Let’s get into details:
Assume we have set up logical replication between a Publisher
node and a Subscriber
node and then we create a clone of the Subscriber
node called as Subscriber Clone
.
After some time we realized that replication is broken between Publisher
and Subscriber
and instead Subscriber Clone
started fetching new changes from Publisher
node.
So what happened here
When we createSubscriber Clone
instance from snapshot backup of the Subscriber
, obviously it inherits all the subscription properties of it, like replication slot name and Publisher
connection string and continues to ping Publisher
node to make a replication connection.
select subname, subenabled, subconninfo, subslotname from pg_subscription;
subname | subenabled | subconninfo | subslotname
-------+---------+-----------------+----------+------------+-------------------------------------------------------------------
my_subscription | t | host=10.127.208.xxx port=5432 password=a*****e user=postgres dbname=postgres | my_subscription
LOG: logical replication apply worker for subscription "my_subscription" has started
ERROR: could not start WAL streaming: ERROR: replication slot "my_subscription" is active for PID 172
LOG: background worker "logical replication worker" (PID 183) exited with exit code 1
LOG: logical replication apply worker for subscription "my_subscription" has started
ERROR: could not start WAL streaming: ERROR: replication slot "my_subscription" is active for PID 172
LOG: background worker "logical replication worker" (PID 184) exited with exit code 1
Though initially Subscriber Clone
is not able to set up the replication because Publisher
has an active process on this slot which is sending data to Subscriber
. But as soon as replication is paused/break between Publisher
and Subscriber
for a moment (say Subscriber
got restarted/failed over) then Subscriber Clone
is able to communicate to Publisher
and we will see a new process id is spawned to the IP address of this clone instance.
select pid,usename,application_name,client_addr,client_port,state from pg_stat_replication;
pid | usename | application_name | client_addr | client_port | state
-----+----------+------------------+---------------+-------------+-----------
172 | postgres | my_subscription | 10.127.208.57 | 39088 | streaming
Escape Route
Either create your clone instance in a separate network so it can not communicate to Publisher node or alternate, as soon as this clone instance comes up, remove the replication setup from the clone instance.
/* Instance which we need to clone */
ALTER SUBSCRIPTION my_subscription DISABLE;
/* Cloned instance */
ALTER SUBSCRIPTION my_subscription SET (slot_name=NONE);
DROP SUBSCRIPTION my_subscription;
/* Instance which we need to clone */
ALTER SUBSCRIPTION my_subscription ENABLE;
Some Thoughts
So in PostgreSQL, logical replication communication is initiated by the Subscriber node and the Publisher does not know the identity of the Subscriber node. All here thePublisher node does is to respond to any instance trying to connect with the same replication slot name irrespective of its IP address and that’s what happened above. This is indeed helpful for failover scenarios. Think of a real Prod setup where both of your Publisher and Subscriber nodes are HA enabled through synchronous means, so even Primary Subscriber node is failed over to other side and it’s DNS/IP is changed, the other side still continues to restart the replication and we don’t need any custom DNS failover mechanism.