There is no technical term called snapshot standby in postgres. If you have knowledge of oracle database, then this snapshot standby method is there in dataguard. I.e you can open the standby database in read-write mode temporarily and do all you testing like data change schema change etc. And once your testing is done, you can make the standby sync with primary without rebuilding your standby from scratch.
Same thing we can achieve in postgres also using pg_rewind.
1.Validate replication
Make sure replication is running fine without any lag:
-- MASTER NODE
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 6965
usesysid | 10
usename | postgres
application_name | walreceiver
client_addr | 10.20.30.7
client_hostname |
client_port | 51244
backend_start | 2022-07-13 10:29:03.01259+03
backend_xmin |
state | streaming
sent_lsn | 0/9600C20
write_lsn | 0/9600C20
flush_lsn | 0/9600C20
replay_lsn | 0/9600C20
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
reply_time | 2022-07-13 10:31:10.636569+03
-- SLAVE NODE:
Expanded display is on.
postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid | 30164
status | streaming
receive_start_lsn | 0/5000000
receive_start_tli | 3
written_lsn | 0/8000060
flushed_lsn | 0/8000060
received_tli | 3
last_msg_send_time | 2022-07-13 10:30:32.937076+03
last_msg_receipt_time | 2022-07-13 10:29:33.399633+03
latest_end_lsn | 0/8000060
latest_end_time | 2022-07-13 10:30:02.798967+03
slot_name |
sender_host | 10.20.30.6
sender_port | 5444
conninfo | user=postgres password=postgres channel_binding=prefer host=10.20.30.6 port=5444 sslmode=prefer sslcompression=0 sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any
2. Shutdown slave server:
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl stop -D /oracle/pg_data
waiting for server to shut down.... done
server stopped
-bash-4.2$ ls -ltr
3. Remove standby.signal file:
-bash-4.2$ rm -rf standby.signal
4. Start the slave server:
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl start -D /oracle/pg_data
waiting for server to start....2022-07-13 11:26:58.937 +03 [1941] LOG: redirecting log output to logging collector process
2022-07-13 11:26:58.937 +03 [1941] HINT: Future log output will appear in directory "log".
done
server started
CHECK THE LOG:
2022-07-13 11:26:58.944 +03 [1941] LOG: listening on Unix socket "/tmp/.s.PGSQL.5444"
2022-07-13 11:26:58.951 +03 [1943] LOG: database system was shut down in recovery at 2022-07-13 11:26:19 +03
2022-07-13 11:26:58.952 +03 [1943] LOG: database system was not properly shut down; automatic recovery in progress
2022-07-13 11:26:58.956 +03 [1943] LOG: redo starts at 0/9600C58
2022-07-13 11:26:58.956 +03 [1943] LOG: invalid record length at 0/9600D40: wanted 24, got 0
2022-07-13 11:26:58.956 +03 [1943] LOG: redo done at 0/9600D08 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-07-13 11:26:58.979 +03 [1941] LOG: database system is ready to accept connections
We can see , the db is open now and ready for transactions:
5. Do some data changes on SLAVE:
postgres=# create table test102 as select * from test100;
SELECT 88064
postgres=# create table test103 as select * from test100;
SELECT 88064
postgres=#
postgres=#
postgres=# create table test104 as select * from test100;
SELECT 88064
postgres=# \dt
List of relations
Schema | Name | Type | Owner
--------+---------+-------+----------
public | test100 | table | postgres
public | test101 | table | postgres
public | test102 | table | postgres
public | test103 | table | postgres
public | test104 | table | postgres
(5 rows)
So we created few new tables on slave(though disconnected), which are not on MASTER.
Now we want to revert this and sync the slave with master for replication setup.
6. Create standby.signal file:
-bash-4.2$ touch standby.signal
7. Restart the server:
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl stop -D /oracle/pg_data
waiting for server to shut down.... done
server stopped
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl start -D /oracle/pg_data
waiting for server to start....2022-07-13 11:34:00.571 +03 [2503] LOG: redirecting log output to logging collector process
2022-07-13 11:34:00.571 +03 [2503] HINT: Future log output will appear in directory "log".
done
server started
8. Now promote this server:
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl promote -D /oracle/pg_data
waiting for server to promote.... done
server promoted
-bash-4.2$
2022-07-13 11:34:00.615 +03 [2511] LOG: started streaming WAL from primary at 0/F000000 on timeline 3
2022-07-13 11:34:00.615 +03 [2511] FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/F000000 is ahead of the WAL flush position of this server 0/9600D40
2022-07-13 11:34:00.637 +03 [2515] LOG: started streaming WAL from primary at 0/F000000 on timeline 3
2022-07-13 11:34:00.638 +03 [2515] FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/F000000 is ahead of the WAL flush position of this server 0/9600D40
2022-07-13 11:34:04.729 +03 [2506] LOG: received promote request
2022-07-13 11:34:04.730 +03 [2506] LOG: redo is not required
2022-07-13 11:34:04.737 +03 [2506] LOG: selected new timeline ID: 4
2022-07-13 11:34:04.813 +03 [2506] LOG: archive recovery complete
2022-07-13 11:34:04.839 +03 [2503] LOG: database system is ready to accept connections
If you see , a newtimeline id is created , which is different from master.
8. Now run pg_rewind for syncing with master:
-- shutdown cluster
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl stop -D /oracle/pg_data
waiting for server to shut down.... done
server stopped
-- Run pg_rewind:
-bash-4.2$ /usr/pgsql-14/bin/pg_rewind -D /oracle/pg_data --source-server='host=10.20.30.6 dbname=postgres user=postgres password=postgres port=5444'
pg_rewind: servers diverged at WAL location 0/F0000A0 on timeline 3
pg_rewind: rewinding from last common checkpoint at 0/F000028 on timeline 3
pg_rewind: Done!
9 . Create standby.signal file and update postgres.auto.conf file:
When you do pg_rewind. it copies the files including postgres conf files from master. So you need to update the master host details in postgres.auto.conf file .
-bash-4.2$ touch standby.signal
-bash-4.2$ cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'user=postgres password=postgres channel_binding=prefer host=10.20.30.6 port=5444 sslmode=prefer sslcompression=0 sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
wal_log_hints = 'on'
10. Start the cluster:
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl start -D /oracle/pg_data
waiting for server to start....2022-07-13 11:36:45.698 +03 [2734] LOG: redirecting log output to logging collector process
2022-07-13 11:36:45.698 +03 [2734] HINT: Future log output will appear in directory "log".
done
server started
-bash-4.2$ psql -d postgres -p 5444
psql (14.2)
Type "help" for help.
postgres=# \dt
List of relations
Schema | Name | Type | Owner
--------+---------+-------+----------
public | test100 | table | postgres
public | test101 | table | postgres
(2 rows)
We can see, the new tables which we created are gone and is in sync with master.
11. Validate the replication again:
-- On master:
postgres=# create table test106 as select * from test100;
SELECT 88065
postgres=#
-- On slave:
postgres=# \dt
List of relations
Schema | Name | Type | Owner
--------+---------+-------+----------
public | test100 | table | postgres
public | test101 | table | postgres
public | test106 | table | postgres
(3 rows)
I created a new table on master and can observe that , it got replicated to slave also.
COMMON ISSUES:
1.pg_rewind: source and target cluster are on the same timeline
Problem:
-bash-4.2$ /usr/pgsql-14/bin/pg_rewind -D /oracle/pg_data –source-server=’host=10.20.30.6 dbname=postgres user=postgres password=postgres port=5444′
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
Solution:
pg_rewind, doesnot work when both master and slave have sametimeline id. You can change the timeline id of slave, using pg_ctl promote. So run pg_rewind, after running the promote command.
2. pg_ctl: cannot promote server; server is not in standby mode
Problem:
-bash-4.2$ /usr/pgsql-14/bin/pg_ctl promote -D /oracle/pg_data
pg_ctl: cannot promote server; server is not in standby mode
Solution:
create a standby.signal file and then restart the cluster.
touch standby.signal
3. pg_rewind: error: postgres single-user mode in target cluster failed
Problem:
-bash-4.2$ /usr/pgsql-14/bin/pg_rewind -D /oracle/pg_data –source-server=’host=10.20.30.6 dbname=postgres user=postgres password=postgres port=5444′
pg_rewind: executing “/usr/pgsql-14/bin/postgres” for target server to complete crash recovery
2022-07-13 11:31:57.696 +03 [2287] FATAL: lock file “postmaster.pid” already exists
2022-07-13 11:31:57.696 +03 [2287] HINT: Is another postmaster (PID 1941) running in data directory “/oracle/pg_data”?
pg_rewind: error: postgres single-user mode in target cluster failed
pg_rewind: fatal: Command was: “/usr/pgsql-14/bin/postgres” –single -F -D “/oracle/pg_data” template1 < “/dev/null”
Solution:
Before pg_rewind runs, make sure that postgres is down.
/usr/pgsql-14/bin/pg_ctl stop -D /oracle/pg_data