PostgreSQL on ECS多雲盤的部署、快照備份和恢複
標簽
PostgreSQL , ECS , 雲盤 , 快照 , 一致性備份 , 時間點恢複 , zfs , lvm , raid , 並行計算
背景
隨著阿裏雲雲盤的發展,雲盤的性能已經越來越好了。IOPS可以做到十萬以上,讀寫吞吐也超過1GB/s了。相信隨著網絡的發展,SSD雲盤IOPS突破40萬,讀寫吞吐突破4GB/s也不遠了。
不過這裏的IOPS和吞吐是指並發的IO能力,單次IO的延遲與本地還是不能比(將來RDMA網絡也許能解決這個問題)。
PostgreSQL 如何解決SSD雲盤IO延遲高缺陷
某些業務對數據庫的(要求響應時間很快的寫小事務)對單次IO延遲比較敏感,不過PostgreSQL有方法可以解決這個小缺陷。
1、後台IO,(write syscall)
PostgreSQL 的大多數IO為後台IO(bgwriter, backend writer),所以刷shared buffer對IO延遲不敏感。
2、異步提交
事務提交時,寫WAL異步提交。不會造成數據不一致,但是當數據庫CRASH,可能丟失在wal buffer中未提交的事務(最多10毫秒)。
這種方法是最有效的。
3、組提交
組提交,解決WAL寫瓶頸,將多個同時提交的事務的WAL fsync動作合並為單次,從而減少FSYNC次數,提高高並發時的寫小事務的TPS。
PostgreSQL 如何充分利用並發IO的能力
由於我們看到的雲盤IOPS和讀寫吞吐指標是並發指標,數據庫如何利用好這麼好的指標呢?對高並發小事務不是問題,肯定是能將它用起來的,但是對於低並發,長事務(分析型業務),如何利用雲盤的IOPS能力和讀寫吞吐能力呢?
PostgreSQL的並行計算特性可以充分利用雲盤的並發IOPS和讀寫帶寬。
多雲盤卷組
單塊雲盤的IOPS能力有上限,容量有上限,讀寫帶寬也有上限。好在ECS支持多塊雲盤,目前已支持一台ECS掛載16塊雲盤。
通過多塊雲盤,組卷條帶後,提高讀寫帶寬。以Linux RHEL/CentOS 7.x為例,組卷和條帶方法:
1、邏輯卷
yum install lvm2
2、軟RAID
yum install -y mdadm
3、ZFS
以centos 7.3為例
wget https://download.zfsonlinux.org/epel/zfs-release.el7_3.noarch.rpm
rpm -ivh zfs-release.el7_3.noarch.rpm
yum install -y zfs
邏輯卷例子
假設環境中有16塊SSD雲盤
1、創建PV
pvcreate /dev/vd[b-q]
2、創建VG
vgcreate -A y -s 128M vgdata01 /dev/vd[b-q]
-s, --physicalextentsize PhysicalExtentSize[bBsSkKmMgGtTpPeE]
Sets the physical extent size on physical volumes of this volume group. A size suffix
(k for kilobytes up to t for terabytes) is optional, megabytes is the default if no suffix is present.
For LVM2 format, the value
must be a power of 2 of at least 1 sector (where the sector size is the largest sector size of the
PVs currently used in the VG) or, if not a power of 2, at least 128KiB. For the older LVM1 format, it must be a power
of 2 of at least 8KiB. The default is 4 MiB. Once this value has been set, it is difficult to
change it without recreating the volume group which would involve backing up and restoring data on any logical volumes.
However, if no extents need moving for the new value to apply, it can be altered using vgchange -s.
If the volume group metadata uses lvm1 format, extents can vary in size from 8KiB to 16GiB and
there is a limit of 65534 extents in each logical volume. The default of 4 MiB leads to a maximum logical volume size of
around 256GiB.
If the volume group metadata uses lvm2 format those restrictions do not apply, but having a
large number of extents will slow down the tools but have no impact on I/O performance to the logical volume. The smallest
PE is 1KiB
The 2.4 kernel has a limitation of 2TiB per block device.
3、創建LV,設置條帶
16塊盤,每個條帶單位為8KB。
lvcreate -A y -i 16 -I 8 -l 100%VG -n lv01 vgdata01
-i|--stripes Stripes
Gives the number of stripes. This is equal to the number of physical volumes to scatter the logical volume data.
When creating a RAID 4/5/6 logical volume, the extra devices which are necessary for parity are internally accounted for.
Specifying -i 3 would cause 3 devices for striped and RAID 0 logical volumes, 4 devices for
RAID 4/5, 5 devices for RAID 6 and 6 devices for RAID 10. Alternatively, RAID 0 will stripe across 2
devices, RAID 4/5 across 3 PVs, RAID 6 across 5 PVs and RAID 10 across 4 PVs in the volume group
if the -i argument is omitted. In order to stripe across all PVs of the VG if the -i argument is omitted, set
raid_stripe_all_devices=1 in the allocation section of lvm.conf (5) or add
--config allocation/raid_stripe_all_devices=1
to the command.
Note the current maxima for stripes depend on the created RAID type. For raid10, the maximum of stripes is 32,
for raid0, it is 64, for raid4/5, it is 63 and for raid6 it is 62.
See the --nosync option to optionally avoid initial syncrhonization of RaidLVs.
Two implementations of basic striping are available in the kernel. The original device-mapper implementation
is the default and should normally be used. The alternative implementation using MD, available since ver‐
sion 1.7 of the RAID device-mapper kernel target (kernel version 4.2) is provided to facilitate the
development of new RAID features. It may be accessed with --type raid0[_meta], but is best avoided at present
because of assorted restrictions on resizing and converting such devices.
-I|--stripesize StripeSize
Gives the number of kilobytes for the granularity of the stripes.
StripeSize must be 2^n (n = 2 to 9) for metadata in LVM1 format. For metadata in LVM2 format,
the stripe size may be a larger power of 2 but must not exceed the physical extent size.
4、創建文件係統,設置條帶
當數據庫數據塊=32K時,chunk大小(單塊盤讀寫32KB,再寫下一塊)。 條帶大小512KB(32KB*16)。
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=8,stripe_width=128 -b 4096 -T largefile -L lv01
或
當數據庫數據塊=8K時,chunk大小(單塊盤讀寫8KB,再寫下一塊)。 條帶大小128KB(8KB*16)。
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01
-b block-size
Specify the size of blocks in bytes. Valid block-size values are 1024, 2048 and 4096 bytes per block.
If omitted, block-size is heuristically determined by the filesystem size and the expected usage of the filesys‐
tem (see the -T option). If block-size is preceded by a negative sign ('-'), then mke2fs will use
heuristics to determine the appropriate block size, with the constraint that the block size will be at least block-
size bytes. This is useful for certain hardware devices which require that the blocksize be a multiple of 2k.
stride=stride-size 單位為blocks
Configure the filesystem for a RAID array with stride-size filesystem blocks. This is the number
of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the
chunk size. This mostly affects placement of filesystem metadata like bitmaps at mke2fs
time to avoid placing them on a single disk, which can hurt performance.
It may also be used by the block allocator.
stripe_width=stripe-width 單位為blocks
Configure the filesystem for a RAID array with stripe-width filesystem blocks per stripe.
This is typically stride-size * N, where N is the number of data-bearing disks in the RAID
(e.g. for RAID 5 there
is one parity disk, so N will be the number of disks in the array minus 1).
This allows the block allocator to prevent read-modify-write of the parity in a
RAID stripe if possible when the data is written.
lazy_itable_init[= <0 to disable, 1 to enable>]
If enabled and the uninit_bg feature is enabled, the inode table will not be fully
initialized by mke2fs. This speeds up filesystem initialization noticeably,
but it requires the kernel to finish initial‐
izing the filesystem in the background when the filesystem is first mounted.
If the option value is omitted, it defaults to 1 to enable lazy inode table zeroing.
5、MOUNT文件係統
如果不使用雲盤鏡像備份,使用這種掛載模式。
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback LABEL=lv01 /data01
或
如果使用雲盤鏡像備份,使用這種掛載模式。
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=1,data=ordered LABEL=lv01 /data01
6、FIO測試
yum install -y fio
fio -filename=/data01/testdir -direct=1 -thread -rw=write -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_write.log 2>&1
fio -filename=/data01/testdir -direct=1 -thread -rw=read -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_read.log 2>&1
fio -filename=/data01/testdir -direct=1 -thread -rw=randwrite -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_randwrite.log 2>&1
fio -filename=/data01/testdir -direct=1 -thread -rw=randread -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_randread.log 2>&1
順序
READ: io=72621MB, aggrb=1210.3MB/s, minb=1210.3MB/s, maxb=1210.3MB/s, mint=60003msec, maxt=60003msec
WRITE: io=36845MB, aggrb=628743KB/s, minb=628743KB/s, maxb=628743KB/s, mint=60007msec, maxt=60007msec
隨機
READ: io=53390MB, aggrb=911160KB/s, minb=911160KB/s, maxb=911160KB/s, mint=60002msec, maxt=60002msec
WRITE: io=26078MB, aggrb=445032KB/s, minb=445032KB/s, maxb=445032KB/s, mint=60004msec, maxt=60004msec
數據庫部署
1、安裝軟件
2、初始化數據庫
initdb -E SQL_ASCII -U postgres --locale=C -D $PGDATA
3、配置postgresql.conf
port = 1921
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '.'
shared_buffers = 64GB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 500
bgwriter_lru_multiplier = 5.0
bgwriter_flush_after = 0
effective_io_concurrency = 16
max_worker_processes = 128
max_parallel_workers_per_gather = 64
max_parallel_workers = 128
backend_flush_after = 0
wal_level = minimal
synchronous_commit = off
wal_sync_method = fsync
full_page_writes = on
wal_buffers = 1GB
wal_writer_delay = 10ms
wal_writer_flush_after = 0
checkpoint_timeout = 35min
max_wal_size = 128GB
min_wal_size = 8GB
checkpoint_completion_target = 0.5
checkpoint_flush_after = 0
checkpoint_warning = 30s
max_wal_senders = 0
random_page_cost = 1.2
parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_table_scan_size = 0
min_parallel_index_scan_size = 0
effective_cache_size = 200GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_file_mode = 0600
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 0
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
max_locks_per_transaction = 6400
pg_ctl start
4、配置歸檔,歸檔目錄不能放在數據盤的雲盤裏麵,你可以放到其他雲盤,或者放到OSS。
這裏為了演示方便,放在本地雲盤。(保證與數據雲盤不是同一雲盤)
mkdir /archivedir
chown digoal:digoal /archivedir
vi postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /archivedir/%f && cp %p /archivedir/%f'
pg_ctl restart -m fast
5、壓測
5.1、寫入大吞吐壓測
初始化
-- 建表
do language plpgsql $$
declare
begin
for i in 0..65535 loop
execute 'create table IF NOT EXISTS test_'||i||'(id int, info text, crt_time timestamp) with (autovacuum_enabled=off, toast.autovacuum_enabled=off)';
end loop;
end;
$$;
-- 刪表
do language plpgsql $$
declare
begin
for i in 0..65535 loop
execute 'drop table IF EXISTS test_'||i;
end loop;
end;
$$;
-- 寫表
create or replace function batch_ins(int) returns void as $$
declare
begin
execute 'insert into test_'||$1||' select generate_series(1,10000), md5(random()::text), now()';
end;
$$ language plpgsql strict;
大吞吐寫入測試,寫入速度直逼200萬行/s
vi test.sql
\set id random(0,65535)
select batch_ins(:id);
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 20801
latency average = 184.678 ms
latency stddev = 57.522 ms
tps = 173.218345 (including connections establishing)
tps = 173.233151 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set id random(0,65535)
184.684 select batch_ins(:id);
實際為寫WAL的瓶頸,采用UNLOGGED TABLE,寫入速度直逼1000萬行/s。
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 117367
latency average = 32.731 ms
latency stddev = 23.047 ms
tps = 977.228590 (including connections establishing)
tps = 977.331826 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set id random(0,65535)
32.727 select batch_ins(:id);
5.2、10億數據量,OLTP壓測
pgbench -i -s 10000
pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 120
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10000
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 580535
latency average = 13.229 ms
latency stddev = 30.073 ms
tps = 4829.300799 (including connections establishing)
tps = 4829.989470 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.097 BEGIN;
5.650 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.153 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
3.122 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.631 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.418 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
1.148 END;
5.3、並行讀壓測
postgres=# \dt+ e
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+-------+-------------
public | e | table | postgres | 24 GB |
(1 row)
-- 32個並行讀,24G數據處理耗時0.3秒。
postgres=# alter table e set (parallel_workers =32);
postgres=# set max_parallel_workers_per_gather =32;
postgres=# set min_parallel_table_scan_size =0;
postgres=# set min_parallel_index_scan_size =0;
postgres=# set parallel_setup_cost =0;
postgres=# set parallel_tuple_cost =0;
postgres=# set force_parallel_mode =on;
postgres=# select count(*) from e;
count
----------
28766181
(1 row)
Time: 993.876 ms
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from e;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=178479.59..178479.60 rows=1 width=8) (actual time=305.796..305.796 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=27422
-> Gather (cost=178479.50..178479.51 rows=32 width=8) (actual time=304.868..305.784 rows=33 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 32
Workers Launched: 32
Buffers: shared hit=27422
-> Partial Aggregate (cost=178479.50..178479.51 rows=1 width=8) (actual time=291.483..291.483 rows=1 loops=33)
Output: PARTIAL count(*)
Buffers: shared hit=23934
Worker 0: actual time=281.287..281.287 rows=1 loops=1
Buffers: shared hit=718
..........
Worker 30: actual time=299.304..299.304 rows=1 loops=1
Buffers: shared hit=438
Worker 31: actual time=300.227..300.227 rows=1 loops=1
Buffers: shared hit=460
-> Parallel Index Only Scan using uk_e on public.e (cost=0.44..176232.78 rows=898689 width=0) (actual time=0.085..178.852 rows=871702 loops=33)
Heap Fetches: 0
Buffers: shared hit=23934
Worker 0: actual time=0.078..173.121 rows=1014806 loops=1
Buffers: shared hit=718
..........
Worker 30: actual time=0.089..184.251 rows=635007 loops=1
Buffers: shared hit=438
Worker 31: actual time=0.070..184.000 rows=649729 loops=1
Buffers: shared hit=460
Planning time: 0.092 ms
Execution time: 314.527 ms
(144 rows)
基於雲盤鏡像的備份
備份過程
1、開始備份
select pg_start_backup('xx');
2、打雲盤快照
調用API或在阿裏雲ECS控製台執行。
3、壓測
打快照的過程中,產生一些數據庫變更。
使用前麵的例子,大吞吐寫入測試(使用logged table)、OLTP測試、夾雜建表、刪表。
4、雲盤快照結束
5、停止壓測
6、結束備份
select pg_stop_backup();
7、切換日誌
checkpoint;
select pg_switch_wal();
checkpoint;
select pg_switch_wal();
8、檢查歸檔已正常
確保打快照過程中產生的WAL全部歸檔成功。
9、生成每個測試表的CHECKSUM
do language plpgsql $$
declare
n name;
res int8;
begin
for n in select tablename from pg_tables where schemaname='public'
loop
execute 'select sum(hashtext(t.*::text)) from '||n||' t' into res;
raise notice 'checksum %: %', n, res;
end loop;
end;
$$;
NOTICE: checksum pgbench_history: -422046586146
NOTICE: checksum pgbench_tellers: 215006661683
NOTICE: checksum pgbench_branches: 158568443210
......
基於雲盤鏡像的恢複
實際上PostgreSQL本身已支持增量備份、塊級增量備份的功能,但是既然有雲盤快照,可以隻備歸檔和雲盤鏡像,數據少走一次網絡開銷。
基於PG本身的備份也簡單介紹一下,見本文參考部分。
恢複過程
1、創建新ECS(可以根據鏡像創建)
2、如果根據數據庫ECS鏡像創建,則不需要這一步。
部署postgresql軟件,注意需要與主庫軟件、插件等一致。
3、根據雲盤快照創建雲盤
4、複原邏輯卷
pvscan
vgscan
lvscan
5、檢查文件係統
fsck.ext4 -y /dev/mapper/vgdata01-lv01
6、加載文件係統
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=1,data=ordered LABEL=lv01 /data01
7、清理WAL文件,(因為是多盤快照,快照不可能在同一時間點,因此WAL文件可能出現partial write。wal文件是恢複數據文件的關鍵,因此必須完整。所以我們從歸檔來獲取WAL。)。
rm -f $PGDATA/pg_wal/*
注意,如果是單盤,可以隻刪除最後一個WAL。
8、配置恢複
cp $PGHOME/share/recovery.conf.sample $PGDATA/recovery.conf
vi $PGDATA/recovery.conf # (auto copy archive to $PGDATA/pg_wal)
restore_command = 'cp /archivedir/%f %p' # e.g. 'cp /mnt/server/archivedir/%f %p'
standby_mode = on
9、啟動數據庫
pg_ctl start
10、等待恢複完成,恢複完成即恢複到最後一個已歸檔的WAL文件。
11、激活數據庫
pg_ctl promote
12、檢查CHECKSUM
do language plpgsql $$
declare
n name;
res int8;
begin
for n in select tablename from pg_tables where schemaname='public'
loop
execute 'select sum(hashtext(t.*::text)) from '||n||' t' into res;
raise notice 'checksum %: %', n, res;
end loop;
end;
$$;
13、檢查所有數據塊是否正常。
set vacuum_freeze_min_age =0;
vacuum freeze;
小結
1、PostgreSQL 具備WAL和FULL PAGE WRITE,可以實現數據文件的不一致恢複,也就是說備份時的數據文件partial write可以被檢查點後的wal full page write PAGE修複。最終實現一致性。
2、多塊雲盤並行備份,提高了數據備份的速度。
3、多塊雲盤,提升了整體的讀寫IOPS和讀寫帶寬。
4、PostgreSQL 通過組提交、異步提交、數據文件異步寫,解決了雲盤單次IO延遲較本地延遲更高一點的性能問題。也就是說對IO延遲不敏感。
5、PostgreSQL 利用多核並行,可以在單條SQL中,充分利用多塊雲盤帶來的高IOPS和高讀寫帶寬的能力。
6、通過邏輯卷、軟RADI、ZFS等手段,可以充分利用多雲盤的能力。
參考
1、《PostgreSQL on Linux 最佳部署手冊》
2、zfs
《PostgreSQL 最佳實踐 - 塊級增量備份(ZFS篇)雙機HA與塊級備份部署》
《PostgreSQL 最佳實踐 - 塊級增量備份(ZFS篇)單個數據庫采用多個zfs卷(如表空間)時如何一致性備份》
《PostgreSQL 最佳實踐 - 塊級增量備份(ZFS篇)備份集自動校驗》
《PostgreSQL 最佳實踐 - 塊級增量備份(ZFS篇)方案與實戰》
3、man lvm
4、man mdadm
5、備份原理
《PostgreSQL 最佳實踐 - 塊級別增量備份(pg_rman baseon LSN)源碼淺析與使用》
6、時間點恢複原理與最佳實踐
《PostgreSQL 最佳實踐 - 任意時間點恢複源碼分析》
《PostgreSQL 最佳實踐 - 在線增量備份與任意時間點恢複》
7、https://github.com/pgbackrest/pgbackrest
最後更新:2017-08-20 17:06:15