392
技術社區[雲棲]
PostgreSQL 從文件係統恢複遺失的UNLOGGED table's datafile.
PostgreSQL的unlogged table是不記錄xlog的,所以在備庫上沒有unlogged table的數據記錄。
另一方麵,當數據庫異常crash時,數據庫重啟後需要從WAL恢複,由於unlogged table不記錄WAL,為了保證數據的一致性,postgresql會自動清除unlogged table的數據文件。
那麼問題來了,萬一你不小心誤創建了unlogged table,你可能一開始會沒有感知,但是一旦發生以下情況,你會發現數據不見了。
1. 數據庫crash掉之後,重啟。
2. 主備切換,備庫變成主庫,主庫變成備庫。
以上兩種情況,都是數據庫在啟動並進入recovery狀態後,原來主庫下的unlogged table的datafile都會被清空。
代碼見:
src/backend/access/transam/xlog.c
/*
* This must be called ONCE during postmaster or standalone-backend startup
*/
void
StartupXLOG(void)
{
......
/* REDO */
if (InRecovery)
{
......
/*
* We're in recovery, so unlogged relations may be trashed and must be
* reset. This should be done BEFORE allowing Hot Standby
* connections, so that read-only backends don't try to read whatever
* garbage is left over from before.
*/
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
......
/*
* Reset unlogged relations to the contents of their INIT fork. This is
* done AFTER recovery is complete so as to include any unlogged relations
* created during recovery, but BEFORE recovery is marked as having
* completed successfully. Otherwise we'd not retry if any of the post
* end-of-recovery steps fail.
*/
if (InRecovery)
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
......
backend/storage/file/reinit.c
/*
* Reset unlogged relations from before the last restart.
*
* If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
* relation with an "init" fork, except for the "init" fork itself.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
*/
void
ResetUnloggedRelations(int op)
{
char temp_path[MAXPGPATH];
DIR *spc_dir;
struct dirent *spc_de;
MemoryContext tmpctx,
oldctx;
/* Log it. */
elog(DEBUG1, "resetting unlogged relations: cleanup %d init %d",
(op & UNLOGGED_RELATION_CLEANUP) != 0,
(op & UNLOGGED_RELATION_INIT) != 0);
/*
* Just to be sure we don't leak any memory, let's create a temporary
* memory context for this operation.
*/
tmpctx = AllocSetContextCreate(CurrentMemoryContext,
"ResetUnloggedRelations",
ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE,
ALLOCSET_DEFAULT_MAXSIZE);
oldctx = MemoryContextSwitchTo(tmpctx);
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
ResetUnloggedRelationsInTablespaceDir("base", op);
/*
* Cycle through directories for all non-default tablespaces.
*/
spc_dir = AllocateDir("pg_tblspc");
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
ResetUnloggedRelationsInTablespaceDir(temp_path, op);
}
FreeDir(spc_dir);
/*
* Restore memory context.
*/
MemoryContextSwitchTo(oldctx);
MemoryContextDelete(tmpctx);
}
src/include/common/relpath.h
/*
* Stuff for fork names.
*
* The physical storage of a relation consists of one or more forks.
* The main fork is always created, but in addition to that there can be
* additional forks for storing various metadata. ForkNumber is used when
* we need to refer to a specific fork in a relation.
*/
typedef enum ForkNumber
{
InvalidForkNumber = -1,
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
INIT_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM and possibly
* FORKNAMECHARS below, and update the forkNames array in
* src/common/relpath.c
*/
} ForkNumber;
那麼問題來了,如果真的這樣了,有辦法恢複嗎?
1. 首先,如果你在原來的主庫上有基礎備份,你可以從基礎備份恢複。為什麼需要主庫的備份呢,因為備庫上沒有unlogged table的數據文件內容,所以在備庫備份是備不到unlogged table的datafile的。
但是這種恢複方法也務必要小心,你需要在啟動數據庫前,先把_init的文件都刪掉,這樣啟動數據庫時數據文件就不會被清除。
2. 從主庫的邏輯備份中恢複。
3. 從審計日誌中回放SQL恢複。
4. 如果你沒有主庫的基礎備份,那麼可有從文件係統中去恢複刪掉的數據文件。例如ext4文件係統的恢複方式如下:
https://blog.163.com/digoal@126/blog/static/16387704020142124032866/
例子:
創建一個unlogged table,並記錄它的filenode
postgres=# create unlogged table utbl1(id int);
CREATE TABLE
postgres=# insert into utbl1 select generate_series(1,1000);
INSERT 0 1000
postgres=# select pg_relation_filepath('utbl1'::regclass);
pg_relation_filepath
----------------------
base/151898/33822
(1 row)
查看到PG對unlogged table做了_init的後綴標記
https://www.postgresql.org/docs/9.4/static/storage-init.html
postgres@digoal-> cd $PGDATA
postgres@digoal-> ll base/151898/33822*
-rw------- 1 postgres postgres 40K Sep 26 11:39 base/151898/33822
-rw------- 1 postgres postgres 24K Sep 26 11:39 base/151898/33822_fsm
-rw------- 1 postgres postgres 0 Sep 26 11:38 base/151898/33822_init
現在,我們進入恢複模式
postgres@digoal-> mv recovery.done recovery.conf
postgres@digoal-> pg_ctl start
數據庫啟動後,其實已經將unlogged table的數據文件清理掉了,見前麵的源碼。
postgres=# select count(*) from utbl1 ;
ERROR: cannot access temporary or unlogged relations during recovery
現在再起來,unlogged table就沒有數據了。
postgres@digoal-> mv recovery.conf recovery.done
postgres@digoal-> pg_ctl start
postgres=# select count(*) from utbl1 ;
count
-------
0
(1 row)
恢複過程,先停庫,然後umount數據文件所在的文件係統。
pg_ctl stop -m fast
root@digoal-> umount /data01
假設我已經安裝了extundelete
https://sourceforge.net/projects/extundelete/files/extundelete/
https://blog.163.com/digoal@126/blog/static/16387704020142124032866/
root@digoal-> cd /opt/extundelete/
root@digoal-> ll
total 4.0K
drwxr-xr-x 2 root root 4.0K Sep 26 13:43 bin
root@digoal-> cd bin/
root@digoal-> ll
total 1.2M
-rwxr-xr-x 1 root root 1.2M Sep 26 13:43 extundelete
查看數據文件所在的文件係統的塊設備的inode信息。
root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01
./extundelete: unrecognized option '--ls'
NOTICE: Extended attributes are not restored.
Loading filesystem metadata ... 80 groups loaded.
Group: 0
Contents of inode 2:
0000 | ed 41 00 00 00 10 00 00 6d 08 fa 55 bc 08 fa 55 | .A......m..U...U
0010 | bc 08 fa 55 00 00 00 00 00 00 05 00 08 00 00 00 | ...U............
0020 | 00 00 00 00 02 00 00 00 a1 22 00 00 00 00 00 00 | ........."......
0030 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0040 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0050 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0060 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0070 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0080 | 1c 00 00 00 88 90 cc 84 88 90 cc 84 00 00 00 00 | ................
0090 | 6d 08 fa 55 00 00 00 00 00 00 00 00 00 00 00 00 | m..U............
00a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
Inode is Allocated
File mode: 16877
Low 16 bits of Owner Uid: 0
Size in bytes: 4096
Access time: 1442449517
Creation time: 1442449596
Modification time: 1442449596
Deletion Time: 0
Low 16 bits of Group Id: 0
Links count: 5
Blocks count: 8
File flags: 0
File version (for NFS): 0
File ACL: 0
Directory ACL: 0
Fragment address: 0
Direct blocks: 8865, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Indirect block: 0
Double indirect block: 0
Triple indirect block: 0
File name | Inode number | Deleted status
. 2
.. 2
lost+found 11
pg_root_1921 131073
pg_root_1922 393217
我們的數據文件在pg_root_1921目錄下,對應的inode=131073,根據這個inode繼續查詢下一級目錄的inode
root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131073
找到了base的inode
base 131077
繼續找,找到了unlogged table所在的database的inode
root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131077
151898
131078
繼續找,找對應的_init文件,找到了它的前綴,根據前綴過濾
root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131078|grep _init
33822_init 131152
找到了被刪除的文件
root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131078|grep 33822
./extundelete: unrecognized option '--ls'
33822 131116 Deleted
33822_fsm 131147 Deleted
33822_init 131152
使用inode恢複,恢複刪除的datafile
root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --restore-inode 131116
恢複後放在這裏
root@digoal-> ll RECOVERED_FILES/
total 40K
-rw-r--r-- 1 root root 40K Sep 26 13:50 file.131116
將文件拷貝到原來的位置
root@digoal-> mount /data01
root@digoal-> cp file.131116 /data01/pg_root_1921/base/151898/33822
root@digoal-> chown -R postgres:postgres /data01/
root@digoal-> su - postgres
啟動數據庫前,切記,刪除_init後綴的文件,否則數據文件可能又會被清理掉。
cd $PGDATA
postgres@digoal-> rm -f base/151898/33822_init
啟動數據庫,數據已經回來了。
postgres@digoal-> pg_ctl start
postgres=# select count(*) from utbl1 ;
count
-------
1000
(1 row)
大家千萬要切記,不要濫用unlogged table,除非你可以為你的行為負責。
hash index也是這樣,因為不寫XLOG,所以如果你使用流複製來實施HA的話,在切換到備庫後,如果走hash index scan,你會發現數據憑空"消失",實際上是因為hash index的變更沒有複製過去的原因。所以就不要使用hash index了吧。
[參考]
1. https://sourceforge.net/projects/extundelete/files/extundelete/
2. https://blog.163.com/digoal@126/blog/static/16387704020142124032866/
最後更新:2017-04-01 13:37:07