閱讀127 返回首頁    go 阿裏雲 go 技術社區[雲棲]


DBA生存警示:保護現場不要讓事情更糟

640?tp=webp&wxfrom=5&wx_lazy=1

張樂奕

雲和恩墨副總經理 Oracle ACE 總監

ITPUB Oracle數據庫管理版版主、Oracle高可用版版主、ACOUG聯合創始人


今天收到一個發過來請求幫助的 case,Oracle 數據庫無法啟動,請求幫助恢複。仔細閱讀了發過來的告警日誌,這是一個典型的“事情越弄越糟”的案例。

作為一個專業的DBA,在遇到問題時,一定要思考:如何保護現場,不讓事情變得更糟。這是基本要求,保護現場以使得其他人接手工作時,可以從原有狀態開始。


以下就來根據告警日誌,一條一條地回顧這位 DBA 是如何將數據庫弄到完全啟動不了的。


故障最開始是從1月11日的淩晨 3:30 開始出現,數據庫在歸檔的時候,意外發現某個控製文件的頭塊全部被清零了,這可能是存儲本身的問題,並非人為。

Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Master background archival failure: 227
Fri Jan 11 03:31:24 2013
Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Completely zero block found during control file header read
Fri Jan 11 03:31:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:31:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'


接下來,數據庫痛苦地掙紮了半小時,期間不停地報相同的 ORA-00227 錯誤。一直到淩晨 4:01,終於 CKPT 進程也發現無法更新控製文件頭部,於是強勢地將數據庫直接關閉了。

Fri Jan 11 04:01:25 2013
Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Completely zero block found during control file header read
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
CKPT: terminating instance due to error 227
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_pmon_2997.trc:
ORA-00227: corrupt block detected in control file: (block , # blocks )
Fri Jan 11 04:01:26 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_psp0_2999.trc:
ORA-00227: corrupt block detected in control file: (block , # blocks )
Instance terminated by CKPT, pid = 3007


接下來的5個小時,數據庫靜靜地躺在機房裏,沒有人知道這個數據庫已經掛掉了,一直到上午 DBA 來上班。他發現數據庫無法訪問,於是嚐試重新啟動數據庫。

Fri Jan 11 09:15:51 2013
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.


自然數據庫無法正常啟動,連 mount 狀態都無法進入,因為某個控製文件頭部已經損壞了。告警日誌的信息明確地說明了無法讀取 control03.ctl 文件的頭塊,因此在嚐試 mount 數據庫的時候報了 ORA-00205 錯誤。

Fri Jan 11 09:15:56 2013
ALTER DATABASE   MOUNT
Fri Jan 11 09:15:56 2013
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047: unable to read the header block of file
Additional information: 2
Fri Jan 11 09:15:59 2013
ORA-205 signalled during: ALTER DATABASE   MOUNT...
Fri Jan 11 09:19:31 2013
Starting ORACLE instance (normal)
Fri Jan 11 09:19:43 2013
alter database mount
Fri Jan 11 09:19:43 2013
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047: unable to read the header block of file
Additional information: 2
Fri Jan 11 09:19:43 2013
ORA-205 signalled during: alter database mount


接下來,這位 DBA 開始反複地關閉實例,又啟動實例。這樣的操作一直持續了1個小時,一直到上午的 10:28,可以想象這是多麼糾結的一個小時。

忠告: 除非十分特殊的恢複案例,否則反複起停數據庫實例是於事無補的。

Shutting down instance: further logons disabled
Fri Jan 11 09:23:47 2013
Stopping background process CJQ0
……
Fri Jan 11 09:38:02 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 09:43:00 2013
Shutting down instance: further logons disabled
……
Fri Jan 11 09:43:58 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 09:55:34 2013
ALTER DATABASE CLOSE NORMAL
……
Fri Jan 11 09:56:55 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 10:28:10 2013
ALTER DATABASE CLOSE NORMAL


接下來 10:29 的再次重新啟動數據庫實例之前,DBA 終於意識到可能是控製文件出現了問題,因此修改了初始化參數,將報錯的 control03.ctl 文件從初始化參數 control_files 中去掉了。這次數據庫得以正常啟動。

Fri Jan 11 10:29:20 2013
Starting ORACLE instance (normal)
……
control_files            = /data1/oradata/dpdata/control01.ctl, /data1/oradata/dpdata/control02.ctl
……
Fri Jan 11 10:29:37 2013
Completed: ALTER DATABASE OPEN


而DBA也迅速地作了一次備份控製文件的操作,但是正是這個操作引導到了後麵噩夢一般的結果。

Fri Jan 11 10:36:14 2013
alter database backup controlfile to trace
Fri Jan 11 10:36:14 2013
Completed: alter database backup controlfile to trace


數據庫又平穩地運行了一個上午,這種寧靜到下午14:16的時候被打破,數據庫開始報 ORA-600 錯誤,並且在 CKPT 進程作檢查點的時候,報數據文件10和31的文件頭部無法被正確讀取。如果不是更深層次的原因,那麼這可能仍然是跟淩晨時候控製文件意外損壞時候的故障一樣,也許是存儲子係統本身的問題。

Fri Jan 11 14:16:07 2013
Errors in file /oracle/admin/dpdata/udump/dpdata_ora_22240.trc:
ORA-00600: internal error code, arguments: [6002], [0], [0], [3], [0], [], [], []
Fri Jan 11 14:19:44 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171: datafile 10 going offline due to error advancing checkpoint
ORA-01122: database file 10 failed verification check
ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf'
ORA-01251: Unknown File Header Version read for file number 10
Fri Jan 11 14:19:59 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171: datafile 31 going offline due to error advancing checkpoint
ORA-01122: database file 31 failed verification check
ORA-01110: data file 31: '/data3/ts2_dpcis.dbf'
ORA-01251: Unknown File Header Version read for file number 31


緊接著,應用係統的某個 JOB 也由於數據文件無法訪問,而開始報錯。

Fri Jan 11 14:30:19 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_j001_12993.trc:
ORA-12012: error on auto execute of job 88
ORA-00376: file 10 cannot be read at this time
ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf'
ORA-06512: at "DECTR.P_MOVE_CONTS_SHIP", line 77
ORA-06512: at line 1


相同的報錯一直持續了四十多分鍾,之後 DBA 又再次介入了。但是 DBA 很奇怪地斷然執行了 offline 這兩個數據文件的操作,並在2分多鍾之後,又嚐試將兩個數據文件再次 online 。而由於文件損壞,自然在 online 的時候遇到了 ORA-1122 錯誤,而無法成功 online 。


Fri Jan 11 15:16:23 2013
alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:16:23 2013
Completed: alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:17:05 2013
alter database datafile  '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:17:05 2013
Completed: alter database datafile  '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:19:41 2013
alter database datafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:19:41 2013
ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...
Fri Jan 11 15:21:10 2013
alter database datafile  '/data2/DECTR_HIS2.dbf' online
Fri Jan 11 15:21:10 2013
ORA-1122 signalled during: alter database datafile  '/data2/DECTR_HIS2.dbf' online...


這才僅僅是噩夢的開始,接下來的一切屬於危險動作,請勿輕易模仿。

遇到 ORA-1122 錯誤以後,DBA 考慮了 9 秒鍾,再次斷然地關閉了數據庫,並隨之又重新啟動。由於僅僅是用戶表空間數據文件損壞,並且之前也已經被 offline 了,因此數據庫實例毫無障礙地 OPEN 成功。

Fri Jan 11 15:21:19 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:21:19 2013
Stopping background process QMNC
Fri Jan 11 15:21:19 2013
Stopping background process CJQ0
Fri Jan 11 15:21:21 2013
Stopping background process MMNL
Fri Jan 11 15:21:22 2013
Stopping background process MMON
Fri Jan 11 15:21:23 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:22:59 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:23:13 2013
Completed: ALTER DATABASE OPEN


DBA再次嚐試online數據文件的操作,同樣的ORA-1122錯誤。

Fri Jan 11 15:23:31 2013
alter database datafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:23:31 2013
ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...


考慮了2分半鍾之後,DBA 也許是想起上午的時候做過控製文件的備份,因此決定進行數據庫恢複。重啟數據庫到 nomount 狀態,並開始進行RECOVER DATABASE USING BACKUP CONTROLFILE,ORA-1507 錯誤的意思是告知如果要使用備份的控製文件進行數據庫恢複,那麼應該要先使用備份的控製文件將數據庫啟動到 mount 狀態。

Fri Jan 11 15:25:05 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:25:05 2013
Stopping background process QMNC
Fri Jan 11 15:25:05 2013
Stopping background process CJQ0
Fri Jan 11 15:25:07 2013
Stopping background process MMNL
Fri Jan 11 15:25:08 2013
Stopping background process MMON
Fri Jan 11 15:25:09 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:26:32 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:26:46 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 15:26:46 2013
ORA-1507 signalled during: ALTER DATABASE RECOVER  database using backup controlfile    until cancel  ...


DBA 於是將數據庫啟動到 mount 狀態,繼續進行數據庫恢複。這其中的幾個ORA 錯誤都是正常的,ORA-279 提示需要一個歸檔文件來完成恢複,ORA-308提示打不開 1_87749_604491553.dbf 歸檔文件,根據前麵的告警日誌,可以知道實際上 87749 這個重做日誌是當前日誌,還沒有歸檔,自然找不到。ORA-1547錯誤表示恢複已經完成,但是 OPEN RESETLOGS 的時候仍然要遇到錯誤。


Fri Jan 11 15:26:56 2013
alter database mount
Fri Jan 11 15:27:00 2013
Setting recovery target incarnation to 2
Fri Jan 11 15:27:00 2013
Successful mount of redo thread 1, with mount id 560899584
Fri Jan 11 15:27:00 2013
Database mounted in Exclusive Mode
Completed: alter database mount
Fri Jan 11 15:27:10 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Media Recovery Start
parallel recovery started with 3 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Fri Jan 11 15:27:28 2013
Media Recovery Log /soft/db_arch/1_87749_604491553.dbf
Errors with log /soft/db_arch/1_87749_604491553.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Fri Jan 11 15:27:28 2013
Media Recovery Log /soft/db_arch/1_87749_604491553.dbf
Errors with log /soft/db_arch/1_87749_604491553.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER CANCEL
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...


DBA忽略了這個錯誤,嚐試將數據庫打開,很顯然會遇到O RA-1589 錯誤,之後又嚐試用 NORESTLOGS 方式 OPEN 數據庫,這也很顯然會遇到 ORA-1588 錯誤。不完全恢複的數據庫必須要以 RESETLOGS 方式打開。

Fri Jan 11 15:29:52 2013
alter database open
Fri Jan 11 15:29:52 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 15:30:11 2013
alter database open NORESETLOGS
Fri Jan 11 15:30:11 2013
ORA-1588 signalled during: alter database open NORESETLOGS


之後, DBA 作了一個艱難的決定,再次關閉並重啟了數據庫。又再次嚐試相同的 OPEN 步驟。當然,Oracle 也給與了相同的報錯。數據庫仍然無法打開。至此,數據庫無法提供服務已經1個多小時。

Fri Jan 11 15:30:42 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:30:42 2013
Stopping background process CJQ0
Fri Jan 11 15:30:42 2013
Stopping background process MMNL
Fri Jan 11 15:30:43 2013
Stopping background process MMON
Fri Jan 11 15:30:44 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:30:59 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:31:08 2013
ALTER DATABASE OPEN
ORA-1589 signalled during: ALTER DATABASE OPEN...
Fri Jan 11 15:31:28 2013
alter database open NORESETLOGS
Fri Jan 11 15:31:28 2013
ORA-1588 signalled during: alter database open NORESETLOGS...
Fri Jan 11 15:31:41 2013
alter database open RESETLOGS
Fri Jan 11 15:31:41 2013
ORA-1122 signalled during: alter database open RESETLOGS..
.


再接下來,是一團混亂, DBA 多次重啟數據庫,嚐試了多種恢複手段。offline數據文件,recover 數據文件,recover 數據庫,online 數據文件,再recover,再 offline,再 open,但是一切嚐試都是徒勞的。一直到晚上18:35,在數據庫宕機4個多小時以後,開始求助我們幫助其恢複數據庫。

Fri Jan 11 15:41:28 2013
alter database datafile '/data2/DECTR_HIS2.dbf' offline
Completed: alter database datafile '/data2/DECTR_HIS2.dbf' offline
Fri Jan 11 15:41:35 2013
alter database open
Fri Jan 11 15:41:35 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 15:42:20 2013
alter database  open resetlogs
Fri Jan 11 15:42:20 2013
ORA-1245 signalled during: alter database  open resetlogs...
Fri Jan 11 15:43:40 2013
ALTER DATABASE RECOVER  datafile '/data3/ts2_dpcis.dbf'  
Fri Jan 11 15:43:40 2013
Media Recovery Start
Fri Jan 11 15:43:40 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile '/data3/ts2_dpcis.dbf'  ...
Fri Jan 11 15:46:09 2013
ALTER DATABASE RECOVER  datafile 10  
Fri Jan 11 15:46:09 2013
Media Recovery Start
Fri Jan 11 15:46:09 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 10  ...
……
Fri Jan 11 16:37:51 2013
ALTER DATABASE RECOVER  database  
Fri Jan 11 16:37:51 2013
Media Recovery Start
Fri Jan 11 16:37:51 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  database  ...
Fri Jan 11 16:39:29 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 16:39:29 2013
Media Recovery Start
parallel recovery started with 3 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 16:39:43 2013
ALTER DATABASE RECOVER    CANCEL  
Fri Jan 11 16:39:44 2013
ORA-1547 signalled during: ALTER DATABASE RECOVER    CANCEL  ...
Fri Jan 11 16:39:44 2013
ALTER DATABASE RECOVER CANCEL
ORA-1112 signalled during: ALTER DATABASE RECOVER CANCEL ...
Fri Jan 11 16:40:15 2013
alter database datafile 10 online
Fri Jan 11 16:40:15 2013
Completed: alter database datafile 10 online
Fri Jan 11 16:40:25 2013
alter database datafile 31 online
Completed: alter database datafile 31 online
Fri Jan 11 16:40:47 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 16:40:47 2013
Media Recovery Start
Fri Jan 11 16:40:47 2013
Media Recovery failed with error 1110
ORA-283 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 16:47:12 2013
WARNING: inbound connection timed out (ORA-3136)
Fri Jan 11 17:44:47 2013
ALTER DATABASE RECOVER  datafile 10  
Fri Jan 11 17:44:47 2013
Media Recovery Start
Fri Jan 11 17:44:47 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 10  ...
Fri Jan 11 17:45:19 2013
ALTER DATABASE RECOVER  database until cancel using backup controlfile  
Fri Jan 11 17:45:19 2013
Media Recovery Start
Fri Jan 11 17:45:19 2013
Media Recovery failed with error 1110
ORA-283 signalled during: ALTER DATABASE RECOVER  database until cancel using backup controlfile  ...
Fri Jan 11 17:46:39 2013
alter database datafile 10 offline
Fri Jan 11 17:46:40 2013
Completed: alter database datafile 10 offline
Fri Jan 11 17:47:18 2013
ALTER DATABASE RECOVER  database until cancel  
Fri Jan 11 17:47:18 2013
Media Recovery Start
Fri Jan 11 17:47:18 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  database until cancel  ...
Fri Jan 11 18:11:31 2013
alter database open
Fri Jan 11 18:11:31 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 18:35:29 2013
Starting ORACLE instance (normal)
Fri Jan 11 18:35:43 2013
alter database open
Fri Jan 11 18:35:43 2013
ORA-1589 signalled during: alter database open..
.

這是一個沒有備份的數據庫,實際上如果是存儲字係統的問題導致了數據文件損壞,那麼可能與DBA的關係並不大,但是在經過一下午的折騰,將一個其實僅僅是壞了2個數據文件而可以輕鬆 OPEN 的數據庫恢複到無論如何也無法輕易打開的狀態,這就與 DBA 有很大的關係了。


文章轉自數據和雲公眾號,原文鏈接

最後更新:2017-07-18 09:32:31

  上一篇:go  DBA入門之路:察微知漸細致入微
  下一篇:go  分支/標記