一波三折:DBA需要頭腦冷清思路清晰解決故障以幸存
楊廷琨(yangtingkun)
雲和恩墨 CTO
高級谘詢顧問,Oracle ACE總監,ITPUB Oracle數據庫管理版版主
這是一則生產環境的真實維護過程,由於RAC的測試環境空間不足,因此規劃給ASM擴展空間,然而在給ASM添加新的磁盤空間時又出現了故障,這類問題在很多用戶的生產環境中可能也會遇到。
空間擴展的操作步驟如下:
在RAC環境的節點1上啟動了DBCA工具來管理ASM設備;
由於新增的裸設備在ASM圖形界麵下看不到;
通過root用戶將裸設備的訪問權限授予了操作係統上的Oracle用戶;
這時,從圖形界麵的候選磁盤中已經可以看到這些裸設備了;
通過圖形界麵將裸設備加到了磁盤組中。
但是這個操作引發了兩個錯誤:分別是ORA-15032和ORA-15075。
首先看看在Oracle官方文檔上是如何描述這兩個錯誤的:
ORA-15032: not all alterations performed
Cause: At least one ALTER DISKGROUP action failed.
Action: Check the other messages issued along with this summary error.
ORA-15075: disk(s) are not visible cluster-wide
Cause: An ALTER DISKGROUP ADD DISK command specified a disk that could not be discoveredby one or more nodes in a RAC cluster configuration.
Action: Determine which disks are causing the problem from the GV$OSM_DISK fixedview. Check operating system permissions for the device and the storage sub-systemconfiguration on each node in a RAC cluster that cannot identify the disk.
其實ORA-15075錯誤中的信息已經足夠明顯了。根據這個錯誤進行分析應該就能很快找到問題的原因。但是由於發生了其他的意外,導致解決問題的方向發生了偏差。
這個錯誤導致了奇怪的現象:根據錯誤信息判斷,操作已經失敗了,但是檢查發現這些裸設備在DBCA的ASM配置中已經可見了。
當正在檢查這兩個錯誤信息時,同事告訴我節點2上的實例連不上了。
通過操作係統命令檢查發現數據庫實例2已經關閉了,不過這個節點上的ASM實例仍然存在。
這個現象的怪異之處在於:對ASM的操作引起的錯誤,當前ASM實例沒有出錯,卻有另外一個數據庫實例被關閉。
檢查alert文件如下,顯著的信息顯示是控製文件訪問出錯:
$ tail -500 alert*
List of nodes:
……
Thu Mar 29 17:25:36 2007
SUCCESS: disk DISK_0017(17.4042303525) added to diskgroup DISK
SUCCESS: disk DISK_0018(18.4042303520) added to diskgroup DISK
SUCCESS: disk DISK_0019(19.4042303521) added to diskgroup DISK
SUCCESS: disk DISK_0020(20.4042303522) added to diskgroup DISK
SUCCESS: disk DISK_0021(21.4042303523) added to diskgroup DISK
SUCCESS: disk DISK_0022(22.4042303524) added to diskgroup DISK
Thu Mar 29 17:29:45 2007
SUCCESS: diskgroup DISKwas dismounted
SUCCESS: diskgroup DISKwas dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file:'+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroupwas forcibly dismounted
Thu Mar 29 17:29:46 2007
LMON: terminating instancedue to error 204
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_pmon_2754.trc:
ORA-00204: error in reading(block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
System state dump ismade for local instance
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms1_2797.trc:
ORA-00204: error in reading(block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms0_2793.trc:
ORA-00204: error in reading(block , # blocks ) of control file
System State dumped totrace file /data/oracle/admin/testrac/bdump/testrac2_diag_2756.trc
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading(block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789
嚐試重啟係統,看看會產生何種錯誤信息,這一次看到的顯示信息居然又是參數文件的讀取錯誤:
$ sqlplus "/ as sysdba"
已連接到空閑例程。
SQL> startup
ORA-01078: failure inprocessing system parameters
ORA-01565: error in identifyingfile '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could notlocate ASM instance serving a required diskgroup
SQL> shutdown
ORA-01034: ORACLE notavailable
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory
那麼到底是怎麼回事呢?如果仔細一行一行來分析告警日誌,其實這時alert文件中已經明顯包含了導致錯誤的原因:
SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file:'+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroupwas forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading(block 35, # blocks 1) of control file
ORA-00202: control file:'+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
ASM的磁盤組首先已經DISMOUNT了,所以後麵對於ASM中文件的訪問當然會出現問題,隻不過我們在閱讀日誌時,注意力很容易被明顯、易懂、熟悉的吸引,往往就忽略了真實的問題所在,這就是人的選擇性注意力關注吧:
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading(block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789
看到這個ORA-204錯誤信息,想當然地認為這是導致問題的原因。
其實如果查看隨後的啟動報錯信息就可以看出問題:
ORA-15077: could not locate ASM instance serving a requireddiskgroup。
Oracle官方文檔對這個錯誤的描述為:
ORA-15077: could not locateASM instance serving a required diskgroup
Cause: The instance failed to perform the specified operation because it could notlocate a required ASM instance.
Action: Start an ASM instance and mount the required diskgroup.
在遇到這個案例時,還同時遭遇了另外一個思維陷阱:此前剛剛碰到一個Bug,這個Bug的關鍵報錯信息恰好也是 ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora。於是忽略了以上的關鍵信息,而把關注點轉移到Bug上,並認為這次碰到的問題可能和上次有關。
參數文件不能讀取,怎麼辦呢?當時就去嚐試使用本地pfile文件啟動數據庫:
SQL> startup pfile=/export/home/oracle/inittestrac2.ora
ORACLE 例程已經啟動。
Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 503317800 bytes
Database Buffers 1627389952 bytes
Redo Buffers 14745600 bytes
ORA-00205: ?????????, ??????, ???????
出現了報錯後,又一次被誤導,去檢查ORA-00205報錯信息。
ORA-00205: error in identifying control file, check alert log for more info
Cause: The system could not find a control file of the specified name and size.
Action: Check that ALL control files are online and that they are the same filesthat the system created at cold start time.
直到發現控製文件本身並沒有問題—實例1一直正常運行。這時才意識到自己“誤入歧途”。
仔細檢查所有的報錯信息以及導致錯誤產生的原因——添加磁盤組的操作,終於發現了問題的真正原因:當時在給裸設備授權的時候,隻在節點1進行了授權,而沒有在節點2進行授權,因此節點1上的DBCA配置的ASM實例可以成功地將裸設備加到磁盤組中,而在節點2上同樣的操作由於缺少權限,導致了磁盤組DISMOUNT,最終導致了數據庫實例的關閉。
於是在節點2上對裸設備進行授權,重啟ASM實例,問題解決。
$ su -
Password:
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s1
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s3
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s4
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s5
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s6
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s7
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s1
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s3
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s4
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s5
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s6
# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s7
$ sqlplus "/ assysdba"
SQL> shutdown
ORA-01507: 未裝載數據庫
ORACLE 例程已經關閉。
$ srvctl stop asm -nracnode2
$ srvctl start asm -nracnode2
$ sqlplus "/ assysdba"
已連接到空閑例程。
SQL> startup
ORACLE 例程已經啟動。
Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 469763368 bytes
Database Buffers 1660944384 bytes
Redo Buffers 14745600 bytes
數據庫裝載完畢。
數據庫已經打開。
本來很簡單的一個問題卻大費周折。這個教訓說明解決問題的時候須冷靜地分析和判斷,否則很容易被一些其他的信息幹擾而誤入歧途,從而導致在解決問題時走上彎路。
尤其是,如果在誤入歧途的過程中,執行了破壞性或不可逆轉的操作,則可能產生運維事故。由此可見,DBA在診斷、分析和處理問題時,保持頭腦的清醒和冷靜是多麼的重要。
與DBA朋友們共勉。
本文出自數據和雲公眾號,原文鏈接
最後更新:2017-07-17 18:03:51