一波三折：DBA需要頭腦冷清思路清晰解決故障以幸存

640?wxfrom=5&wx_lazy=1

楊廷琨（yangtingkun）

雲和恩墨 CTO

高級谘詢顧問，Oracle ACE總監，ITPUB Oracle數據庫管理版版主

這是一則生產環境的真實維護過程，由於RAC的測試環境空間不足，因此規劃給ASM擴展空間，然而在給ASM添加新的磁盤空間時又出現了故障，這類問題在很多用戶的生產環境中可能也會遇到。

空間擴展的操作步驟如下：

在RAC環境的節點1上啟動了DBCA工具來管理ASM設備；

由於新增的裸設備在ASM圖形界麵下看不到；

通過root用戶將裸設備的訪問權限授予了操作係統上的Oracle用戶；

這時，從圖形界麵的候選磁盤中已經可以看到這些裸設備了；

通過圖形界麵將裸設備加到了磁盤組中。

但是這個操作引發了兩個錯誤：分別是ORA-15032和ORA-15075。

首先看看在Oracle官方文檔上是如何描述這兩個錯誤的：

ORA-15032: not all alterations performed
Cause: At least one ALTER DISKGROUP action failed.
Action: Check the other messages issued along with this summary error.
ORA-15075: disk(s) are not visible cluster-wide
Cause: An ALTER DISKGROUP ADD DISK command specified a disk that could not be discoveredby one or more nodes in a RAC cluster configuration.
Action: Determine which disks are causing the problem from the GV$OSM_DISK fixedview. Check operating system permissions for the device and the storage sub-systemconfiguration on each node in a RAC cluster that cannot identify the disk.

其實ORA-15075錯誤中的信息已經足夠明顯了。根據這個錯誤進行分析應該就能很快找到問題的原因。但是由於發生了其他的意外，導致解決問題的方向發生了偏差。

這個錯誤導致了奇怪的現象：根據錯誤信息判斷，操作已經失敗了，但是檢查發現這些裸設備在DBCA的ASM配置中已經可見了。

當正在檢查這兩個錯誤信息時，同事告訴我節點2上的實例連不上了。

通過操作係統命令檢查發現數據庫實例2已經關閉了，不過這個節點上的ASM實例仍然存在。

這個現象的怪異之處在於：對ASM的操作引起的錯誤，當前ASM實例沒有出錯，卻有另外一個數據庫實例被關閉。

檢查alert文件如下，顯著的信息顯示是控製文件訪問出錯：

$ tail -500 alert*

List of nodes:

……

Thu Mar 29 17:25:36 2007

SUCCESS: disk DISK_0017(17.4042303525) added to diskgroup DISK

SUCCESS: disk DISK_0018(18.4042303520) added to diskgroup DISK

SUCCESS: disk DISK_0019(19.4042303521) added to diskgroup DISK

SUCCESS: disk DISK_0020(20.4042303522) added to diskgroup DISK

SUCCESS: disk DISK_0021(21.4042303523) added to diskgroup DISK

SUCCESS: disk DISK_0022(22.4042303524) added to diskgroup DISK

Thu Mar 29 17:29:45 2007

SUCCESS: diskgroup DISKwas dismounted

Thu Mar 29 17:29:46 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:

ORA-00202: control file:'+DISK/testrac/control01.ctl'

ORA-15078: ASM diskgroupwas forcibly dismounted

Thu Mar 29 17:29:46 2007

LMON: terminating instancedue to error 204

Thu Mar 29 17:29:46 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_pmon_2754.trc:

ORA-00204: error in reading(block , # blocks ) of control file

Thu Mar 29 17:29:46 2007

System state dump ismade for local instance

Thu Mar 29 17:29:46 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms1_2797.trc:

ORA-00204: error in reading(block , # blocks ) of control file

Thu Mar 29 17:29:46 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms0_2793.trc:

ORA-00204: error in reading(block , # blocks ) of control file

System State dumped totrace file /data/oracle/admin/testrac/bdump/testrac2_diag_2756.trc

Thu Mar 29 17:29:47 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:

ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )

Thu Mar 29 17:29:47 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:

ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )

Thu Mar 29 17:29:47 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:

ORA-00204: error in reading(block , # blocks ) of control file

Thu Mar 29 17:29:52 2007

Instance terminated by LMON, pid = 2789

嚐試重啟係統，看看會產生何種錯誤信息，這一次看到的顯示信息居然又是參數文件的讀取錯誤：

$ sqlplus "/ as sysdba"

已連接到空閑例程。

SQL> startup

ORA-01078: failure inprocessing system parameters

ORA-01565: error in identifyingfile '+DISK/testrac/spfiletestrac.ora'

ORA-17503: ksfdopn:2Failed to open file +DISK/testrac/spfiletestrac.ora

ORA-15077: could notlocate ASM instance serving a required diskgroup

SQL> shutdown

ORA-01034: ORACLE notavailable

ORA-27101: shared memory realm does not exist

SVR4 Error: 2: No such file or directory

那麼到底是怎麼回事呢？如果仔細一行一行來分析告警日誌，其實這時alert文件中已經明顯包含了導致錯誤的原因：

SUCCESS: diskgroup DISK was dismounted

Thu Mar 29 17:29:46 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:

ORA-00202: control file:'+DISK/testrac/control01.ctl'

ORA-15078: ASM diskgroupwas forcibly dismounted

Thu Mar 29 17:29:46 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:

ORA-00204: error in reading(block 35, # blocks 1) of control file

ORA-00202: control file:'+DISK/testrac/control01.ctl'

ORA-15078: ASM diskgroup was forcibly dismounted

ASM的磁盤組首先已經DISMOUNT了，所以後麵對於ASM中文件的訪問當然會出現問題，隻不過我們在閱讀日誌時，注意力很容易被明顯、易懂、熟悉的吸引，往往就忽略了真實的問題所在，這就是人的選擇性注意力關注吧：

Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:

ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )

Thu Mar 29 17:29:47 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:

ORA-00204: 讀取控製文件時出錯(塊 , # 塊 )

Thu Mar 29 17:29:47 2007

Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:

ORA-00204: error in reading(block , # blocks ) of control file

Thu Mar 29 17:29:52 2007

Instance terminated by LMON, pid = 2789

看到這個ORA-204錯誤信息，想當然地認為這是導致問題的原因。

其實如果查看隨後的啟動報錯信息就可以看出問題：

ORA-15077: could not locate ASM instance serving a requireddiskgroup。

Oracle官方文檔對這個錯誤的描述為：

ORA-15077: could not locateASM instance serving a required diskgroup
Cause: The instance failed to perform the specified operation because it could notlocate a required ASM instance.
Action: Start an ASM instance and mount the required diskgroup.

在遇到這個案例時，還同時遭遇了另外一個思維陷阱：此前剛剛碰到一個Bug，這個Bug的關鍵報錯信息恰好也是 ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora。於是忽略了以上的關鍵信息，而把關注點轉移到Bug上，並認為這次碰到的問題可能和上次有關。

參數文件不能讀取，怎麼辦呢？當時就去嚐試使用本地pfile文件啟動數據庫：

SQL> startup pfile=/export/home/oracle/inittestrac2.ora

ORACLE 例程已經啟動。

Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 503317800 bytes
Database Buffers 1627389952 bytes
Redo Buffers 14745600 bytes

ORA-00205: ?????????, ??????, ???????

出現了報錯後，又一次被誤導，去檢查ORA-00205報錯信息。

ORA-00205: error in identifying control file, check alert log for more info
Cause: The system could not find a control file of the specified name and size.
Action: Check that ALL control files are online and that they are the same filesthat the system created at cold start time.

直到發現控製文件本身並沒有問題—實例1一直正常運行。這時才意識到自己“誤入歧途”。

仔細檢查所有的報錯信息以及導致錯誤產生的原因——添加磁盤組的操作，終於發現了問題的真正原因：當時在給裸設備授權的時候，隻在節點1進行了授權，而沒有在節點2進行授權，因此節點1上的DBCA配置的ASM實例可以成功地將裸設備加到磁盤組中，而在節點2上同樣的操作由於缺少權限，導致了磁盤組DISMOUNT，最終導致了數據庫實例的關閉。

於是在節點2上對裸設備進行授權，重啟ASM實例，問題解決。

$ su -

Password:

Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s1

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s3

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s4

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s5

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s6

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad6s7

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s1

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s3

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s4

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s5

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s6

# chown oracle:oinstall/dev/rdsk/c2t500601603022E66Ad7s7

$ sqlplus "/ assysdba"

SQL> shutdown

ORA-01507: 未裝載數據庫

ORACLE 例程已經關閉。

$ srvctl stop asm -nracnode2

$ srvctl start asm -nracnode2

$ sqlplus "/ assysdba"

已連接到空閑例程。

SQL> startup

ORACLE 例程已經啟動。

Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 469763368 bytes
Database Buffers 1660944384 bytes
Redo Buffers 14745600 bytes

數據庫裝載完畢。

數據庫已經打開。

本來很簡單的一個問題卻大費周折。這個教訓說明解決問題的時候須冷靜地分析和判斷，否則很容易被一些其他的信息幹擾而誤入歧途，從而導致在解決問題時走上彎路。

尤其是，如果在誤入歧途的過程中，執行了破壞性或不可逆轉的操作，則可能產生運維事故。由此可見，DBA在診斷、分析和處理問題時，保持頭腦的清醒和冷靜是多麼的重要。

與DBA朋友們共勉。

本文出自數據和雲公眾號，原文鏈接

最後更新：2017-07-17 18:03:51

一波三折：DBA需要頭腦冷清思路清晰解決故障以幸存

上一篇：墨菲定律：一個參數Drop_caches導致集群數據庫實例崩潰

下一篇：診斷案例：從實例掛起到歸檔失敗和內存管理的蝴蝶效應

相關內容

熱門內容

最新內容

一波三折：DBA需要頭腦冷清思路清晰解決故障以幸存

上一篇： 墨菲定律：一個參數Drop_caches導致集群數據庫實例崩潰

下一篇： 診斷案例：從實例掛起到歸檔失敗和內存管理的蝴蝶效應

相關內容

熱門內容

最新內容

上一篇：墨菲定律：一個參數Drop_caches導致集群數據庫實例崩潰

下一篇：診斷案例：從實例掛起到歸檔失敗和內存管理的蝴蝶效應