閱讀524 返回首頁    go 技術社區[雲棲]


診斷案例:從實例掛起到歸檔失敗和內存管理的蝴蝶效應

640?wxfrom=5&wx_lazy=1
楊廷琨(yangtingkun)

雲和恩墨 CTO

高級谘詢顧問,Oracle ACE 總監,ITPUB Oracle 數據庫管理版版主

編輯手記:在很多數據庫的故障案例中,一個簡單的疏忽可能導致問題被層層放大,最終導致故障,這就是蝴蝶效應的傳播原理。這裏分享的小案例自頂向下的追溯可以顯見:實例掛起->歸檔失敗->實例錯誤->參數配置。根本的原因往往很簡單,DBA的嚴謹尤其重要。


客戶的11.2.0.3 RAC數據庫出現了歸檔失敗的情況,導致單個實例出現HANG死的狀況。

檢查錯誤信息發現:

Tue Jul 02 16:49:13 2013

ARC1: Error 19504 Creating archive log file TO '+DATA02'

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance orcl1 - Archival Error

ORA-16038: log 14 SEQUENCE# 68244 cannot be archived

ORA-19504: failed TO CREATE file ""

ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.264.792274883'

ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.265.792274889'

Archiver process freed FROM errors. No longer stopped

Tue Jul 02 16:50:37 2013

ARC0: LGWR IS actively archiving destination LOG_ARCHIVE_DEST_3

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance orcl1 - Archival Error

ORA-16014: log 14 SEQUENCE# 68244 NOT archived, no available destinations

ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.264.792274883'

ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.265.792274889'

ARC0: Archive log rejected (thread 1 SEQUENCE 68240) at host 'orclsh'

FAL[server, ARC0]: FAL archive failed, see trace file.

ARCH: FAL archive failed. Archiver continuing

ORACLE Instance orcl1 - Archival Error. Archiver continuing.

由於歸檔失敗發生在ASM磁盤上,首先檢查ASM磁盤空間以及DB_RECOVERY_FILE_DEST_SIZE,ASM磁盤空間是足夠的,而且由於隻有一個節點出現出現了無法歸檔的問題,也可以排除是空間不足造成的。


確認兩個節點的DB_RECOVERY_FILE_DEST_SIZE參數設置都是0,基本上可以判斷問題和當前節點的ASM實例狀態不正常有關。


接下來檢查ASM實例的錯誤信息:

Tue Jul 02 16:41:43 2013

Dumping diagnostic DATA IN directory=[cdmp_20130702164115], requested BY (instance=2, osid=2032294 (LMD0)), summary=[incident=165521].

Tue Jul 02 16:49:13 2013

Dumping diagnostic DATA IN directory=[cdmp_20130702164845], requested BY (instance=2, osid=2032294 (LMD0)), summary=[incident=165522].


當前節點ASM實例出現了的這個信息,說明報錯發生在實例2上,檢查實例2的ASM告警日誌文件:

Tue Jul 02 18:29:55 2013

Errors IN file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_lmd0_2032294.trc  (incident=186256):

ORA-04031: unable TO allocate 3768 bytes OF shared memory ("shared pool","unknown object","sga heap(1,0)","ges enqueues")

USE ADRCI OR Support Workbench TO package the incident.

See Note 411.1 at My Oracle Support FOR error AND packaging details.

 Insufficient shared pool TO allocate a GES object (ospid 2032294)

Tue Jul 02 18:29:55 2013

Sweep [inc][186256]: completed

Tue Jul 02 18:36:49 2013

Errors IN file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_lmd0_2032294.trc  (incident=186257):

ORA-04031: unable TO allocate 3768 bytes OF shared memory ("shared pool","unknown object","sga heap(1,0)","ges enqueues")

USE ADRCI OR Support Workbench TO package the incident.

See Note 411.1 at My Oracle Support FOR error AND packaging details.

 Insufficient shared pool TO allocate a GES object (ospid 2032294)

果然實例2上的ASM出現了大量ORA-4031錯誤,這說明ASM實例的Shared Pool內存可能配置不足。檢查ASM啟動的參數配置:


Sat Aug 25 20:06:55 2012

NOTE: No asm libraries found IN the system

ERROR: -5(Duplicate disk DATA_DG01:ASM_DISK1)

ERROR: -5(Duplicate disk DATA_DG01:ASM_DISK2)

MEMORY_TARGET defaulting TO 411041792.

* instance_number obtained FROM CSS = 2, checking FOR the existence OF node 0... 

* node 0 does NOT exist. instance_number = 2 

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Private Interface 'en1' configured FROM GPnP FOR USE AS a private interconnect.

  [name='en1', TYPE=1, ip=169.254.78.6, mac=00-1a-64-bb-50-7d, net=169.254.0.0/16, mask=255.255.0.0, USE=haip:cluster_interconnect/62]

Public Interface 'en0' configured FROM GPnP FOR USE AS a public interface.

  [name='en0', TYPE=1, ip=10.1.16.35, mac=00-1a-64-bb-50-7c, net=10.1.16.32/27, mask=255.255.255.224, USE=public/1]

Picked latch-free SCN scheme 3

USING LOG_ARCHIVE_DEST_1 parameter DEFAULT VALUE AS /u01/app/11.2.0.3/grid/dbs/arch

Autotune OF undo retention IS turned ON. 

LICENSE_MAX_USERS = 0

SYS auditing IS disabled

NOTE: Volume support  enabled

Starting up:

Oracle DATABASE 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

WITH the REAL Application Clusters AND Automatic Storage Management options.

ORACLE_HOME = /u01/app/11.2.0.3/grid

System name: AIX

Node name: orcldb2

Release: 1

Version: 6

Machine: 00C94E064C00

USING parameter settings IN server-side pfile /u01/app/11.2.0.3/grid/dbs/init+ASM2.ora

System parameters WITH non-DEFAULT VALUES:

  large_pool_size          = 12M

  instance_type            = "asm"

  remote_login_passwordfile= "EXCLUSIVE"

  asm_diskstring           = "/dev/ocr_*"

  asm_diskstring           = "/dev/voting_*"

  asm_diskstring           = "/dev/asm_*"

  asm_diskgroups           = "DATA"

  asm_diskgroups           = "DATA_DG01"

  asm_diskgroups           = "SPFILE_DG"

  asm_power_limit          = 1

  diagnostic_dest          = "/u01/app/grid"

Cluster communication IS configured TO USE the following interface(s) FOR this instance

  169.254.78.6

cluster interconnect IPC version:Oracle UDP/IP (generic)

IPC Vendor 1 proto 2


當前ASM實例使用默認的MEMORY_TARGET配置,分配大小大約是400M,根據Oracle的MOS文章:ASM & Shared Pool (ORA-4031) [ID 437924.1],在11.2.0.3中,Oracle增加了ASM實例所允許的默認進程數PROCESSES,但是默認的MEMORY_TARGET參數沒有增加。

根據Oracle的建議,11.2.0.3的MEMORY_TARGET至少應該設置到1536M,而MEMORY_MAX_TARGET設置為4096M

SQL> ALTER system SET memory_max_target=4096m scope=spfile;

SQL> ALTER system SET memory_target=1536m scope=spfile;

對於當前的情況,如果短時間內無法重啟DB和ASM實例,可以在問題節點配置一個第二本地歸檔路徑,設置目標路徑為本地磁盤,從而避免歸檔無法完成而導致的實例HANG死。


本文出自數據和雲公眾號,原文鏈接


最後更新:2017-07-17 18:03:50

  上一篇:go  一波三折:DBA需要頭腦冷清思路清晰解決故障以幸存
  下一篇:go  實踐真知:使用ASM和文件係統的數據庫在AIO上有何不同?