747
技術社區[雲棲]
Oracle壞塊問題處理 Oracle壞塊修複 Oracle壞塊怎麼辦
Oracle數據庫出現壞塊現象是指:在Oracle數據庫的一個或多個數據塊(一個數據塊的容量在創建數據庫時由db_block_size參數指定,缺省為8K)內出現內容混亂的現象。由於正常的數據塊都有固定的合法內容格式,壞塊的出現,導致數據庫進程無法正常解析數據塊的內容,進而使數據庫進程報錯乃至掛起,並級聯導致整個數據庫實例出現異常。
物理壞塊:也可以稱為介質壞塊,指的是塊格式本身是壞的,塊內的數據沒有任何意義。
邏輯壞塊:指的是塊內的數據在邏輯是存在問題。比如說索引塊的索引值沒有按從小到大排列。物理壞塊一般是由於內存問題、OS問題、IO子係統問題和硬件引起,邏輯壞塊一般是是由於Oracle Bug等原因引起。
另據metalink文檔Identify the corruption extension using RMAN/DBV/ANALYZE etc [ID 836658.1],有如下分類:
Datafile Block Corruption - Physical/Logical
Table/Index Mismatch
Extents Inconsistencies
Data Dictionary Inconsistencies
可見除了物理和邏輯壞塊,還有一些邏輯不一致的情況出現(TABLE / INDEX Mismatch),在本文檔不做過多討論。
壞塊產生的原因大致有以下幾種:
2.1 硬件問題
Oracle進程在處理一個數據塊時,首先將其讀入物理內存空間,在處理完成後,再由特定進程將其寫回磁盤;如果在這個過程中,出現內存故障,CPU計算失誤,都會導致內存數據塊的內容混亂,最後反映到寫回磁盤的數據塊內容有誤。同樣,如果存儲子係統出現異常,數據塊損壞也就隨之出現了。
2.2 操作係統BUG
由於Oracle進程對數據塊的讀寫,都是以操作係統內核調用(system call)的方式完成的,如果操作係統在內核調用存在問題,必然導致Oracle進程寫入非法的內容。
2.3 操作係統的I/O錯誤或緩衝問題
2.4 內存或paging問題
Oracle軟件BUG。
Oracle軟件特定版本上,可能出現導致數據塊的內容出現異常BUG。
2.5 非Oracle進程擾亂Oracle共享內存區域
如上文所述,在當數據塊的內容被讀入主機的物理內存時,如果其他非Oracle進程,對Oracle使用的共享內存區域形成了擾亂,最終導致寫回磁盤的數據塊內容混亂。
2.6 異常關機,掉電,終止服務
異常關機,掉電,終止服務使進程異常終止,而破壞數據塊的完整性,導致壞塊產生。(注:這也是為什麼突然斷電會導致數據庫無法啟動)
由上可見,壞塊的形成原因複雜。當出現壞塊時,為了找到確切的原因,需要大量的分析時間和排查操作,甚至需要多次重現才能找出根本原因。但當故障發生在生產係統上,我們為了減少停機時間,會盡快實施應急權變措施以保證係統的可用性,這樣就破壞了故障現場,對根本原因的分析因而也更加困難了。
3.1 Alert日誌後台日誌
3.2 rman命令
RMAN> BACKUP CHECK LOGICAL VALIDATE DATAFILE 數據文件號;
SQL>select * from v$database_block_corruption where file#=4;
3.3 dbv工具
$dbv file=/u01/app/oracle/oradata/PROD/user01.dbf
3.4 確定壞塊對象信息
SQL>Select tablespace_name,segment_type,owner,segment_name From dba_extents Where file_id=數據文件號 and 塊號 between block_id and block_id+blocks-1;
3.5 ora錯誤
SQL> select count(*) from t1;
select count(*) from t1
*
第 1 行出現錯誤:
ORA-01578: ORACLE 數據塊損壞 (文件號 7, 塊號 131)
ORA-01110: 數據文件 7: 'E:\WH.DBF'
3.6 關於以上排查方法的區別和優缺點
1)RMAN的BACKUP CHECK LOGICAL VALIDATE DATAFILE
該命令可以設置並行度PARALLELISM,即使是對一個數據文件檢查也可以多個通道並行,所以效率是比較高的;和dbv相比,該方法也可以通過v$session_longops更加方便的完成進度監控;且rman會把壞塊信息保存到控製文件 (v$database_block_corruption, v$backup_corruption),通過sql可以很好的完成對應表空間、數據文件、對象的統計。
使用該方法的前提是需要數據庫運行於歸檔模式,如果是非歸檔的話隻能在mount下使用。
2) Dbv
Dbv執行塊級的檢測,所以這個工具無法檢查如INDEX versus TABLE mismatches的情況;也無法做到自動並行;結果在os以平麵文件的形式存儲,不易統計。
3) ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE
該命令是會在表上加tm鎖,但是不影響dml。
關於該命令,SQL Reference解釋如下:
Specify ONLINE to enable Oracle Database to run the validation while DML operations are ongoing within the object. The database reduces the amount of validation performed to allow for concurrency.
Note:
When you validate the structure of an object ONLINE, Oracle Database does not collect any statistics, as it does when you validate the structure of the object OFFLINE.
Specify OFFLINE, to maximize the amount of validation performed. This setting prevents INSERT, UPDATE, and DELETE statements from concurrently accessing the object during validation but allows queries. This is the default.
也就是該命令不會收集任何統計信息,不會引起sql執行計劃的突變;性能方麵,對於大並發的係統可能會有輕微影響。
由於dbv和rman都比較貼近底層,在mount時候就可以做,可見這兩個工具都是比較貼近塊級檢測的,所以在做logical inconsistents(TABLE / INDEX Mismatch)方麵的檢測,還是需要ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE這個方法。
在metalink上麵有文檔,可以利用該工具,在表空間級別完成檢測,大大提高了檢測效率,文檔為SCRIPT: VALIDATE.SQL to ANALYZE .. VALIDATE STRUCTURE objects in a Tablespace [ID 100419.1]。
實際上dbv、rman、validate structure均可以檢測出一定程度的邏輯訛誤,但是最可靠的還是db_block_checksum=true情況下的validate structure [online]驗證命令。從另一個角度來說,普通的dbv隻能做單一的檢測,而無法做到交叉地檢驗,從而了解表和索引上的不一致問題,但是validate structure online卻可以做到。
所以在日常檢測,考慮到效率和可靠性等,最好用rman+validate structure來完成。
四,如何修複壞塊
4.1 無備份情況
4.1.1 EXP/IMP方式
4.1.2 通過內部事件跳過壞塊
SQL>Alter system set events='10231 trace name context forever,level 10';
$exp userid=wh/wh file=t1.dump tables=t1
SQL>drop table t1;
$imp userid=wh/wh file=t1.dump tables=t1
4.1.3 DBMS_REPAIR包
a)創建表
SQL>BEGIN
DBMS_REPAIR.ADMIN_TABLES (
TABLE_NAME => 'REPAIR_TABLE01',
TABLE_TYPE => dbms_repair.repair_table,
ACTION => dbms_repair.create_action,
TABLESPACE => 'WH');
END;
/
b)收集壞塊信息
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
SCHEMA_NAME => 'WH',
OBJECT_NAME => 'T1',
REPAIR_TABLE_NAME => 'REPAIR_TABLE01',
corrupt_count => num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/
c)查看結果
SQL> select object_name, block_id, corrupt_type, marked_corrupt,corrupt_description,repair_description from repair_table01;
d)跳過壞塊
BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'WH',
OBJECT_NAME => 'T1',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.SKIP_FLAG);
END;
/
注:這兩種方法均是跳過壞塊,而不去修複,數據有丟失。
4.1.4 rman製造壞塊,bbed修複壞塊
a)建立測試表
SQL> create tablespace test datafile '/home/oracle/oradata/orcl/test.dbf' size 100m;
Tablespace created.
SQL> create table test (a number,b varchar2(20)) tablespace test;
Table created.
SQL> Select rowid,dbms_rowid.rowid_relative_fno(rowid) rel_fno,dbms_rowid.rowid_block_number(rowid) block from test;
ROWID REL_FNO BLOCK
------------------ ---------- ----------
AAASwmAAGAAAACHAAA 6 135
AAASwmAAGAAAACHAAB 6 135
AAASwmAAGAAAACHAAC 6 135
AAASwmAAGAAAACHAAD 6 135
AAASwmAAGAAAACHAAE 6 135
AAASwmAAGAAAACHAAF 6 135
6 rows selected.
SQL> alter system checkpoint;
System altered.
b)DBV檢測數據文件
[oracle@bjtest ~]$ dbv file=/home/oracle/oradata/orcl/test.dbf blocksize=8192
DBVERIFY: Release 11.2.0.3.0 - Production on Wed Jul 17 06:39:00 2013
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
DBVERIFY - Verification starting : FILE = /home/oracle/oradata/orcl/test.dbf
DBVERIFY - Verification complete
Total Pages Examined : 12800
Total Pages Processed (Data) : 5
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 130
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12665
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 911665 (0.911665)
這裏創建了test表,數據存儲在file 6 block 135 中,現在該block一切正常,本試驗就是要通過rman來使得該block corrupt,然後通過bbed來修複
c)bbed查看kcbh
BBED> set dba 6,135
BBED> print kcbh
struct kcbh, 20 bytes @0
ub1 type_kcbh @0 0x06
ub1 frmt_kcbh @1 0xa2
ub1 spare1_kcbh @2 0x00
ub1 spare2_kcbh @3 0x00
ub4 rdba_kcbh @4 0x01800087
ub4 bas_kcbh @8 0x000de931
ub2 wrp_kcbh @12 0x0000
ub1 seq_kcbh @14 0x01
ub1 flg_kcbh @15 0x06 (KCBHFDLC, KCBHFCKV)
ub2 chkval_kcbh @16 0x59c2
ub2 spare3_kcbh @18 0x0000
d)RMAN標記壞塊
RMAN> BLOCKRECOVER DATAFILE 6 block 135 clear;
Starting recover at 17-JUL-13
using channel ORA_DISK_1
Finished recover at 17-JUL-13
e)dbv再次檢查壞塊
[oracle@bjtest ~]$ dbv file=/home/oracle/oradata/orcl/test.dbf blocksize=8192
DBVERIFY: Release 11.2.0.3.0 - Production on Wed Jul 17 06:48:44 2013
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
DBVERIFY - Verification starting : FILE = /home/oracle/oradata/orcl/test.dbf
Page 135 is marked corrupt
Corrupt block relative dba: 0x01800087 (file 6, block 135)
Bad check value found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x01800087
last change scn: 0x0000.000de931 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe9310601
check value in block header: 0x59c2
computed block checksum: 0x923b
DBVERIFY - Verification complete
Total Pages Examined : 12800
Total Pages Processed (Data) : 4
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 130
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12665
Total Pages Marked Corrupt : 1
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 911663 (0.911663)
SQL> select * from test;
select * from test
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 135)
ORA-01110: data file 6: '/home/oracle/oradata/orcl/test.dbf'
f)bbed驗證壞塊
BBED> set dba 6,135
BBED> print kcbh --查看塊的結構體
struct kcbh, 20 bytes @0
ub1 type_kcbh @0 0x06 --Block type
ub1 frmt_kcbh @1 0xa2 --Block format 1=Oracle 7, 2=Oracle 8+
ub1 spare1_kcbh @2 0x00 --Not used
ub1 spare2_kcbh @3 0x00 --Not used
ub4 rdba_kcbh @4 0x01800087 --RDBA -Relative Data Block Address
ub4 bas_kcbh @8 0x000de931 --SCN Base
ub2 wrp_kcbh @12 0x0000 --SCN Wrap
ub1 seq_kcbh @14 0x01 --Sequence number, incremented for every change made to the block at the same SCN
ub1 flg_kcbh @15 0x06 (KCBHFDLC, KCBHFCKV)
ub2 chkval_kcbh @16 0x59c2 --這裏沒有變化 --[Optional block checksum (if DB_BLOCK_CHECKSUM=TRUE)]
ub2 spare3_kcbh @18 0x0000 --Not used
BBED> verify --驗證壞塊
DBVERIFY - Verification starting
FILE = /home/oracle/oradata/orcl/test.dbf
BLOCK = 135
Block 135 is corrupt
Corrupt block relative dba: 0x01800087 (file 0, block 135)
Bad check value found during verification
Data in bad block:
type: 6 format: 2 rdba: 0x01800087
last change scn: 0x0000.000de931 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe9310601
check value in block header: 0x59c2
computed block checksum: 0x923b
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 0
Total Blocks Failing (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 1
Total Blocks Influx : 0
Message 531 not found; product=RDBMS; facility=BBED
這裏通過dbv,bbed,select table都證明rman能夠標記block為Corrupt.
這裏需要分析:block已經被標記,那證明該塊肯定有修改,也就是說chkval_kcbh一定要變化,但是這裏沒有變化,證明該處異常
g)bbed修複rman生成壞塊
BBED> set dba 6,135
BBED> sum apply
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
Check value for File 6, Block 135:
current = 0xcbf9, required = 0xcbf9
BBED> verify
DBVERIFY - Verification starting
FILE = /home/oracle/oradata/orcl/test.dbf
BLOCK = 135
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0
Message 531 not found; product=RDBMS; facility=BBED
BBED測試已經恢複
h)DBV驗證
[oracle@bjtest ~]$ dbv file=/home/oracle/oradata/orcl/test.dbf blocksize=8192
DBVERIFY: Release 11.2.0.3.0 - Production on Wed Jul 17 06:55:39 2013
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
DBVERIFY - Verification starting : FILE = /home/oracle/oradata/orcl/test.dbf
DBVERIFY - Verification complete
Total Pages Examined : 12800
Total Pages Processed (Data) : 5
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 130
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12665
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 911665 (0.911665)
DBV測試已經恢複
i)RMAN 驗證
RMAN> validate datafile 6 block 135;
Starting validate at 17-JUL-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=30 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00006 name=/home/oracle/oradata/orcl/test.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
6 OK 0 0 1 911665
File Name: /home/oracle/oradata/orcl/test.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 1
Index 0 0
Other 0 0
Finished validate at 17-JUL-13
SQL> select count(*) from test;
COUNT(*)
----------
6
總結說明
1.我們可以通過rman的clear命令來標記壞塊(BLOCKRECOVER DATAFILE file# BLOCK block1#, block2#, block3#… CLEAR ;)
2.我們可以通過bbed的sum apply命令來修複該類型壞塊
4.1.5 利用構造ROWID實現無備份情況下繞過ORA-1578、ORA-8103、ORA-1410等邏輯/物理壞塊問題(速度較慢)
對於無備份情況下的ORA-1578、ORA-8103、ORA-1410等邏輯/物理壞塊問題,可以通過如下PL/SQL 構造ROWID的方式挽救絕大多數非壞塊的數據,
一般分成2種情況:有索引可以用來獲取ROWID; 或者沒有索引、索引不可用,必須通過dbms_rowid.ROWID_CREATE來構造ROWID的方式。
a)創建示例數據
create table liuwen (t1 int,t2 date default sysdate) tablespace liuwen
partition by range(t1) (partition p1 values less than (10000),
partition p2 values less than (20000),
partition p3 values less than (30000),
partition p4 values less than (40000),
partition p5 values less than (50000),
partition p6 values less than (60000),
partition p7 values less than (70000),
partition p8 values less than (80000),
partition p9 values less than (90000),
partition p10 values less than (100000),
partition p11 values less than (110000),
partition p12 values less than (120000),
partition p13 values less than (130000),
partition p14 values less than (140000),
partition p15 values less than (150000),
partition p16 values less than (160000));
insert into liuwen(t1) select rownum from dual connect by level<1000;
SQL> select count(*) from liuwen;
COUNT(*)
----------
5999
exec dbms_stats.gather_schema_stats(‘SYS’);
alter system flush buffer_cache;
隨機采樣一些塊來做 壞塊 約涉及到5個塊的數據
SQL> set linesize 200 pagesize 1400
SQL> select dbms_rowid.rowid_block_number(rowid) blkid,
2 dbms_rowid.rowid_relative_fno(rowid) rfile
3 from liuwen
4 where rownum <= 1100
5 group by dbms_rowid.rowid_block_number(rowid),
6 dbms_rowid.rowid_relative_fno(rowid)
7 order by 1;
BLKID RFILE
---------- ----------
165 7
168 7
171 7
b)製造壞塊
RMAN> blockrecover datafile 7 block 165,168,171 clear;
Starting recover at 23-JUL-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=30 device type=DISK
Finished recover at 23-JUL-13
SQL> select count(*) from liuwen;
select count(*) from liuwen
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 7, block # 165)
ORA-01110: data file 7: '/home/oracle/oradata/orcl/liuwen.dbf'
通過blockrecover datafile block clear 構造了一係列壞塊且沒有備份 ,我們通過下列腳本挽回大部分可用數據.
c)創建恢複表 liuwen_tab_backup,錯誤表 bad_rows
liuwen_tab_backup用於存放導出的可用的數據,bad_rows用於存放不可用的記錄的rowid。
SQL> create table liuwen_tab_backup
2 tablespace liuwen
3 as select * from liuwen where 1=0;
create table bad_rows (row_id rowid,oracle_error_code varchar2(50))
tablespace liuwen;
set serveroutput on;
set timing on;
d)執行程序塊
該匿名塊根據對象所在塊循環,依次對塊裏麵的所有記錄進行遍曆,如果該記錄可用,那麼導出到新表裏麵,如果不可用,那麼記錄不可用的記錄的rowid。循環套循環,隻能窮舉,這也是此方法較慢的原因之一。另外,如果這個對象所在表空間不是標準的8k塊大小,如果是16k或者32k,在理論上, Oracle用12bit來記錄記錄數的. 也就是最多4k條記錄,這樣的話,最內層的循環會更多速度更慢。
DECLARE
TYPE RowIDTab IS TABLE OF ROWID INDEX BY BINARY_INTEGER;
CURSOR Crowid_info IS
select Do.DATA_OBJECT_ID dataid,
DE.FILE_ID fid,
DE.BLOCK_ID blkid,
DE.BLOCKS blkcnt
from dba_objects DO, dba_extents DE
where DO.OBJECT_NAME = 'LIUWEN'
--and DE.PARTITION_NAME='&PARTITION_NAME'
--若指定分區則取消注釋
and nvl(DO.SUBOBJECT_NAME,'-1') = nvl(DE.PARTITION_NAME,'-1')
and DO.OBJECT_NAME = DE.SEGMENT_NAME
and DO.owner = 'SYS'
order by 1, 2, 3 asc;
bad_rows number := 0;
errors varchar2(500);
error_code varchar2(500);
myrowid rowid;
BEGIN
execute immediate 'alter session set commit_write=''batch,nowait'' ';
for i in Crowid_info loop --對象所對應所有的物理塊位置
for j in 0 .. i.blkcnt - 1 loop --逐個塊循環
for z in 0 .. 2000 loop --塊裏麵一行一行看,rownum(一般一個8k的塊,記錄數小於736行,如果碰到大塊,應該還加大此參數,循環套循環,隻能窮舉,這也是此方法較慢的原因之一)
begin
myrowid := dbms_rowid.ROWID_CREATE(1,
i.dataid,
i.fid,
i.blkid + j,
z);
insert into liuwen_tab_backup
select /*+ ROWID(A) */
*
from liuwen A
where rowid = myrowid;
--能讀取的放到backup裏麵備出來
EXCEPTION
when OTHERS then
BEGIN
errors := SQLERRM;
error_code := SQLCODE;
if (error_code like '%1410%' or error_code like '%8103%' or error_code like '%1578%') then
bad_rows := bad_rows + 1;
insert into bad_rows values (myrowid, error_code);
--不能讀取的,rowid也放到相應表裏麵備案
commit;
else
raise;
end if;
END;
commit;
end;
end loop;
end loop;
end loop;
dbms_output.put_line('Total Bad Rows: ' || bad_rows);
commit;
END;
4.2 有備份情況
4.2.1采用rman恢複
RMAN>blockrecover datafile 7 block 131 from backupset;
Oracle會修複壞塊,數據無丟失。推薦方式。
4.2.2 11g新特性recover corruption list
11g新特性RMAN語法recover corruption list是為了簡化數據壞塊的修複,在11g中recover corruption塊時不需要一一指定數據文件名字了,隻要是在v$database_block_corruption視圖中記錄的壞塊,隻要使用了 corruption list語法,都會試圖修複。
下麵我們使用recover .. clear命令手動造成個別數據塊壞塊,之後使用 recover corruption list;
a)修複:
RMAN> BLOCKRECOVER DATAFILE 6 block 135 clear;
Starting recover at 17-JUL-13
using channel ORA_DISK_1
Finished recover at 17-JUL-13
RMAN> validate datafile 6 block 135;
Starting validate at 17-JUL-13
using channel ORA_DISK_1
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00006 name=/home/oracle/oradata/orcl/test.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
6 FAILED 0 0 1 0
File Name: /home/oracle/oradata/orcl/test.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 1 1
Index 0 0
Other 0 0
validate found one or more corrupt blocks
See trace file /home/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_12203.trc for details
Finished validate at 17-JUL-13
SQL> select * from v$database_block_corruption;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
---------- ---------- ---------- ------------------ ------------------------
6 135 1 911665 FRACTURED
select count(*) from test
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 135)
ORA-01110: data file 6: '/home/oracle/oradata/orcl/test.dbf'
b)恢複
RMAN> recover corruption list;
Starting recover at 17-JUL-13
using channel ORA_DISK_1
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00006
channel ORA_DISK_1: reading from backup piece /home/oracle/fast_recovery_area/ORCL/backupset/2013_07_17/o1_mf_nnndf_TAG20130717T062935_8yclbhvn_.bkp
channel ORA_DISK_1: piece handle=/home/oracle/fast_recovery_area/ORCL/backupset/2013_07_17/o1_mf_nnndf_TAG20130717T062935_8yclbhvn_.bkp tag=TAG20130717T062935
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 00:00:01
starting media recovery
media recovery complete, elapsed time: 00:00:04
Finished recover at 17-JUL-13
RMAN> validate datafile 6 block 135;
Starting validate at 17-JUL-13
using channel ORA_DISK_1
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00006 name=/home/oracle/oradata/orcl/test.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
6 OK 0 0 1 911665
File Name: /home/oracle/oradata/orcl/test.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 1
Index 0 0
Other 0 0
Finished validate at 17-JUL-13
SQL> select count(*) from test;
COUNT(*)
----------
6
五,如何預防
在有備份的情況下可以通過blockrecovery在線修複部分邏輯壞塊,但是如果該邏輯訛誤確實是由Oracle Bug引起的話,那麼很有可能blockrecover也無能為力,需要具體問題具體分析之。壞塊問題破壞性大,但並非不可預防。
5.1 底層
1)重要的文件係統、或者存儲,raid也好,二次條帶也好,不管是在哪一層,一定要冗餘。
2)修改vg參數要注意,一定要先做好備份。
3)在為提高性能為操作係統打開異步I/O時,一定要把操作係統與異步I/O相關的補丁要打全。
4)尊重硬件生命周期,要及時的檢查硬件的狀態,及時更換驅動器部件。
5)關注由第三方軟硬件引起的壞塊相關問題,參考metalink文檔,Known Corruption issues caused by 3rd party Software/Hardware Provider [ID 1323649.1]。
5.2 數據庫設置
1)數據文件避免自動增長,使用裸設備避免破壞數據文件的文件頭。
2)控製文件、日誌文件一定要多工。
3)要開啟歸檔,因為是rman validate check logical database;的前提條件,也是利用rman做recover的前提條件。
5.3 備份
1)製定一個良好的備份恢複策略,不能光放到本地文件係統,最好傳到nas或者備份到帶庫,最好有表的EXP備份。
2)要做備份恢複測試,對備份文件的有效性進行檢查,如:
restore validate database;
restore validate controlfile;
restore validate archivelog between lowseq and highseq;
5.4 數據加載
1)首先數據加載,盡量不使用nologging模式。
2)如果要使用nologging,在加載數據後,盡快全備數據庫。
3)使用nologging模式創建的對象,創建後要馬上改為logging,且在dba_tables和dba_indexes裏麵檢查該對象所對應logging字段是否為yes
例如語句 alter index table1_PK rebuild online parallel nologging compute statistics,該語句在rebuild時候指定了nologging,所以其缺省存儲屬性是nologging的,這樣dg打開後會遇到壞塊問題,所以如果有dg,不建議使用nologging選項。
4)永久表和索引,不管是不是分區,都應該確認logging,分區表和分區索引新添加的分區默認的default attributes也應該是logging。
5.5 DG
1)物理dg要定期open read only打開,可以跑一些查詢,或者啟動到mount後backup validate check logical database。
2)dg主庫,隻有forcelogging不夠,要經常檢查dba_tables和dba_indexes裏麵logging字段是否全部為yes,另外建議開啟supplemental log,否則萬一用到日誌挖掘恢複某些數據,防止sql_redo出現UNSUPPORTED而無法抓取到。
5.6 定期做相關檢查
1)alert日誌。
2)BACKUP CHECK LOGICAL VALIDATE DATAFILE,並檢查v$database_block_corruption。
3)ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE,檢查內部邏輯不一致。
5.7 官方發布的預警等
在Metalink.oracle.com網站,Oracle定期發布基於特定軟件版本的“已知問題(known issues)說明”。對於可能導致壞塊的Oracle軟件BUG,在Oracle公司內部,是作為高嚴重級別的問題進行處理,在“已知問題(known issues)說明”中,這些BUG以嚴重(Noticable)問題標出(標記為*或+),部分問題,Oracle還會發布警告(Alert)通告。在文檔中,Oracle會提供相應的補丁或應對措施。
比如bug 5386204,該問題是在使用asm存儲時,通過直接路徑加載的話,可能會遇到壞塊的bug,對應oracle軟件版本為10204及10203,而這兩個版本的軟件使用時非常廣泛的,該bug到10205才被完全修複。可見關注官方相關信息的重要性,可以提前預警、防患於未然。
所以,要密切關注官方的安全預警、bug修正,例如:https://www.oracle.com/technetwork/topics/security/alerts-086861.html是Critical Patch Updates, Security Alerts and Third Party Bulletin,另外還有Oracle Recommended Patches -- Oracle Database [ID 756671.1]。
附參考metalink文檔:
Physical and Logical Block Corruptions. All you wanted to know about it. [ID 840978.1]
SCRIPT: VALIDATE.SQL to ANALYZE .. VALIDATE STRUCTURE objects in a Tablespace [ID 100419.1]
ORA-1499. Table/Index row count mismatch [ID 563070.1]
Known Corruption issues caused by 3rd party Software/Hardware Provider [ID 1323649.1]
Identify the corruption extension using RMAN/DBV/ANALYZE etc [ID 836658.1]
OERR: ORA-1578 "ORACLE data block corrupted (file # %s, block # %s)" Master Note [ID 1578.1]
最後更新:2017-09-15 10:02:31