昨天,一个客户的数据库系统出现故障,RAC无法启动,大量的错误信息,经过分析检查,最后我们通过强制手段打开数据库,帮助用户挽回了数据损失。
故障的起因应该是主机和存储之间出现连接故障,message信息里可以看到大量如下提示:
Jun 1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12931135
Jun 1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 1 19:57:18 Oracle-02 kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition
Jun 1 19:57:18 Oracle-02 kernel:
Jun 1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12931647
Jun 1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 1 19:57:18 Oracle-02 kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition
Jun 1 19:57:18 Oracle-02 kernel:
Jun 1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12932671
Jun 1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 1 19:57:18 Oracle-02 kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition
一旦出现IO写失败或者写丢失,则数据库将遭受数据损失,一旦UNDO和REDO损坏,数据库就将会无法启动。
在数据库的告警日志层面,首先出现的错误是如下类型:
WARNING: Read Failed. group:3 disk:0 AU:102 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 256 in group [3.3870646521] from disk REDO1 allocation unit 102 reason error; if possible,will try another mirror side
WARNING: Read Failed. group:3 disk:0 AU:102 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 256 in group [3.3870646521] from disk REDO1 allocation unit 102 reason error; if possible,will try another mirror side
这些错误首先提示REDO写操作失败,读写AU单位失败,这类错误信息教Oracle 11gR2之前详细了很多,精确到了AU单位及Offset。
在启动过程中出现了ORA-00600错误:
ARC0: STARTING ARCH PROCESSES COMPLETE
Redo thread 2 internally disabled at seq 1 (CKPT)
ARC2: Archiving disabled thread 2 sequence 1
Archived Log entry 785 added for thread 2 sequence 1 ID 0x0 dest 1:
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Errors in file /oracle/diag/rdbms/spsp/SPSP2/trace/SPSP2_smon_14335.trc (incident=291384):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/diag/rdbms/spsp/SPSP2/incident/incdir_291384/SPSP2_smon_14335_i291384.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
注意Oracle 11gR2的提示,非常详尽,居然提示你通过My Oracle Support去参考 Note 411.1 学习ADRCI知识。
ORA-600的 4193 和 4194 错误,可以通过重建回滚表空间解决。
此处需要提醒的是,Oracle 11g的缺省UNDO段命名,增加了一个Unix Time的时间戳在回滚段名称里,如下所示:
SQL> select * from v$rollname;
USN NAME
---------- ------------------------------
0 SYSTEM
31 _SYSSMU31_1012201531$
32 _SYSSMU32_3505455792$
33 _SYSSMU33_760762808$
34 _SYSSMU34_2022544229$
35 _SYSSMU35_774751923$
36 _SYSSMU36_3132287946$
37 _SYSSMU37_1634453262$
38 _SYSSMU38_1341527290$
39 _SYSSMU39_2970077311$
40 _SYSSMU40_1403026629$
在设置初始化参数时大致设置如下:
_allow_resetlogs_corruption= TRUE
_offline_rollback_segments= "_SYSSMU11_1202330240$"
_offline_rollback_segments= "_SYSSMU12_1617713323$"
_offline_rollback_segments= "_SYSSMU13_1359816937$"
_offline_rollback_segments= "_SYSSMU14_2078039711$"
_offline_rollback_segments= "_SYSSMU15_1538259293$"
_offline_rollback_segments= "_SYSSMU16_3126366281$"
_offline_rollback_segments= "_SYSSMU17_2553547900$"
_offline_rollback_segments= "_SYSSMU18_1481844821$"
_offline_rollback_segments= "_SYSSMU19_135756661$"
_offline_rollback_segments= "_SYSSMU20_2322289537$"
_corrupted_rollback_segments= "_SYSSMU11_1202330240$"
_corrupted_rollback_segments= "_SYSSMU12_1617713323$"
_corrupted_rollback_segments= "_SYSSMU13_1359816937$"
_corrupted_rollback_segments= "_SYSSMU14_2078039711$"
_corrupted_rollback_segments= "_SYSSMU15_1538259293$"
_corrupted_rollback_segments= "_SYSSMU16_3126366281$"
_corrupted_rollback_segments= "_SYSSMU17_2553547900$"
_corrupted_rollback_segments= "_SYSSMU18_1481844821$"
_corrupted_rollback_segments= "_SYSSMU19_135756661$"
_corrupted_rollback_segments= "_SYSSMU20_2322289537$"
屏蔽了事务的回滚与重做之后,数据库被成功打开。
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE 11.2.0.2.0 Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production
数据库风险随时可能发生,备份必不可少啊!