一、问题简述
某系统发生宕机事故,初步查看,主要由于数据库HANG住导致。
在landingbj1实例,发生故障时段,alert_landingbj1.log有如下警告
Thu Apr 25 03:32:45 2013
System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc
Thu Apr 25 03:34:59 2013
System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc
Thu Apr 25 03:51:44 2013
System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc
Thu Apr 25 03:53:59 2013
在landingbj1实例,发生故障时段,alert_landingbj2.log有如下告警
hu Apr 25 03:22:08 2013
IPC Send timeout detected. Receiver ospid 479472
Receiver is waiting for a latch dumping latch state for receiver -17196
Thu Apr 25 03:22:08 2013
Errors in file /oracle/admin/landingbj/bdump/landingbj2_lmon_458982.trc:
Thu Apr 25 03:22:13 2013
Errors in file /oracle/admin/landingbj/bdump/landingbj2_lck0_479472.trc:
省略部分信息
System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj2_diag_500206.trc
Thu Apr 25 04:05:24 2013
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=129
System State dumped to trace file /oracle/admin/landingbj/udump/landingbj2_ora_524666.trc
Thu Apr 25 04:05:27 2013
省略部分信息
WARNING: inbound connection timed out (ORA-3136)
Fri Apr 26 03:10:19 2013
Global Enqueue Services Deadlock detected. More info in file
/oracle/admin/landingbj/udump/landingbj2_ora_643428.trc.
Fri Apr 26 03:10:19 2013
省略部分信息
Fri Apr 26 03:59:06 2013
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=116
System State dumped to trace file /oracle/admin/landingbj/udump/landingbj2_ora_717126.trc
Fri Apr 26 03:59:39 2013
二、问题分析
经过进一步分析,主要由于ORACLE BUG导致数据库HANG住引起系统宕机。
1、触发Bug 7331323
参考文档:
ORA-27302 Failure Occurred at skgxpvfymmtu Signalled in the Alert.log [ID 746888.1]
Bug 7331323 - ORA-27302 failure occurred at skgxpvfymmtu in RAC [ID 7331323.8]
该BUG很致命,从分析信息看,数据库确实触发该BUG。
1)alert_landingbj1.log日志里,landingbj1实例的启动信息有如下报错:
Tue Apr 23 19:05:48 2013
Errors in file /oracle/admin/landingbj/bdump/landingbj2_lmon_323796.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Tue Apr 23 19:05:49 2013
从ORA-27300看,一般来说是某种OS资源不足导致的问题。
检查landingbj2_lmon_323796.trc文件,有如下报错
SKGXPVFYMMTU: select failed 3.
*** 2013-04-23 19:05:48.566
SKGXPVFYMMTU: Could not create context which satisfies the minimum
MTU requirement of 32768 bytes
struct ksxpp * ksxppg_ [11022c700, 1104058a8) = 1104058a0
警告日志出现ORA-00603、ORA-27504、ORA-27300、ORA-27301、ORA-27302、ORA-27303等报错和 [ID 7331323.8] 描述吻合。
2)该BUG影响的版本吻合,此数据库版本为10.2.0.4,使用是ASM。
Applies to:
Oracle Database - Enterprise Edition - Version 10.2.0.4 to 11.1.0.6 [Release 10.2 to 11.1]
Information in this document applies to any platform.
Occurrence
This only affects Oracle Real Application Clusters and can be reported in ASM as well as database instances. The issue was introduced in Oracle 10.2.0.4 so earlier versions are not affected
3)发生问题症状相同,这是最重要一点
该BUG通常表现如下3个症状
The symptoms can include:
- process failure
- startup failure
- processes spinning in function sskgxp_select with high CPU usage
landingbj1实例启动中曾发生lmon进程启动失败,见landingbj2_lmon_323796.trc。
在发生故障时段的dump文件landingbj1_diag_507906.trc和landingbj2_diag_500206.trc,均有大量下面信息,大量进程对sskgxp_select函数的调用,导致资源争用厉害,引起宕机,和和BUG 7331323描述吻合(processes spinning in function sskgxp_select with high CPU usage)
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044BC<-sskgxp_select+0100<-skgxpiwait+0614<-skgxpwait+02c8<-ksxpwait+0538<-ksliwat+0654<-kslwaitns_timed+0024<-kskthbwt+022c<-kslwait+00f4<-ksarcv+00cc<-kjclwmg+001c<-kjfcln+0558<-ksbrdp+04b4<-opirip+03fc<-opidrv+0458<-sou2o+0090<-opimai_real+0150<-main+0098<-__start+0098
4)根据处理故障人员描述,数据库出现故障时,主要是HANG住,shutdown immediate是无法停机数据库的,需要重启服务器。这个情况也和和 [ID 7331323.8] 描述吻合。
Workaround
There is no workaround available. However, if the instance fails to start, a reboot of the server supporting the instance will usually allow startup to succeed.
5)目前处理该问题没有更好办法,需打Patch 7331323。
Patches
At the time of writing, patches were under development on top of 10.2.0.4 on some platforms. Please check Patch 7331323 for availability on your platform.
The problem is resolved in 10.2.0.5 and 11.1.0.7 patchsets.
2、SGA_MAX_SIZE参数设置问题
从两个实例启动日志看,有大量下面警告
Specified value of sga_max_size is too small, bumping to 19897778176
很明显是由于SGA内所有内存组件值大于SGA_MAX_SIZE导致的问题。
三、问题建议
数据库宕机事故主要是触发Bug 7331323导致。
建议如下:
1、打Patch 7331323
2、调整sga_max_size值
create pfile='文件名' from spfile;
alter system set sga_max_size=21474836480 scope=spfile;
参考文档
1、ORA-27302 Failure Occurred at skgxpvfymmtu Signalled in the Alert.log [ID 746888.1]
2、Bug 7331323 - ORA-27302 failure occurred at skgxpvfymmtu in RAC [ID 7331323.8]
3、Sskgpgetexecname Failed To Get Name Message Appears in Alert.log File [ID 604804.1]