[原创]Bug 7331323导致数据库hang_MySQL, Oracle及数据库讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛_联动北方技术论坛  
网站首页 | 关于我们 | 服务中心 | 经验交流 | 公司荣誉 | 成功案例 | 合作伙伴 | 联系我们 |
联动北方-国内领先的云技术服务提供商
»  游客             当前位置:  论坛首页 »  自由讨论区 »  MySQL, Oracle及数据库讨论区 »
总帖数
2
每页帖数
101/1页1
返回列表
0
发起投票  发起投票 发新帖子
查看: 423 | 回复: 1   主题: [原创]Bug 7331323导致数据库hang        下一篇 
xiaogang.xu
注册用户
等级:上士
经验:251
发帖:13
精华:0
注册:1970-1-1
状态:离线
发送短消息息给xiaogang.xu 加好友    发送短消息息给xiaogang.xu 发消息
发表于: IP:您无权察看 2016-8-11 15:36:58 | [全部帖] [楼主帖] 楼主

一、问题简述  

某系统发生宕机事故,初步查看,主要由于数据库HANG住导致。

landingbj1实例,发生故障时段,alert_landingbj1.log有如下警告

Thu Apr 25 03:32:45 2013

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc

Thu Apr 25 03:34:59 2013

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc

Thu Apr 25 03:51:44 2013

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc

Thu Apr 25 03:53:59 2013

landingbj1实例,发生故障时段,alert_landingbj2.log有如下告警

hu Apr 25 03:22:08 2013

IPC Send timeout detected. Receiver ospid 479472

Receiver is waiting for a latch dumping latch state for receiver -17196

Thu Apr 25 03:22:08 2013

Errors in file /oracle/admin/landingbj/bdump/landingbj2_lmon_458982.trc:

Thu Apr 25 03:22:13 2013

Errors in file /oracle/admin/landingbj/bdump/landingbj2_lck0_479472.trc:

省略部分信息

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj2_diag_500206.trc

Thu Apr 25 04:05:24 2013

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=129

System State dumped to trace file /oracle/admin/landingbj/udump/landingbj2_ora_524666.trc

Thu Apr 25 04:05:27 2013

省略部分信息

WARNING: inbound connection timed out (ORA-3136)

Fri Apr 26 03:10:19 2013

Global Enqueue Services Deadlock detected. More info in file

 /oracle/admin/landingbj/udump/landingbj2_ora_643428.trc.

Fri Apr 26 03:10:19 2013

省略部分信息

Fri Apr 26 03:59:06 2013

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=116

System State dumped to trace file /oracle/admin/landingbj/udump/landingbj2_ora_717126.trc

Fri Apr 26 03:59:39 2013

 

二、问题分析

经过进一步分析,主要由于ORACLE BUG导致数据库HANG住引起系统宕机。

1、触发Bug 7331323

参考文档:

ORA-27302 Failure Occurred at skgxpvfymmtu Signalled in the Alert.log [ID 746888.1]

Bug 7331323 - ORA-27302 failure occurred at skgxpvfymmtu in RAC [ID 7331323.8]

 

BUG很致命,从分析信息看,数据库确实触发该BUG

1alert_landingbj1.log日志里,landingbj1实例的启动信息有如下报错:

Tue Apr 23 19:05:48 2013

Errors in file /oracle/admin/landingbj/bdump/landingbj2_lmon_323796.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3

ORA-27301: OS failure message: No such process

ORA-27302: failure occurred at: skgxpvfymmtu

ORA-27303: additional information: MTU could not be verified.  Did not receive valid message.

Tue Apr 23 19:05:49 2013

 

ORA-27300看,一般来说是某种OS资源不足导致的问题。

检查landingbj2_lmon_323796.trc文件,有如下报错

SKGXPVFYMMTU: select failed 3.

*** 2013-04-23 19:05:48.566

SKGXPVFYMMTU: Could not create context which satisfies the minimum

              MTU requirement of 32768 bytes

struct ksxpp * ksxppg_ [11022c700, 1104058a8) = 1104058a0

 

警告日志出现ORA-00603ORA-27504ORA-27300ORA-27301ORA-27302ORA-27303等报错和 [ID 7331323.8] 描述吻合。  

 

2)该BUG影响的版本吻合,数据库版本为10.2.0.4,使用是ASM

Applies to:

Oracle Database - Enterprise Edition - Version 10.2.0.4 to 11.1.0.6 [Release 10.2 to 11.1]
Information in this document applies to any platform.

Occurrence

This only affects Oracle Real Application Clusters and can be reported in ASM as well as database instances. The issue was introduced in Oracle 10.2.0.4 so earlier versions are not affected

 

3)发生问题症状相同,这是最重要一点

BUG通常表现如下3个症状

The symptoms can include:

- process failure

- startup failure

- processes spinning in function sskgxp_select with high CPU usage

 

landingbj1实例启动中曾发生lmon进程启动失败,见landingbj2_lmon_323796.trc

在发生故障时段的dump文件landingbj1_diag_507906.trclandingbj2_diag_500206.trc,均有大量下面信息,大量进程对sskgxp_select函数的调用,导致资源争用厉害,引起宕机,和和BUG 7331323描述吻合(processes spinning in function sskgxp_select with high CPU usage

 

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044BC<-sskgxp_select+0100<-skgxpiwait+0614<-skgxpwait+02c8<-ksxpwait+0538<-ksliwat+0654<-kslwaitns_timed+0024<-kskthbwt+022c<-kslwait+00f4<-ksarcv+00cc<-kjclwmg+001c<-kjfcln+0558<-ksbrdp+04b4<-opirip+03fc<-opidrv+0458<-sou2o+0090<-opimai_real+0150<-main+0098<-__start+0098

图片1.png 

 

4)根据处理故障人员描述,数据库出现故障时,主要是HANG住,shutdown immediate是无法停机数据库的,需要重启服务器。这个情况也和和 [ID 7331323.8] 描述吻合。

Workaround

There is no workaround available. However, if the instance fails to start, a reboot of the server supporting the instance will usually allow startup to succeed.

 

5)目前处理该问题没有更好办法,需打Patch 7331323

Patches

At the time of writing, patches were under development on top of 10.2.0.4 on some platforms. Please check Patch 7331323 for availability on your platform.

The problem is resolved in 10.2.0.5 and 11.1.0.7 patchsets.

 

2SGA_MAX_SIZE参数设置问题

从两个实例启动日志看,有大量下面警告

Specified value of sga_max_size is too small, bumping to 19897778176

很明显是由于SGA内所有内存组件值大于SGA_MAX_SIZE导致的问题。

 

三、问题建议

数据库宕机事故主要是触发Bug 7331323导致。

建议如下:

1Patch 7331323

2、调整sga_max_size

create pfile='文件名' from spfile;

alter system set sga_max_size=21474836480 scope=spfile;

 

参考文档

1ORA-27302 Failure Occurred at skgxpvfymmtu Signalled in the Alert.log [ID 746888.1]

2Bug 7331323 - ORA-27302 failure occurred at skgxpvfymmtu in RAC [ID 7331323.8]

3Sskgpgetexecname Failed To Get Name Message Appears in Alert.log File [ID 604804.1]





赞(0)    操作        顶端 
联动小白
注册用户
等级:新兵
经验:61
发帖:0
精华:0
注册:2015-5-27
状态:离线
发送短消息息给联动小白 加好友    发送短消息息给联动小白 发消息
发表于: IP:您无权察看 2018-6-9 1:52:22 | [全部帖] [楼主帖] 2  楼

为了方便大家阅读,我对文章中命令来解释一下吧!

File-->as:File

CC-->at:CC

希望能对大家有帮助!^_^ By:坚持不懈的小白

-- 来自: 北京联动北方科技有限公司



赞(0)    操作        顶端 
总帖数
2
每页帖数
101/1页1
返回列表
发新帖子
请输入验证码: 点击刷新验证码
您需要登录后才可以回帖 登录 | 注册
技术讨论