[原创]Bug 7331323导致数据库hang_MySQL, Oracle及数据库讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛

总帖数

每页帖数

1/1页

返回列表

发起投票

查看: 3067 | 回复: 0

主题： [原创]Bug 7331323导致数据库hang

xiaogang.xu

注册用户

等级：上士
经验：251
发帖：13
精华：0
注册：1970-1-1
状态：离线
发送短消息息给xiaogang.xu

加好友发送短消息息给xiaogang.xu

发消息

发表于：

2016-8-11 15:36:58 | [全部帖] [楼主帖]

楼主

一、问题简述

某系统发生宕机事故，初步查看，主要由于数据库HANG住导致。

在landingbj1实例，发生故障时段，alert_landingbj1.log有如下警告

Thu Apr 25 03:32:45 2013

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc

Thu Apr 25 03:34:59 2013

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc

Thu Apr 25 03:51:44 2013

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj1_diag_507906.trc

Thu Apr 25 03:53:59 2013

在landingbj1实例，发生故障时段，alert_landingbj2.log有如下告警

hu Apr 25 03:22:08 2013

IPC Send timeout detected. Receiver ospid 479472

Receiver is waiting for a latch dumping latch state for receiver -17196

Thu Apr 25 03:22:08 2013

Errors in file /oracle/admin/landingbj/bdump/landingbj2_lmon_458982.trc:

Thu Apr 25 03:22:13 2013

Errors in file /oracle/admin/landingbj/bdump/landingbj2_lck0_479472.trc:

省略部分信息

System State dumped to trace file /oracle/admin/landingbj/bdump/landingbj2_diag_500206.trc

Thu Apr 25 04:05:24 2013

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=129

System State dumped to trace file /oracle/admin/landingbj/udump/landingbj2_ora_524666.trc

Thu Apr 25 04:05:27 2013

省略部分信息

WARNING: inbound connection timed out (ORA-3136)

Fri Apr 26 03:10:19 2013

Global Enqueue Services Deadlock detected. More info in file

/oracle/admin/landingbj/udump/landingbj2_ora_643428.trc.

Fri Apr 26 03:10:19 2013

省略部分信息

Fri Apr 26 03:59:06 2013

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=116

System State dumped to trace file /oracle/admin/landingbj/udump/landingbj2_ora_717126.trc

Fri Apr 26 03:59:39 2013

二、问题分析

经过进一步分析，主要由于ORACLE BUG导致数据库HANG住引起系统宕机。

1、触发Bug 7331323

参考文档：

ORA-27302 Failure Occurred at skgxpvfymmtu Signalled in the Alert.log [ID 746888.1]

Bug 7331323 - ORA-27302 failure occurred at skgxpvfymmtu in RAC [ID 7331323.8]

该BUG很致命，从分析信息看，数据库确实触发该BUG。

1）alert_landingbj1.log日志里，landingbj1实例的启动信息有如下报错：

Tue Apr 23 19:05:48 2013

Errors in file /oracle/admin/landingbj/bdump/landingbj2_lmon_323796.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3

ORA-27301: OS failure message: No such process

ORA-27302: failure occurred at: skgxpvfymmtu

ORA-27303: additional information: MTU could not be verified. Did not receive valid message.

Tue Apr 23 19:05:49 2013

从ORA-27300看，一般来说是某种OS资源不足导致的问题。

检查landingbj2_lmon_323796.trc文件，有如下报错

SKGXPVFYMMTU: select failed 3.

*** 2013-04-23 19:05:48.566

SKGXPVFYMMTU: Could not create context which satisfies the minimum

MTU requirement of 32768 bytes

struct ksxpp * ksxppg_ [11022c700, 1104058a8) = 1104058a0

警告日志出现ORA-00603、ORA-27504、ORA-27300、ORA-27301、ORA-27302、ORA-27303等报错和 [ID 7331323.8] 描述吻合。

2）该BUG影响的版本吻合，此数据库版本为10.2.0.4，使用是ASM。

Applies to:

Oracle Database - Enterprise Edition - Version 10.2.0.4 to 11.1.0.6 [Release 10.2 to 11.1]
Information in this document applies to any platform.

Occurrence

This only affects Oracle Real Application Clusters and can be reported in ASM as well as database instances. The issue was introduced in Oracle 10.2.0.4 so earlier versions are not affected

3）发生问题症状相同，这是最重要一点

该BUG通常表现如下3个症状

The symptoms can include:

- process failure

- startup failure

- processes spinning in function sskgxp_select with high CPU usage

landingbj1实例启动中曾发生lmon进程启动失败，见landingbj2_lmon_323796.trc。

在发生故障时段的dump文件landingbj1_diag_507906.trc和landingbj2_diag_500206.trc，均有大量下面信息，大量进程对sskgxp_select函数的调用，导致资源争用厉害，引起宕机，和和BUG 7331323描述吻合（processes spinning in function sskgxp_select with high CPU usage）

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044BC<-sskgxp_select+0100<-skgxpiwait+0614<-skgxpwait+02c8<-ksxpwait+0538<-ksliwat+0654<-kslwaitns_timed+0024<-kskthbwt+022c<-kslwait+00f4<-ksarcv+00cc<-kjclwmg+001c<-kjfcln+0558<-ksbrdp+04b4<-opirip+03fc<-opidrv+0458<-sou2o+0090<-opimai_real+0150<-main+0098<-__start+0098

图片1.png

4）根据处理故障人员描述，数据库出现故障时，主要是HANG住，shutdown immediate是无法停机数据库的，需要重启服务器。这个情况也和和 [ID 7331323.8] 描述吻合。

Workaround

There is no workaround available. However, if the instance fails to start, a reboot of the server supporting the instance will usually allow startup to succeed.

5）目前处理该问题没有更好办法，需打Patch 7331323。

Patches

At the time of writing, patches were under development on top of 10.2.0.4 on some platforms. Please check Patch 7331323 for availability on your platform.

The problem is resolved in 10.2.0.5 and 11.1.0.7 patchsets.

2、SGA_MAX_SIZE参数设置问题

从两个实例启动日志看，有大量下面警告

Specified value of sga_max_size is too small, bumping to 19897778176

很明显是由于SGA内所有内存组件值大于SGA_MAX_SIZE导致的问题。

三、问题建议

数据库宕机事故主要是触发Bug 7331323导致。

建议如下：

1、打Patch 7331323

2、调整sga_max_size值

create pfile='文件名' from spfile;

alter system set sga_max_size=21474836480 scope=spfile;

参考文档

1、ORA-27302 Failure Occurred at skgxpvfymmtu Signalled in the Alert.log [ID 746888.1]

2、Bug 7331323 - ORA-27302 failure occurred at skgxpvfymmtu in RAC [ID 7331323.8]

3、Sskgpgetexecname Failed To Get Name Message Appears in Alert.log File [ID 604804.1]

本版精华
热门帖子

操作引用/回复

总帖数

每页帖数

1/1页

返回列表

用户登录

Weblogic中间件技术论坛

Tuxedo中间件技术论坛

数据库论坛

Java论坛

Linux/unix论坛

网站地图