[原创]Oracle RAC节点自动挂掉问题诊断_MySQL, Oracle及数据库讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛

总帖数

每页帖数

1/1页

返回列表

发起投票

查看: 4304 | 回复: 0

主题： [原创]Oracle RAC节点自动挂掉问题诊断

yang.liu

注册用户

等级：少校
经验：1182
发帖：77
精华：1
注册：2014-1-3
状态：离线
发送短消息息给yang.liu

加好友发送短消息息给yang.liu

发消息

发表于：

2014-3-27 17:35:58 | [全部帖] [楼主帖]

楼主

问题描述

Oracle RAC数据库在正常运行中，节点一自动挂掉，而且没有明显报错日志。

问题分析

1,查看节点一的alert.log，在数据库挂到的前后，日志如下所示，并无明显报错信息。

Mon Dec 12 04:05:18 2011
Thread 1 advanced to log sequence 192016 (LGWR switch)
Current log# 3 seq# 192016 mem# 0: +ORADATA/testdb/onlinelog/group_3.261.651772417
Mon Dec 12 04:05:49 2011
Thread 1 advanced to log sequence 192017 (LGWR switch)
Current log# 1 seq# 192017 mem# 0: +ORADATA/testdb/onlinelog/group_1.259.651772415
Mon Dec 12 04:06:05 2011
Thread 1 advanced to log sequence 192018 (LGWR switch)
Current log# 2 seq# 192018 mem# 0: +ORADATA/testdb/onlinelog/group_2.260.651772417
Mon Dec 12 04:06:21 2011
Thread 1 advanced to log sequence 192019 (LGWR switch)
Current log# 3 seq# 192019 mem# 0: +ORADATA/testdb/onlinelog/group_3.261.651772417
Mon Dec 12 04:06:41 2011
Thread 1 advanced to log sequence 192020 (LGWR switch)
Current log# 1 seq# 192020 mem# 0: +ORADATA/testdb/onlinelog/group_1.259.651772415
Mon Dec 12 04:06:52 2011
Thread 1 advanced to log sequence 192021 (LGWR switch)
Current log# 2 seq# 192021 mem# 0: +ORADATA/testdb/onlinelog/group_2.260.651772417
Mon Dec 12 04:07:07 2011
Thread 1 advanced to log sequence 192022 (LGWR switch)
Current log# 3 seq# 192022 mem# 0: +ORADATA/testdb/onlinelog/group_3.261.651772417
Mon Dec 12 04:56:26 2011
Starting ORACLE instance (normal)
Mon Dec 12 04:56:26 2011
Specified value of sga_max_size is too small, bumping to 8388608000
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 192.168.5.0 configured from OCR for use as a public interface

Picked latch-free SCN sch日志eme 3

Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:

2，查看节点2的alert.log日志，在节点1宕机期间如下

Mon Dec 12 04:07:12 2011
Thread 2 advanced to log sequence 168007 (LGWR switch)
Current log# 6 seq# 168007 mem# 0: +ORADATA/testdb/onlinelog/group_6.270.651773375
Mon Dec 12 04:07:24 2011
Starting control autobackup
Mon Dec 12 04:09:42 2011
Control autobackup written to SBT_TAPE device
comment 'API Version 2.0,MMS Version 4.1.0.0',
media 'rman0000008'
handle 'c-2435100534-20111212-01'
Mon Dec 12 04:36:55 2011
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Dec 12 04:36:55 2011
LMS 0: 3 GCS shadows cancelled, 1 closed
Mon Dec 12 04:36:55 2011
LMS 3: 6 GCS shadows cancelled, 1 closed
Mon Dec 12 04:36:55 2011
LMS 1: 1 GCS shadows cancelled, 0 closed
Mon Dec 12 04:36:55 2011
LMS 2: 16 GCS shadows cancelled, 2 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Mon Dec 12 04:36:56 2011
Instance recovery: looking for dead threads
Mon Dec 12 04:36:56 2011
Beginning instance recovery of 1 threads
Mon Dec 12 04:36:56 2011
LMS 1: 53727 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
LMS 0: 51921 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
LMS 3: 53092 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
LMS 2: 52837 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Dec 12 04:36:57 2011
parallel recovery started with 15 processes
Mon Dec 12 04:36:57 2011
Started redo scan
Mon Dec 12 04:36:58 2011
Completed redo scan
11094 redo blocks read, 1289 data blocks need recovery
Mon Dec 12 04:36:58 2011
Errors in file /opt/app/oracle/admin/testdb/bdump/testdb2_p000_24747.trc:
ORA-27090: Message 27090 not found; product=RDBMS; facility=ORA
Linux-x86_64 Error: 4: Interrupted system call
Additional information: 3
Additional information: 128
Additional information: 65536
Mon Dec 12 04:36:58 2011
Errors in file /opt/app/oracle/admin/testdb/bdump/testdb2_p007_24770.trc:
ORA-27090: Message 27090 not found; product=RDBMS; facility=ORA
Linux-x86_64 Error: 4: Interrupted system call
Additional information: 3
Additional information: 128
Additional information: 65536

在节点2上出现了多个ORA-27090的错误，并且trace文件testdb2_p000_24747.trc中的信息跟节点2的告警日志报错信息内容一致，都为Linux-x86_64 Error: 4:Interrupted system call造成的，查看metalink上关于ORA-27090: MESSAGE 27090 NOT FOUND; Linux-x86_64 Error: 4:Interrupted system call[ID 579108.1]相关信息可知，

引起这个报错跟参数fs.aio-max-nr参数值太小，查询系统如下

[root@dbserver1 ~]# cat /proc/sys/fs/aio-max-nr
65536
[root@dbserver2 ~]# cat /proc/sys/fs/aio-max-nr
65536

oracle官方建议值为

fs.aio-max-nr= 3145728

3，查看节点1的系统messages日志，可以看到在当天有过系统时间的调整，如下图

Dec 12 12:38:47 dbserver1 sysctl: net.core.wmem_default = 262144
Dec 12 12:38:47 dbserver1 sysctl: net.core.wmem_max = 262144
Dec 12 12:38:47 dbserver1 sysctl: fs.file-max = 65536
Dec 12 12:38:47 dbserver1 rc.sysinit: Configuring kernel parameters:  succeeded
Dec 12 04:38:49 dbserver1 date: Mon Dec 12 04:38:49 CST 2011
Dec 12 04:38:49 dbserver1 rc.sysinit: Setting clock  (localtime): Mon Dec 12 04:38:49 CST 2011 succeeded
Dec 12 04:38:49 dbserver1 rc.sysinit: Loading default keymap succeeded
Dec 12 04:38:49 dbserver1 rc.sysinit: Setting hostname dbserver1:  succeeded

在Oracle 10.2.0.4的RAC中，如果修改其中一个节点的时间，可能造成节点脑裂，被踢宕掉。

4，swap分区过小

两个节点中，目前swap的值大小为4G。在业务高峰期时，过小的swap空间会影响系统稳定性，严重的话会直接导致系统Hang住，数据库没有任何反应。

北京联动北方科技有限公司

oracle官方建议值为RAM*0.75（两个系统的内存为16G），即12G。如图

北京联动北方科技有限公司

解决方案

建议：

a. 调整2，4中提到的参数，在做修改前对两个节点的系统swap分区的使用情况做监控，时间区间为凌晨4点到5点。修改完后，再去其做监控，时间区间一致。

b. 定时查看两个节点系统时间，以及数据库时间。判断两个节点随着时间的推移，时间是否不一致。同时，询问管理人员是否有对系统做修改时间的操作。

同时重新查看同步时钟的状态是否正常。

c．建议将RAC网络布置单独的路由，防止网络工程师对网络操作影响到RAC。

总结：影响RAC发生脑裂的情况有多种，注意网络、时间等，现在做更改后让DB管理员关注一段时间，看以后还会不会发生。

本版精华
热门帖子

操作引用/回复

总帖数

每页帖数

1/1页

返回列表

用户登录

Weblogic中间件技术论坛

Tuxedo中间件技术论坛

数据库论坛

Java论坛

Linux/unix论坛

网站地图