问题描述
Oracle RAC数据库在正常运行中,节点一自动挂掉,而且没有明显报错日志。
问题分析
1,查看节点一的alert.log,在数据库挂到的前后,日志如下所示,并无明显报错信息。
Mon Dec 12 04:05:18 2011
Thread 1 advanced to log sequence 192016 (LGWR switch)
Current log# 3 seq# 192016 mem# 0: +ORADATA/testdb/onlinelog/group_3.261.651772417
Mon Dec 12 04:05:49 2011
Thread 1 advanced to log sequence 192017 (LGWR switch)
Current log# 1 seq# 192017 mem# 0: +ORADATA/testdb/onlinelog/group_1.259.651772415
Mon Dec 12 04:06:05 2011
Thread 1 advanced to log sequence 192018 (LGWR switch)
Current log# 2 seq# 192018 mem# 0: +ORADATA/testdb/onlinelog/group_2.260.651772417
Mon Dec 12 04:06:21 2011
Thread 1 advanced to log sequence 192019 (LGWR switch)
Current log# 3 seq# 192019 mem# 0: +ORADATA/testdb/onlinelog/group_3.261.651772417
Mon Dec 12 04:06:41 2011
Thread 1 advanced to log sequence 192020 (LGWR switch)
Current log# 1 seq# 192020 mem# 0: +ORADATA/testdb/onlinelog/group_1.259.651772415
Mon Dec 12 04:06:52 2011
Thread 1 advanced to log sequence 192021 (LGWR switch)
Current log# 2 seq# 192021 mem# 0: +ORADATA/testdb/onlinelog/group_2.260.651772417
Mon Dec 12 04:07:07 2011
Thread 1 advanced to log sequence 192022 (LGWR switch)
Current log# 3 seq# 192022 mem# 0: +ORADATA/testdb/onlinelog/group_3.261.651772417
Mon Dec 12 04:56:26 2011
Starting ORACLE instance (normal)
Mon Dec 12 04:56:26 2011
Specified value of sga_max_size is too small, bumping to 8388608000
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 192.168.5.0 configured from OCR for use as a public interface
Picked latch-free SCN sch日志eme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
2,查看节点2的alert.log日志,在节点1宕机 期间如下
Mon Dec 12 04:07:12 2011
Thread 2 advanced to log sequence 168007 (LGWR switch)
Current log# 6 seq# 168007 mem# 0: +ORADATA/testdb/onlinelog/group_6.270.651773375
Mon Dec 12 04:07:24 2011
Starting control autobackup
Mon Dec 12 04:09:42 2011
Control autobackup written to SBT_TAPE device
comment 'API Version 2.0,MMS Version 4.1.0.0',
media 'rman0000008'
handle 'c-2435100534-20111212-01'
Mon Dec 12 04:36:55 2011
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Dec 12 04:36:55 2011
LMS 0: 3 GCS shadows cancelled, 1 closed
Mon Dec 12 04:36:55 2011
LMS 3: 6 GCS shadows cancelled, 1 closed
Mon Dec 12 04:36:55 2011
LMS 1: 1 GCS shadows cancelled, 0 closed
Mon Dec 12 04:36:55 2011
LMS 2: 16 GCS shadows cancelled, 2 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Mon Dec 12 04:36:56 2011
Instance recovery: looking for dead threads
Mon Dec 12 04:36:56 2011
Beginning instance recovery of 1 threads
Mon Dec 12 04:36:56 2011
LMS 1: 53727 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
LMS 0: 51921 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
LMS 3: 53092 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
LMS 2: 52837 GCS shadows traversed, 0 replayed
Mon Dec 12 04:36:56 2011
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Dec 12 04:36:57 2011
parallel recovery started with 15 processes
Mon Dec 12 04:36:57 2011
Started redo scan
Mon Dec 12 04:36:58 2011
Completed redo scan
11094 redo blocks read, 1289 data blocks need recovery
Mon Dec 12 04:36:58 2011
Errors in file /opt/app/oracle/admin/testdb/bdump/testdb2_p000_24747.trc:
ORA-27090: Message 27090 not found; product=RDBMS; facility=ORA
Linux-x86_64 Error: 4: Interrupted system call
Additional information: 3
Additional information: 128
Additional information: 65536
Mon Dec 12 04:36:58 2011
Errors in file /opt/app/oracle/admin/testdb/bdump/testdb2_p007_24770.trc:
ORA-27090: Message 27090 not found; product=RDBMS; facility=ORA
Linux-x86_64 Error: 4: Interrupted system call
Additional information: 3
Additional information: 128
Additional information: 65536
在节点2上出现了多个ORA-27090的错误,并且trace文件testdb2_p000_24747.trc中的信息跟节点2的告警日志报错信息内容一致,都为Linux-x86_64 Error: 4:Interrupted system call造成的,查看metalink上关于ORA-27090: MESSAGE 27090 NOT FOUND; Linux-x86_64 Error: 4:Interrupted system call[ID 579108.1]相关信息可知,
引起这个报错跟参数fs.aio-max-nr参数值太小,查询系统如下
[root@dbserver1 ~]# cat /proc/sys/fs/aio-max-nr
65536
[root@dbserver2 ~]# cat /proc/sys/fs/aio-max-nr
65536
oracle官方建议值为
fs.aio-max-nr= 3145728
3,查看节点1的系统messages日志,可以看到在当天有过系统时间的调整,如下图
Dec 12 12:38:47 dbserver1 sysctl: net.core.wmem_default = 262144
Dec 12 12:38:47 dbserver1 sysctl: net.core.wmem_max = 262144
Dec 12 12:38:47 dbserver1 sysctl: fs.file-max = 65536
Dec 12 12:38:47 dbserver1 rc.sysinit: Configuring kernel parameters: succeeded
Dec 12 04:38:49 dbserver1 date: Mon Dec 12 04:38:49 CST 2011
Dec 12 04:38:49 dbserver1 rc.sysinit: Setting clock (localtime): Mon Dec 12 04:38:49 CST 2011 succeeded
Dec 12 04:38:49 dbserver1 rc.sysinit: Loading default keymap succeeded
Dec 12 04:38:49 dbserver1 rc.sysinit: Setting hostname dbserver1: succeeded
在Oracle 10.2.0.4的RAC中,如果修改其中一个节点的时间,可能造成节点脑裂,被踢宕掉。
4,swap分区过小
两个节点中,目前swap的值大小为4G。在业务高峰期时,过小的swap空间会影响系统稳定性,严重的话会直接导致系统Hang住,数据库没有任何反应。
oracle官方建议值为RAM*0.75(两个系统的内存为16G),即12G。如图
解决方案
建议:
a. 调整2,4中提到的参数,在做修改前对两个节点的系统swap分区的使用情况做监控,时间区间为凌晨4点到5点。修改完后,再去其做监控,时间区间一致。
b. 定时查看两个节点系统时间,以及数据库时间。判断两个节点随着时间的推移,时间是否不一致。同时,询问管理人员是否有对系统做修改时间的操作。
同时重新查看同步时钟的状态是否正常。
c.建议将RAC网络布置单独的路由,防止网络工程师对网络操作影响到RAC。
总结:影响RAC发生脑裂的情况有多种,注意网络、时间等,现在做更改后让DB管理员关注一段时间,看以后还会不会发生。