客户的10.2.0.4 RAC环境前几天出现了单节点DOWN机,帮客户诊断问题所在。
首先检查数据库的告警日志:
Fri Jun 17 18:32:17 2011
ospid 14177: network interface with IP address 192.169.1.2 no longer running (check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmon_14175.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms0_14179.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms3_14191.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms4_14195.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms6_14203.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms5_14199.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms2_14187.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms7_14207.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms1_14183.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmd0_14177.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_asmb_14255.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_diag_14171.trc
可以看到,告警日志中出现了很多ORA-29702错误,导致这个错误的原因显然是CLUSTER层出现了问题,再看最后出现的ORA-15064和ORA-3113错误,说明ASM实例也出现了问题,导致数据库实例崩溃。
此外,错误一开始的时候出现了网络地址不可用的问题,以前偶尔在其他RAC环境的告警日志中也看到过类似的错误,一般是出现了这个错误后,1分钟后再次检测的时候,网络地址就恢复了正常,而类似这种单独出现的很少见。问题是否会与这个地址告警有关还不好说,不过既然现有错误提示和ASM以及CLUSTER有关系,那么首先检查ASM上的错误日志:
Fri Jun 17 18:32:08 2011
ospid 13967: network interface with IP address 192.169.1.2 no longer running (check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/+ASM/bdump/+asm2_lmon_13964.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file /ora10g/oracle/admin/+ASM/bdump/+asm2_diag_13955.trc
Fri Jun 17 18:32:24 2011
Trace dumping is performing id=[cdmp_20110617183224]
ASM的错误日志并无新奇之处,错误和数据库实例相仿,包含了网络地址不可用的信息,以及ORA-29702错误,ASM的问题是CLUSTER上的错误导致的。
继续检查CLUSTER日志:
2011-06-17 18:31:52.706
[cssd(12435)]CRS-1612:node tgpsndb1 (1) at 50% heartbeat fatal, eviction in 29.094 seconds
2011-06-17 18:32:07.706
[cssd(12435)]CRS-1611:node tgpsndb1 (1) at 75% heartbeat fatal, eviction in 14.094 seconds
2011-06-17 18:32:16.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 5.084 seconds
2011-06-17 18:32:17.707
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 4.094 seconds
2011-06-17 18:32:18.709
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 3.094 seconds
2011-06-17 18:32:19.711
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 2.094 seconds
2011-06-17 18:32:20.713
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 1.094 seconds
2011-06-17 18:32:21.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 0.094 seconds
[cssd(12435)]Invalid msg id [1609]
2011-06-17 18:36:21.314
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
很显然,这时由于网络心跳超时导致的当前节点被踢出CLUSTER。
那么造成网络超时的原因是否和前面实例以及ASM错误日志中的IP地址不可用有关呢,继续检查操作系统上的日志:
Jun 17 18:31:23 tgpsndb2 kernel: igb: eth1 NIC Link is Down
Jun 17 18:31:23 tgpsndb2 kernel: igb: eth0 NIC Link is Down
Jun 17 18:32:22 tgpsndb2 logger: Oracle CSSD failure 134.
Jun 17 18:32:22 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 17 18:32:22 tgpsndb2 kernel: SysRq : Resetting
Jun 17 18:32:22 tgpsndb2 logger: Oracle clsomon failed with fatal status 12.
Jun 17 18:32:23 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 17 18:32:23 tgpsndb2 logger: Oracle init script ceding reboot to sibling 11929.
Jun 17 18:32:24 tgpsndb2 logger: Cluster Ready Services completed waiting on dependencies.
在错误发生时刻,操作系统上两个网卡都DOWN掉了,而此后就是CLUSTER的错误以及准备重启的信息,显然这次实例自动重启的真正原因就在于此。