Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
Applies to:
Oracle Server - Standard Edition - Version: 10.1.0.5 to 11.2.0.1 - Release: 10.1 to 11.2
Oracle Server - Enterprise Edition - Version: 10.1.0.5 to 11.2.0.1.0 [Release: 10.1 to 11.2]
Linux x86
HP-UX PA-RISC (64-bit)
IBM AIX on POWER Systems (64-bit)
Oracle Solaris on SPARC (64-bit)
HP-UX Itanium
Red Hat Enterprise Linux Advanced Server x86-64 (AMD Opteron Architecture)
Red Hat Enterprise Linux Advanced Server Itanium
Oracle Solaris on x86-64 (64-bit)
Linux x86-64
UnitedLinux Itanium
Oracle Server Enterprise Edition - Version: 10.1.0.5 to 11.1.0.7
Oracle Clusterware
Symptoms
当以下情况Oracle Clusterware从集群中消除节点
Node is not pinging via the network heartbeat
Node is not pinging the Voting disk
Node is hung/busy and is unable to perform either of the earlier tasks
在节点被剔除的大多数情况下,有信息被写进日志文件来分析节点被剔除的原因,然而在某些情况下可能是丢失,记事本中的注释可能被用到低于11gR2 (11.2.0.1)的Clusterware版本中没有足够的信息或者是没有信息的剔除原因的诊断
同11.2.0.1启动一样, 客户不需要设置diagwait ,因为结构已经改变。
. Changes
无
Cause
当节点被剔除和节点在CPU方面很繁忙,有可能OS没有时间来清除日志到文件系统,设置diagwait属性来延迟节点的重启来给予额外的附加时间给操作系统来写跟踪文件。这个设置会安全地为收集诊断数据提供更多的时间和不会增加发生错误的可能性。在设置完diawait后,Clusterware会额外等待10秒(Diagwait - reboottime)。Customers解决完他们的系统安排问题下,通过按照文档中的步骤可以不设置diawait.
@For internal Support Staff
Diagwait attribute was introduced in 10.2.0.3 and is included in 10.2.0.4 & 11.1.0.6 and higher releases. It has also been subsequently backported to 10.1.0.5 on most platforms. 这意味着能在10.1.0.5 (or higher), 10.2.0.3 (or higher) and in 11.1.0.6 (or higher)上set diagwait可用. 如果命令 crsctl set/get css diagwait报告 "unrecognized parameter diagwait specified" 那么可以大胆猜想Clusterware版本没有必要适应补充的diagwait,.如果是那种情况下,在尝试设置 diagwait之前客户是被建议应用最新可用的补丁集。
Solution
当改变 diagwait 时,在所有节点上的clusterware stack必需被关闭时非常重要的 .以下步骤介绍如何一步步设置diagwait.
1. Execute as root
#crsctl stop crs
#<CRS_HOME>/bin/oprocd stop
2. Ensure that Clusterware stack is down on all nodes by executing
#ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
This should return no processes. If there are clusterware processes running and you proceed to the next step, you will corrupt your OCR. Do not continue until the clusterware processes are down on all the nodes of the cluster.
3. From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force
4. Check if diagwait is set successfully by executing. the following command. The command should return 13. If diagwait is not set, the following message will be returned "Configuration parameter diagwait is not defined"
#crsctl get css diagwait
5. Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs
6. Validate that the node is running by executing:
#crsctl check crs
Unsetting/Removing diagwait
在没有解决系统安排的问题时客户不应该弄混diagwait,通过重启可以提出节点. Diagwait 延迟 节点消除 (and reconfiguration) by diagwait (13) seconds and as such setting diagwait 不会影响到大多数用户.万一有需要移除 diagwait, 上面提到的步骤要实施除了第三步要用以下命令替换
#crsctl unset css diagwait -f
(Note: the -f option must be used when unsetting diagwait since CRS will be down when doing so)