Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Applies to:
Oracle Server - Enterprise Edition - Version: 10.1.0.2 to 11.1.0.7 - Release: 10.1 to 11.1
Oracle Server - Enterprise Edition - Version: 9.2.0.8 to 11.1.0.7 [Release: 9.2 to 11.1]
Linux x86
Linux x86-64
Purpose
Hangcheck_timer module需要Linux系统上Oracle Real Application Clusters环境下运行配置,通过releases 9i, 10g, or 11g RAC。这个注释辨别和概述配置在一个Oracle Enterprise Linux上configure hangcheck-timer的需求,Red Hat Linux,或者SUSE Linux envirment.
注意: Hangheck timer不需要和Hangheck timer一起启动。
Scope
这篇文章是为产品经理准备的,系统架构,和涉及到在Linux环境下开发和配置Oracle RAC 9i, 10g系统管理员。这个文档在保护工程师和咨询组织去促进按章和配置Linux RAC环境中Oracle需求.
Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
从9.2.0.2的发行和以后,Oracle RAC环境需要用到一个新的I/O攻击模式,即the hangcheck-timer module,这个模块是被补充到替换可以提供类似攻击功能的监视者模块,Hangcheck-timer可以随后作为Liunx内核2.4发行及以后的标准内核分发。
Hangcheck-timer应该在启动时间时加载,和为一个长时间运行的操作系统管理linux 内核,可以影响一个RAC节点的依赖性。它运行在内核态和使用时间戳计数器来捕捉安排延迟或者节点挂起。设置一个定时器可以做到。然后检查什么时候定时器点火因为这个可能被延迟允许的错误空间。如果期间超过允许的时间(hangcheck_tick + hangcheck_margin seconds),机器会重启,Hangcheck-timer不因CPU饥饿而引起重启。
Hangcheck-timer需呀配置三个参数:
hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected. The default value varies by kernel version. In the 2.4 kernel, the default is 1. In 2.6 kernels, the default is 0.
All hangcheck-timer default values should be explicitly overridden when loading the kernel module, based on the Oracle release as follows:
9i: Assuming the default setting of "oracm misscount" is set to 220 seconds:
hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
10g/11g: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
你必需总是保证集群计数错误设置大于hangcheck_tick + hangcheck_margin的和的设置。
@ Unpublished information for Oracle Support Internal Use:
当在Linux上运行Oracle Clusterware, hangcheck-timer 应噶总是会被配置到每一个RAC集群点,因为模块的功能需要提供I/O攻击来保证没有丢失从一个在RAC集群中被剔除的节点写入. 如果hangcheck-timer模块作为root或者Oracle用户运行在一个节点执行上要修改。
# /sbin/lsmod | grep hangcheck
hangcheck-timer 2672 0
如果 hangcheck-timer 模块被加载 (running) 你会看到和上面类似的输出. 当 hangcheck-timer 没有加载时没有收集输出, 而命令提示返回给用户.
In an Oracle Enterprise Linux, Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loaded using the modprobe command:
# modprobe hangcheck-timer hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
为零保证启动时加载模块,你应该在方便的当地命令执行目录中替换相同命令 (e.g. /etc/rc.d/rc.local, or /etc/init.d/boot.local). 在早期的发行中, hangcheck-time用insmod代替 modprobe被加载. 咨询你的发行详细文档来决定需要哪个初始化方法。
Hangcheck-timer 会向系统提供消息日志,当消除一个错误时, 模块会使节点重启初始化:
When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
如果看到如下信息 /var/log/messages: "Hangcheck: hangcheck value past margin!" 意味着要重启不能继续运行了, 因为hangcheck_reboot 没有被置为1. 如果看到这个消息, 你必须重新加载在日志中早期的hangcheck 模块, 并且把 hangcheck_reboot值置为1.
Known Issues
著名问题
Bug:6125546 which can prevent hangcheck-timer from rebooting in RHEL4 (fixed in 2.6.9.56 or RHEL4.6)
@ 6782377 INCOMPATIBILITY WITH HANGCHECK AND HPET CLOCK TIMER