发现并修理故障 11.2 Clusterware 节点剔除（重启）_MySQL, Oracle及数据库讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛

总帖数

每页帖数

1/1页

返回列表

发起投票

查看: 4621 | 回复: 0

主题： 发现并修理故障 11.2 Clusterware 节点剔除（重启）

jinquan

注册用户

等级：少尉
经验：390
发帖：89
精华：0
注册：2012-3-1
状态：离线
发送短消息息给jinquan

加好友发送短消息息给jinquan

发消息

发表于：

2012-3-5 9:55:24 | [全部帖] [楼主帖]

楼主

发现并修理故障 11.2 Clusterware 节点剔除（重启）

Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.1 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.

目的

本文档时提供有关发现并修理故障 11.2 Clusterware 节点剔除，对于Clusterware 节点剔除先前到11.2,看记录：165769.1

范围和应用

本文档时为DBA准备和支持分析员体验clusterware node evictions (reboots).

NODE EVICTION OVERVIEW

The Oracle Clusterware是被设计用来通过从集群里移除一个或多个节点如果一些关键问题被解决来执行节点剔除，一个关键问题可以通过网络重要特征成为一个不响应的节点，不是通过硬盘重要特征响应，一个逗留的或者严格降低机器，或者逗留ocssd.bin进程，这个节点剔除的目的是通过移除坏的成员来保持cluster的整体健康。

1.0 - PROCESS ROLES FOR REBOOTS

OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both

vendor clusterware and non-vendor clusterware environments. OCSSD's 主要工作是在中间点健康监视 and RDBMS 实例结束点发现. 健康监视器包括一个网络重要部分 and和一个硬盘重要部分 (to the voting files). OCSSD 在从来自客户成员消灭之后剔除一个节点(such as a database LMON process)。这是一个运行在高层优先和作为Oracle用户的多线程进程.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin
CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent
CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor
2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
Important files to review:
* Clusterware alert log in /log/
* The cssdagent log(s) in /log//agent/ohasd/oracssdagent_root
* The cssdmonitor log(s) in /log//agent/ohasd/oracssdmonitor_root
* The ocssd log(s) in /log//cssd
* The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
* IPD/OS or OS Watcher data
* 'opatch lsinventory -detail' output for the GRID home
* *Messages files:
* Messages file locations:
* Linux: /var/log/messages
* Sun: /var/adm/messages
* HP-UX: /var/adm/syslog/syslog.log
* IBM: /bin/errpt -a > messages.out
Note that the diagcollection.pl script in /bin can be used to obtain the /log files.
11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log. This can be used to determine which process is responsible for the reboot. Example message from a clusterware alert log:
[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653
This particular eviction happened when the cssagent timed out heartbeating to the CSSD process (oclsomon functionality). Once you know the process (in the first message), the corresponding logs can be reviewed.
More examples to come later...
If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes.
3.0 - TROUBLESHOOTING OCSSD EVICTIONS
If you have encountered an OCSSD eviction review common causes in section 3.1 below. If the problem cannot be determined by reviewing the common causes, review and collect the data from section 3.3.
3.1 - COMMON CAUSES OF OCSSD EVICTIONS
* Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction.
* Problems writing to or reading from the CSS voting disk. If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
* A member kill escalation. For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism. If this times out it could escalate to a node kill.
* An unexpected failure of the OCSSD process, this can be caused by any of the above issues or something else.
* An Oracle bug.
3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS
All files from section 2.0 from all cluster nodes. More data may be required.
4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS
If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below. If the problem cannot be determined by reviewing the common causes, review and collect the data from section 4.3.
4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS
* An OS scheduler problem. For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine, thus preventing the scheduler from behaving reasonably.
* A thread(s) within the CSS daemon hung.
* An Oracle bug.
4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS
All files from section 2.0 from all cluster nodes. More data may be required.

本版精华
热门帖子

操作引用/回复

总帖数

每页帖数

1/1页

返回列表

用户登录

Weblogic中间件技术论坛

Tuxedo中间件技术论坛

数据库论坛

Java论坛

Linux/unix论坛

网站地图