介绍集群健康管理器(IPD/OS)
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1
Oracle Server - Standard Edition - Version: 10.1.0.2
Generic Linux
Microsoft Windows x64 (64-bit)
Microsoft Windows (32-bit)
什么被通知?
什么是集群健康管理器(IPD/os)
集群健康管理器—是一个工具来定期和自动收集操作系统的运行数据,这些数据被在线和不在线的分析员共同存储。
我从哪里能得到最新的集群健康管理器的副本?
从这里下载
Cluster Health Monitor (IPD/OS) can be downloaded from http://otn.oracle.com/rac
管理器分析收集的数据的过程
工具收集操作系统的运行数据,这些数据用来协调单一实例,RAC运行协调,也可以用来找出root起因,特别是那些由安排问题或者高cpu加载引起的问题。Generic 运行收集工具有时候在系统非常忙时收集数据出现问题。这就是为什么要用到Cluster Health Monitor.
为什么是Cluster Health Monitor?
Oracle Clusterware & Oracle database performance/node由于缺少CPU/Memory资源重启而引起客户来问如何管理他们的系统,一些客户有利用vmstat, mpstat来获得udimentary脚本,但是他们都是在定期的间隔内被收集的节点。在一些情况下,我们发现客户每一个小时收集一次但是当节点在重启时中间时刻破坏掉的时候。系统监视器做了一个成功工作就是用严格的收集间隔来严格收集数据。Cluster Health Monito通过保证它总是被安排收集数据点同时提供客户GUI来查看当前加载扩展OSwatcher.。
你应该做些什么?
安装
产品的readme文件中有解释
用途
工具用来在线或者离线管理他们的节点。一般当和Oracle支持,这些数据可以离线查看。
GUI模式
在线模式用来发现存在客户环境的问题,这些数据可以通过Health Monitor utility /usr/lib/oracrf/bin/crfgui查看。The GUI 不是被安装到服务器节点上但是通过crfinst.pl -g被安装到其他客户
1. For example, To look at the load on a node you can run the command .
/usr/lib/oracrf/bin/crfgui.sh -m
The above will pop up a screen as follows
>
默认更新时一秒. 改成更新5秒执行如下
/usr/lib/oracrf/bin/crfgui.sh -n -r 5
2. 另一个属性能被加到工具中的是 -d. 用来查看从以前到现在的数据. 如果有一个节点在四小时之前重启和 你需要查看在重启前十分钟的数据, 你可以输入 -d "04:10:00"
/usr/lib/oracrf/bin/crfgui.sh -d "04:10:05"
以上的所有脚本需要GUI到节点。
非GUI 模式
假设没有通道从GUI到节点, 从 /usr/lib/oracrf/bin/oclumon 能得到加载信息
Execute oclumon -h option to see the help
For help from command line : oclumon -h
For help in interactive mode : -h
Currently supported verbs are :
showtrail, showobjects, dumpnodeview, manage, version, debug, quit and help
有各种属性可以用来查找出运行问题
Some useful attributes that can be passed to oclumon are
1. Showobjects
/usr/lib/oracrf/bin/oclumon showobjects -n stadn59 -time "2008-06-03 16:10:00"
2. Dumpnodeview
/usr/lib/oracrf/bin/oclumon dumpnodeview -n halinux4
3. Showgaps
/usr/lib/oracrf/bin/oclumon showgaps -n celx32oe40d \
-s "2009-07-09 02:40:00" -e "2009-07-09 03:59:00"
Number of gaps found = 0
4. Showtrail
$/usr/lib/oracrf/bin/oclumon showtrail -n celx32oe40d -diskid \
sde qlen totalwaittime -s "2009-07-09 03:40:00" \
-e "2009-07-09 03:50:00" -c "red" "yellow" "green"
Parameter=QUEUE LENGTH
2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN
2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN
2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN
2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN
Parameter=TOTAL WAIT TIME
$/usr/lib/oracrf/bin/oclumon showtrail -n celx32oe40d -sys cpuqlen \
-s "2009-07-09 03:40:00" -e "2009-07-09 03:50:00" \
-c "red" "yellow" "green"
Parameter=CPU QUEUELENGTH
2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN
2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN
2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN
2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN
Data Collection
For Oracle 11.2 RAC installations use the diagcollection nscript that comes with Cluster Health Monitor:
/usr/lib/oracrf/bin/diagcollection.pl --collect --ipd
For other versions run
/usr/lib/oracrf/bin/oclumon dumpnodeview -allnodes -v -last "23:59:59" > /
Make sure has more than 2Gb space to create file
Zip or compress before uploading to the Service Request.
Also update the SR with the information when (date and time) you have observed a specific issue.
Usage Scenarios
Detailed command line usage scenarios are depicted below
1. Node reboot
客户抱怨说他们的节点halinux1在重启时被激活。他们使用diagcollection.pl上传收集到的日志。 他们也Cluster Health Monitor在运行。
* The typical thing that can be done by the customer is to find out the exact reboot time. They can either check the OS logs (linux /var/log/messages, Solaris /var/adm/messages) or they can simply run last reboot
>
* Once the time & node name of the last reboot is established we can invoke the gui viewer of Cluster Health Monitor using /usr/lib/oracrf/bin/crfgui.sh -n -d "02:00"
--Assuming its been 1 hours and 45 min after the reboot and we want to find out the load about 15 minutes before the actual reboot.
* The tool can be used in command line mode. A quick way to find out if there were cases that IPD/OS could not collect data is running the following command
usr/lib/oracrf/bin/oclumon showgaps -n -s "2008-11-23 15:10:00" -e "2008-11-23 16:15:00"
The output of that command can be used to see if OSwatcher was not scheduled. This generally means some problem with CPU scheduling or very high load on the node. Generally Cluster Health Monitor should always be scheduled since it is running as RT process.
* Since we do not know which resource (network, cpu, memory, disk) was low causing the node to be evicted via reboot, we could run the following command
oclumon showtrail -n halinux4 -nicid eth1 effectivebw errors -c "red" "yellow" "orange" "green
以上命令告诉我们nicid eth1哪个问题出现多少次,输出用各种颜色来描述。比如绿色表示良好and 黄色表示不是非常好 but不是真正的坏 and红色意味出现问题。
>
Similarly we can use the showtrail option to show cpu load
./oclumon showtrail -n halinux4 -sys usagepc cpuqlen cpunumprocess, openfds, numrt, numofiosps, lowmem, memfree, -c "red" "yellow"
>
* From the above screen shot we can see that lowmem is in red all the time, Now we can get details of that lowmem usage using
./oclumon dumpnodeview -n halinux4 -s "2008-11-24 20:26:55" -e "2008-11-24 20:30:21"
>
* The cause of node reboot here was a program that was written to simulate load. This basically allocated and de-allocated huge chunks of memory causing the system to swap & page.
*
*
2. 丢失块的运行问题
3.由于私有网络问题造成节点重启
4. 由于缺失到Voting 硬盘的路径而节点重启
5. 由于IPC发送超时而引起的实例风险