【转帖】 介绍集群健康管理器_MySQL, Oracle及数据库讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛_联动北方技术论坛  
网站首页 | 关于我们 | 服务中心 | 经验交流 | 公司荣誉 | 成功案例 | 合作伙伴 | 联系我们 |
联动北方-国内领先的云技术服务提供商
»  游客             当前位置:  论坛首页 »  自由讨论区 »  MySQL, Oracle及数据库讨论区 »
总帖数
1
每页帖数
101/1页1
返回列表
0
发起投票  发起投票 发新帖子
查看: 5023 | 回复: 0   主题: 【转帖】 介绍集群健康管理器        下一篇 
众里寻他
注册用户
等级:少尉
经验:383
发帖:27
精华:1
注册:2013-2-25
状态:离线
发送短消息息给众里寻他 加好友    发送短消息息给众里寻他 发消息
发表于: IP:您无权察看 2013-3-4 11:42:57 | [全部帖] [楼主帖] 楼主

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

介绍集群健康管理器(IPD/OS)

(联动北方技术论坛 - Powered by Landingbj) [网际游航]Applies to:
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Oracle Server - Enterprise Edition - Version: 10.2.0.1
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Oracle Server - Standard Edition - Version: 10.1.0.2
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Generic Linux
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Microsoft Windows x64 (64-bit)
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Microsoft Windows (32-bit)
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

什么被通知?

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

什么是集群健康管理器(IPD/os)

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

集群健康管理器—是一个工具来定期和自动收集操作系统的运行数据,这些数据被在线和不在线的分析员共同存储。

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

我从哪里能得到最新的集群健康管理器的副本?

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

从这里下载

(联动北方技术论坛 - Powered by Landingbj) [网际游航]Cluster Health Monitor (IPD/OS) can be downloaded from http://otn.oracle.com/rac
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

管理器分析收集的数据的过程

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

工具收集操作系统的运行数据,这些数据用来协调单一实例,RAC运行协调,也可以用来找出root起因,特别是那些由安排���题或者高cpu加载引起的问题。Generic 运行收集工具有时候在系统非常忙时收集数据出现问题。这就是为什么要用到Cluster Health Monitor.

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

为什么是Cluster Health Monitor?

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

Oracle Clusterware & Oracle database performance/node由于缺少CPU/Memory资源重启而引起客户来问如何管理他们的系统,一些客户有利用vmstat, mpstat来获得udimentary脚本,但是他们都是在定期的间隔内被收集的节点。在一些情况下,我们发现客户每一个小时收集一次但是当节点在重启时中间时刻破坏掉的时候。系统监视器做了一个成功工作就是用严格的收集间隔来严格收集数据。Cluster Health Monito通过保证它总是被安排收集数据点同时提供客户GUI来查看当前加载扩展OSwatcher.。

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

你应该做些什么?

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

安装

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

产品的readme文件中有解释

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

用途

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

工具用来在线或者离线管理他们的节点。一般当和Oracle支持,这些数据可以离线查看。

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

GUI模式

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

在线模式用来发现存在客户环境的问题,这些数据可以通过Health Monitor utility /usr/lib/oracrf/bin/crfgui查看。The GUI 不是被安装到服务器节点上但是通过crfinst.pl -g被安装到其他客户

(联动北方技术论坛 - Powered by Landingbj) [网际游航]1. For example, To look at the load on a node you can run the command .
(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/crfgui.sh -m
(联动北方技术论坛 - Powered by Landingbj) [网际游航]The above will pop up a screen as follows
(联动北方技术论坛 - Powered by Landingbj) [网际游航]>
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

默认更新时一秒. 改成更新5秒执行如下

(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/crfgui.sh -n -r 5
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

2. 另一个属性能被加到工具中的是 -d. 用来查看从以前到现在的数据. 如果有一个节点在四小时之前重启和 你需要查看在重启前十分钟的数据, 你可以输入 -d "04:10:00"

(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/crfgui.sh -d "04:10:05"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

以上的所有脚本需要GUI到节点。

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

非GUI 模式

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

假设没有通道从GUI到��点, 从 /usr/lib/oracrf/bin/oclumon 能得到加载信息

(联动北方技术论坛 - Powered by Landingbj) [网际游航]Execute oclumon -h option to see the help
(联动北方技术论坛 - Powered by Landingbj) [网际游航]For help from command line : oclumon -h
(联动北方技术论坛 - Powered by Landingbj) [网际游航]For help in interactive mode : -h
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Currently supported verbs are :
(联动北方技术论坛 - Powered by Landingbj) [网际游航]showtrail, showobjects, dumpnodeview, manage, version, debug, quit and help
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

有各种属性可以用来查找出运行问题

(联动北方技术论坛 - Powered by Landingbj) [网际游航]Some useful attributes that can be passed to oclumon are
(联动北方技术论坛 - Powered by Landingbj) [网际游航]1. Showobjects
(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/oclumon showobjects -n stadn59 -time "2008-06-03 16:10:00"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2. Dumpnodeview
(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/oclumon dumpnodeview -n halinux4
(联动北方技术论坛 - Powered by Landingbj) [网际游航]3. Showgaps
(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/oclumon showgaps -n celx32oe40d \
(联动北方技术论坛 - Powered by Landingbj) [网际游航]-s "2009-07-09 02:40:00" -e "2009-07-09 03:59:00"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Number of gaps found = 0
(联动北方技术论坛 - Powered by Landingbj) [网际游航]4. Showtrail
(联动北方技术论坛 - Powered by Landingbj) [网际游航]$/usr/lib/oracrf/bin/oclumon showtrail -n celx32oe40d -diskid \
(联动北方技术论坛 - Powered by Landingbj) [网际游航]sde qlen totalwaittime -s "2009-07-09 03:40:00" \
(联动北方技术论坛 - Powered by Landingbj) [网际游航]-e "2009-07-09 03:50:00" -c "red" "yellow" "green"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Parameter=QUEUE LENGTH
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Parameter=TOTAL WAIT TIME
(联动北方技术论坛 - Powered by Landingbj) [网际游航]$/usr/lib/oracrf/bin/oclumon showtrail -n celx32oe40d -sys cpuqlen \
(联动北方技术论坛 - Powered by Landingbj) [网际游航]-s "2009-07-09 03:40:00" -e "2009-07-09 03:50:00" \
(联动北方技术论坛 - Powered by Landingbj) [网际游航]-c "red" "yellow" "green"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Parameter=CPU QUEUELENGTH
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Data Collection
(联动北方技术论坛 - Powered by Landingbj) [网际游航]For Oracle 11.2 RAC installations use the diagcollection nscript that comes with Cluster Health Monitor:
(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/diagcollection.pl --collect --ipd
(联动北方技术论坛 - Powered by Landingbj) [网际游航]For other versions run
(联动北方技术论坛 - Powered by Landingbj) [网际游航]/usr/lib/oracrf/bin/oclumon dumpnodeview -allnodes -v -last "23:59:59" > /
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Make sure has more than 2Gb space to create file
(联动��方技术论坛 - Powered by Landingbj) [网际游航]Zip or compress before uploading to the Service Request.
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Also update the SR with the information when (date and time) you have observed a specific issue.
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Usage Scenarios
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Detailed command line usage scenarios are depicted below
(联动北方技术论坛 - Powered by Landingbj) [网际游航]1. Node reboot
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

客户抱怨说他们的节点halinux1在重启时被激活。他们使用diagcollection.pl上传收集到的日志。 他们也Cluster Health Monitor在运行。

(联动北方技术论坛 - Powered by Landingbj) [网际游航]* The typical thing that can be done by the customer is to find out the exact reboot time. They can either check the OS logs (linux /var/log/messages, Solaris /var/adm/messages) or they can simply run last reboot
(联动北方技术论坛 - Powered by Landingbj) [网际游航]>
(联动北方技术论坛 - Powered by Landingbj) [网际游航]* Once the time & node name of the last reboot is established we can invoke the gui viewer of Cluster Health Monitor using /usr/lib/oracrf/bin/crfgui.sh -n -d "02:00"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]--Assuming its been 1 hours and 45 min after the reboot and we want to find out the load about 15 minutes before the actual reboot.
(联动北方技术论坛 - Powered by Landingbj) [网际游航]* The tool can be used in command line mode. A quick way to find out if there were cases that IPD/OS could not collect data is running the following command
(联动北方技术论坛 - Powered by Landingbj) [网际游航]usr/lib/oracrf/bin/oclumon showgaps -n -s "2008-11-23 15:10:00" -e "2008-11-23 16:15:00"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]The output of that command can be used to see if OSwatcher was not scheduled. This generally means some problem with CPU scheduling or very high load on the node. Generally Cluster Health Monitor should always be scheduled since it is running as RT process.
(联动北方技术论坛 - Powered by Landingbj) [网际游航]* Since we do not know which resource (network, cpu, memory, disk) was low causing the node to be evicted via reboot, we could run the following command
(联动北方技术论坛 - Powered by Landingbj) [网际游航]oclumon showtrail -n halinux4 -nicid eth1 effectivebw errors -c "red" "yellow" "orange" "green
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

以上命令告诉我们nicid eth1哪个问题出现多少次,输出用各种颜色来描述。比如绿色表示良好and 黄色表示不是非常好 but不是真正的坏 and红色意味出现问题。

(联动北方技术论坛 - Powered by Landingbj) [网际游航]>
(联动北方技术论坛 - Powered by Landingbj) [网际游航]Similarly we can use the showtrail option to show cpu load
(联动北方技术论坛 - Powered by Landingbj) [网际游航]./oclumon showtrail -n halinux4 -sys usagepc cpuqlen cpunumprocess, openfds, numrt, numofiosps, lowmem, memfree, -c "red" "yellow"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]>
(联动北方技术论坛 - Powered by Landingbj) [网际游航]* From the above screen shot we can see that lowmem is in red all the time, Now we can get details of that lowmem usage using
(联动北方技术论坛 - Powered by Landingbj) [网际游航]./oclumon dumpnodeview -n halinux4 -s "2008-11-24 20:26:55" -e "2008-11-24 20:30:21"
(联动北方技术论坛 - Powered by Landingbj) [网际游航]>
(联动北方技术论坛 - Powered by Landingbj) [网际游航]* The cause of node reboot here was a program that was written to simulate load. This basically allocated and de-allocated huge chunks of memory causing the system to swap & page.
(联动北方技术论坛 - Powered by Landingbj) [网际游航]*
(联动北方技术论坛 - Powered by Landingbj) [网际游航]*
(联动北方技术论坛 - Powered by Landingbj) [网际游航]

2. 丢失块的运行问题

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

3.由于私有网络问题造成节点重启

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

4. 由于缺失到Voting 硬盘的路径而节点重启

(联动北方技术论坛 - Powered by Landingbj) [网际游航]

5. 由于IPC发送超时而引起的实例风险

(联动北方技术论坛 - Powered by Landingbj) [网际游航]



赞(0)    操作        顶端 
总帖数
1
每页帖数
101/1页1
返回列表
发新帖子
请输入验证码: 点击刷新验证码
您需要登录后才可以回帖 登录 | 注册
技术讨论