应用范围
企业管理器网格控制,版本:10.2.0.4到10.2.0.5版
本文适于任何操作系统平台。
故障症状
OMS重启,且告警日志文件OMS ORACLE_HOME/sysman/log/emoms.trc报错如下:
HealthMonitor Feb 23, 2010 11:56:27 AM XMLLoader error: XMLLoader processing
file /oracle/oms10g/oms10g/sysman/recv/50000000051.xml timed out.
Critical error err=3 detected in module XMLLoader
OMS will be restarted. A full thread dump will be generated
in the opmn log file
/opmn/logs/OC4J~OC4J_EM~default_island~1
to help Oracle Support analyse the problem.
Please consult My Oracle Support Note 964469.1 for detailed instructions.
注:emoms.log/emoms.trc中提到的Note 964469.1的参考说明将在10.2.0.5.2网格控制补丁设置升级(Patch Set Update)应用后介绍。可以查看Note: 822485.1获得更多EM推荐补丁的相关信息。
日志文件 /opmn/logs/OC4J~OC4J_EM~default_island~1 的内容如下:
10/02/23 12:28:09 HealthMonitor : Executing diagnostic command for module
omsThread. Feb 23, 2010 12:28:09 PM
...........
"XMLLoader2 50000000051.xml" daemon prio=1 tid=0xaaf56e40 nid=0x181a runnable
[0xa9e7d000..0xa9e7e248]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Unknown Source)
at oracle.net.ns.DataPacket.receive(Unknown Source)
at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:978)
at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:950)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:447)
at oracle.jdbc.driver.T4CCallableStatement.doOall8(T4CCallableStatement.java:183)
at oracle.jdbc.driver.T4CCallableStatement.execute_for_rows(T4CCallableStatement.java:872)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1160)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3000)
at oracle.jdbc.driver.OraclePreparedStatement.execute(OraclePreparedStatement.java:3092)
- locked (a oracle.jdbc.driver.T4CCallableStatement)
- locked (a oracle.jdbc.driver.T4CConnection)
at oracle.jdbc.driver.OracleCallableStatement.execute(OracleCallableStatement.java:4285)
- locked (a oracle.jdbc.driver.T4CCallableStatement)
- locked (a oracle.jdbc.driver.T4CConnection)
at oracle.sysman.util.jdbc.PreparedStatementWrapper.execute(PreparedStatementWrapper.java:175)
at oracle.sysman.util.jdbc.CallableStatementWrapper.execute(CallableStatementWrapper.java:135)
at oracle.sysman.emdrep.dbjava.loader.XMLLoaderContext.flushCurrentMetrics(XMLLoaderContext.java:2211)
at oracle.sysman.emdrep.dbjava.loader.XMLLoaderContext.loadFromStream(XMLLoaderContext.java:1963)
at oracle.sysman.emdrep.dbjava.loader.XMLLoader.LoadFile(XMLLoader.java:662)
at oracle.sysman.emdrep.dbjava.loader.XMLLoader.LoadFiles(XMLLoader.java:754)
at oracle.sysman.emdrep.dbjava.loader.XMLLoader.run(XMLLoader.java:1417)
依据以上内容,健康监控(HealthMonitor)线程会重启OMS,这是因为XMLLoader尝试加载一个名为50000000051.xml的xml文件到仓库数据库超时造成的。
故障原因
此故障已经在EM Bug 8645222:OMS FAILED TO LOAD THE FILES INTO REPOSITORY LOT OF ERR FILES UNDER RECV/ERRORS中研究过。
此bug将在下面的情况下被击中(发生):
1. 由于XMLLoader超时造成健康监控重启OMS。/opmn/logs/OC4J~OC4J_EM~default_island~1显示XMLLoader调用包含flushCurrentMetrics 或 flushStringMetricHistory 的栈。
2. 50000000051.xml文件被加载,而其中包含有谢谢连接描述符,例如(ADDRESS=(PROTOCOL=TCP)(HOST=hostname)(PORT=1521))。
3. OMS通过防火墙,使用端1521端口监听器,连接库数据库。防火墙有一些特性,例如SQLNet fixup protocol / Deep Packet Inspection (DPI) / SQLNet packet inspection / SQL Fixup / SQL ALG (Juniper firewall) enabled for the 1521 port。
由上面的bug研究看到,bug产生条件有,使用了思科防火墙,且在1521端口激活了SQLNet数据包检查(SQLNet Packet Inspection)特性。此特性会更改接收的数据包中的数据内容,造成了数据丢失,从而造成XMLLoader超时,进而造成OMS重启。
此故障在思科防火墙的bug列表“Bug CSCsm92275 - SQL inspection rewrites IP addresses embeded in SQL data”中也被识别。
注:在某些情况下,防火墙针对默认1521端口可能会有以上提到的某个特性。例如,思科5400/5500系列自适应安全设备(Adaptive Security Appliances)针对默认1521端口,激活了SQLnet fixup protocol/Sql Inspection特性。
例如:
代理为监听器目标收集的度量标准之一是“TNS地址”。代理会想OMS上传此度量内容,且OMS的加载器组件把此数据加载到管理仓库中。下面为一个例子:
C531B97913156CE7CE3FACAF7D697C4E
2F817855DE48E3CD2083D24FE76CD179
2009-09-29 15:23:07
(ADDRESS=(PROTOCOL=TCP)(HOST=host1.example.com)(PORT=1521))
挡在OMS和贮藏库,以及SQL*Net检测模块之间打开了防火墙后,思科防火墙有时会通过连接修改数据。例如会把(ADDRESS=(PROTOCOL=TCP)(HOST=host1.example.com)(PORT=1521))
修改为:
(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.101)(PORT=6402))
注意到替换后的数据已非SQL*Net协议头的一部分,但却是OM尝试插入库表中的实际数据。在此例中,自从数据包的有效长度被降低后,数据库服务器会等待更多数据,而这些数据是永远等不到的。这将造成客户端和服务器的无限hang机来等待对方响应。
故障解决
两种方案:
1. 关闭思科防火墙的针对1521端口的SQLNet包检测功能,
参考:
- Note 119706.1: Troubleshooting Guide TNS-12535 or ORA-12535 or ORA-12170 Errors
Section: Note A - Firewall Restrictions
- Note 742535.1: Insert Into Remote Table Using DBLINK Over VPN Tunnel Hangs on Large Number of Rows
2. 在库数据库主机中创建另外一个监听器,端口号为除1521外的其它数字,如1522。
通过1522端口配置OMS连接库数据库。
参考:
Note 369997.1: How to Re-configure the OMS After Port Change for the Listener Servicing the Grid Control Repository Database。
为了连接成功,防火墙应确保对1522端口关闭掉了上面提到的所有特性,诸如SQLNet fixup protocol / Deep Packet Inspection (DPI) / SQLNet packet inspection / SQL Fixup等。
参考资料
BUG:7009048 - HEALTHMONITOR CAUSES OC4J_EM TO CRASH AND RESTART WHEN XML FILE CANT BE LOADED
BUG:7214751 - SCP: OMS RESTARTS EVERY 15 MINUTES. XMLLOADER DIDN'T LOAD ANY DATA
BUG:7620814 - FREQUENT OMS RESTARTS DUE TO LOADER TIMEOUTS
BUG:8337827 - JDBC DRIVER HANGS
BUG:8527201 - OMS RESTARTS EVERY 16 MINUTES - FIREWALL IS CAUSE
BUG:9087501 - SELECT FROM V$SESSION AND SOME OTHER DICTIONARY VIEWS HANG FROM A REMOTE CLIENT
BUG:9814813 - JDBC HANGS
NOTE:2084440.6 - Oracle and Firewalls: Answers to Frequently Asked Questions
NOTE:361284.1 - Port 1521 Open on Firewall But Unable to Connect Due to Errors: ORA-12535,TNS-12203
NOTE:397393.1 - External Clients behind NAT Translation / Port Forwarding / Tunneling Fail to Connect to Database (ORA-12535 or ORA-12541)
NOTE:45226.1 - SQL*Net and Firewalls
NOTE:66382.1 - Firewalls, Windows NT and Redirections
NOTE:822485.1 - Oracle Recommended Patches -- Oracle Enterprise Manager
NOTE:859480.1 - Indefinite Hang Using DESCRIBE Via Oracle Net for Large Number of Columns
NOTE:964469.1 - Grid Control Performance: How to Troubleshoot OMS Crash / Restart Issues?
相关产品
企业管理器 > 企业管理器网关控制
关键字
HANGING; OMS; FIREWALL; CISCO; REPOSITORY; CISCO FIREWALL; GRID CONTROL; TIME OUT
故障型号
TNS-12535; ORA-12535; ORA-12170; 12170 ERROR