[分享]GoldenGate的提取异常终止“无法锁定文件”跟踪文件中的错误_MySQL, Oracle及数据库讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛

总帖数

每页帖数

1/1页

返回列表

发起投票

查看: 5615 | 回复: 0

主题： [分享]GoldenGate的提取异常终止“无法锁定文件”跟踪文件中的错误

shunzi

注册用户

等级：上尉
经验：782
发帖：133
精华：0
注册：2011-10-10
状态：离线
发送短消息息给shunzi

加好友发送短消息息给shunzi

发消息

发表于：

2011-12-23 17:44:05 | [全部帖] [楼主帖]

楼主

GoldenGate Extract Abends with "Unable to lock file" Error For Trail File
Applies to:
Oracle GoldenGate - Version: 9.5.0.16 and later [Release: 9.5.0 and later ]
Information in this document applies to any platform.
Symptoms
Extract abends, reporting "Unable to lock file" response from Server/Collector.
Cause
Network outages that last longer than the time the TCP/IP stack is configured to retransmit unacknowledged packets may result in "orphan" TCP/IP connections on the RMTHOST system. Since the local system has closed the connections and the "RST" packets were lost due to the network outage, no packets (data or "control") will ever be sent for these connections.
Since the RST packets were not delivered to the RMTHOST, the TCP/IP stack will not present an error to the Server/Collector process The Server/Collector process will continue to wait, passively, forever, for new data that will never arrive because the Extract process on the other system is no longer running.
As of v10.4, Server/Collector locks the trail file to prevent multiple processes from writing to the same trail file, so new Server/Collector processes are unable to lock the trail files.
A second cause for this symptom is that the remote server was rebooted and the Network-Attached Storage (NAS) device where the target trails reside did not detect and was not notified of the reboot, so the locks acquired prior to the reboot are still considered to be in force.
Solution
The best solution is to ensure that Server/Collector processes will detect when they have become "orphans" and terminate themselves; this will release the lock, release other resources used by the process, and enable a new Server/Collector process to do recovery processing as directed by EXTRACT.
Beginning with version 10.4.0.29 (on the target system - this is the version of Server/Collector), the "-w " option instructs the Server/Collector to terminate if it doesn't receive any checkpoint information from EXTRACT within "" seconds. For example:
RMTHOST 192.168.10.1, MGRPORT 7809, PARAMS "-w 40"
This tells the Server/Collector to terminate if it doesn't receive any checkpoint information for more than 40 seconds.
Under normal circumstances, EXTRACT will send checkpoint information at least every 10 seconds (or whatever interval is specified by the CHECKPOINTSECS parameter in the EXTRACT parameter file). If the network connection is broken, the Server/Collector will detect that is has not received any checkpoint information for longer than the specified interval and terminate.
Also available since v10.4.0.29, but less desirable, the "RMTHOST UNLOCKEDTRAILS" option (or "RMTHOST PARAMS -UL" option, or "UNLOCKEDTRAILFILES" in the GLOBALS parameter file) disables this protective mechanism.
Note: When two or more processes write to the same trail file, data corruption and loss of data integrity may result.
RFE 9425192 requests the implementation of a heartbeat mechanism between the Extract and Server/Collector processes, to enable the Server/Collector to detect that the TCP/IP connection has been lost and terminate within a reasonably short time. Transient, recoverable network problems, such as when a router is rebooted, may last longer than the value specified for the "-w " option described above, causing the Server/Collector to terminate "too soon" - EXTRACT will recover by starting a new Server/Collector process.
In the absence of either RFE 9425192, the "-w " option, or the "UNLOCKEDTRAILS" option, it is necessary to kill the orphan Server/Collector processes to recover from this situation.
In the case where the NAS was unaware that the system had been rebooted, the best long-term solution is to contact the NAS vendor, who might be able to provide an utility program that can be run early in the system startup process to notify the NAS that it should release all locks owned by this system. The following procedure might offer a short-term work-around:
1. Stop all REPLICAT processes that read the trail file.
2. Stop the target MGR process.
3. Rename the trail file that cannot be locked.
4. Copy the trail file to a new file with a different name than the original trail file.
5. Rename the copy of the trail file to the original trail file name.
6. Repeat steps 2-5 for each trail file that can't be locked.
7. From the shell, kill the server (collector) process that was writing to the trail. ie Check on OS level for orphan processes, e.g. on unix style OS's: ps -ef | grep server
If any such orphan servers exist, e.g.:
oracle 27165 1 0 11:20 ? 00:00:00 ./server -p 7840 -k -l /opt/oracle/gg/ggserr.log
Then: kill 27165 (or, kill -9 27165) (for this particular case)
8. Start MGR.
9. Start the REPLICAT processes.
10. Re-start the extract that abended and gave this error message.
Note that this may not work, depending on the NAS and the way it keeps track of advisory file locks acquired using fcntl( F_GETLK ).

GoldenGate的提取异常终止“无法锁定文件”跟踪文件中的错误

适用于：

Oracle GoldenGate的 - 版本： 9.5.0.16和更高版本： 9.5.0和更高版本]

在这个文档中的信息适用于任何平台。

症状

提取异常终止，报告“无法锁定文件”服务器/收藏家的响应。

原因

网络中断，持续时间超过配置的TCP / IP协议栈是不被承认的数据包重发的时间较长，可能导致在“赵氏孤儿” RMTHOST系统的TCP / IP连接。由于本地系统已经关闭了连接和“ RST ”包丢失，由于网络中断，没有数据包（数据或“控制” ）将永远为这些连接发送。

由于RST包不运到RMTHOST ， TCP / IP堆栈将不存在一个错��的服务器/收集过程中，服务器/收集器进程将继续等待，被动的，永远的，新的数据将永远不会到达的，因为提取物对其他系统的过程中，不再运行。

10.4 Server/Collector锁定跟踪文件写入到同一个跟踪文件，以防止多个进程，使新的服务器/收集过程无法锁定跟踪文件。

这种症状的一个第二个原因是，重新启动远程服务器和网络附加存储（NAS ）设备，目标跟踪没有检测到，并没有重新启动的通知，所以收购的锁在重新启动之前，仍然被认为是要有效。

解决方案

最好的办法是，以确保服务器/收集过程将检测时，他们已经成为“孤儿” ，并终止自己，这将释放锁，释放过程中所使用的其他资源，并启用新的Server/Collector进程做回收处理导演摘录。从版本10.4.0.29 （目标系统上 - 这是服务器/集的版本）， “ - w”选项指示服务器/集电极终止，如果它不接受摘录“ ”秒内任何检查点信息。例如：

RMTHOST 192.168.10.1, MGRPORT 7809, PARAMS "-w 40"

这告诉服务器/集电极终止，如果它不接受任何检查点的信息超过40秒。

在正常情况下，提取将派出检查点信息，至少每10秒（或任何CHECKPOINTSECS参数提取物中的参数文件中指定的时间间隔）。如果网络连接断开，服务器/收集器将检测就是没有收到任何检查点信息的时间比指定的时间间隔，并终止。

也可自v10.4.0.29 ，但不太理想， “ RMTHOST UNLOCKEDTRAILS ”选项（或“ RMTHOST PARAMS - UL认证”选项，或“ UNLOCKEDTRAILFILES ”在全局参数文件）禁用这种保护机制。

注：当两个或多个进程写入相同的跟踪文件，可能导致数据损坏和数据完整性的损失。

RFE 9425192请求之间的提取和服务器/收集过程中的心跳机制的实施，使服务器/集电极检测到TCP / IP连��已丢失，相当短的时间内终止。短暂的，可恢复的网络问题，如当路由器重新启动后，可能会持续更长的时间比“ - w”选项上面描述指定的值，从而导致服务器/集电极终止“太早” - 提取将开始一个新的恢复服务器/收集器进程。

RFE 9425192 ， “ - w”选项“ UNLOCKEDTRAILS ”选项的情况下，有必要杀孤儿的服务器/收集过程，从这种情况中恢复过来。

NAS的是不知道，该系统已重新启动的情况下，最好的长期解决方案是联系NAS供应商，他们也许可以提供一个实用程序，可以运行在系统启动过程中的早期通知NAS的，它应该释放这个系统所拥有的所有的锁。下面的过程可能会提供一个短期的变通办法：

1。停止所有REPLICAT的进程，阅读跟踪文件。

2。停止目标MGR的过程。

3。重命名的跟踪文件，不能被锁定。

4。跟踪文件复制到一个不同的名字比原来的跟踪文件的新文件。

5。跟踪文件的副本重命名原始的跟踪文件名称。

6。不能被锁定为每个跟踪文件，重复步骤2-5。

7。从外壳，杀死书面线索服务器（集电极）的过程。即检查OS级别为孤儿的过程，例如在UNIX风格的操作系统：PS - EF | GREP SERVER

如果存在任何这样的孤儿服务器，例如：

oracle 27165 1 0 11:20 ? 00:00:00 ./server -p 7840 -k -l /opt/oracle/gg/ggserr.log
Then: kill 27165 (or, kill -9 27165) (for this particular case)

8。启动管理。

9。启动REPLICAT进程。

10。重新启动提取abended了此错误消息。

请注意，这可能无法正常工作，在NAS和方式取决于它不断跟踪咨询文件锁定使用fcntl （ F_GETLK ）收购。

该贴被shunzi编辑于2011-12-23 17:46:41

本版精华
热门帖子

操作引用/回复

总帖数

每页帖数

1/1页

返回列表

用户登录

Weblogic中间件技术论坛

Tuxedo中间件技术论坛

数据库论坛

Java论坛

Linux/unix论坛

网站地图