一个朋友的历史库出现故障,在linux 4的平台上asm的10.2.0.1的单库,asm使用asmlib来处理。
asm不能正常mount磁盘组,可以看到asmdisk,alert日志报ORA-00600[kfklLibFetchNext00]
操作系统内核是:2.6.9-78
oracleasmlib是:2.0.2-1
asm磁盘组mount失败
--以前故障
SQL> ALTER DISKGROUP ALL MOUNT
Thu Sep 6 14:23:16 2012
NOTE: cache registered group DGARC number=1 incarn=0x2bf96274
NOTE: cache registered group DGDATA number=2 incarn=0x2c196275
NOTE: cache registered group DGSYS number=3 incarn=0x2c196276
Thu Sep 6 14:23:16 2012
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_10204.trc:
ORA-15183: ASMLIB initialization error [driver/agent not installed]
Thu Sep 6 14:23:16 2012
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_10204.trc:
ORA-15183: ASMLIB initialization error [/opt/oracle/extapi/64/asm/orcl/1/libasm.so]
ORA-15183: ASMLIB initialization error [driver/agent not installed]
Thu Sep 6 14:23:16 2012
ERROR: no PST quorum in group 1: required 2, found 0
Thu Sep 6 14:23:16 2012
NOTE: cache dismounting group 1/0x2BF96274 (DGARC)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGARC was not mounted
Thu Sep 6 14:23:16 2012
ERROR: no PST quorum in group 2: required 2, found 0
Thu Sep 6 14:23:16 2012
NOTE: cache dismounting group 2/0x2C196275 (DGDATA)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGDATA was not mounted
Thu Sep 6 14:23:16 2012
ERROR: no PST quorum in group 3: required 2, found 0
Thu Sep 6 14:23:16 2012
NOTE: cache dismounting group 3/0x2C196276 (DGSYS)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGSYS was not mounted
--现在故障
Thu Jan 24 13:49:45 2013
SQL> ALTER DISKGROUP ALL MOUNT
Thu Jan 24 13:49:45 2013
NOTE: cache registered group DGARC number=1 incarn=0xf388cee9
NOTE: cache registered group DGDATA number=2 incarn=0xf3a8ceea
NOTE: cache registered group DGSYS number=3 incarn=0xf3a8ceeb
Thu Jan 24 13:49:45 2013
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_13449.trc:
ORA-00600: internal error code, arguments: [kfklLibFetchNext00],
[18446744073709551614], [0], [], [], [], [], []
Thu Jan 24 13:49:46 2013
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_13449.trc:
ORA-00600: internal error code, arguments: [kfklLibFetchNext00],
[18446744073709551614], [0], [], [], [], [], []
Thu Jan 24 13:49:46 2013
ERROR: no PST quorum in group 1: required 2, found 0
Thu Jan 24 13:49:46 2013
NOTE: cache dismounting group 1/0xF388CEE9 (DGARC)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGARC was not mounted
Thu Jan 24 13:49:46 2013
ERROR: no PST quorum in group 2: required 2, found 0
Thu Jan 24 13:49:46 2013
NOTE: cache dismounting group 2/0xF3A8CEEA (DGDATA)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGDATA was not mounted
Thu Jan 24 13:49:46 2013
ERROR: no PST quorum in group 3: required 2, found 0
Thu Jan 24 13:49:46 2013
NOTE: cache dismounting group 3/0xF3A8CEEB (DGSYS)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGSYS was not mounted
Shutting down instance: further logons disabled |
trace文件信息
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+31 call ksedst1() 000000000 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
ksedmp()+610 call ksedst() 000000000 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
ksfdmp()+21 call ksedmp() 000000003 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
kgerinv()+161 call ksfdmp() 000000003 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
kgesinv()+33 call kgerinv() 006469D40 ? 0064E1C58 ?
000000000 ? 000000000 ?
000000001 ? 000000001 ?
kgesinw()+166 call kgesinv() 006469D40 ? 0064E1C58 ?
000000000 ? 000000000 ?
000000001 ? 000000001 ?
kfklLibScanNext()+2 call kgesinw() 006469D40 ? 000000000 ?
39 000000001 ? 000000000 ?
FFFFFFFFFFFFFFFE ?
000000000 ?
kfkLibFetchNext()+3 call kfklLibScanNext() 0064DDD70 ? 7FBFFFDCD0 ?
43 000000001 ? 000000000 ?
FFFFFFFFFFFFFFFE ?
000000000 ?
kfuitrnInit()+524 call kfkLibFetchNext() 006469D40 ? 2A971DFF90 ?
000000001 ? 000000000 ?
FFFFFFFFFFFFFFFE ?
000000000 ?
kfkLibIterInit()+18 call kfuitrnInit() 006469D40 ? 2A971DFCB0 ?
0 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfkLoadAllLibs()+36 call kfkLibIterInit() 000000000 ? 00646C7E0 ?
3 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfkDiscoverString() call kfkLoadAllLibs() 000000000 ? 00646C7E0 ?
+107 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
Cannot find symbol
Cannot find symbol
Cannot find symbol
kfdDiscoverString() call kfkDiscoverString() 067A53768 ? 00646C7E0 ?
+28 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfdDiscoverShallow( call kfdDiscoverString() 067A53768 ? 000000000 ?
)+315 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfgbDriver()+1174 call kfdDiscoverShallow( 000000180 ? 000000000 ?
) 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
ksbabs()+564 call kfgbDriver() 7FBFFFE5C0 ? 000000048 ?
000000000 ? 000000009 ?
000000009 ? 000000000 ?
ksbrdp()+727 call ksbabs() 7FBFFFE5C0 ? 000000048 ?
000000000 ? 000000009 ?
000000009 ? 000000000 ?
opirip()+616 call ksbrdp() 7FBFFFE5C0 ? 000000048 ?
000000001 ? 06002C770 ?
000000009 ? 000000000 ?
opidrv()+582 call opirip() 000000032 ? 000000004 ?
7FBFFFF6C8 ? 06002C770 ?
000000009 ? 000000000 ?
sou2o()+114 call opidrv() 000000032 ? 000000004 ?
7FBFFFF6C8 ? 06002C770 ?
000000009 ? 000000000 ?
opimai_real()+317 call sou2o() 7FBFFFF6A0 ? 000000032 ?
000000004 ? 7FBFFFF6C8 ?
000000009 ? 000000000 ?
main()+116 call opimai_real() 000000003 ? 7FBFFFF730 ?
000000004 ? 7FBFFFF6C8 ?
000000009 ? 000000000 ?
<0x3c9fb1c40b> call main() 000000003 ? 7FBFFFF730 ?
000000004 ? 7FBFFFF6C8 ?
000000009 ? 000000000 ?
--------------------- Binary Stack Dump --------------------- |
因为客户的库是一个历史库,基本上不怎么使用,在2012年启动asm就出现了ORA-15183错误,然后在2013年重启机器后,再次启动asm就出现了ORA-
00600[kfklLibFetchNext00]错误,通过2012年的错误提示,我们大概可以判断出来该问题和ASMLIB有关系,查询mos发现429945.1,发现Call Stack Trace完全一
致,可以定位是该问题(如果想深入分析,可以通过strace继续分析)
ORA-600: [kfklLibFetchNext00], [18446744073709551614], [0] when mounting diskgroup in ASM
Applies to:
Linux OS - Version: 2.0.1-1 and later [Release: RHEL4 and later ]
Information in this document applies to any platform.
Linux Kernel - Version: 2.0.1
Symptoms
3 RAC db.
2 nodes are up and functioning except for 1 node - ASM did not come back up after
the reboot eventhough all disks show available from asmlib's perspective:
Changes
All that was done with resources were stopped on Node1 and an extra LUN added.
A reboot was then performed.
Cause
The cause of the issue is libasm.o corruption
Ran the following to confirm that disks are ok:
/dev/oracleasm listdisks
/usr/sbin/asmtool -I -l /dev/oracleasm -n /dev/sdg1 -a label
/usr/sbin/oracleasm-discover 'ORCL:*'
dd if=/dev/sdg1 bs=8192 count=1 | od -c
==> output checked out fine
.
kfod asm_diskstring='ORCL:*'
==> this failed on Node1
KFOD-00600: file not found; argument [610][kfklLibFetchNext00] even though libasm.o exists
You might see the following call stack as well
----- Call Stack Trace -----
kfklLibScanNext
kfkLibFetchNext
kfuitrnInit
kfkLibIterInit
kfkLoadAllLibs
kfkDiscoverString
kfdDiscoverString
kfdDiscoverShallow
kfgbDriver
strace showed
Node1-failing
-------
stat("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", {st_mode=S_IFREG|0777, st_size=19344, ...}) = 0
getdents64(4, /* 0 entries */, 4096) = 0 <<<<
close(4) = 0
open("/opt/oracle/product/10.2.0/db_1/rdbms/mesg/kfodus.msb", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/opt/oracle/product/10.2.0/db_1/rdbms/mesg/kfodus.msb", O_RDONLY) = -1
ENOENT (No such file or directory)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a9750d000
write(1, "KFOD-00600: file not found; argu"..., 69) = 69
Node2-working
-----
stat("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", {st_mode=S_IFREG|0755, st_size=19344, ...}) = 0
open("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", O_RDONLY) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\23\0"..., 832) = 832
fstat(4, {st_mode=S_IFREG|0755, st_size=19344, ...}) = 0
mmap(NULL, 1066104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) 0x2a9750d000 |
通过MOS的描述,可以明确定位到问题是:libasm.o异常导致
解决方案
To implement the solution, reinstall the ASMlib RPM
>rpm -Uvh oracleasmlib-2.0.0-1
This replaces the /opt/oracle/extapi/64/asm/orcl/1/libasm.so |
--转自