通过DBCA在服务器上创建一个额外的实例,结束后以外发现,原有的数据库实例出现了异常。
简单描述一下系统环境,Oracle 11.2.0.2 for Linux x86-64,随后又安装了GRID,创建了ASM磁盘组。原有的数据库有部分表空间和REDO存放在ASM磁盘组中。
为了测试,通过DBCA创建了额外的实例,事实上,通过DBCA又先后创建了两个数据库实例,第二个数据库实例创建后一切正常,将第二个实例关闭,通过DBCA创建第三个数据库实例时,导致服务器上的第一个实例出现了异常。
检查实例1上的告警日志,可以发现大量下面的错误信息:
2011-08-30 10:20:54.502000 +08:00
Process W000 died, see its trace file
2011-08-30 10:20:58.518000 +08:00
Process W000 died, see its trace file
2011-08-30 10:21:02.524000 +08:00
Process W000 died, see its trace file
2011-08-30 10:21:06.527000 +08:00
Process W000 died, see its trace file
Errors in file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_j000_19030.trc:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1000 (oinstall), current egid = 1200 (dba)
2011-08-30 10:21:07.536000 +08:00
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_cjq0_12308.trc:
Errors in file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_j000_19036.trc:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1000 (oinstall), current egid = 1200 (dba)
数据库并非简单的报错而已,实际上即使是SQLPLUS / AS SYSDBA登录数据库,进行任何操作也会显示没有连接到数据库,因此对于实例1来说,只能执行SHUTDOWN ABORT关闭实例并重新启动。
查询METALINK发现不少类似的情况,导致这个错误产生的原因就是ORA_CRS_HOME/bin下的oracle可执行文件或者ORACLE_HOME/bin下的oracle可执行文件发生修改所致。虽然从随后的ORA-2730X信息中无法判断是GRID用户下还是ORACLE用户下的oracle文件被修改,但是可以确定的是,oracle可执行文件的组信息被修改,原本应该是oinstall,而现在变成了dba。
分别检查ORA_CRS_HOME/bin和ORACLE_HOME/bin下的oracle文件,发现是ORACLE目录下的ORACLE_HOME/bin/oracle发生了改变:
[oracle@dbserver1 ~]$ cd $ORACLE_HOME/bin
[oracle@dbserver1 bin]$ ls -l oracle
-rwsr-s--x 1 oracle dba 228881982 Jul 26 16:55 oracle
[oracle@dbserver1 bin]$ ls -l |grep dba
-rwsr-s--x 1 oracle dba 228881982 Jul 26 16:55 oracle
-rwxr-xr-x 1 oracle oinstall 32372 Jul 26 16:55 osdbagrp
-rwxr-xr-x 1 oracle oinstall 35677 Aug 20 2010 osdbagrp0
可以看到,只有oracle文件变成了dba组,bin目录下所有其他文件仍然是oinstall组。为了对比,在另外一个服务器上,有一台同时搭建的相同环境的oracle,检查oracle文件的组信息:
[oracle@dbserver2 ~]$ cd $ORACLE_HOME/bin
[oracle@dbserver2 bin]$ ls -l oracle
-rwsr-s--x. 1 oracle oinstall 217733745 Apr 27 14:02 oracle
[oracle@dbserver2 bin]$ ls -l |grep dba
-rwxr-xr-x. 1 oracle oinstall 32372 Apr 27 14:02 osdbagrp
-rwxr-xr-x. 1 oracle oinstall 35677 Aug 20 2010 osdbagrp0
在这个服务器上,oracle文件的组属性仍然是oinstall,这说明节点1上的oracle组属性确实被修改了。
前面提到了,利用DBCA添加第二个实例的时候,数据库并未出现问题,而添加第三个实例的时候,才导致oracle组信息变化,那么同样是dbca的操作,为什么有所区别呢。对比第二个和第三个实例的区别,问题应该和是否使用ASM有关。第二个实例的所有文件都存储在本地硬盘,而第三个实例的所有文件都保存在ASM磁盘组中,甚至会和实例1共用一些磁盘组,应该就是这个差别导致了问题的产生。
检查ASM告警日志:
2011-08-30 10:21:08.853000 +08:00
NOTE: client al32utf8:al32utf8 registered, osid 19095, mbr 0x1
2011-08-30 10:53:27.195000 +08:00
NOTE: client al32utf8:al32utf8 deregistered
2011-08-30 10:53:31.248000 +08:00
Starting background process ASMB
ASMB started with pid=42, OS id=23251
NOTE: client +ASM:+ASM registered, osid 23253, mbr 0x1
2011-08-30 10:53:36.804000 +08:00
NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
2011-08-30 10:53:51.372000 +08:00
NOTE: client al32utf8:al32utf8 registered, osid 23441, mbr 0x1
可以看到,就是第三个实例注册到ASM的时刻,在实例1上出现了错误。应该就是这个时刻,oracle文件的组信息发生了变化。
检查一下实例3上的告警日志:
2011-08-30 10:20:53.722000 +08:00
Adjusting the default value of parameter parallel_max_servers
from 960 to 135 due to the value of parameter processes (150)
Starting ORACLE instance (normal)
2011-08-30 10:20:55.149000 +08:00
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Shared memory segment for instance monitoring created
Picked latch-free SCN scheme 3
2011-08-30 10:20:56.823000 +08:00
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/11.2.0/dbhome_1/dbs/arch
Autotune of undo retention is turned on.
IMODE=BR
ILAT =29
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
Using parameter settings in client-side pfile /u01/app/oracle/admin/al32utf8/pfile/init.ora on machine dbserver1
System parameters with non-default values:
processes = 150
memory_target = 6432M
db_block_size = 8192
compatible = "11.2.0.0.0"
db_create_file_dest = "+DATA"
undo_tablespace = "UNDOTBS1"
remote_login_passwordfile= "EXCLUSIVE"
db_domain = ""
dispatchers = "(PROTOCOL=TCP) (SERVICE=al32utf8XDB)"
local_listener = "LISTENER_AL32UTF8"
audit_file_dest = "/u01/app/oracle/admin/al32utf8/adump"
audit_trail = "DB"
db_name = "al32utf8"
open_cursors = 300
diagnostic_dest = "/u01/app/oracle"
2011-08-30 10:21:06.647000 +08:00
PMON started with pid=2, OS id=19032
PSP0 started with pid=3, OS id=19034
2011-08-30 10:21:07.721000 +08:00
VKTM started with pid=4, OS id=19038 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
GEN0 started with pid=5, OS id=19042
DIAG started with pid=6, OS id=19044
DBRM started with pid=7, OS id=19046
DIA0 started with pid=8, OS id=19048
MMAN started with pid=9, OS id=19050
DBW0 started with pid=10, OS id=19052
DBW1 started with pid=11, OS id=19054
DBW2 started with pid=12, OS id=19056
LGWR started with pid=13, OS id=19058
CKPT started with pid=14, OS id=19060
SMON started with pid=15, OS id=19062
RECO started with pid=16, OS id=19064
MMON started with pid=17, OS id=19066
MMNL started with pid=18, OS id=19068
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
ORACLE_BASE from environment = /u01/app/oracle
CREATE DATABASE "al32utf8"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
DATAFILE SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
SYSAUX DATAFILE SIZE 2048M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE SIZE 8192M AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED
SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE SIZE 8192M AUTOEXTEND ON NEXT 5120K MAXSIZE UNLIMITED
CHARACTER SET AL32UTF8
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 SIZE 1024M,
GROUP 2 SIZE 1024M,
GROUP 3 SIZE 1024M
USER SYS IDENTIFIED BY *USER SYSTEM IDENTIFIED BY
Starting background process ASMB
ASMB started with pid=22, OS id=19092
2011-08-30 10:21:08.901000 +08:00
Starting background process RBAL
RBAL started with pid=23, OS id=19105
NOTE: initiating MARK startup
Starting background process MARK
MARK started with pid=24, OS id=19108
NOTE: MARK has subscribed
2011-08-30 10:21:12.325000 +08:00
NOTE: Loaded library: System
SUCCESS: diskgroup DATA was mounted
Database mounted in Exclusive Mode
Lost write protection disabled
ERROR: failed to establish dependency between database al32utf8 and diskgroup resource ora.DATA.dg
2011-08-30 10:21:36.862000 +08:00
Successful mount of redo thread 1, with mount id 3103441300
Assigning activation ID 3103441300 (0xb8fac194)
Thread 1 opened at log sequence 1
Current log# 1 seq# 1 mem# 0: +DATA/al32utf8/onlinelog/group_1.265.760530073
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
processing ?/rdbms/admin/dcore.bsq
create tablespace SYSTEM datafile SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL online
2011-08-30 10:21:44.431000 +08:00
Completed: create tablespace SYSTEM datafile SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
可以确认实例1出现问题的时刻,正是实例3启动并在ASM磁盘组中创建表空间的时刻。
了解了问题的原委后,这个问题并不难解决,只需要重启实例1,就可以解决了。但是这个问题对于包含ASM的产品环境而言,存在很大的隐患,使用DBCA建立新库,会影响现有正在运行的数据库,这是一个埋藏比较深,且有可能造成严重后果的bug。
这个错误应该只会影响11.2以上版本,11.1以前的ASM在ORACLE_HOME目录下,应该不会导致这个问题的产生。