海南联通200806-200809_MQ, Tuxedo及OLTP讨论区_Weblogic技术|Tuxedo技术|中间件技术|Oracle论坛|JAVA论坛|Linux/Unix技术|hadoop论坛

总帖数

每页帖数

1/1页

返回列表

发起投票

查看: 7327 | 回复: 0

主题： 海南联通200806-200809

kent_duo

注册用户

等级：新兵
经验：72
发帖：4
精华：0
注册：2012-3-8
状态：离线
发送短消息息给kent_duo

加好友发送短消息息给kent_duo

发消息

发表于：

2014-9-15 14:09:49 | [全部帖] [楼主帖]

楼主

200806---数据库故障处理
select index_name,table_name,status from dba_indexes where status != 'VALID';

4-20、5-8、5-27三次故障，5-8的故障将db切到了备机41（4900）上，5-27的故障将db又切回了主机30（4800）上，目前的日志显示07445错误。

通过查看日志文件，发现由于数据库的07445、00600内部错误（也有可能是外部环境的变化导致bug的唤醒）频繁出现，导致今天凌晨“Shutting down instance: further logons disabled”，即实例关闭，不再允许登录。通过查询资料，普遍的看法是此内部错误需要打补丁解决，但是不排除其他原因导致问题发生的可能。

oracle@hnora$oerr ora 07445
07445, 00000, "exception encountered: core dump [%s] [%s] [%s] [%s] [%s] [%s]"
// *Cause: An OS exception occurred which should result in the creation of a
// core file. This is an internal error.
// *Action: Contact your customer support representative.

首先是

Mon May 26 17:03:18 2008
Errors in file /opt/oracle/admin/cookdb/udump/cookdb_ora_1630.trc:

ORA-07445: 出现异常: 核心转储 [00000001005BD3D8] [SIGSEGV] [Address not mapped to object] [0x000000000] [] []

ORA-12570: TNS: 包阅读程序失败

ORA-01013: 用户请求取消当前的操作

Mon May 26 17:37:44 2008
Errors in file /opt/oracle/admin/cookdb/udump/cookdb_ora_3217.trc:

ORA-07445: 出现异常: 核心转储 [00000001005BD3D8] [SIGSEGV] [Address not mapped to object] [0x000000000] [] []

ORA-03113: 通信通道的文件结束

之后大量的07445

00600
-----------------------------------------------------

200807：数据库备份慢的问题

inftmn@hngora$cat infbak.sh
#! /bin/sh
##############################################
# 读取同目录下的bakdir.conf的备份的目录，对应用程序进行备份
#统一备份到一个目录下，然后再统一复制到本地硬盘进行备份
#---------------------------------------------
# Written by songjian on 2006-12-11
##############################################
WORK_DIR=/netwatcher/perf-data/informix_bak
WORK_DIR1=/netwatcher/gapp/inftmn/zmq/dbbak
LOG_DIR=/netwatcher/gapp/inftmn/zmq/dbbak/log
BAK_CONF=$WORK_DIR1/dbbak.conf
timestamp=`date +'%y%m%d'`
#BAKDIR=$WORK_DIR/$timestamp
#
# if [ -d ${BAKDIR} ];then
# echo OK
# else
# mkdir $timestamp
# fi
cd $WORK_DIR
$WORK_DIR1/dbbak.sh authdb>>${LOG_DIR}/${timestamp}.log
$WORK_DIR1/dbbak.sh tmndb1>>${LOG_DIR}/${timestamp}.log
$WORK_DIR1/dbbak.sh hnltg_rms>>${LOG_DIR}/${timestamp}.log
$WORK_DIR1/dbbak.sh nechk>>${LOG_DIR}/${timestamp}.log
$WORK_DIR1/dbbak.sh pmad>>${LOG_DIR}/${timestamp}.log
$WORK_DIR1/dbbak.sh wlzy>>${LOG_DIR}/${timestamp}.log
$WORK_DIR1/dbbak.sh pmdb>>${LOG_DIR}/${timestamp}.log
#cat $BAK_CONF|grep -v '#'|while read resdir
#do
# echo $resdir
# echo '-----------'
# cd $WORK_DIR1
# echo "dbbak.sh $resdir"
# echo '============'
# dbbak.sh $resdir
# echo ${resdir} backup ok
#done
inftmn@hngora$
pmdbinftmn@hngora$cat dbbak.sh
#############################################################################
#有3个文件
#gen_<database>.sql load_<database>.sql <database>data.tar.Z
#gen_<database>.sql是建库sql
#load_<database>.sql 是导数据的sql
#<database>data.tar.Z是unl文件
#先gen_<database>.sql 建库
#再uncompress <database>data.tar.Z解压缩
#再load_<database>.sql 导入数据
#建议在恢复时去掉数据库的logging
#或在load的同时，做ontape -s
#############################################################################
#!/bin/sh
. /netwatcher/gapp/inftmn/etc/crontab.env
cd /netwatcher/perf-data/informix_bak
if [ $# -ne 1 ]
then
echo "Usage:$0 <database name>"
return
fi
DBNAME=$1
dbaccess ${DBNAME} - << EOF
unload to .unload_${DBNAME}.sql
select 'unload to '||tabname||'.unl select * from '||
tabname||';' from systables where tabid>=100;
unload to .load_${DBNAME}.sql
select 'load from '||tabname||'.unl insert into '||
tabname||';' from systables where tabid>=100;
EOF
cat .unload_${DBNAME}.sql|awk -F'|' '{print $1}' > unload_${DBNAME}.sql
cat .load_${DBNAME}.sql|awk -F'|' '{print $1}' > load_${DBNAME}.sql
rm .unload_${DBNAME}.sql .load_${DBNAME}.sql
if [ -d ${DBNAME} ]
then
echo "DIR:${DBNAME} already exist!"
exit
fi
mkdir ${DBNAME};cd ${DBNAME};
dbaccess ${DBNAME} < ../unload_${DBNAME}.sql
tar cvf ${DBNAME}data.tar *.unl
rm *.unl
compress ${DBNAME}data.tar
cd ..
dbschema -d ${DBNAME} gen_${DBNAME}.sql
mv load_${DBNAME}.sql ${DBNAME}
mv gen_${DBNAME}.sql ${DBNAME}
rm unload_${DBNAME}.sql
inftmn@hngora$date
Wed Jun 25 10:49:54 CST 2008
inftmn@hngora$vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr m3 m3 m3 m3 in sy cs us sy id
0 0 0 14109984 3181848 461 713 11304 81 78 0 2 0 0 0 0 841 699 538 35 16 49
0 5 0 580584 1271648 3867 9577 45729 34 34 0 0 0 0 0 0 4533 315234 11248 35 42 23
0 7 0 581168 1268664 3910 10011 47797 2 2 0 0 0 0 0 0 4922 311067 10962 39 43 18
0 8 0 585336 1263224 3740 9939 41062 6 6 0 0 0 0 0 0 4712 310609 11682 35 42 23
0 7 0 585536 1255416 3671 9454 33709 0 0 0 0 0 0 0 0 5298 314744 11938 39 41 19
0 8 0 585792 1245912 3670 9504 43554 10 8 0 0 0 0 0 0 5128 309429 11813 35 42 23
0 7 0 586968 1245336 3893 9770 49091 24 24 0 0 0 0 0 0 4367 315100 10708 34 42 24
^Cinftmn@hngora$iostat 5
tty md310 md311 md312 md320 cpu
tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id
2 230 175 22 18 164 21 15 164 21 15 2 0 49 35 16 6 43
0 47 1156 163 7 1156 163 6 1156 163 6 2 0 10 45 41 11 3
0 17 1368 175 31 1368 175 26 1366 175 24 0 0 0 37 43 15 5
0 17 1087 156 8 1087 156 6 1087 156 7 0 0 0 36 41 18 5
0 17 1168 166 8 1166 166 8 1168 166 7 0 0 0 49 44 5 2
0 17 1155 154 6 1155 154 5 1153 154 5 0 0 0 51 43 4 2
0 17 1110 158 8 1109 158 6 1110 158 6 0 0 0 52 42 4 2
0 17 1042 152 10 1041 151 8 1042 152 8 32 4 9 56 38 5 1
^Cinftmn@hngora$
inftmn@hngora$ls -l */
authdb/:
total 166
-rw-r--r-- 1 inftmn tmn 37957 Jun 23 13:18 authdbdata.tar.Z
-rw-r--r-- 1 inftmn tmn 43810 Jun 23 13:18 gen_authdb.sql
-rw-r--r-- 1 inftmn tmn 1856 Jun 23 13:18 load_authdb.sql
hnltg_rms/:
total 380286
-rw-r--r-- 1 inftmn tmn 149874 Jun 23 13:48 gen_hnltg_rms.sql
-rw-r--r-- 1 inftmn tmn 194428539 Jun 23 13:45 hnltg_rmsdata.tar.Z
-rw-r--r-- 1 inftmn tmn 7070 Jun 23 13:22 load_hnltg_rms.sql
nechk/:
total 1410
-rw-r--r-- 1 inftmn tmn 70090 Jun 23 13:48 gen_nechk.sql
-rw-r--r-- 1 inftmn tmn 3642 Jun 23 13:48 load_nechk.sql
-rw-r--r-- 1 inftmn tmn 638134 Jun 23 13:48 nechkdata.tar.Z
pmad/:
total 202
-rw-r--r-- 1 inftmn tmn 33963 Jun 23 13:48 gen_pmad.sql
-rw-r--r-- 1 inftmn tmn 1841 Jun 23 13:48 load_pmad.sql
-rw-r--r-- 1 inftmn tmn 65709 Jun 23 13:48 pmaddata.tar.Z
pmdb/:
total 17816040
-rw-r--r-- 1 inftmn tmn 730796 Jun 24 12:32 gen_pmdb.sql
-rw-r--r-- 1 inftmn tmn 28134 Jun 23 13:49 load_pmdb.sql
-rw-r--r-- 1 inftmn tmn 9116568717 Jun 24 10:00 pmdbdata.tar.Z
tmndb1/:
total 27650
-rw-r--r-- 1 inftmn tmn 318066 Jun 23 13:22 gen_tmndb1.sql
-rw-r--r-- 1 inftmn tmn 16940 Jun 23 13:18 load_tmndb1.sql
-rw-r--r-- 1 inftmn tmn 13800481 Jun 23 13:21 tmndb1data.tar.Z
wlzy/:
total 3036
-rw-r--r-- 1 inftmn tmn 89558 Jun 23 13:49 gen_wlzy.sql
-rw-r--r-- 1 inftmn tmn 5604 Jun 23 13:48 load_wlzy.sql
-rw-r--r-- 1 inftmn tmn 1447947 Jun 23 13:48 wlzydata.tar.Z
inftmn@hngora$pwd

由此可见，性能数据库历史数据过多，压缩之后还有9G，实际估计超过20G，按照网管规范来讲，保留3个月的历史数据即可。

查看crontab，内容极多，没有发现明显的索引更新shell。

更新索引，看看效果如何。只更新了除pmdb以外的数据量小的数据库，有所改善，但是不很明显，vm、io提高了20%左右。以上性能改善似乎没有必然性，或许是系统闲时，应该中午更新pmdb的索引再试一试。

是否锁的数目较少？lru队列少？缓存少？

检查pmdb里与小区相关的表：

inftmn@hngora$cat loa*|grep _cell
load from h_cell.unl insert into h_cell;
load from c_cell.unl insert into c_cell;
load from p_cell_day.unl insert into p_cell_day;
load from p_cell_week.unl insert into p_cell_week;
load from p_cell_weekavg.unl insert into p_cell_weekavg;
load from p_cell_month.unl insert into p_cell_month;
load from p_cell_monthavg.unl insert into p_cell_monthavg;
load from t_apg40bsc_cellcchdr.unl insert into t_apg40bsc_cellcchdr;
load from t_apg40bsc_cellcchho.unl insert into t_apg40bsc_cellcchho;
load from t_apg40bsc_cellhcs.unl insert into t_apg40bsc_cellhcs;
load from t_apg40bsc_cellsqi.unl insert into t_apg40bsc_cellsqi;
load from t_rpt_10_cell.unl insert into t_rpt_10_cell;
load from t_rpt_11_cell.unl insert into t_rpt_11_cell;
load from t_p_cell.unl insert into t_p_cell;
load from t_p_cell_norm.unl insert into t_p_cell_norm;
load from t_eric_cellsqi.unl insert into t_eric_cellsqi;
load from t_eric_cellcchho.unl insert into t_eric_cellcchho;
load from t_p_pcu_cell.unl insert into t_p_pcu_cell;
load from v_avg_t_p_pcu_cell.unl insert into v_avg_t_p_pcu_cell;
load from t_p_cell_his.unl insert into t_p_cell_his;
load from t_p_cell_norm_his.unl insert into t_p_cell_norm_his;
load from t_p_cell_m2000.unl insert into t_p_cell_m2000;
load from t_p_cell_norm2000.unl insert into t_p_cell_norm2000;
inftmn@hngora$

select count(*) from p_cell_day;1959440，数据自2006年9月开始至今

select count(*) from t_p_cell;5298768；5309481
p_cell_week：272874

等等，至此原因基本查明，由于历史数据过多，unload备份的时候需要更长的时间，1年之前的时候5-8小时，现在数据翻倍以上，10几个小时是正常的。

措施：需要项目清理历史数据。

inftmn@hngora$crontab -l

在凌晨闲时gzip一个66G的unload数据库文件，

Thu Jun 26 01:35:23 CST 2008
Thu Jun 26 05:08:24 CST 2008

需要3.5个小时！事实证明gzip比express压缩效率更高，gzip压成了6.9G，而express压成9.2G。

就是说，备份过程的压缩阶段耗时接近4小时。

更新索引后，闲时io空闲已经达到20%左右。

------------------------------------------------
20080810,compile environment done:
10.21.0.41,add 1 line to oratmn user's .profile file like this:
. ./setenv_proc
oratmn@hngora$cat setenv_proc
ORACLE_BASE=/opt/oracle
ORACLE_HOME=$ORACLE_BASE/product/9.2.0
NLS_LANG=AMERICAN_AMERICA.ZHS16GBK
NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS'
LD_LIBRARY_PATH=$ORACLE_HOME/lib:$ORACLE_HOME/jdbc/lib:$LD_LIBRARY_PATH
PATH=$ORACLE_HOME/bin:$PATH
export ORACLE_BASE ORACLE_HOME NLS_LANG NLS_DATE_FORMAT LD_LIBRARY_PATH PATH
then you can run proc app normaly.
----------------------------------------------------
20080910---工作日志：oracle数据库不正常

问题1：

10.21.0.41(informix/informix)
/netwatcher/informix/control_file

导数据:

./unload_alarm_from_inf_to_ora.sh pbcattbl

提示错误:

SQL*Loader: Release 9.2.0.8.0 - Production on Wed Sep 3 16:42:17 2008
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
SQL*Loader-704: Internal error: ulconnect: OCIServerAttach [0]
ORA-12500: TNS:listener failed to start a dedicated server process

问题2：

10.21.0.41(oratmn/passwd)

路径:/opt/tmn/oratmn/

执行环境变量:. setenvora

cd bin
tmboot -s alarm_rec

提示错误:

INFO: BEA Tuxedo, Version 8.1, 32-bit, Patch Level 231
INFO: Serial #: 650522264138-1365162828289, Expiration NONE, Maxusers 60
INFO: Licensed to: China CUC
Booting server processes ...
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
exec alarm_rec -A :
CMDTUX_CAT:1685: ERROR: Application initialization failure
0 processes started.

实际处理过程：

dbaccess tmndb1
unload to pbcattbl.txt select * from pbcattbl;
$ cat pbcattbl.txt
metatasks|0|tmn|-12|400|N|N|134|33|Tahoma|-12|400|N|N|134|33|Tahoma|-12|400|N|N|134|33|Tahoma|\ |
ORACLE_HOME=/opt/dyora/product/9.2.0
ORACLE_SID=ora9i
export ORACLE_HOME ORACLE_SID
$ORACLE_HOME/bin/sqlldr tmndb1/tmndb1 control=pbcattbl.ctl
SQL*Loader: Release 9.2.0.8.0 - Production on Wed Sep 10 14:39:09 2008
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Commit point reached - logical record count 1
$ $ORACLE_HOME/bin/sqlplus tmndb1/tmndb1
ok
$ $ORACLE_HOME/bin/sqlplus tmndb1/tmndb1@ora9i
SQL*Plus: Release 9.2.0.8.0 - Production on Wed Sep 10 15:15:37 2008
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
ERROR:
ORA-12500: TNS:listener failed to start a dedicated server process
$ $ORACLE_HOME/bin/sqlldr tmndb1/tmndb1@ora9i control=pbcattbl.ctl
SQL*Loader: Release 9.2.0.8.0 - Production on Wed Sep 10 16:01:26 2008
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
SQL*Loader-704: Internal error: ulconnect: OCIServerAttach [0]
ORA-12500: TNS:listener failed to start a dedicated server process

以上可见，数据库的本机连接是正常的，监听端口的网络链接不行，而且新的连接进程无法建立。

考虑重启listener，未见效果；关闭数据库，发现关闭时无法正常进行关闭，有大量的无用进程未关闭。强行杀掉数据库进程，启动时提示内存不足，查看系统进程，发现系统配置、维护不足，几百个sendmail进程吊死未释放，还有别的进程建议项目上适当确认、处理一下。

清理不必要的进程以后，数据库启动正常，系统恢复正常。以上两个问题随之解决。

总结：本次故障是因为大量的无效进程占据了过多的内存，新的数据库server进程无法启动，也无法建立新的连接。

本版精华
热门帖子

操作引用/回复

总帖数

每页帖数

1/1页

返回列表

用户登录

Weblogic中间件技术论坛

Tuxedo中间件技术论坛

数据库论坛

Java论坛

Linux/unix论坛

网站地图