1 问题描述
XX银行8月20日与8月24日出现系统故障,提示weblogic拒绝连接,所有柜面业务中断。日志中伴随有jdbc连接池报错。之后三分钟左右,连接池为不可用状态。之后系统恢复正常。
2 问题分析
接下来的分析主要针对日志中反应的当时的服务器(其上JDBC连接池)运行情况加以描述。
局方现场工程师指出问题出现当时Teller的5个节点,有四个都出现了相同的问题,导致连接池不可用。
下面以Teller1为例分析当时的日志变化情况。
在8月20上午 10点05分第一次出现下列信息
####<2012-8-20 上午10时05分30秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345428330022> <BEA-000627> <Reached maximum capacity of pool "teller", making "0" new resource instances instead of "5".>
Reached maximum capacity of pool "teller" 很明显此时jdbc连接池使用已经达到了最大值,不能再开辟连接资源了,并且接下来的一段时间里,都处于达到最大值的状态。
2012-8-20 上午10时06分31秒应用报出了weblogic.jdbc.extensions.PoolLimitSQLException 异常,这个异常就是连接池已满造成的,如下列信息:
weblogic.jdbc.extensions.PoolLimitSQLException: weblogic.common.resourcepool.ResourceLimitException: No resources currently available in pool teller to allocate to applications, please increase the size of the pool and retry..
at weblogic.jdbc.common.internal.JDBCUtil.wrapAndThrowResourceException(JDBCUtil.java:243)
at weblogic.jdbc.pool.Driver.connect(Driver.java:161)
at weblogic.jdbc.jts.Driver.getNonTxConnection(Driver.java:647)
at weblogic.jdbc.jts.Driver.connect(Driver.java:137)
at weblogic.jdbc.common.internal.RmiDataSource.getConnection(RmiDataSource.java:359)
at com.dc.datastore.JDBCDataStoreImpl.getConnection(JDBCDataStoreImpl.java:695)
at com.dc.datastore.JDBCDataStoreImpl.getConnection(JDBCDataStoreImpl.java:523)
at com.dc.datastore.DataStore.getConnection(Unknown Source)
at com.dcfs.platform.sql.DBDynamicSQL2.<init>(DBDynamicSQL2.java:70)
at com.dcfs.platform.sql.DBDynamicSQL2.getInstance(DBDynamicSQL2.java:80)
at com.dcfs.platform.servlet.QryICDevServlet.doPost(QryICDevServlet.java:44)
接下来伴随Reached maximum capacity of pool "teller"出现了creating connection IO异常,且都发生在10点6分这一段
####<2012-8-20 上午10时06分40秒 CST> <Warning> <JDBC> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '35' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345428400818> <BEA-001129> <Received exception while creating connection for pool "teller": Io 异常: Connection reset>
从这里看出此时jdbc与数据库的IO通信有异常连接被reset,网络以及数据库关闭都可能造成这一现象。
在2012-8-20 上午10时06分54秒 开始出现 weblogic.jdbc.extensions.PoolDisabledSQLException 这一异常。
weblogic.jdbc.extensions.PoolDisabledSQLException: weblogic.common.resourcepool.ResourceDisabledException: Pool teller is disabled, cannot allocate resources to applications..
此时jdbc连接池被disable掉,导致整个池连接失效,正因为这个过程相对短暂,在之后没有通过人工重启,数据源自动恢复。
在2012-8-20 上午10时09分03秒 jdbc开始恢复正常,开始创建新的连接
####<2012-8-20 上午10时09分03秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <Thread-933666> <<anonymous>> <> <> <1345428543633> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
至此故障自动恢复,业务系统也恢复正常运作。
在8月24日上午10时01分45秒, 第一次出现下列信息:
####<2012-8-24 上午10时01分45秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '36' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773705303> <BEA-000627> <Reached maximum capacity of pool "teller", making "0" new resource instances instead of "5".>
从上午10时01分45秒持续到2012-8-24 上午10时01分52秒,在这期间累计八次这样的信息。之后第一次出现weblogic.jdbc.extensions.PoolLimitSQLException: weblogic.common.resourcepool.ResourceLimitException: No resources currently available in pool teller to allocate to applications, please increase the size of the pool and retry..
####<2012-8-24 上午10时01分55秒 CST> <Error> <HTTP> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '36' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1345773715315> <BEA-101020> <[weblogic.servlet.internal.WebAppServletContext@ce136402 - appName: 'SmartTeller', name: 'SmartTeller', context-path: '/SmartTeller'] Servlet failed with Exception
com.dc.datastore.PMException:
ErrorCode:[system.pm.CreateConnectionFail]
ErrorMesg:[???zh_CN.system.pm.CreateConnectionFail???]
SceneInfo:datasource:[teller]
weblogic.jdbc.extensions.PoolLimitSQLException: weblogic.common.resourcepool.ResourceLimitException: No resources currently available in pool teller to allocate to applications, please increase the size of the pool and retry..>
接着就报出<Received exception while creating connection for pool "teller": Io 异常: Connection reset>
从2012-8-24 上午10时02分46秒 到2012-8-24 上午10时04分38秒 期间出现<Received exception while creating connection for pool "teller": Io 异常: Connection reset> 六次。
具体详细信息如下:
####<2012-8-24 上午10时02分46秒 CST> <Warning> <JDBC> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '14' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773766690> <BEA-001129> <Received exception while creating connection for pool "teller": Io 异常: Connection reset>
####<2012-8-24 上午10时03分00秒 CST> <Warning> <JDBC> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '44' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773780226> <BEA-001129> <Received exception while creating connection for pool "teller": Io 异常: Connection reset>
####<2012-8-24 上午10时03分16秒 CST> <Warning> <JDBC> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '38' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773796307> <BEA-001129> <Received exception while creating connection for pool "teller": Io 异常: Connection reset>
####<2012-8-24 上午10时04分19秒 CST> <Warning> <JDBC> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '41' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773859614> <BEA-001129> <Received exception while creating connection for pool "teller": Io 异常: Connection reset>
####<2012-8-24 上午10时04分38秒 CST> <Warning> <JDBC> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '14' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773878862> <BEA-001129> <Received exception while creating connection for pool "teller": Io 异常: Connection reset>
在报出<Received exception while creating connection for pool "teller": Io 异常: Connection reset> 之后,都会有下列error的报出:
weblogic.jdbc.extensions.PoolLimitSQLException: weblogic.common.resourcepool.ResourceLimitException: No resources currently available in pool teller to allocate to applications, please increase the size of the pool and retry..
之后就会出现:
<Reached maximum capacity of pool "teller", making "0" new resource instances instead of "5".>
从2012-8-24 上午10时01分45秒 到2012-8-24 上午10时04分39秒这段时间,统计信息
<Reached maximum capacity of pool "teller", making "0" new resource instances instead of "5".>
得到578次。以下列举部分尝试新建连接的进程的状态:
ExecuteThread: '36'实例尝试新建连接14次;ExecuteThread: '35'实例尝试新建连接12次;ExecuteThread: '34'实例尝试新建连接13次;ExecuteThread: '33'实例尝试新建连接11次等。
以上说明在尝试创建新的连接的进程,数据库的连接已经处于不健康的状态,导致不断请求,请求之后都会报出<Received exception while creating connection for pool "teller": Io 异常: Connection reset> ,说明当时,数据库连接有异常。
在2012-8-24 上午10时04分46秒时,第一次出现weblogic.jdbc.extensions.PoolDisabledSQLException: weblogic.common.resourcepool.ResourceDisabledException: Pool teller is disabled, cannot allocate resources to applications.. 连接池被disable。以下是连接池第一次被disable的详细信息:
####<2012-8-24 上午10时04分46秒 CST> <Error> <HTTP> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '53' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1345773886839> <BEA-101020> <[weblogic.servlet.internal.WebAppServletContext@ce136402 - appName: 'SmartTeller', name: 'SmartTeller', context-path: '/SmartTeller'] Servlet failed with Exception
com.dc.datastore.PMException:
ErrorCode:[system.pm.CreateConnectionFail]
ErrorMesg:[???zh_CN.system.pm.CreateConnectionFail???]
SceneInfo:
datasource:[teller]
......
at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:121)
at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2019)
at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:1925)
at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1394)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
weblogic.jdbc.extensions.PoolDisabledSQLException: weblogic.common.resourcepool.ResourceDisabledException: Pool teller is disabled, cannot allocate resources to applications..
at weblogic.jdbc.common.internal.JDBCUtil.wrapAndThrowResourceException(JDBCUtil.java:242)
at weblogic.jdbc.pool.Driver.connect(Driver.java:161)
at weblogic.jdbc.jts.Driver.getNonTxConnection(Driver.java:647)
以下是连接池disable的详细时间信息:
<2012-8-24 上午10时04分58秒 CST>
<2012-8-24 上午10时05分03秒 CST>
<2012-8-24 上午10时05分11秒 CST>
<2012-8-24 上午10时05分12秒 CST>
<2012-8-24 上午10时05分18秒 CST>
<2012-8-24 上午10时05分25秒 CST>
<2012-8-24 上午10时05分33秒 CST>
<2012-8-24 上午10时05分36秒 CST>
<2012-8-24 上午10时05分41秒 CST>
至此连接池被disable的时间持续55秒。
期间从2012-8-24 上午10时05分39秒<Connection for pool "teller" closed.> 直到2012-8-24 上午10时05分41秒 第一批close完毕。共49个<Connection for pool "teller" closed.>也就是说在连接池被disable的时候,同时在关闭不可用的连接。
在2012-8-24 上午10时05分41秒 第一次开始创建新的可用连接:
####<2012-8-24 上午10时05分41秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '39' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773941989> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分41秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <Thread-74630> <<anonymous>> <> <> <1345773941989> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分41秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '42' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773941990> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分41秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '40' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773941990> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分41秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773941991> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分41秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '37' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773941992> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分43秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '45' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773943098> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
####<2012-8-24 上午10时05分44秒 CST> <Info> <Common> <teller1.bocd.com.cn> <AdminServer> <[ACTIVE] ExecuteThread: '46' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1345773944424> <BEA-000628> <Created "5" resources for pool "teller", out of which "5" are available and "0" are unavailable.>
至此故障自动恢复,业务系统也恢复正常运作。
综合日志分析,从当时不断的尝试创建新的连接来看,这些尝试新建连接的并不是大量的不同的线程,而是相同的进程多次请求创建新的连接。说明当时请求创建连接的线程,数据库的连接已经处于不健康的状态,导致不断请求,但又没有响应,请求之后都会报出<Received exception while creating connection for pool "teller": Io 异常: Connection reset> ,说明当时,数据库连接有异常。
手工强制suspend连接池、数据库关闭、网络不稳定等因素都可能成为connection pool被disable的诱因。而根据以上pool为什么会被disable的原因中,可以得出是数据库关闭,或数据库连接中断(网络异常)导致。
一个被disable的connection pool我们不需要手工resume,weblogic内部实现了连接池的自我健康检查功能,对于disable的connection pool,weblogic会每隔5秒钟(DEFAULT_SCAN_UNIT)去做一次连接尝试(尝试创建一个物理连接,如果连接成功,那么这个连接会被直接放入连接池中,同时resume连接池。这也就是后来,为什么故障又自动恢复的原因。
综上分析得出,导致连接池超载的原因是当时的连接中断,不断的请求创建新的连接可供使用,导致连接在短时间内达到最大。而导致pool被disable的原因也是由于当时的数据库关闭,或数据库连接中断(网络异常)。
3 问题总结
此问题,不是由中间件引起。
4 解决方案
还请排查数据库方面,故障期间导致连接中断和关闭的原因。