on 07-28-2008 10:38 AM
We are using sapdb 7.6.0.34 as database in a webapp (spring, hibernate as orm mapper, commons dbcp) with the jdbc-driver version 7.6.0.30.5567. Database and application are on the same machine and the cache hit rate is at 100% as our cache is the same size than the database.
After some days the application stucks and is inresponsible after that. After killing and restarting the app it is responsible again
Doing a thread-dump at that point reveals that all threads are waiting for the database.
At that point i find tons of that in the xserver-log:
2008-07-28 10:51:12 13306 12915 XSERVER Connection lost
2008-07-28 10:51:12 13306 ERR -11987 COMMUNIC semop (receive 44793978) interrupted and check reports error
2008-07-28 10:51:12 13288 12915 XSERVER Connection lost
2008-07-28 10:51:12 13288 ERR -11987 COMMUNIC semop (receive 44269674) interrupted and check reports error
About half an hour before we can see a dramatic growth in connections. It seems the application starts opening new connections as the available ones are bound and waiting for a call to the db to return:
2008-07-28 06:38:53 4089 11560 COMMUNIC Releasing T244
2008-07-28 06:38:53 4089 12827 COMMUNIC wait for connection T244
2008-07-28 06:39:00 4073 11561 COMMUNIC Connecting T266 localhost 0
2008-07-28 06:39:00 4089 12929 TASKING Task T266 started
2008-07-28 06:39:00 4089 11007 COMMUNIC wait for connection T266
2008-07-28 06:39:00 4089 11561 COMMUNIC Connected T266 localhost 0
2008-07-28 08:06:04 4073 11561 COMMUNIC Connecting T242 localhost 0
2008-07-28 08:06:04 4089 12929 TASKING Task T242 started
2008-07-28 08:06:04 4089 11007 COMMUNIC wait for connection T242
2008-07-28 08:06:04 4089 11561 COMMUNIC Connected T242 localhost 0
2008-07-28 08:28:26 4073 11561 COMMUNIC Connecting T230 localhost 0
2008-07-28 08:28:26 4089 12929 TASKING Task T230 started
2008-07-28 08:28:26 4089 11007 COMMUNIC wait for connection T230
2008-07-28 08:28:26 4089 11561 COMMUNIC Connected T230 localhost 0
2008-07-28 09:24:44 4073 11561 COMMUNIC Connecting T218 localhost 0
2008-07-28 09:24:44 4089 12929 TASKING Task T218 started
2008-07-28 09:24:44 4089 11007 COMMUNIC wait for connection T218
2008-07-28 09:24:44 4089 11561 COMMUNIC Connected T218 localhost 0
2008-07-28 10:19:50 4073 11561 COMMUNIC Connecting T300 localhost 0
2008-07-28 10:19:50 4089 12929 TASKING Task T300 started
2008-07-28 10:19:50 4089 11007 COMMUNIC wait for connection T300
2008-07-28 10:19:50 4089 11561 COMMUNIC Connected T300 localhost 0
2008-07-28 10:19:52 4073 11561 COMMUNIC Connecting T243 localhost 0
2008-07-28 10:19:52 4089 12929 TASKING Task T243 started
2008-07-28 10:19:52 4089 11007 COMMUNIC wait for connection T243
2008-07-28 10:19:52 4089 11561 COMMUNIC Connected T243 localhost 0
2008-07-28 10:19:55 4073 11561 COMMUNIC Connecting T234 localhost 0
2008-07-28 10:19:55 4089 12929 TASKING Task T234 started
2008-07-28 10:19:55 4089 11007 COMMUNIC wait for connection T234
2008-07-28 10:19:55 4089 11561 COMMUNIC Connected T234 localhost 0
2008-07-28 10:20:15 4089 12929 TASKING Task T267 started
2008-07-28 10:20:15 4073 11561 COMMUNIC Connecting T267 localhost 0
2008-07-28 10:20:15 4089 11007 COMMUNIC wait for connection T267
2008-07-28 10:20:15 4089 11561 COMMUNIC Connected T267 localhost 0
2008-07-28 10:20:16 4073 11561 COMMUNIC Connecting T233 localhost 0
2008-07-28 10:20:16 4089 12929 TASKING Task T233 started
2008-07-28 10:20:16 4089 11007 COMMUNIC wait for connection T233
2008-07-28 10:20:16 4089 11561 COMMUNIC Connected T233 localhost 0
2008-07-28 10:21:08 4089 12929 TASKING Task T290 started
2008-07-28 10:21:08 4073 11561 COMMUNIC Connecting T290 localhost 0
2008-07-28 10:21:08 4089 11007 COMMUNIC wait for connection T290
2008-07-28 10:21:08 4089 11561 COMMUNIC Connected T290 localhost 0
2008-07-28 10:21:17 4089 12929 TASKING Task T235 started
2008-07-28 10:21:17 4073 11561 COMMUNIC Connecting T235 localhost 0
2008-07-28 10:21:17 4089 11007 COMMUNIC wait for connection T235
2008-07-28 10:21:17 4089 11561 COMMUNIC Connected T235 localhost 0
2008-07-28 10:21:21 4073 11561 COMMUNIC Connecting T215 localhost 0
2008-07-28 10:21:21 4089 12929 TASKING Task T215 started
2008-07-28 10:21:21 4089 11007 COMMUNIC wait for connection T215
2008-07-28 10:21:21 4089 11561 COMMUNIC Connected T215 localhost 0
2008-07-28 10:21:33 4073 11561 COMMUNIC Connecting T263 localhost 0
2008-07-28 10:21:33 4089 12929 TASKING Task T263 started
2008-07-28 10:21:33 4089 11007 COMMUNIC wait for connection T263
2008-07-28 10:21:33 4089 11561 COMMUNIC Connected T263 localhost 0
2008-07-28 10:21:41 4073 11561 COMMUNIC Connecting T225 localhost 0
I've googled for that sort of problem but the best answer i found is "this can happen if there is a problem with the socket communication" so that the socket is brocken but the jdbc-driver never realizes this and sits there, waiting for an answer.
Now i'm totally hosed, i've no clue where to look next.
Any hint would be appreciated.
Regards
Dirk Lachowski
Hi Dirk,
> About half an hour before we can see a dramatic growth in connections. It seems the application starts opening new connections as the available ones are bound and waiting for a call to the db to return:
> Now i'm totally hosed, i've no clue where to look next.
>
> Any hint would be appreciated.
I propose to check why the client starts to create so many connections.
If the db calls hang on the client with semaphore operators in the JDBC driver, than perhaps checking the JAVA runtime environment might be paying off.
It don't sound like a db (kernel) problem at all.
regards,
Lars
BTW: the MaxDB version you use is pretty outdated - get the current patch here in SDN. As this also comes with a new version of the JDBC driver this perhaps improves things for you.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Lars,
i finaly solved the problem, but i'm still unsure if it isn't a kernel problem.
As it turns out we was missing some indexes. Using diagnose monitor we found some selects with a very bad selectivity. The problem was, that under high load (approx. 1000 sql-commands/s) some of these selects seem to get stucked. The command sits there, doing nothing but holding a shared lock on a table because it started to do a full table scan but never returns. In turn, other commands waiting for a write lock also get stuck and there we are, everything stoped.
After adding some indexes the problem is gone, but i still don't understand the reason.
HI Dirk,
good to hear you solved the problem.
Concerning the root cause: it would have been interesting to see, whether the tasks were really doing nothing.
X_CONS and the database trace (vtrace) would have been the 'weapons of choice' here.
Can you reproduce the issue on a test system?
regards,
Lars
User | Count |
---|---|
95 | |
11 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.