cancel
Showing results for 
Search instead for 
Did you mean: 

communication hangs on Vwait

Former Member
0 Kudos

Hi all,

i get again the blocked threads, but now maybe i found some reason, using:

x_cons DBNAME sh ac 1 i get:

SERVERDB: DBNAME

T115   7     -1 User         0* Vwait                 0 0               187962245(s)
T121   7     -1 User         0* Vwait                 0 0               187962245(s)
T125   7     -1 User         0* Vwait                 0 0               187962245(s)
T142   7     -1 User         0* Vwait                 0 0               187962245(s)
T193   7     -1 User         0* Vwait                 0 0               187962245(s)
T214   7     -1 User         0* Vwait                 0 0               187962245(s)
T222   7     -1 User         0* Vwait                 0 0               187962245(s)
T239   8     -1 User         0* Vwait                 0 0               191206909(s)
T252   8     -1 User         0* Vwait                 0 0               191206909(s)
T257   8     -1 User         0* Vwait                 0 0               191206909(s)
T276   8     -1 User         0* Vwait                 0 0               191206909(s)
T287   8     -1 User         0* Vwait                 0 0               191206909(s)
T304   8     -1 User         0* Vwait                 0 0               191206909(s)
T312   8     -1 User         0* Vwait                 0 0               191206909(s)
T324   8     -1 User         0* Vwait                 0 0               191206909(s)
T335   8     -1 User         0* Vwait                 0 0               191206909(s)
T360   9     -1 User         0* Vwait                 0 0               192723529(s)
T388   9     -1 User         0* Vwait                 0 0               192723529(s)
T400   9     -1 User         0* Vwait                 0 0               192723529(s)
T418   9     -1 User         0* Vwait                 0 0               192723529(s)
T459   9     -1 User         0* Vwait                 0 0               192723529(s)
T466   9     -1 User         0* Vwait                 0 0               192723529(s)
T475  10     -1 User         0* Vwait                 0 0               185005529(s)
T486  10     -1 User         0* Vwait                 0 0               185005529(s)
T493  10     -1 User         0* Vwait                 0 0               185005529(s)
T497  10     -1 User         0* Vwait                 0 0               185005529(s)
T503  10     -1 User         0* Vwait                 0 0               185005529(s)
T505  10     -1 User         0* Vwait                 0 0               185005529(s)
T507  10     -1 User         0* Vwait                 0 0               185005529(s)
T513  10     -1 User         0* Vwait                 0 0               185005529(s)
T531  10     -1 User         0* Vwait                 0 0               185005529(s)
T532  10     -1 User         0* Vwait                 0 0               185005529(s)
T534  10     -1 User         0* Vwait                 0 0               185005529(s)
T549  10     -1 User         0* Vwait                 0 0               185005529(s)
T561  10     -1 User         0* Vwait                 0 0               185005529(s)
T575  10     -1 User         0* Vwait                 0 0               185005529(s)
T582  10     -1 User         0* Vwait                 0 0               185005529(s)
T584  10     -1 User         0* Vwait                 0 0               185005529(s)

checking Server side i get the Stack that is waiting on socketRead that blocks the Finalizer:

"http-80-10" daemon prio=10 tid=0x0000002af5750800 nid=0x7b9b runnable [0x000000004b853000..0x000000004b854c30]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at com.sap.dbtech.rte.comm.BasicSocketComm.receiveData(BasicSocketComm.java:577)
        at com.sap.dbtech.rte.comm.BasicSocketComm.receive(BasicSocketComm.java:666)
        at com.sap.dbtech.rte.comm.JdbcCommunication.execute(JdbcCommunication.java:41)
        at com.sap.dbtech.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:536) 
        - locked <0x0000002aa7c55c60> (a com.sap.dbtech.jdbc.ConnectionSapDB)
        at com.sap.dbtech.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:461)
        at com.sap.dbtech.jdbc.CallableStatementSapDB.execute(CallableStatementSapDB.java:441)
        at com.sap.dbtech.jdbc.CallableStatementSapDB.execute(CallableStatementSapDB.java:313)
        at com.sap.dbtech.jdbc.CallableStatementSapDB.executeUpdate(CallableStatementSapDB.java:778)
        at com.sap.dbtech.jdbc.trace.PreparedStatement.executeUpdate(PreparedStatement.java:81)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102)
        at org.exolab.castor.jdo.engine.SQLEngine.create(SQLEngine.java:616)
        at org.exolab.castor.jdo.engine.SQLEngine.create(SQLEngine.java:536)
        at org.exolab.castor.jdo.engine.SQLEngine.create(SQLEngine.java:536)
        at org.exolab.castor.persist.ClassMolder.create(ClassMolder.java:891)
        at org.exolab.castor.persist.LockEngine.create(LockEngine.java:458)
        at org.exolab.castor.persist.TransactionContext.create(TransactionContext.java:803)
        - locked <0x0000002aa9dfbd38> (a org.exolab.castor.jdo.engine.TransactionContextImpl)
        at org.exolab.castor.jdo.engine.DatabaseImpl.create(DatabaseImpl.java:338)
        at com.supridatta.bean.DataPersist.gravar(DataPersist.java:215)

and here is the Finalizer BLOCKED:

"Finalizer" daemon prio=10 tid=0x0000002af1e04800 nid=0x4a5a waiting for monitor entry [0x0000000041531000..0x0000000041531c30]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.sap.dbtech.jdbc.StatementSapDB.closeResultSet(StatementSapDB.java:189)
        - waiting to lock <0x0000002aa7c55c60> (a com.sap.dbtech.jdbc.ConnectionSapDB)
        at com.sap.dbtech.jdbc.StatementSapDB.close(StatementSapDB.java:174)
        at com.sap.dbtech.jdbc.CallableStatementSapDB.close(CallableStatementSapDB.java:2325)
        at com.sap.dbtech.jdbc.trace.PreparedStatement.close(PreparedStatement.java:637)
        at org.apache.tomcat.dbcp.dbcp.DelegatingStatement.close(DelegatingStatement.java:168)
        at org.exolab.castor.jdo.engine.SQLEngine$SQLQuery.close(SQLEngine.java:1675)
        at org.exolab.castor.persist.QueryResults.close(QueryResults.java:241)
        at org.exolab.castor.jdo.engine.OQLQueryImpl$OQLEnumeration.close(OQLQueryImpl.java:676)
        at org.exolab.castor.jdo.engine.OQLQueryImpl$OQLEnumeration.finalize(OQLQueryImpl.java:685)
        at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
        at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)
        at java.lang.ref.Finalizer.access$100(Finalizer.java:14)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)
        
   Locked ownable synchronizers:
        - None

anyway to check what causes that Vwait, but a way to check just when that occurs, because i cant manage to create or simulate that problem, that just occurs in a random time, maybe occurs one per week, one per month, etc...

any idea are welcome.

best regards

Clóvis

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hallo,

you are familiar with the select-option to specify the isolation level?

If at least for the select of SYSPARSEID you would use ISOLATION LEVEL 0, every lock and check for locks on this table would be avoided. ANd I think, especially for this table, no locks during select are needed.

But even with the ISOLATION LEVEL 1 (usually the default), one would not lock the table, just check if the row currently handled is exclusive locked by another task. Please check the statement (add that option ISOLATION LEVEL ...) and check if this change will help.

Elke

Former Member
0 Kudos

Hi, Elke

i used the lock ISOLATION LEVEL 0 but still getting:

118:  * W1  User task 338 is waiting in state Command wait for page 59552, locked from task 520

at time when this analyses was taken i stoped using my tool, then perhaps, maybe command monitor was locking or waiting to insert commands in SYSPARSEID table, like a queue?

thanks for your tip

Clóvis

Answers (1)

Answers (1)

lbreddemann
Active Contributor
0 Kudos

Hi Clovis,

please provide a


x_cons <SID> show all

output the next time this issue occurs.

regards,

Lars

Former Member
0 Kudos

Hi, Lars,

the system blocks again but with other messages:

T388   9     -1 User         0* Running               0 1860913            195182771(r)

Console command finished (2009-04-30 15:44:39).

follow is relative to the thread that blocks many others:

"Thread-310" daemon prio=10 tid=0x0000002af67b9000 nid=0x1579 runnable [0x000000004e05b000..0x000000004e05cc30]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at com.sap.dbtech.rte.comm.BasicSocketComm.receiveConnect(BasicSocketComm.java:707)
        at com.sap.dbtech.rte.comm.BasicSocketComm.dbConnectExchange(BasicSocketComm.java:789)
        at com.sap.dbtech.rte.comm.BasicSocketComm.connectDB(BasicSocketComm.java:233)
        at com.sap.dbtech.rte.comm.SocketComm$1.open(SocketComm.java:38)
        at com.sap.dbtech.jdbc.DriverSapDB.openConnection(DriverSapDB.java:966)
        at com.sap.dbtech.jdbc.DriverSapDB.openByURL(DriverSapDB.java:891)
        at com.sap.dbtech.jdbc.DriverSapDB.connect(DriverSapDB.java:208)
        - locked <0x0000002aef27ecb0> (a com.sap.dbtech.jdbc.DriverSapDB)
        at org.apache.tomcat.dbcp.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
        at org.apache.tomcat.dbcp.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:294)
        - locked <0x0000002aef7dd398> (a org.apache.tomcat.dbcp.dbcp.PoolableConnectionFactory)
        at org.apache.tomcat.dbcp.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:974)
        at org.apache.tomcat.dbcp.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:96)
        at org.apache.tomcat.dbcp.dbcp.BasicDataSource.getConnection(BasicDataSource.java:880)
        at org.exolab.castor.jdo.engine.DatabaseRegistry.createConnection(DatabaseRegistry.java:399)

that socketRead never ends and i must stop/start tomcat to get web application running again.

i executed x_cons <SID> show all but the file generated have 9MB in size i can send this for you by email? or you must need a link to download?

i have generated a dump from running JVM too, and i have a option to query all connections instances running in, you can check with devs of JDBC driver, if have any OQL that i can do to try search any info about this type of problem?

Best regards.

Clóvis

lbreddemann
Active Contributor
0 Kudos

>

> Hi, Lars,

>

> the system blocks again but with other messages:

>

>

T388   9     -1 User         0* Running               0 1860913            195182771(r)
> 
> Console command finished (2009-04-30 15:44:39).

>

> follow is relative to the thread that blocks many others:

>

>

"Thread-310" daemon prio=10 tid=0x0000002af67b9000 nid=0x1579 runnable [0x000000004e05b000..0x000000004e05cc30]
>    java.lang.Thread.State: RUNNABLE
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
[...]
> 

>

> that socketRead never ends and i must stop/start tomcat to get web application running again.

Hmm.. still the old topic on your system, the hanging network acesses...

> i executed x_cons <SID> show all but the file generated have 9MB in size i can send this for you by email? or you must need a link to download?

9MB is too large for our email policy - a download would help, yes.

> i have generated a dump from running JVM too, and i have a option to query all connections instances running in, you can check with devs of JDBC driver, if have any OQL that i can do to try search any info about this type of problem?

As already discussed in the previous threads - it actually does not seem to be the problem of the JDBC driver. Somehow the network calls don't return.

Have you ever seen these effects on a different server in a different network setup?

regards,

Lars

Former Member
0 Kudos

Hi, Lars,

as you see, i still having this type of problem, that occurs in two different production hardware,

with different machines, but same configuration, that is a web app server connected to a maxdb server using cross over cable with gigabit ethernet cards. one with 83 concurrent users and another with aproximatelly 350, the problem in this post occurs with the scenario that supports 350 concurrent users, but last week the problem occurs with the other too, then i have this problem in two different scenarios. i already changed machines, increased processors, memory, all hardware related things, upgraded versions of Java, Linux, Maxdb, etc..., and this problems just occurs without any way to me discover why). i just know the reason, when a Java synchronized method executes that needs a connection or get one from pool that all threads depends, and database never replies, the system get stucked, sometimes this occurs for 3 minutes or less, and the stop and restart isnt necessary, but sometimes never ends, then think, many users waiting your application for 3 minutes, all at same time, this thing let me crazy.

i put the xcons output in xconsShowAll20090430.txt

about JDBC question, i wish to say "If JDBC devs wants i can query a snapshot that i made here from JVM with that we can inspect all JDBC objects an try to use that information to find something"

i hope that you can help me with this problem

best regards

Clóvis

lbreddemann
Active Contributor
0 Kudos

Hi Clóvis,

I had a look to the x_cons output.

It pretty clearly shows, that there was no active task on the database when this output was taken.

No active tasks, no waiting for region accesses, no nothing...

If it was my system the next step I'd take would be to try to catch the hanging network call with a network monitor. Maybe there is some kind of package optimization (e.g. TCP.NO_DELAY=OFF) active that causes this kind of behaviour.

To me this really looks like the issue is not with the database software here.

Since MaxDB is heavily used with Java for SAP products and we don't know this kind of error yet, I doubt that this is a bug in the software.

sorry for that.

Lars

Former Member
0 Kudos

Hi, Lars

i will capture the communications on database server, and try to analyze how TCP packets are coming to server, to help this you have any doc to point me, that explain the native protocol used by MaxDB?

thanks again for your help

best regards

Clóvis

lbreddemann
Active Contributor
0 Kudos

Hi Clóvis,

sorry, but I don't have that kind of information.

Maybe it helps to have the "vtrace default" active and to flush it the next time you face the hanging situation.

regards,

Lars

lbreddemann
Active Contributor
0 Kudos

Hi Clóvis,

sorry, but I don't have that kind of information.

Maybe it helps to have the "vtrace default" active and to flush it the next time you face the hanging situation.

regards,

Lars

Former Member
0 Kudos

Hi Lars,

I think i found a possible problem, with automatic log backup, the segment in one system is too small and a log backup is made many times per day here is a cut from Backup History file:

3866:  4A034BB7014A|LOG_000001313|SAVE WARM|2009-05-07 16:51:52|2009-05-07 17:59:31|2009-05-07 17:59:35|2009-05-07 17:59:35|  27584207|  27586339|   |autosave                                                        |      2152|         1|         0|                                        |
3867:  4A0362F8014B|LOG_000001314|SAVE WARM|2009-05-07 17:59:31|2009-05-07 19:37:52|2009-05-07 19:38:48|2009-05-07 19:38:48|  27586340|  27588472|   |autosave                                                        |      2152|         1|         0|                                        |
3868:  4A03A177014C|LOG_000001315|SAVE WARM|2009-05-07 19:38:47|2009-05-08 00:05:26|2009-05-08 00:05:27|2009-05-08 00:05:29|  27588473|  27590605|   |autosave                                                        |      2152|         1|         0|                                        |
3869:  4A03A195014D|LOG_000001316|SAVE WARM|2009-05-08 00:05:26|2009-05-08 00:05:53|2009-05-08 00:05:57|2009-05-08 00:05:59|  27590606|  27592738|   |autosave                                                        |      2152|         1|         0|                                        |
3870:  4A03A1A4014E|LOG_000001317|SAVE WARM|2009-05-08 00:05:53|2009-05-08 00:06:12|2009-05-08 00:06:12|2009-05-08 00:06:14|  27592739|  27594871|   |autosave                                                        |      2152|         1|         0|                                        |
3871:  4A03A1B3014F|LOG_000001318|SAVE WARM|2009-05-08 00:06:12|2009-05-08 00:06:27|2009-05-08 00:06:27|2009-05-08 00:06:28|  27594872|  27597004|   |autosave                                                        |      2152|         1|         0|                                        |
3872:  4A03A1C70150|LOG_000001319|SAVE WARM|2009-05-08 00:06:27|2009-05-08 00:06:46|2009-05-08 00:06:47|2009-05-08 00:06:48|  27597005|  27599137|   |autosave                                                        |      2152|         1|         0|                                        |
3873:  4A03A1DB0151|LOG_000001320|SAVE WARM|2009-05-08 00:06:46|2009-05-08 00:07:03|2009-05-08 00:07:07|2009-05-08 00:07:08|  27599138|  27601270|   |autosave                                                        |      2152|         1|         0|                                        |
3874:  4A03A1EA0152|LOG_000001321|SAVE WARM|2009-05-08 00:07:03|2009-05-08 00:07:19|2009-05-08 00:07:22|2009-05-08 00:07:23|  27601271|  27603403|   |autosave                                                        |      2152|         1|         0|                                        |
3875:  4A03A1FE0153|LOG_000001322|SAVE WARM|2009-05-08 00:07:19|2009-05-08 00:07:38|2009-05-08 00:07:42|2009-05-08 00:07:44|  27603404|  27605536|   |autosave                                                        |      2152|         1|         0|                                        |
3876:  4A03AF6D0154|DAT_000032277|SAVE WARM|2009-05-08 01:05:01|2009-05-08 01:05:01|2009-05-08 01:05:02|2009-05-08 01:06:55|  27605879|          |NO |completeF                                                       |    338720|         1|         0|                                        |
3877:  4A0415A40155|LOG_000001323|SAVE WARM|2009-05-08 00:07:38|2009-05-08 08:21:04|2009-05-08 08:21:08|2009-05-08 08:21:09|  27605537|  27607669|   |autosave                                                        |      2152|         1|         0|                                        |
3878:  4A0415AE0156|LOG_000001324|SAVE WARM|2009-05-08 08:21:04|2009-05-08 08:21:17|2009-05-08 08:21:18|2009-05-08 08:21:19|  27607670|  27609802|   |autosave                                                        |      2152|         1|         0|                                        |
3879:  4A0415BD0157|LOG_000001325|SAVE WARM|2009-05-08 08:21:17|2009-05-08 08:21:32|2009-05-08 08:21:33|2009-05-08 08:21:34|  27609803|  27611935|   |autosave                                                        |      2152|         1|         0|                                        |
3880:  4A0415CC0158|LOG_000001326|SAVE WARM|2009-05-08 08:21:32|2009-05-08 08:21:46|2009-05-08 08:21:48|2009-05-08 08:21:50|  27611936|  27614068|   |autosave                                                        |      2152|         1|         0|                                        |
3881:  4A0415DB0159|LOG_000001327|SAVE WARM|2009-05-08 08:21:46|2009-05-08 08:22:02|2009-05-08 08:22:03|2009-05-08 08:22:04|  27614069|  27616201|   |autosave                                                        |      2152|         1|         0|                                        |
3882:  4A0415EA015A|LOG_000001328|SAVE WARM|2009-05-08 08:22:02|2009-05-08 08:22:15|2009-05-08 08:22:18|2009-05-08 08:22:19|  27616202|  27618334|   |autosave                                                        |      2152|         1|         0|                                        |
3883:  4A0415F4015B|LOG_000001329|SAVE WARM|2009-05-08 08:22:15|2009-05-08 08:22:26|2009-05-08 08:22:28|2009-05-08 08:22:29|  27618335|  27620467|   |autosave                                                        |      2152|         1|         0|                                        |
3884:  4A0415FF015C|LOG_000001330|SAVE WARM|2009-05-08 08:22:26|2009-05-08 08:22:37|2009-05-08 08:22:39|2009-05-08 08:22:40|  27620468|  27622600|   |autosave                                                        |      2152|         1|         0|                                        |
3885:  4A041609015D|LOG_000001331|SAVE WARM|2009-05-08 08:22:37|2009-05-08 08:22:48|2009-05-08 08:22:49|2009-05-08 08:22:49|  27622601|  27624733|   |autosave                                                        |      2152|         1|         0|                                        |
3886:  4A041C2E015E|LOG_000001332|SAVE WARM|2009-05-08 08:22:48|2009-05-08 08:48:59|2009-05-08 08:49:02|2009-05-08 08:49:02|  27624734|  27626866|   |autosave                                                        |      2152|         1|         0|                                        |
3887:  4A0421E5015F|LOG_000001333|SAVE WARM|2009-05-08 08:48:59|2009-05-08 09:13:23|2009-05-08 09:13:25|2009-05-08 09:13:25|  27626867|  27628999|   |autosave                                                        |      2152|         1|         0|                                        |
3888:  4A0421F90160|LOG_000001334|SAVE WARM|2009-05-08 09:13:23|2009-05-08 09:13:43|2009-05-08 09:13:45|2009-05-08 09:13:46|  27629000|  27631132|   |autosave                                                        |      2152|         1|         0|                                        |
3889:  4A04220D0161|LOG_000001335|SAVE WARM|2009-05-08 09:13:43|2009-05-08 09:14:02|2009-05-08 09:14:05|2009-05-08 09:14:06|  27631133|  27633265|   |autosave                                                        |      2152|         1|         0|                                        |
3890:  4A0422170162|LOG_000001336|SAVE WARM|2009-05-08 09:14:02|2009-05-08 09:14:13|2009-05-08 09:14:15|2009-05-08 09:14:16|  27633266|  27635398|   |autosave                                                        |      2152|         1|         0|                                        |
3891:  4A0422260163|LOG_000001337|SAVE WARM|2009-05-08 09:14:13|2009-05-08 09:14:25|2009-05-08 09:14:30|2009-05-08 09:14:30|  27635399|  27637531|   |autosave                                                        |      2152|         1|         0|                                        |
3892:  4A0422300164|LOG_000001338|SAVE WARM|2009-05-08 09:14:25|2009-05-08 09:14:38|2009-05-08 09:14:40|2009-05-08 09:14:40|  27637532|  27639664|   |autosave                                                        |      2152|         1|         0|                                        |
3893:  4A04223F0165|LOG_000001339|SAVE WARM|2009-05-08 09:14:38|2009-05-08 09:14:53|2009-05-08 09:14:55|2009-05-08 09:14:55|  27639665|  27641797|   |autosave                                                        |      2152|         1|         0|                                        |
3894:  4A0422A80166|LOG_000001340|SAVE WARM|2009-05-08 09:14:53|2009-05-08 09:16:36|2009-05-08 09:16:40|2009-05-08 09:16:40|  27641798|  27643930|   |autosave                                                        |      2152|         1|         0|                                        |
3895:  4A0439760167|LOG_000001341|SAVE WARM|2009-05-08 09:16:37|2009-05-08 10:53:57|2009-05-08 10:53:58|2009-05-08 10:53:58|  27643931|  27646063|   |autosave                                                        |      2152|         1|         0|                                        |
3896:  4A0440EC0168|LOG_000001342|SAVE WARM|2009-05-08 10:53:57|2009-05-08 11:25:45|2009-05-08 11:25:48|2009-05-08 11:25:49|  27646064|  27648196|   |autosave                                                        |      2152|         1|         0|                                        |
3897:  4A0440F60169|LOG_000001343|SAVE WARM|2009-05-08 11:25:45|2009-05-08 11:25:58|2009-05-08 11:25:58|2009-05-08 11:26:04|  27648197|  27650329|   |autosave                                                        |      2152|         1|         0|                                        |
3898:  4A044187016A|LOG_000001344|SAVE WARM|2009-05-08 11:25:58|2009-05-08 11:28:18|2009-05-08 11:28:23|2009-05-08 11:28:24|  27650330|  27652462|   |autosave                                                        |      2152|         1|         0|                                        |
3899:  4A044196016B|LOG_000001345|SAVE WARM|2009-05-08 11:28:18|2009-05-08 11:28:33|2009-05-08 11:28:38|2009-05-08 11:28:42|  27652463|  27654595|   |autosave                                                        |      2152|         1|         0|                                        |
3900:  4A0447E9016C|LOG_000001346|SAVE WARM|2009-05-08 11:28:33|2009-05-08 11:55:33|2009-05-08 11:55:37|2009-05-08 11:55:37|  27654596|  27656728|   |autosave                                                        |      2152|         1|         0|                                        |
3901:  4A044EBD016D|LOG_000001347|SAVE WARM|2009-05-08 11:55:33|2009-05-08 12:24:40|2009-05-08 12:24:45|2009-05-08 12:24:46|  27656729|  27658861|   |autosave                                                        |      2152|         1|         0|                                        |
3902:  4A044F54016E|LOG_000001348|SAVE WARM|2009-05-08 12:24:40|2009-05-08 12:27:15|2009-05-08 12:27:16|2009-05-08 12:27:17|  27658862|  27660994|   |autosave                                                        |      2152|         1|         0|                                        |
3903:  4A044F5E016F|LOG_000001349|SAVE WARM|2009-05-08 12:27:15|2009-05-08 12:27:26|2009-05-08 12:27:26|2009-05-08 12:27:27|  27660995|  27663127|   |autosave                                                        |      2152|         1|         0|                                        |
3904:  4A044F950170|LOG_000001350|SAVE WARM|2009-05-08 12:27:26|2009-05-08 12:28:18|2009-05-08 12:28:21|2009-05-08 12:28:23|  27663128|  27665260|   |autosave                                                        |      2152|         1|         0|                                        |
3905:  4A0476730171|LOG_000001351|SAVE WARM|2009-05-08 12:28:18|2009-05-08 15:13:56|2009-05-08 15:14:11|2009-05-08 15:14:12|  27665261|  27667393|   |autosave                                                        |      2152|         1|         0|                                        |
3906:  4A04808E0172|LOG_000001352|SAVE WARM|2009-05-08 15:14:06|2009-05-08 15:57:14|2009-05-08 15:57:18|2009-05-08 15:57:19|  27667394|  27669526|   |autosave                                                        |      2152|         1|         0|                                        |
3907:  4A0480A20173|LOG_000001353|SAVE WARM|2009-05-08 15:57:14|2009-05-08 15:57:36|2009-05-08 15:57:38|2009-05-08 15:57:40|  27669527|  27671659|   |autosave                                                        |      2152|         1|         0|                                        |
3908:  4A0481E30174|LOG_000001354|SAVE WARM|2009-05-08 15:57:36|2009-05-08 16:02:57|2009-05-08 16:02:59|2009-05-08 16:03:00|  27671660|  27673792|   |autosave                                                        |      2152|         1|         0|                                        |
3909:  4A0489CF0175|LOG_000001355|SAVE WARM|2009-05-08 16:02:57|2009-05-08 16:36:46|2009-05-08 16:36:47|2009-05-08 16:36:49|  27673793|  27675925|   |autosave                                                        |      2152|         1|         0|                                        |
3910:  4A048CB00176|LOG_000001356|SAVE WARM|2009-05-08 16:36:46|2009-05-08 16:49:00|2009-05-08 16:49:04|2009-05-08 16:49:05|  27675926|  27678058|   |autosave                                                        |      2152|         1|         0|                                        |
3911:  4A049E590177|LOG_000001357|SAVE WARM|2009-05-08 16:49:00|2009-05-08 18:04:23|2009-05-08 18:04:25|2009-05-08 18:04:25|  27678059|  27680191|   |autosave                                                        |      2152|         1|         0|                                        |
3912:  4A04E2680179|PAG_000032278|SAVE WARM|2009-05-08 22:54:49|2009-05-08 22:54:49|2009-05-08 22:54:50|2009-05-08 22:55:07|  27681241|          |NO |incrF                                                           |     16552|         1|         0|                                        |
3913:  4A0500ED0003|DAT_000032279|SAVE WARM|2009-05-09 01:05:01|2009-05-09 01:05:01|2009-05-09 01:05:03|2009-05-09 01:06:43|  27698573|          |NO |completeF                                                       |    339144|         1|         0|                                        |

then i noted that new JDBC connections cant be stabilished when automatic log backup is running, and cant view that situation in x_cons, because when the time i run x_cons the automatic log backup was ended, but JDBC connections hangs and never returns.

i added a new log volume to increase the LOG_SEGMENT_SIZE to a value that do just one log backup per day.

i will stay alert this week to see if things goes better with this new configuration, and maybe rethink my understanding of automatic log backup.

best regards

Clóvis

lbreddemann
Active Contributor
0 Kudos

> I think i found a possible problem, with automatic log backup, the segment in one system is too small and a log backup is made many times per day here is a cut from Backup History file:

> then i noted that new JDBC connections cant be stabilished when automatic log backup is running, and cant view that situation in x_cons, because when the time i run x_cons the automatic log backup was ended, but JDBC connections hangs and never returns.

Hmm... I don't see any connection between the automatic log backup and the possibility to connect to the database.

BUT - as we see in the backup history, it happens from time to time, that the log area runs full because the log backups cannot be done quickly enough.

And as soon as the database goes into LOG FULL or DATABASE FULL state no new connections can be established.

The reason for that is likely a too small log area - if you didn't modify LOG_SEGMENT_SIZE before, the log area will be 3 times the size of the log segment size.

> i added a new log volume to increase the LOG_SEGMENT_SIZE to a value that do just one log backup per day.

Ok, although LOG_SEGMENT_SIZE can also be modified directly, it sounds like a good idea to have a log area bigger than 2152 * 3 = 6456 pages => ca. 50 MB when you change a lot of data in a short amount of time.

By the way: one log backup per day may lead to problems concerning your backup availability.

Therefore you may want to consider to use autolog with a time based backup trigger, e.g. each hour.

> i will stay alert this week to see if things goes better with this new configuration, and maybe rethink my understanding of automatic log backup.

Let us know if this really improved your situation.

regards,

Lars

Former Member
0 Kudos

Hi, Lars

bad news, the connections hangs again follow is the thread:

"http-80-40" daemon prio=10 tid=0x0000002af97fe800 nid=0x5168 runnable [0x0000000057178000..0x0000000057179bb0]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at com.sap.dbtech.rte.comm.BasicSocketComm.receiveConnect(BasicSocketComm.java:707)
        at com.sap.dbtech.rte.comm.BasicSocketComm.dbConnectExchange(BasicSocketComm.java:789)
        at com.sap.dbtech.rte.comm.BasicSocketComm.connectDB(BasicSocketComm.java:233)
        at com.sap.dbtech.rte.comm.SocketComm$1.open(SocketComm.java:38)
        at com.sap.dbtech.jdbc.DriverSapDB.openConnection(DriverSapDB.java:966)
        at com.sap.dbtech.jdbc.DriverSapDB.openByURL(DriverSapDB.java:891)
        at com.sap.dbtech.jdbc.DriverSapDB.connect(DriverSapDB.java:208)
        - locked <0x0000002aa2bb0118> (a com.sap.dbtech.jdbc.DriverSapDB)
        at org.apache.tomcat.dbcp.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
        at org.apache.tomcat.dbcp.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:294)
        - locked <0x0000002aa2e14b40> (a org.apache.tomcat.dbcp.dbcp.PoolableConnectionFactory)
        at org.apache.tomcat.dbcp.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:974)
        at org.apache.tomcat.dbcp.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:96)
        at org.apache.tomcat.dbcp.dbcp.BasicDataSource.getConnection(BasicDataSource.java:880)
        at org.exolab.castor.jdo.engine.DatabaseRegistry.createConnection(DatabaseRegistry.java:399)
        at org.exolab.castor.jdo.engine.TransactionContextImpl.getConnection(TransactionContextImpl.java:203)
        at org.exolab.castor.persist.TransactionContext.query(TransactionContext.java:644)
        - locked <0x0000002aae9f4c18> (a org.exolab.castor.jdo.engine.TransactionContextImpl)
        at org.exolab.castor.jdo.engine.OQLQueryImpl.execute(OQLQueryImpl.java:458)
        at org.exolab.castor.jdo.engine.OQLQueryImpl.execute(OQLQueryImpl.java:414)
        at com.supridatta.bean.DataPersist.consulta(DataPersist.java:501)

maybe i must change the way i'm checking that hangs, maybe the problem is with xserver, there is a way to inspect xserver when this hangs occurs again?

i generated a kernel trace too, but i dont understand that output.

i inspected the network trafic with tcpdump too, but without knowledge about MaxDB protocol is difficult to discover what is wrong.

best regards

Clóvis

lbreddemann
Active Contributor
0 Kudos

> bad news, the connections hangs again follow is the thread:

> maybe i must change the way i'm checking that hangs, maybe the problem is with xserver, there is a way to inspect xserver when this hangs occurs again?

Sure - pstack, strace and gdb would be the weapons of choice here.

Anyhow, as a first step I'd check whether the X_SERVER log files contain any information.

> i generated a kernel trace too, but i dont understand that output.

Provide a download link for this please - maybe we can find something interesting in it.

Do you have a 'x_cons show all' from this hanging situation as well?

> i inspected the network trafic with tcpdump too, but without knowledge about MaxDB protocol is difficult to discover what is wrong.

I believe that - actually I wouldn't really know what to look for. In fact I think it must be the absence of communication for the thread that hangs which would be interesting.

BTW: did you ever run an strace or pstack against the hanging JAVA machine?

Maybe it's some issue with the network system calls ...

regards,

Lars

Former Member
0 Kudos

Hi Lars,

in xserver.prt file just have problems when i do a tomcat stop, note that in x_cons show all output file the time is 2009-05-11 15:17:54 before the errors in xserver, at 15:23 i do a tomcat stop/start to get system works again, i hope that something in kernel trace file help us to find what is wrong.

i create a single file with show all 1 and output to a file from 2009-05-11 15:18:41 to 2009-05-11 15:23:48, that generated about 145 show all view, i do this thinking, that the behavior of show all can change between time, since in past only one show all, dont show anything for us.

with this output i run a cat xconsShowAll2009051115many.txt | grep "T514 10 " command to extract state of the only running Task T514, the output in on this file, in this output we can see a number between two last columns, that have a value of 70 on line 8, then go to 66 on line 16, then go to 58 on line 139, this appears to be a type of decreasing counter, maybe a possible problem?

about strace or pstack i never used that, i never used that with JVM, with that i just used JConsole or other java bin tools to inspect what is wrong with JVM, you have any ready command line to execute this tools on xserver machine, and on Java machine? if no, i will try to understand and use these commands.

best regards.

Clóvis

lbreddemann
Active Contributor
0 Kudos

Hi Clovis,

I'm pretty busy right now - so checking the logs will take some time.

regards,

Lars

Former Member
0 Kudos

Hi, Lars,

ok, i will wait, i dont know anyone that can help me with this, unless you, ;

but today i get more one hang in system now with Vwait again, but for this time, i'm prepared with Database Analyser activated, Command Monitor activated, running commands etc...

i cut from dban.prt between the time that the system get many task in Vwait state, and put in dban20090514.txt for you take a look too, i think that this can help us to find some of the problems.

best regards.

Clóvis

lbreddemann
Active Contributor
0 Kudos

> ok, i will wait, i dont know anyone that can help me with this, unless you, ;

Don't be that desperate...

> but today i get more one hang in system now with Vwait again, but for this time, i'm prepared with Database Analyser activated, Command Monitor activated, running commands etc...

>

> i cut from dban.prt between the time that the system get many task in Vwait state, and put in ... for you take a look too, i think that this can help us to find some of the problems.

Ok, I had a look into the log files you send.

First of all: I'm not really convinced that the connection aborts are connected to the VWAITs.

Thr DBAN.PRT shows, that there are pretty often waits for row locks on the same few tables

TITULOPAGAMENTO, PAGAMENTO and for the root page 59552.

Please check what table belongs to this root page.

Also there are many table scans which may lead to locking situations.

So up to here I'd say: you need to tune some statements to avoid the low selectivty statements.

The other issue - the hanging network calls in your JVM...

The xserver log shows:


2009-05-11 11:56:15 19871 ERR 11926  XSERVER  Receive packet, Ref:7 - socket recv error (110:Connection timed out
2009-05-11 15:23:25 17279     12915  XSERVER  Connection lost
2009-05-11 15:23:25 17279 ERR -11987 COMMUNIC semop (receive 170852413) interrupted and check reports error
2009-05-11 15:23:25 27426     12915  XSERVER  Connection lost
2009-05-11 15:23:25 27426 ERR -11987 COMMUNIC semop (receive 185532489) interrupted and check reports error
2009-05-11 15:23:25 16705     12915  XSERVER  Connection lost
2009-05-11 15:23:25 16705 ERR -11987 COMMUNIC semop (receive 170033188) interrupted and check reports error
2009-05-11 15:23:27 29581     12915  XSERVER  Connection lost

First the socket receive call fails and after that the connections (the IPC connections are handled via semaphores) get cancelled.

Unfortunately we're now not much further here. Somewhere in the lower network stack packets get lost.

Sorry, but that's all I can tell about the issue here.

The vtrace protocol did not bring further insights, since the abort is caused outside the database.

regards,

Lars

lbreddemann
Active Contributor
0 Kudos

> with this output i run a cat xconsShowAll2009051115many.txt | grep "T514 10 " command to extract state of the only running Task T514, the output in on this [file|https://sites.google.com/a/supridatta.com.br/maxdb/Home/stuckthreadsproblem/resultOfCatT514_20090511.txt?attredirects=0], in this output we can see a number between two last columns, that have a value of 70 on line 8, then go to 66 on line 16, then go to 58 on line 139, this appears to be a type of decreasing counter, maybe a possible problem?

Concerning your assumption about wait times counting down in your x_cons output, let's check the output again.

 
ID   UKT UNIX   TASK       APPL Current         Timeout Region     Wait           tid   
       type        pid state          priority cnt try    item 
T514  10     -1 User         0 Running               0 13891408  70        1773396 r 
T514  10     -1 User         0 Running               0 15117700  66        1773396 r
T514  10     -1 User         0 Running               0 34099834  58        1773396 r

I put the title lines to it, so that you can easily see that the "countdown" in fact is the index of the critical region the task is currently working in.

For a much more detailed description on these information, you may want to check the MaxDB internals training material on http://maxdb.sap.com or actually attend the SAP training.

> about strace or pstack i never used that, i never used that with JVM, with that i just used JConsole or other java bin tools to inspect what is wrong with JVM, you have any ready command line to execute this tools on xserver machine, and on Java machine? if no, i will try to understand and use these commands.

Since the problem seems to be on the lower levels of the network stack, I'd say that looking into the JVM is looking into the "wrong box".

STRACE, TRUSS, PSTACK are all OS dependend debugging tools - make sure to check the man pages for them to get the correct call syntax.

cheers,

Lars

Former Member
0 Kudos

Hi Lars,

i become desperate when i discover that i'm the only one in brazil that uses MaxDB community, in production environment,

ok, why JDBC dont get connections at sometimes is a big estrange thing, that maybe i never discover, but that search shows me many things i must solve, that reduced that stuck threads, for weeks i dont get stuck threads, what i do?

i just double MAX_LOCKS and LOG_IO_QUEUE values, to avoid some lock scalations and appears that solves that stuck connections.

i know that this isnt a good solution, but appears to solve something here, now i must solve that low selectivity to avoid table locks, that can stuck threads when reading data in concurrency with another transactions.

about tables TITULOPAGAMENTO and PAGAMENTO i discovered a SELECT statement that do a TABLESCAN and causes a LOCK table, with that, other transactions must wait in INSERT or DELETE statements until that bad SELECT command ends. (here i think that maxdb can have a feature, when a table lock occurs log the culprid statement, just a dream, )

root page 59552 points to SYSPARSEID, maybe this table is locked when i query that tables to generate XML files for post analysis of command monitor?

about errors in xserver.log as i said, that occurs just on times that i must reboot tomcat, on stop that errors appears at xserver.logs maybe tomcat DBCP pool dont close connections or just be killed by tomcat, who knows?

best regards

Clóvis

Former Member
0 Kudos

For a much more detailed description on these information, you may want to check the MaxDB internals training material on http://maxdb.sap.com or actually attend the SAP training.

here in Brazil SAP folks dont offer MaxDB training, exists any e-Learning option?

for Linux i must use strace, but i dont know if i have the knowledge to interpret strace applied to xserver, maybe that will generate too much information that just show me nothing, but i get a tcpdump sniffer on port 7210 and dont find any problem with network communications maybe i must take this sniffer active until next stucked connections occurs, but for now i think to first solve situations of low selectivity statements, that in some way avoid stuck threads to occurs.

one thing i see with sniffer, is that on connection JDBC reads something from catalog, maybe a catalog lock is causing this stuck thread? like a table lock, or something that have to lock catalog?

an information about MaxDB native protocol can be on hand here to a deep understanding of how MaxDB deal with JDBC connections, but that is a hidden information for me, here at the dark side, ops, wish to say community side,

best regards.

Clóvis

Former Member
0 Kudos

Hi, Lars,

today i see a estrange problem, using x_cons sh ac i see:

T264 8 -1 User 0* Vwait 0 0 167 1901051(s)

the task 264 is waiting for a lock made by task 167 to be released.

but in search for this task 167 i cant find that using x_cons sh ac

in database studio i see same thing, that task 264 is waiting for task 167.

then i issued a x_cons show all to see what info i can get about task 167:

the info is:

Task T167    
    task type / state                     USER             / Inactive
    stack start - end                    @0x2b1e00f000 - @0x2b1e08f000
    stack size                            524288 (0x00080000)
    guard area start - end               @0x2b1e08f000 - @0x2b1e090000
    last / max dispatch stack pointer    @0x2b1e08ee2c / @0x2b1e04b6ec
    current / all command max stack usage 165080 / 165080

and

--------------------  T167  USER              ( pid =      0 ) ---------------
 dispatcher_cnt: 8                                     command_cnt   : 6
 exclusive_cnt : 6000730                               self_susp_cnt : 0
 Resume count 0  total 1          History [ T264 ]
 state_vwait   : 0          state_vsleep  : 0          state_vsusp   : 1
 rcv_rpl_count : 6          rcv_rpl_long  : 0          avg_rcv_rpl_t : 0.0000
 prio_total    : 1                                     prio_from_oth : 1
-------------------------------------------------------------------------------

with this we can see that task 167 is inactive and task 264 is waiting for that inactive task, i think that this never ends, right? as i cant kill an inactive task, i killed all active task, and system comes to live again.

maybe i found a pattern in that stuck threads connection, the question now is why a Task is waiting for an inactive task?

another thing i learned is that new connections cant be made, maybe if catalog is locked, because JDBC connection issues commands that read catalog at first time, and if catalog is locked, the connection must wait for catalog to be unlocked, i dont know if catalog receives locks, but in diagnose monitor i see many commands used maybe by JDBC driver that have low selectivity, its just a think, i really dont know if this is right.

best regards.

Clóvis

lbreddemann
Active Contributor
0 Kudos

> today i see a estrange problem, using x_cons sh ac i see:

>

> T264 8 -1 User 0* Vwait 0 0 167 1901051(s)

>

> the task 264 is waiting for a lock made by task 167 to be released.

>

> but in search for this task 167 i cant find that using x_cons sh ac

Ok, so the task is currently incactive.

> then i issued a x_cons show all to see what info i can get about task 167:

Just for completeness, you could/should have also used


x_cons show t_cnt 167

to get the infos.

> with this we can see that task 167 is inactive and task 264 is waiting for that inactive task, i think that this never ends, right? as i cant kill an inactive task, i killed all active task, and system comes to live again.

> maybe i found a pattern in that stuck threads connection, the question now is why a Task is waiting for an inactive task?

Well, the task is inactive in the database kernel.

That does not mean that it's kind of "lost" or something.

The client has opened a transaction and got some information back from the kernel and now is doing something else - with leaving the transaction opened and all locks in place.

Would be interesting to know what client process hangs on this session.

Although x_cons does not show the pid for JDBC processes, you can figure it out by querying the SESSIONS system table:


select * from sessions where taskid=167

Maybe this client is currently hanging on the JDCB network issue you face...

And you're right - as long as the session doesn't do anything it won't get cancelled/cleaned up..

> another thing i learned is that new connections cant be made, maybe if catalog is locked, because JDBC connection issues commands that read catalog at first time, and if catalog is locked, the connection must wait for catalog to be unlocked, i dont know if catalog receives locks, but in diagnose monitor i see many commands used maybe by JDBC driver that have low selectivity, its just a think, i really dont know if this is right.

Hmm... catalog accesses cannot be tuned since most of the tables aren't normal tables but "special" structures. So even if the access to them might be non-optimal there's nothing we can do about it.

regards,

Lars

Former Member
0 Kudos

Hi Lars,

at next time that this problem occurs i use x_cons as you explained, but to be more prepared, about the statement to get the pid for JDBC this is the column APPLICATIONPROCESS or PEERPROCESSID? i ask this because here these two columns are 0 and null values, or are other columns? or i must change something in JDBC connection to activate this values?

in time, exists any way to discover last statement executed by an inactive task?

best regards

Clóvis

lbreddemann
Active Contributor
0 Kudos

> i ask this because here these two columns are 0 and null values, or are other columns? or i must change something in JDBC connection to activate this values?

Oh, sure - you're using JDBC... sorry, again, but with JDBC these values are never filled.

juuuuust great

> in time, exists any way to discover last statement executed by an inactive task?

You mean apart from tracing each and every statement on application level?

Guess it... no, there's no such option.

Since the tracing on application level could also help you also with the question "what task id is my task id" I can only propose to either use [jdbc tracing|http://maxdb.sap.com/doc/7_7/46/7bc389e97150d1e10000000a155369/content.htm] or to implement your own tracing.

regards,

Lars

Former Member
0 Kudos

Hi Lars,

i'm ending this thread too much work was done, i must review how application works with maxdb and create my own database controller at application side. For now help from Elke help me too, in SELECT statements that i cant avoid TABLE LOCKS, using ISOLATION LEVEL 0 solved, that problems that INSERT on table TITULOPAGAMENTO waits until a long analysis SELECT ends.

in time, when you discover if there is a way for the OpenSource side contract some support in MaxDB database i will appreciate to know how, because i think that in some cases, only an specialist can point what is wrong with database or with application.

thanks for all help and time spend.

best regards

Clóvis