Skip to Content
0

SAP stuck, all the DIA processes in WP_HOLD state

Dec 05, 2017 at 09:35 AM

546

avatar image
Former Member

Dear Experts,

this is perhaps a bit long and complicated, but this is exactly why I need your help and advice.
We are running 2 systems in one Linux box. This night, the system was overloaded around 2:30. The monitoring reported issues as high I/O load, high CPU load, etc. However, one system managed to recover from that state. I was called at 6 AM for the other system, which I found in a very miserable state - all the DIA WP were hanging and no SAP GUI session could be started. Here is what I could see in dpmon:

Workprocess Table (long) Tue Dec 5 05:26:17 2017
------------------------------------------------------------

Current snapshot id: 17
Number of preemptions : 12

|No |Pid |Type|State |Cause|Err|Prio|Sess-Key |Sess-Type|Locked|Sem|Time |Program |Cli|User |Action |Action-Info |
|---+--------+----+-------+-----+---+----+----------------+---------+------+---+-----+----------------------------------------+---+------------+--------------------+--------------------|
| 0|25524 |DIA |WP_HOLD|ADM | |high|T11_U29665_M0 |INTERNAL | | |12262|SAPLTHFB |000|SAPSYS | | |
| 1|25525 |DIA |WP_HOLD|MSGOP| |low |T23_U29689_M0 |ASYNC_RFC| | |12259|SAPLMSGSERVER |100|BWREMOTE | | |
| 2|25526 |DIA |WP_HOLD|MSGOP| |low |T29_U29561_M0 |ASYNC_RFC| | |12529|SAPLMSGSERVER |100|BWREMOTE | | |
| 3|25527 |DIA |WP_HOLD|MSGOP| |low |T2_U29498_M0 |ASYNC_RFC| | |13390|SAPLMSGSERVER |100|BWREMOTE | | |
| 4|25528 |DIA |WP_HOLD|MSGOP| |low |T18_U29662_M0 |ASYNC_RFC| | |12265|SAPLMSGSERVER |100|BWREMOTE | | |
| 5|25529 |DIA |WP_HOLD|MSGOP| |high|T10_U29507_M0 |INTERNAL | | |13342|SAPLTHFB |000|SAPSYS | | |
| 6|25530 |DIA |WP_HOLD|MSGOP| |high|T15_U29495_M0 |INTERNAL | | |13403|SAPLTHFB |000|SAPSYS | | |
| 7|25531 |DIA |WP_HOLD|MSGOP| |low |T8_U30317_M0 |ASYNC_RFC| | |10408|SAPLMSGSERVER |100|BWREMOTE | | |
| 8|25532 |UPD | | | | | | | | | | | | | | |
| 9|2219 |BTC |WP_HOLD|MSGOP| |low |T26_U29693_M0 |BATCH | | |12257|SAPLMSGSERVER |100|BWREMOTE | | |
| 10|1252 |BTC | | | | | | | | | | | | | | |
| 11|1717 |BTC |WP_HOLD|MSGOP| |low |T12_U29694_M0 |BATCH | | |12257|SAPLMSGSERVER |100|BWREMOTE | | |
| 12|1185 |BTC | | | | | | | | | | | | | | |
| 13|31519 |BTC | | | | | | | | | | | | | | |
| 14|18457 |BTC | | | | | | | | | | | | | | |
| 15|1676 |BTC | | | | | | | | | | | | | | |
| 16|1821 |BTC |WP_HOLD|MSGOP| |low |T30_U29692_M0 |BATCH | | |12257|SAPLMSGSERVER |100|BWREMOTE | | |
| 17|25541 |SPO | | | | | | | | | | | | | | |
| 18|25543 |UP2 | | | | | | | | | | | | | | |
| 19|9482 |DIA |WP_HOLD|MSGOP| |low |T20_U30311_M0 |ASYNC_RFC| | |10414|SAPLMSGSERVER |100|BWREMOTE | | |
| 20|7726 |DIA |WP_HOLD|MSGOP| |low |T3_U30238_M0 |ASYNC_RFC| | |10628|SAPLMSGSERVER |100|BWREMOTE | | |
| 21|9352 |DIA |WP_HOLD|MSGOP| |low |T31_U30303_M0 |ASYNC_RFC| | |10442|SAPLMSGSERVER |100|BWREMOTE | | |
| 22|9445 |DIA |WP_HOLD|MSGOP| |low |T22_U30352_M0 |ASYNC_RFC| | |10314|<HANDLE RFC> |100|BWREMOTE | | |
| 23|3553 |DIA |WP_HOLD|MSGOP| |low |T16_U29946_M0 |ASYNC_RFC| | |11532|SAPLMSGSERVER |100|BWREMOTE | | |

Found 24 active workprocesses

I tried to google what the hell is MSGOP, but nothing !!!

Investigation of another log files:

- dev_ms: even though, the problem was most probably that the connection to the MS didn't work, this file doesn't contain almost no useful info:

[Thr 140242576963392] Tue Dec 5 02:42:53 2017
[Thr 140242576963392] LOGOUT: cist-bi1_BID_00 [172.19.8.45] (DIA UPD BTC SPO UP2 ICM )

[Thr 140242576963392] Tue Dec 5 02:44:49 2017
[Thr 140242576963392] LOGIN: cist-bi1_BID_00 [127.0.0.1] (DIA UPD BTC SPO UP2 ICM )

[Thr 140242576963392] Tue Dec 5 07:06:14 2017
[Thr 140242576963392] *** ERROR => NiIRead: invalid data (0xfff4fffd/0x7d6e;mode=0;hdl 66;peer=127.0.0.1:24467;local=3601) [nixxi.cpp 5192]

[Thr 140242576963392] Tue Dec 5 07:06:50 2017
[Thr 140242576963392] *** ERROR => NiIRead: invalid data (0xfff4fffd/0x7d6e;mode=0;hdl 67;peer=127.0.0.1:27892;local=3601) [nixxi.cpp 5192]

- dev_disp:

Tue Dec 5 02:28:48 2017
*** ERROR => DpCalcTimeSlice: bad time interval [(1512437328/161886), (1512437328161878)] [dpxxwp.c 2153]

Tue Dec 5 02:33:56 2017
DpUpdateStatusFileWith: state=YELLOW, reason=Request handling without progress

Tue Dec 5 02:33:57 2017
*** WARNING => DpMsCheck: no answer from msg server since 135 secs, but dp_ms_keepalive_timeout(300 secs) not reached [dpMessageSer 3453]

Tue Dec 5 02:33:59 2017
DpSendLoadInfo: quota for load / queue fill level = 7.200000 / 5.000000
DpSendLoadInfo: queue DIA now with high load, load / queue fill level = 7.380309 / 0.007143

Tue Dec 5 02:34:21 2017
DpSendLoadInfo: queue DIA no longer with high load

Tue Dec 5 02:36:26 2017
DpUpdateStatusFileWith: state=YELLOW, reason=Request handling without progress

Tue Dec 5 02:36:28 2017
*** WARNING => DpRequestProcessingCheck: potential request processing problem detected (1. check) [dpxxwp.c 4417]
*** WARNING => DpRequestProcessingCheck: potential request processing problem detected (2. check) [dpxxwp.c 4417]

Tue Dec 5 02:36:29 2017
*** WARNING => DpRequestProcessingCheck: potential request processing problem detected (3. check) [dpxxwp.c 4417]

Then some problems with network interfaces are logged:

Tue Dec 5 02:40:45 2017
***LOG Q0I=> NiIWrite: 127.0.0.1:0: writev
Tue Dec 5 02:40:49 2017
(32: Broken pipe) [/bas/742_REL/src/base/ni/nixxi.cpp 4781]

Tue Dec 5 02:42:32 2017
*** ERROR => NiIWrite: SiSendV failed for hdl 69/sock 14
(SI_ECONN_BROKEN/32; UD; ST; 127.0.0.1:0) [nixxi.cpp 4781]
*** ERROR => NiICheck: NiBufIWriteUserMsg failed (rc=-6) [nixxi.cpp 6400]
*** ERROR => NiBufIProcMsg: NiICheck failed (rc=-6) [nibuf.cpp 1865]
*** ERROR => NiBufIIn: NiBufIProcMsg failed (rc=-6) [nibuf.cpp 2789]

Tue Dec 5 02:42:48 2017
*** ERROR => DpMsCheck: Waiting for answer from msg server since 300 secs [dpMessageSer 3432]

Tue Dec 5 02:42:52 2017
*** ERROR => DpMsCheck: Connection to msg server will be closed [dpMessageSer 3434]
***LOG Q0M=> DpMsDetach, ms_detach () [dpMessageSer 2079]

Tue Dec 5 02:42:53 2017
MBUF state OFF
DpMBufTypeMod: MBUF component DOWN (event=MBUF_DPEVT_DOWN)

Tue Dec 5 02:43:28 2017
MBUF state OFF

Tue Dec 5 02:43:29 2017
DpMBufTypeMod: MBUF component DOWN (event=MBUF_DPEVT_DOWN)

Tue Dec 5 02:43:37 2017
DpMsCheck: DpMsAttach failed (-1), retry msg server attach in 1 sec

Tue Dec 5 02:46:14 2017
***LOG Q0I=> NiIWrite: 127.0.0.1:0: writev (32: Broken pipe) [/bas/742_REL/src/base/ni/nixxi.cpp 4781]

Tue Dec 5 02:46:33 2017
*** ERROR => NiIWrite: SiSendV failed for hdl 70/sock 14
(SI_ECONN_BROKEN/32; UD; ST; 127.0.0.1:0) [nixxi.cpp 4781]
*** ERROR => NiICheck: NiBufIWriteUserMsg failed (rc=-6) [nixxi.cpp 6400]
*** ERROR => NiBufIProcMsg: NiICheck failed (rc=-6) [nibuf.cpp 1865]
*** ERROR => NiBufIIn: NiBufIProcMsg failed (rc=-6) [nibuf.cpp 2789]

And then, I guess, around 3:32h the processes got stuck:

Tue Dec 5 03:32:51 2017
*** WARNING => DpDeadlockCheck: potential deadlock detected (1. check) [dpxxwp.c 4301]

Tue Dec 5 03:32:58 2017
DpSendLoadInfo: quota for load / queue fill level = 11.700000 / 5.000000
DpSendLoadInfo: queue DIA now with high load, load / queue fill level = 11.773900 / 0.021429

Tue Dec 5 03:33:06 2017
*** WARNING => DpDeadlockCheck: potential deadlock detected (1. check) [dpxxwp.c 4301]

Tue Dec 5 03:40:55 2017
DpUpdateStatusFileWith: state=YELLOW, reason=Request handling without progress
*** WARNING => DpDeadlockCheck: potential deadlock detected (31. check) [dpxxwp.c 4301]
*** WARNING => DpRequestProcessingCheck: potential request processing problem detected (299. check) [dpxxwp.c 4417]

Tue Dec 5 03:40:56 2017
*** WARNING => DpDeadlockCheck: potential deadlock detected (32. check) [dpxxwp.c 4301]

********** SERVER SNAPSHOT 1 (Reason: Request handling without progress) - begin **********

Server cist-bi1_BID_00

Force W0 in state WP_HOLD to dump stack
Force W1 in state WP_HOLD to dump stack
Force W2 in state WP_HOLD to dump stack
Force W3 in state WP_HOLD to dump stack
Force W4 in state WP_HOLD to dump stack
Force W5 in state WP_HOLD to dump stack
Force W6 in state WP_HOLD to dump stack
Force W7 in state WP_HOLD to dump stack
Force W9 in state WP_HOLD to dump stack
Force W11 in state WP_HOLD to dump stack
Force W16 in state WP_HOLD to dump stack
Force W19 in state WP_HOLD to dump stack
Force W20 in state WP_HOLD to dump stack
Force W21 in state WP_HOLD to dump stack
Force W22 in state WP_HOLD to dump stack
Force W23 in state WP_HOLD to dump stack

Scheduler info
--------------
WP info
DpNumberOfDiaWps: dia_wps 13, standby_wps 0
#dia = 13
#btc = 8
#standby = 0
#max = 24
General Scheduler info
preemptionInfo.isActive = true
preemptionInfo.timeslice = 500
preemptionInfo.checkLoad = true
Prio Class High
maxRuntime[RQ_Q_PRIO_HIGH] = infinite
maxRuntimeHalf[RQ_Q_PRIO_HIGH] = infinite

Prio Class Normal
maxRuntime[RQ_Q_PRIO_NORMAL] = infinite
maxRuntimeHalf[RQ_Q_PRIO_NORMAL] = infinite
maxTicketsForPrio[RQ_Q_PRIO_NORMAL] = 11
withPrioTickets[RQ_Q_PRIO_NORMAL] = true
Tue Dec 5 03:40:56 2017
Prio Class Low
maxRuntime[RQ_Q_PRIO_LOW] = infinite
maxRuntimeHalf[RQ_Q_PRIO_LOW] = infinite
maxTicketsForPrio[RQ_Q_PRIO_LOW] = 11
withPrioTickets[RQ_Q_PRIO_LOW] = true
Actual tickets in use
actTicketsInUseForPrio[RQ_Q_PRIO_NORMAL] = 10
actTicketsInUseForPrio[RQ_Q_PRIO_LOW] = 10
Running requests[RQ_Q_PRIO_NORMAL] = 0
Running requests[RQ_Q_PRIO_LOW] = 10

Queue Statistics Tue Dec 5 02:40:56 2017
------------------------------------------------------------

Number of lost wakeup datagrams: 2

Max. number of queue elements : 14000

DIA : 51 (peak 88, writeCount 588575, readCount 588524)
UPD : 0 (peak 7, writeCount 1582, readCount 1582)
ENQ : 0 (peak 0, writeCount 0, readCount 0)
BTC : 0 (peak 10, writeCount 142588, readCount 142588)
SPO : 0 (peak 2, writeCount 12506, readCount 12506)
UP2 : 0 (peak 1, writeCount 1291, readCount 1291)
DISP: 0 (peak 13, writeCount 106539, readCount 106539)
GW : 0 (peak 22, writeCount 290065, readCount 290065)
ICM : 0 (peak 1, writeCount 22266, readCount 22266)
LWP : 5 (peak 20, writeCount 26570, readCount 26565)

Session queue dump (high priority, 1 elements, peak 3):
-1 <- 29 < EmbryoQueue_DIA> -> -1
Session queue dump (normal priority, 18 elements, peak 18):

-1 <- 77 < T43_U30411_M0> -> 78
77 <- 78 < T44_U30412_M0> -> 74
78 <- 74 < T40_U30394_M0> -> 73
74 <- 73 < T39_U30393_M0> -> 76
73 <- 76 < T42_U30403_M0> -> 75
76 <- 75 < T41_U30402_M0> -> 72
75 <- 72 < T38_U30385_M0> -> 71
72 <- 71 < T37_U30384_M0> -> 67
71 <- 67 < T6_U30374_M0> -> 57
67 <- 57 < T19_U30373_M0> -> 58
57 <- 58 < T28_U30365_M0> -> 60
58 <- 60 < T35_U30364_M0> -> 68
60 <- 68 < T32_U30356_M0> -> 66
68 <- 66 < T27_U30354_M0> -> 48
66 <- 48 < T21_U30355_M0> -> 41
48 <- 41 < T25_U29568_M0> -> 69
41 <- 69 < T17_U30376_M0> -> 70
69 <- 70 < T36_U30377_M0> -> -1
Session queue dump (low priority, 0 elements, peak 6):

Requests in queue <W2> (1 requests, queue in use):
- 1 requests for handler REQ_HANDLER_PROCESS
Requests in queue <W3> (1 requests, queue in use):
- 1 requests for handler REQ_HANDLER_PROCESS
Requests in queue <W4> (1 requests, queue in use):
- 1 requests for handler REQ_HANDLER_PROCESS
Requests in queue <W5> (1 requests, queue in use):
- 1 requests for handler REQ_HANDLER_PROCESS
Requests in queue <W6> (1 requests, queue in use):
- 1 requests for handler REQ_HANDLER_PROCESS
Requests in queue <EmbryoQueue_DIA> (9 requests):
- 1 requests for handler REQ_HANDLER_BUFREF
- 1 requests for handler REQ_HANDLER_DDLOG
- 1 requests for handler REQ_HANDLER_AUTOTH
- 1 requests for handler REQ_HANDLER_AUTOCCMS
- 1 requests for handler REQ_HANDLER_DELAY
- 1 requests for handler REQ_HANDLER_ADM
- 1 requests for handler REQ_HANDLER_BTCSCHED
- 1 requests for handler REQ_HANDLER_BGRFC_WATCHDOG
- 1 requests for handler REQ_HANDLER_AUTOSECURITY
Requests in queue <T25_U29568_M0> (1 requests):
- 1 requests for handler REQ_HANDLER_RFC
Requests in queue <T21_U30355_M0> (1 requests):
- 1 requests for handler REQ_HANDLER_RFC

Requests in queue <T19_U30373_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T28_U30365_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T35_U30364_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T27_U30354_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T6_U30374_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T32_U30356_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T17_U30376_M0> (1 requests):
- 1 requests for handler REQ_HANDLER_RFC
Requests in queue <T36_U30377_M0> (1 requests):
- 1 requests for handler REQ_HANDLER_RFC
Requests in queue <T37_U30384_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T38_U30385_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T39_U30393_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T40_U30394_M0> (3 requests):
- 2 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T41_U30402_M0> (2 requests):
- 1 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T42_U30403_M0> (2 requests):
- 1 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T43_U30411_M0> (2 requests):
- 1 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION
Requests in queue <T44_U30412_M0> (2 requests):

- 1 requests for handler REQ_HANDLER_PLUGIN
- 1 requests for handler REQ_HANDLER_SESSION


Infos about some special queues:

Queue <StartServiceQueue> in slot 3 (port=0) has no requests

Workprocess Table (long) Tue Dec 5 02:40:56 2017
------------------------------------------------------------

Current snapshot id: 1
Number of preemptions : 12

|No |Pid |Type|State |Cause|Err|Prio|Sess-Key |Sess-Type|Locked|Sem|Time |Program |Cli|User |Action |Action-Info |
|---+--------+----+-------+-----+---+----+----------------+---------+------+---+-----+----------------------------------------+---+------------+--------------------+--------------------|
| 0|25524 |DIA |WP_HOLD|ADM | |high|T11_U29665_M0 |INTERNAL | | | 2341|SAPLTHFB |000|SAPSYS | | |
| 1|25525 |DIA |WP_HOLD|MSGOP| |low |T23_U29689_M0 |ASYNC_RFC| | | 2338|SAPLMSGSERVER |100|BWREMOTE | | |
| 2|25526 |DIA |WP_HOLD|MSGOP| |low |T29_U29561_M0 |ASYNC_RFC| | | 2608|SAPLMSGSERVER |100|BWREMOTE | | |
| 3|25527 |DIA |WP_HOLD|MSGOP| |low |T2_U29498_M0 |ASYNC_RFC| | | 3469|SAPLMSGSERVER |100|BWREMOTE | | |
| 4|25528 |DIA |WP_HOLD|MSGOP| |low |T18_U29662_M0 |ASYNC_RFC| | | 2344|SAPLMSGSERVER |100|BWREMOTE | | |
| 5|25529 |DIA |WP_HOLD|MSGOP| |high|T10_U29507_M0 |INTERNAL | | | 3421|SAPLTHFB |000|SAPSYS | | |
| 6|25530 |DIA |WP_HOLD|MSGOP| |high|T15_U29495_M0 |INTERNAL | | | 3482|SAPLTHFB |000|SAPSYS | | |
| 7|25531 |DIA |WP_HOLD|MSGOP| |low |T8_U30317_M0 |ASYNC_RFC| | | 487|SAPLMSGSERVER |100|BWREMOTE | | |
| 9|2219 |BTC |WP_HOLD|MSGOP| |low |T26_U29693_M0 |BATCH | | | 2336|SAPLMSGSERVER |100|BWREMOTE | | |
| 11|1717 |BTC |WP_HOLD|MSGOP| |low |T12_U29694_M0 |BATCH | | | 2336|SAPLMSGSERVER |100|BWREMOTE | | |
| 16|1821 |BTC |WP_HOLD|MSGOP| |low |T30_U29692_M0 |BATCH | | | 2336|SAPLMSGSERVER |100|BWREMOTE | | |
| 19|9482 |DIA |WP_HOLD|MSGOP| |low |T20_U30311_M0 |ASYNC_RFC| | | 493|SAPLMSGSERVER |100|BWREMOTE | | |
| 20|7726 |DIA |WP_HOLD|MSGOP| |low |T3_U30238_M0 |ASYNC_RFC| | | 707|SAPLMSGSERVER |100|BWREMOTE | | |
| 21|9352 |DIA |WP_HOLD|MSGOP| |low |T31_U30303_M0 |ASYNC_RFC| | | 521|SAPLMSGSERVER |100|BWREMOTE | |

|
| 22|9445 |DIA |WP_HOLD|MSGOP| |low |T22_U30352_M0 |ASYNC_RFC| | | 393|<HANDLE RFC> |100|BWREMOTE | | |
| 23|3553 |DIA |WP_HOLD|MSGOP| |low |T16_U29946_M0 |ASYNC_RFC| | | 1611|SAPLMSGSERVER |100|BWREMOTE | | |


10 |10000 characters needed characters left characters exceeded
* Please Login or Register to Answer, Follow or Comment.

7 Answers

Best Answer
Mario De Felipe Dec 05, 2017 at 12:20 PM
0

Hi Symon, assuming this is not production enviroment, I would do (apart from decreasing your number of work processess for each SAP server);

rdisp/max_wprun_time = 0 (No, please eliminate the parameter from RZ10)

rdisp/scheduler/prio_normal/max_quota = 50%

rdisp/scheduler/prio_low/max_quota = 40%

rdisp/wp_max_no, set a value, lower than your current limit

rdisp/configurable_wp_no set a value, lower than your current limit

rdisp/wp_no_restricted = 3 (leave 3 wp as restricted)

lets see how it goes.

Share
10 |10000 characters needed characters left characters exceeded
avatar image
Former Member Dec 05, 2017 at 09:36 AM
0

dev_w0:

B Tue Dec 5 02:34:01 2017
B *** ERROR => dbsync[check_sync_interval]: 212 seconds passed since last synchronisation
last sync: 20171205023029, now: 20171205023401
[dbsync.c 4549]
B {root-id=005056BF236B1EE7B6AB8BA7155C303C}_{conn-id=00000000000000000000000000000000}_0
B
B Tue Dec 5 02:34:13 2017
B ***LOG BS5=> Buffer synchronisation has not been called for 212 seconds [dbsync 4550]
B
B Tue Dec 5 02:39:58 2017
B RECONNECT: rsdb/reco_trials: 3
B RECONNECT: rsdb/reco_sleep_time: 5
B RECONNECT: rsdb/reco_sync_all_server: OFF
B ***LOG BV4=> reconnect state is set for the work process [dblink 2136]
B ***LOG BYY=> work process left reconnect status [dblink 2137]
M ThHdlReconnect: reconnect o.k.
B dbsync[db_syexe]: wait=0, call_no=2774, current_ts=20171205023958, last_counter=50805904
B *** ERROR => dbsync[check_sync_interval]: 357 seconds passed since last synchronisation
last sync: 20171205023401, now: 20171205023958
[dbsync.c 4549]
B {root-id=005056BF236B1EE7B6ABA6799DB9F054}_{conn-id=00000000000000000000000000000000}_0
B ***LOG BS5=> Buffer synchronisation has not been called for 357 seconds [dbsync 4550]
M
M Tue Dec 5 02:46:33 2017
M MBUF info for hooks: MS component DOWN
M
M Tue Dec 5 02:46:49 2017
M ThVBHdlMsgDown: msg down
M ThIVBChangeState: update deactivated
M ***LOG R0R=> ThIVBChangeState, update deactivated () [thxxvb.c 11344]
M
M Tue Dec 5 02:57:16 2017
M ***LOG R0Z=> ThResetVBDispatching, reset update dispatching info () [thxxvb.c 8328]
M MBUF info for hooks: MS component DOWN
M
M Tue Dec 5 02:57:17 2017
M DpMBufReload: already loading
M

M Tue Dec 5 02:57:18 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=005056BF236B1EE7B6ABF3E02460F073}_{conn-id=00000000000000000000000000000000}_0
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:19 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=005056BF236B1EE7B6ABF3E02460F073}_{conn-id=00000000000000000000000000000000}_0
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:20 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=005056BF236B1EE7B6ABF3E02460F073}_{conn-id=00000000000000000000000000000000}_0
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:21 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=005056BF236B1EE7B6ABF3E02460F073}_{conn-id=00000000000000000000000000000000}_0
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:22 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=005056BF236B1EE7B6ABF3E02460F073}_{conn-id=00000000000000000000000000000000}_0
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:23 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=00155DAA641E1EE59FDC06201CB6217A}_{conn-id=5A2060DBE78D63B8E1000000AC13082D}_1
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:24 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=00155DAA641E1EE59FDC06201CB6217A}_{conn-id=5A2060DBE78D63B8E1000000AC13082D}_1
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:25 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=00155DAA641E1EE59FDC06201CB6217A}_{conn-id=5A2060DBE78D63B8E1000000AC13082D}_1
M
M Tue Dec 5 02:57:26 2017
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:27 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]

M {root-id=00155DAA641E1EE59FDC06201CB6217A}_{conn-id=5A2060DBE78D63B8E1000000AC13082D}_1
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:28 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=00155DAA641E1EE59FDC06201CB6217A}_{conn-id=5A2060DBE78D63B8E1000000AC13082D}_1
A TH VERBOSE LEVEL FULL
A ** RABAX: level LEV_RX_PXA_RELEASE_MTX entered.
A ** RABAX: level LEV_RX_PXA_RELEASE_MTX completed.
A ** RABAX: level LEV_RX_COVERAGE_ANALYSER entered.
A ** RABAX: level LEV_RX_COVERAGE_ANALYSER completed.
A ** RABAX: level LEV_RX_HOTSPOT_TRACE entered.
A ** RABAX: level LEV_RX_HOTSPOT_TRACE completed.
A ** RABAX: level LEV_RX_SAVE_SHMLOCKS entered.
A ** RABAX: level LEV_RX_SAVE_SHMLOCKS completed.
A ** RABAX: level LEV_RX_RESET_SHMLOCKS entered.
A ** RABAX: level LEV_RX_RESET_SHMLOCKS completed.
A ** RABAX: level LEV_RX_ROLLBACK entered.
A ** RABAX: level LEV_RX_ROLLBACK completed.
A ** RABAX: level LEV_RX_DB_ALIVE entered.
A ** RABAX: level LEV_RX_DB_ALIVE completed.
A ** RABAX: level LEV_RX_HOOKS entered.
A ** RABAX: level LEV_RX_HOOKS completed.
A ** RABAX: level LEV_RX_STANDARD entered.
A ** RABAX: level LEV_RX_STANDARD completed.
A ** RABAX: level LEV_RX_STOR_VALUES entered.
A ** RABAX: level LEV_RX_STOR_VALUES completed.
A ** RABAX: level LEV_RX_C_STACK entered.
A ** RABAX: level LEV_RX_C_STACK completed.
A ** RABAX: level LEV_RX_MEMO_CHECK entered.
A ** RABAX: level LEV_RX_MEMO_CHECK completed.
A ** RABAX: level LEV_RX_AFTER_MEMO_CHECK entered.
A ** RABAX: level LEV_RX_AFTER_MEMO_CHECK completed.
A ** RABAX: level LEV_RX_INTERFACES entered.
A ** RABAX: level LEV_RX_INTERFACES completed.
A ** RABAX: level LEV_RX_GET_MESS entered.
A ** RABAX: level LEV_RX_GET_MESS completed.
A ** RABAX: level LEV_RX_INIT_SNAP entered.
A ** RABAX: level LEV_RX_INIT_SNAP completed.
A ** RABAX: level LEV_RX_WRITE_SYSLOG entered.
A ** RABAX: level LEV_RX_WRITE_SYSLOG completed.

A ** RABAX: level LEV_RX_WRITE_SNAP_BEG entered.
A ** RABAX: level LEV_RX_WRITE_SNAP_BEG completed.
A ** RABAX: level LEV_RX_WRITE_SNAP entered.
A
A Tue Dec 5 02:57:30 2017
A ** RABAX: level LEV_SN_END completed.
A ** RABAX: level LEV_RX_WRITE_SNAP_END entered.
A ** RABAX: level LEV_RX_WRITE_SNAP_END completed.
A ** RABAX: level LEV_RX_RAL_FOR_RABAX entered.
A ** RABAX: level LEV_RX_RAL_FOR_RABAX completed.
A ** RABAX: level LEV_RX_SET_ALERT entered.
A ** RABAX: level LEV_RX_SET_ALERT completed.
A ** RABAX: level LEV_RX_COMMIT entered.
A ** RABAX: level LEV_RX_COMMIT completed.
A ** RABAX: level LEV_RX_SNAP_SYSLOG entered.
A ** RABAX: level LEV_RX_SNAP_SYSLOG completed.
A ** RABAX: level LEV_RX_RESET_PROGS entered.
A ** RABAX: level LEV_RX_RESET_PROGS completed.
A ** RABAX: level LEV_RX_STDERR entered.
A Tue Dec 5 02:57:30 2017
A
A ABAP Program SAPLTHFB .
A Source LTHFBU01 Line 115.
A Error Code RAISE_EXCEPTION.
A Module $Id: //bas/742_STACK/src/krn/abap/runt/abfunc.c#2 $ SAP.
A Function ab_jfune Line 5928.
A ** RABAX: level LEV_RX_STDERR completed.
A ** RABAX: level LEV_RX_RFC_ERROR entered.
A ** RABAX: level LEV_RX_RFC_ERROR completed.
A ** RABAX: level LEV_RX_RFC_CLOSE entered.
A ** RABAX: level LEV_RX_RFC_CLOSE completed.
A ** RABAX: level LEV_RX_IMC_ERROR entered.
A ** RABAX: level LEV_RX_IMC_ERROR completed.
A ** RABAX: level LEV_RX_APC_ERROR entered.
A ** RABAX: level LEV_RX_APC_ERROR completed.
A ** RABAX: level LEV_RX_DATASET_CLOSE entered.
A ** RABAX: level LEV_RX_DATASET_CLOSE completed.
A ** RABAX: level LEV_RX_ERROR_SAVE entered.
A ** RABAX: level LEV_RX_ERROR_SAVE completed.
A ** RABAX: level LEV_RX_ERROR_TPDA entered.
A ** RABAX: level LEV_RX_ERROR_TPDA completed.
A ** RABAX: level LEV_RX_PXA_RELEASE_RUDI entered.
A ** RABAX: level LEV_RX_PXA_RELEASE_RUDI completed.
A ** RABAX: level LEV_RX_LIVE_CACHE_CLEANUP entered.

A ** RABAX: level LEV_RX_LIVE_CACHE_CLEANUP completed.
A ** RABAX: level LEV_RX_END entered.
A ** RABAX: level LEV_RX_END completed.
A ** RABAX: end RX_RFC
M
M Tue Dec 5 02:57:55 2017
M DpMBufReload: already loading
M
M Tue Dec 5 02:57:56 2017
M *** ERROR => ThMBufGet: ThMBufFullRead [thxxmbuf.c 221]
M {root-id=00155DAA641E1EE59FDC06201CB6217A}_{conn-id=045DD9E7E17EF1E289F300155DAA641E}_1


and this goes on and on and on... And then snapshots are starting

M Tue Dec 5 03:40:56 2017
M
M ********** SERVER SNAPSHOT 1 - begin **********
M
M ------------------ C-STACK ----------------------
dw.sapBID_DVEBMGS00[S](LinStackBacktrace+0x8c)[0x63a3c9]
dw.sapBID_DVEBMGS00[S](LinStack+0x35)[0x63e410]
dw.sapBID_DVEBMGS00[S](CTrcStack2+0x4e)[0x63a18c]
dw.sapBID_DVEBMGS00[S](ThStackHandler()+0xb1)[0x4ea152]
dw.sapBID_DVEBMGS00[S](DpTrcOnOff(int)+0x212)[0x453069]
<signal handler called> [0x7f82a708a130]
libc.so.6[S](__GI_semtimedop+0xa)[0x7f82a6db1a3a]
dw.sapBID_DVEBMGS00[S](SemTimedOp+0x40)[0x19ef6c0]
dw.sapBID_DVEBMGS00[S](EvtWtRst+0x72)[0x19eee42]
dw.sapBID_DVEBMGS00[S](RqQWorkerWaitForRequests(unsigned int, unsigned char, RQ_Q_PRIO, unsigned int*, int)+0x2da)[0x19afc8a]
dw.sapBID_DVEBMGS00[S](ThRqCheckQueues(int, REQUEST_BUF**)+0xb05)[0x1a02075]
dw.sapBID_DVEBMGS00[S](ThRqGetNextRequest(int, REQUEST_BUF**)+0x30)[0x4af49d]
dw.sapBID_DVEBMGS00[S](ThRqWaitFor(int, REQUEST_BUF**)+0x36)[0x4af687]
dw.sapBID_DVEBMGS00[S](ThRqAcceptImpl(unsigned char, int, REQUEST_BUF**)+0x118)[0x1a02858]
dw.sapBID_DVEBMGS00[S](ThIRqSendWithReplyInline(REQUEST_BUF*, int, unsigned char, REQUEST_BUF**)+0x180)[0x4afeee]
dw.sapBID_DVEBMGS00[S](ThSysAdmMsg()+0x44d)[0x1a45efd]
dw.sapBID_DVEBMGS00[S](ThSysInfo+0x89)[0x1a50d09]
dw.sapBID_DVEBMGS00[S](ab_jcaly()+0x11b)[0x183bbfb]
dw.sapBID_DVEBMGS00[S](ab_extri()+0x1d8)[0x17bc7e8]
dw.sapBID_DVEBMGS00[S](ab_xevent(char16_t const*)+0x35)[0x17f4255]
dw.sapBID_DVEBMGS00[S](ab_dstep+0x139)[0x1b48e99]
dw.sapBID_DVEBMGS00[S](dynpmcal(DINFDUMY*, STPDUMMY*)+0x2cb)[0x178be5b]
dw.sapBID_DVEBMGS00[S](dynppbo0(DINFDUMY*)+0x201)[0x178dff1]
dw.sapBID_DVEBMGS00[S](dynprctl(DINFDUMY*)+0x173)[0x1a8b403]
dw.sapBID_DVEBMGS00[S](dynpen00+0x365)[0x1a7d975]
dw.sapBID_DVEBMGS00[S](ThrtCallAbapVm+0xcb)[0x1a4cdfb]
dw.sapBID_DVEBMGS00[S](ThHandleRequest(REQUEST_BUF*, unsigned char)+0x194)[0x1a14fd4]
dw.sapBID_DVEBMGS00[S](TskhLoop()+0x9f)[0x4e98c7]
dw.sapBID_DVEBMGS00[S](ThStart()+0x289)[0x4e9cfd]
dw.sapBID_DVEBMGS00[S](DpMain+0x37d)[0x19a8bcd]
libc.so.6[S](__libc_start_main+0xf5)[0x7f82a6cdbaf5]
M -------------------------------------------------
M
M ABAP-Stack at ThStackHandler -- ABAP stack of session T11_U29665_M0
M -------------------------------------------------------------------

M 1: TH_SEND_ADM_MESS of program SAPLTHFB (Function)
M 2: RZL_EXECUTE_STRG_REQ of program SAPLSAL2 (Function)
M 3: RZL_STRG_READALL_I of program SAPLSAL2 (Function)
M 4: CHECK_IF_LOADBALANCE_ONLY of program SAPMSSY6 (Form)
M 5: AUTO_ABAP of program SAPMSSY6 (Form)
M 6: PBO_1100 of program SAPMSSY6 (Module)
M
M
M ********** SERVER SNAPSHOT 1 - end **********

The thing also is, the system which survived, has kernel 7.45 patch 400 and the one which crashed, has kenerl 7.42 patch 300, which is rather old.

I went on the marketplace, under Support packages and patches, in order to check which is the newest patch, but I didn't find kernel 7.42 there at all !! Do you know if this kernel version has been withdrawn ?

My main question - why did the system get stuck ? A kernel bug, or is/was there some networking problem ? I have a hard time analyzing those Ni errors cited above. Many thanks in advance!

Share
10 |10000 characters needed characters left characters exceeded
Mario De Felipe Dec 05, 2017 at 10:18 AM
0

Hello, I am concerned about this;

Prio Class High
maxRuntime[RQ_Q_PRIO_HIGH] = infinite
maxRuntimeHalf[RQ_Q_PRIO_HIGH] = infinite

Prio Class Normal
maxRuntime[RQ_Q_PRIO_NORMAL] = infinite
maxRuntimeHalf[RQ_Q_PRIO_NORMAL] = infinite

does it mean you have infinite value in parameters *max_runtime* ?

Be aware that since kernel 7.40 SAP changed some of these parameters, parameter rdisp/max_wprun_time should not be set, and you should schedule parameters rdisp/scheduler/prio* as per note 2129291

can you tell us the values of those parameters in both systems?

Thanks

Share
10 |10000 characters needed characters left characters exceeded
avatar image
Former Member Dec 05, 2017 at 10:41 AM
0

Hola Mario,

many thanks for the quick response!

On both systems:

rdisp/max_wprun_time = 0
rdisp/scheduler/prio* = 10m
rdisp/scheduler/prio_low/max_runtime -> no value
rdisp/scheduler/prio_normal/max_quota = 80%
rdisp/scheduler/prio_normal/max_runtime = 60m

hung system:

rdisp/scheduler/prio_low/max_quota = 80%

surviving system:

rdisp/scheduler/prio_low/max_quota = 70%

And what do you think about kernel 7.42 ?

P. S.: Many thanks for the note, it was very informative!

Share
10 |10000 characters needed characters left characters exceeded
Mario De Felipe Dec 05, 2017 at 11:09 AM
0

Hello, when we have server overload issues, WPs may loose the connection to the message server, and they need to be restarted. I also see MS restart operations in your log

M ThVBHdlMsgDown: msg down
M ThIVBChangeState: update deactivated
M ***LOG R0R=> ThIVBChangeState, update deactivated () [thxxvb.c 11344]
M
M Tue Dec 5 02:57:16 2017
M ***LOG R0Z=> ThResetVBDispatching, reset update dispatching info () [thxxvb.c 8328]
M MBUF info for hooks: MS component DOWN
M
M Tue Dec 5 02:57:17 2017
M DpMBufReload: already loading

the timeout parameters are created to limit damage in those situations too

I would play limiting the number of dialog work processes, and quota, to prevent system to overload CPU operations again to avoid such problems.

As far as I know, SAP Kernel 742 is out of maintenance, last patch was released in march.

in 745 there are some notes that recommends systems to have patches later than 201

2334345 - Request queue with high load and temporary connection break down to message server

If you have Netweaver 7.40 or 7.50 latest SAP Kernel is 745 PL 600.

If you have Netweaver 7.51 this is SAP Kernel 749 PL 400

Kernel 7.45 is downwards compatible with 742 so its a good chance to replace it

Share
10 |10000 characters needed characters left characters exceeded
avatar image
Former Member Dec 05, 2017 at 11:40 AM
0

Many thanks, Mario.

What parameters would you tune ? The ones which you mentioned in your previous answer ?
Would you decrease

rdisp/scheduler/prio_normal/max_quota = 80%

probably to 60% ?

Yes, I will request a kernel update. Many thanks for confirming, that it is out of maintenance.

Share
10 |10000 characters needed characters left characters exceeded
avatar image
Former Member Dec 05, 2017 at 01:29 PM
0

Gracias, Mario.

Right, this is not a production system.
We will first upgrade the kernel and see what it'll do.
If it hangs again, we'll modify the parameters as per your requirements.

Share
10 |10000 characters needed characters left characters exceeded