on 02-02-2015 5:43 PM
Hi all,
I administer an ECC system running on ERP 6.0 EHP6, with separate instances for database, central instance and 1 dialog instance. Today all the 3 instances got 'hanged' such that users were not able to log into the gui. I tried killing PIDs from OS but still was not able to access the systems. I had to stop all instances, reboot the servers and start again. Now I want to determine the cause of the issue. Please tell me where to start looking at - the work directory? I am suspecting some user activity to have caused this....
regards,
Aman
Dear Aman, what kind of DB are you using?
Stucks within SAP can also be caused by problem within the DB, most of the times, full tablespaces or full transaction logs.
You should also have a look within the DB logs at that point of time.
A good idea during a stuck is to save all logs (db/sap/os).
best regards
Peter
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Aman,
you should start your analysis with the syslog in SM21, which instance and which component has reported the first error before the system got hanging. In the syslog you see also, in which trace you can find more information on the error message. In a such hanging situation it is helpful to create a dpmon output about which actions work processes are executing and also to create a call stack in the work process traces as described in the note "112 - Trace and error information in the "dev_" files". As of certain kernel release sapstack program is available to create the call stack (note "1964673 - C-Call stack analysis").
Regards, Alwina
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Reagan/Nikhil/Alwina,
The system is on Unix and kernel is 720 EXT Patch 300.
I'm attaching the dev_ms.old and dev_disp.old of the central instance. Pls tell me if anything can be derived from them.
In SM21, I don't see anything really suspicious except for the 2 warnings:
13:56:46 DIA 000 000 SAPSYS | BS 5 The buffer synchronization has not been called for 189 seconds |
13:56:47 DIA 015 000 SAPSYS | BS 5 The buffer synchronization has not been called for 190 seconds |
Alwina, so I can take a stack trace from dpmon - it would have helped me better analyse the issue?
regards.
Your error looks same as mentioned in note 1974139 - Message server disconnection error while system works fine.
Hello Aman,
what is the exact time stamp, when the system started to hang? You should check the syslog for about one hour before this time. Did you check the syslog of all instances?
During the system is hanging you should collect dpmon output (an output like SM50 with the work process table, note "42074 - Using the R/3 dispatcher monitor 'dpmon' ") and create call stacks in the traces of hanging processes (work processes, dispatcher), so that you can see at which action the system is hanging.
Regards, Alwina
Hi again everybody,
The exact time is around 13 20 hours, but the syslog does not show anything odd.
But I did observe something in STAD tcode. Early in the morning there was the EWA job /SDF/TCODE_SIMILARITY_IS which ran for a little more than 2 hours consuming lots of resources and at the same time there was another user job (mass process) running. I'm thinking maybe these 2 processes filled up the buffers causing slowness throughout the day....
Alwina, what is exact steps to create the call stacks?
Also, for the future, I want to know what program/t-code users have run and what parameters/values they entered - STAD does not give details on the values entered but just the t-codes. Even auditing SM19 does not give such details. What can I enable to also know the values entered by the user?
regards.
Hello Aman,
do you get the message "ERROR => DpMsgProcess: MsReceive () -> MSEPARTNERUNKNOWN, partner:" in dev_disp all the time, also today?
What do you see in dev_w0.old at around 13:20 and a little bit earlier? You have already posted some information from dev_w0.old, but we should check also the trace short before the system got to hang.
To create the call stack you should execute "kill -USR2 <pid>" (<pid> of a hanging process) on OS level (unix). You have mentioned you were not able to kill processes on OS level? So the processes did not react to any signals? In this case sapstack should be used, either "sapstack <pid>" for one process or "sapstack -tree <pid>" with <pid> of sapstart process for all processes of one instance.
Regards, Alwina
Hi Alwina,
Thnx for the reply.
I'm still getting the error "ERROR => DpMsgProcess: MsReceive () -> MSEPARTNERUNKNOWN, partner:". Is this normal?
Regarding dev_w0.old before the 13 20, pls see below an extract:
Mon Feb 2 07:47:12 2015
B dbmyclu : info : my major identification is 3232240855, minor one 1.
B dbmyclu : info : Time Reference is 1.12.2001 00:00:00h GMT.
B dbmyclu : info : my initial uuid is 54CE3138F2DF2287E1000000C0A814D7.
B dbmyclu : info : current optimistic cluster level: 0
B dbmyclu : info : pessimistic reads set to 2.
M
M Mon Feb 2 11:40:48 2015
M Deactivate ASTAT hyper index locking
B
B Mon Feb 2 13:56:46 2015
B *** ERROR => dbsync[check_sync_interval]: 189 seconds passed since last synchronisation
last sync: 20150202135337, now: 20150202135646
[dbsync.c 4336]
B {root-id=8010E02C8BBE1ED4AAD839911A03D660}_{conn-id=00000000000000000000000000000000}_0
I will test the sapstack -tree <pid> in a test system to get an idea.
Do we have some other means to detect such problems in the future - like auditing etc?
regards.
Hello Aman,
could you please check also all other traces at around the time 13:20?
In the most cases call stacks and dpmon output can already give a hint, in which area you have to search for the reason of hanging situation. If it happens to often, then a higher trace level needs to be activated for the work processes and dispatcher, but this will influence the performance.
If you have still the error "ERROR => DpMsgProcess: MsReceive () -> MSEPARTNERUNKNOWN, partner:", but you do not see any problems in the network, then it does only mean, that the dispatcher tries to communicate with another instance through the message server, but this instance is currently not logged on to the message server. The message server gives the information back, that it does not know the instance.
Regards, Alwina
Hi Aman,
Posibly the system hang can come if any of the mount points or db space are full but this wont fix after system restart unless you extend or manage clearing some space.
I believe in your situation the issue was fixed after system restart, hence you can avoid the abov scenario.
Next cause can be if all the dialog process are occupied. In this case even if you kill some os process, the next requests in dispatcher queue will be allocated and still user login issue would be there.
You can check for any oracle dead locks, or long running dialog process that kept on holding the process at that point of time using dpmon command from os level.
I'm not sure if such information would be captured in old dispatcher or work process logs.
Regards,
Nikhil
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Investigate what happened at the DB side. If the DB was OK then check the CI trace files or supply the full trace files and then the AS trace files. What was the status of the work processes in DPMON during this time?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Aman,
You need to analyse all the trace files in the work directory of the CI and AS systems (In your case all with .old extension as you have restarted the system) especially the dev_disp.old, dev_ms.old, dev_w0.old and also dev_rfc traces to identify what was the reason behind this. I believe the systems are running on Unix/Linux platforms, in that case review also the unlimts of the <SID>adm user. Could you supply the kernel release and patch level? It looks like you must be running on a buggy kernel.
Cheers
RB
wwhich Db do you use ? Did you check orarach Space ?
you tried to kill from OS level, did you see any mount point full?
GEnerallly Ur work directory is full causing all workprocrsses occupied.
you can check dev_w0.old for that time logs
regards
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi James,
DB is oracle 11.2.0.3.
No mountpoint was full.
See below extract of dev_w0.old during the time the issue occured:
Mon Feb 2 13:56:46 2015
*** ERROR => dbsync[check_sync_interval]: 189 seconds passed since last synchronisation
last sync: 20150202135337, now: 20150202135646
[dbsync.c 4336]
{root-id=8010E02C8BBE1ED4AAD839911A03D660}_{conn-id=00000000000000000000000000000000}_0
Mon Feb 2 13:56:47 2015
***LOG BS5=> Buffer synchronisation has not been called for 189 seconds [dbsync 4341]
dbsync[db_clean_ddlog]: buffer synchronisation is called now
Mon Feb 2 13:56:48 2015
dbsync[db_syexe]: interfering sync call detected
my_call_no = 609, current_ts = 20150202135647, most_recent_sync = -2133741819, oldest_gap = (2147483647,00000000000000)
syncalls = 610, time_of_last_sync = 20150202135647, last_counter = -2133741836, oldest_gap = (2147483647,00000000000000)
Mon Feb 2 13:57:25 2015
ThEMsgArrived: sysmsg_for_rfc = 0
------------------ C-STACK ----------------------
[0] ThAlarmHandler ( 0x2, 0x71c, 0x2, 0x2, 0x1039e6000, 0x1039e9000 ), at 0x1003ea5f0
[1] DpSigAlrm ( 0x1039e6, 0x103a0cccc, 0x103a0c000, 0x103800, 0x100000, 0x1 ), at 0x100358adc
[2] __sighndlr ( 0xe, 0x0, 0xffffffff7fffecb0, 0x100358980, 0x0, 0xd ), at 0xffffffff789d7c58
[3] call_user_handler ( 0xffffffff7b100200, 0xffffffff7b100200, 0xffffffff7fffecb0, 0x0, 0x0, 0x0 ), at 0xffffffff789cb7ec
[4] sigacthandler ( 0x0, 0x0, 0xffffffff7fffecb0, 0xffffffff7b100200, 0x0, 0xffffffff78b3e000 ), at 0xffffffff789cb9f8
[5] setIcuCollation ( 0x0, 0xffffffff7ffff064, 0x38a7400, 0x1a9d084, 0x0, 0x1042dfa38 ), at 0x101ea769c
[6] __1cMDoSetTextEnv6FkpknKSAP_CpInfo_kpkH_v_ ( 0xfffffffe2b4b6018, 0x103c56b2a, 0x10779a000, 0x103c56782, 0x10779a, 0x107400 ), a
t 0x100abff50
[7] ab_RollInEnv ( 0x2000, 0x103a0c000, 0x103a0d668, 0x107bea650, 0x1077a0000, 0x1077a0 ), at 0x1008c4d30
[8] ThRollIn ( 0x1039e9000, 0x0, 0x0, 0x0, 0x1, 0x1039e9 ), at 0x100431ec4
[9] ThSessionRestore ( 0x0, 0x0, 0xffffffff665275a0, 0x0, 0x1039e9000, 0x1039e9 ), at 0x1003eeba8
[10] TskhLoop ( 0x1039e9, 0x0, 0x103800, 0x0, 0x1039e9000, 0x0 ), at 0x1003a6b30
[11] DpMain ( 0x1039e6, 0x0, 0x1039e6, 0x103800, 0x1, 0x0 ), at 0x1002b7b60
Is the above indicating anything?
regards.
Hi Aman
1. Is this Oracle log DB or USR folder full ? and is this any dump in ST22?
2. Could you refer the SAP KBA for ERROR => dbsync
2020479 - Buffer Sync Failiure because of sequence DDLOG_SEQ wrongly created
BR
SS
User | Count |
---|---|
93 | |
10 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.