cancel
Showing results for 
Search instead for 
Did you mean: 

Instances not accessible

Former Member
0 Kudos

Hi all,

I administer an ECC system running on ERP 6.0 EHP6, with separate instances for database, central instance and 1 dialog instance. Today all the 3 instances got 'hanged' such that users were not able to log into the gui. I tried killing PIDs from OS but still was not able to access the systems. I had to stop all instances, reboot the servers and start again. Now I want to determine the cause of the issue. Please tell me where to start looking at - the work directory? I am suspecting some user activity to have caused this....

regards,

Aman

Accepted Solutions (0)

Answers (5)

Answers (5)

Former Member
0 Kudos

Dear Aman, what kind of DB are you using?

Stucks within SAP can also be caused by problem within the DB, most of the times, full tablespaces or full transaction logs.

You should also have a look within the DB logs at that point of time.

A good idea during a stuck is to save all logs (db/sap/os).

best regards

Peter

alwina_enns
Employee
Employee
0 Kudos

Hello Aman,

you should start your analysis with the syslog in SM21, which instance and which component has reported the first error before the system got hanging. In the syslog you see also, in which trace you can find more information on the error message. In a such hanging situation it is helpful to create a dpmon output about which actions work processes are executing and also to create a call stack in the work process traces as described in the note "112 - Trace and error information in the "dev_" files". As of certain kernel release sapstack program is available to create the call stack (note "1964673 - C-Call stack analysis").

Regards, Alwina

Former Member
0 Kudos

Hi Reagan/Nikhil/Alwina,

The system is on Unix and kernel is 720 EXT Patch 300.

I'm attaching the dev_ms.old and dev_disp.old of the central instance. Pls tell me if anything can be derived from them.

In SM21, I don't see anything really suspicious except for the 2 warnings:

13:56:46 DIA  000 000 SAPSYS           BS  5 The buffer synchronization has not been called for 189 seconds
13:56:47 DIA  015 000 SAPSYS          

BS  5 The buffer synchronization has not been called for 190 seconds

Alwina, so I can take a stack trace from dpmon - it would have helped me better analyse the issue?

regards.

Former Member
0 Kudos
alwina_enns
Employee
Employee
0 Kudos

Hello Aman,

what is the exact time stamp, when the system started to hang? You should check the syslog for about one hour before this time. Did you check the syslog of all instances?

During the system is hanging you should collect dpmon output (an output like SM50 with the work process table, note "42074 - Using the R/3 dispatcher monitor 'dpmon' ") and create call stacks in the traces of hanging processes (work processes, dispatcher), so that you can see at which action the system is hanging.

Regards, Alwina

alwina_enns
Employee
Employee
0 Kudos

Hello Sunil,

we should know the exact time stamp first before we can start to check the available information.

Regards, Alwina

Former Member
0 Kudos

Hi again everybody,

The exact time is around 13 20 hours, but the syslog does not show anything odd.

But I did observe something in STAD tcode. Early in the morning there was the EWA job /SDF/TCODE_SIMILARITY_IS which ran for a little more than 2 hours consuming lots of resources and at the same time there was another user job (mass process) running. I'm thinking maybe these 2 processes filled up the buffers causing slowness throughout the day....

Alwina, what is exact steps to create the call stacks?

Also, for the future, I want to know what program/t-code users have run and what parameters/values they entered - STAD does not give details on the values entered but just the t-codes. Even auditing SM19 does not give such details. What can I enable to also know the values entered by the user?

regards.

alwina_enns
Employee
Employee
0 Kudos

Hello Aman,

do you get the message "ERROR => DpMsgProcess: MsReceive () -> MSEPARTNERUNKNOWN, partner:" in dev_disp all the time, also today?

What do you see in dev_w0.old at around 13:20 and a little bit earlier? You have already posted some information from dev_w0.old, but we should check also the trace short before the system got to hang.

To create the call stack you should execute "kill -USR2 <pid>" (<pid> of a hanging process) on OS level (unix). You have mentioned you were not able to kill processes on OS level? So the processes did not react to any signals? In this case sapstack should be used, either "sapstack <pid>" for one process or "sapstack -tree <pid>" with <pid> of sapstart process for all processes of one instance.

Regards, Alwina

Former Member
0 Kudos

Hi Alwina,

Thnx for the reply.

I'm still getting the error "ERROR => DpMsgProcess: MsReceive () -> MSEPARTNERUNKNOWN, partner:". Is this normal?

Regarding dev_w0.old before the 13 20, pls see below an extract:

Mon Feb  2 07:47:12 2015

B  dbmyclu : info : my major identification is 3232240855, minor one 1.

B  dbmyclu : info : Time Reference is 1.12.2001 00:00:00h GMT.

B  dbmyclu : info : my initial uuid is 54CE3138F2DF2287E1000000C0A814D7.

B  dbmyclu : info : current optimistic cluster level: 0

B  dbmyclu : info : pessimistic reads set to 2.

M

M Mon Feb  2 11:40:48 2015

M  Deactivate ASTAT hyper index locking

B

B Mon Feb  2 13:56:46 2015

B  *** ERROR => dbsync[check_sync_interval]: 189 seconds passed since last synchronisation

last sync: 20150202135337, now: 20150202135646

[dbsync.c     4336]

B  {root-id=8010E02C8BBE1ED4AAD839911A03D660}_{conn-id=00000000000000000000000000000000}_0

I will test the sapstack -tree <pid> in a test system to get an idea.

Do we have some other means to detect such problems in the future - like auditing etc?

regards.

alwina_enns
Employee
Employee
0 Kudos

Hello Aman,

could you please check also all other traces at around the time 13:20?

In the most cases call stacks and dpmon output can already give a hint, in which area you have to search for the reason of hanging situation. If it happens to often, then a higher trace level needs to be activated for the work processes and dispatcher, but this will influence the performance.

If you have still the error "ERROR => DpMsgProcess: MsReceive () -> MSEPARTNERUNKNOWN, partner:", but you do not see any problems in the network, then it does only mean, that the dispatcher tries to communicate with another instance through the message server, but this instance is currently not logged on to the message server. The message server gives the information back, that it does not know the instance.

Regards, Alwina

Former Member
0 Kudos

Hi Aman,

Posibly the system hang can come if any of the mount points or db space are full but this wont fix after system restart unless you extend or manage clearing some space.

I believe in your situation the issue was fixed after system restart, hence you can avoid the abov scenario.

Next cause can be if all the dialog process are occupied. In this case even if you kill some os process, the next requests in dispatcher queue will be allocated and still user login issue would be there.

You can check for any oracle dead locks, or long running dialog process that kept on holding the process at that point of time using dpmon command from os level.

I'm not sure if such information would be captured in old dispatcher or work process logs.

Regards,

Nikhil

Reagan
Advisor
Advisor
0 Kudos

Investigate what happened at the DB side. If the DB was OK then check the CI trace files or supply the full trace files and then the AS trace files. What was the status of the work processes in DPMON during this time?

Former Member
0 Kudos

Hi Reagan,

The DB seemed ok (alert log - no issue). The work processes were showing 'ROLL IN' and 'NO ACTION' in DPMON. I tried to kill from there, but nothing happened. Which trace files specifically I need to check?

regards.

Reagan
Advisor
Advisor
0 Kudos

Aman,

You need to analyse all the trace files in the work directory of the CI and AS systems (In your case all with .old extension as you have restarted the system) especially the dev_disp.old, dev_ms.old, dev_w0.old and also dev_rfc traces to identify what was the reason behind this. I believe the systems are running on Unix/Linux platforms, in that case review also the unlimts of the <SID>adm user. Could you supply the kernel release and patch level? It looks like you must be running on a buggy kernel.

Cheers

RB

Former Member
0 Kudos

wwhich Db do you use ? Did you check orarach Space ?

you tried to kill from OS level, did you see any mount point full?

GEnerallly Ur work directory is full causing all workprocrsses occupied.

you can check dev_w0.old for that time logs

regards

Former Member
0 Kudos

Hi James,

DB is oracle 11.2.0.3.

No mountpoint was full.

See below extract of dev_w0.old during the time the issue occured:

Mon Feb  2 13:56:46 2015

*** ERROR => dbsync[check_sync_interval]: 189 seconds passed since last synchronisation

last sync: 20150202135337, now: 20150202135646

[dbsync.c     4336]

  {root-id=8010E02C8BBE1ED4AAD839911A03D660}_{conn-id=00000000000000000000000000000000}_0

Mon Feb  2 13:56:47 2015

  ***LOG BS5=> Buffer synchronisation has not been called for 189        seconds [dbsync       4341]

  dbsync[db_clean_ddlog]: buffer synchronisation is called now

Mon Feb  2 13:56:48 2015

  dbsync[db_syexe]: interfering sync call detected

    my_call_no = 609, current_ts = 20150202135647, most_recent_sync = -2133741819, oldest_gap = (2147483647,00000000000000)

    syncalls = 610, time_of_last_sync = 20150202135647, last_counter = -2133741836, oldest_gap = (2147483647,00000000000000)

Mon Feb  2 13:57:25 2015

  ThEMsgArrived: sysmsg_for_rfc = 0

  ------------------ C-STACK ----------------------

[0] ThAlarmHandler ( 0x2, 0x71c, 0x2, 0x2, 0x1039e6000, 0x1039e9000 ), at 0x1003ea5f0

[1] DpSigAlrm ( 0x1039e6, 0x103a0cccc, 0x103a0c000, 0x103800, 0x100000, 0x1 ), at 0x100358adc

[2] __sighndlr ( 0xe, 0x0, 0xffffffff7fffecb0, 0x100358980, 0x0, 0xd ), at 0xffffffff789d7c58

[3] call_user_handler ( 0xffffffff7b100200, 0xffffffff7b100200, 0xffffffff7fffecb0, 0x0, 0x0, 0x0 ), at 0xffffffff789cb7ec

[4] sigacthandler ( 0x0, 0x0, 0xffffffff7fffecb0, 0xffffffff7b100200, 0x0, 0xffffffff78b3e000 ), at 0xffffffff789cb9f8

[5] setIcuCollation ( 0x0, 0xffffffff7ffff064, 0x38a7400, 0x1a9d084, 0x0, 0x1042dfa38 ), at 0x101ea769c

[6] __1cMDoSetTextEnv6FkpknKSAP_CpInfo_kpkH_v_ ( 0xfffffffe2b4b6018, 0x103c56b2a, 0x10779a000, 0x103c56782, 0x10779a, 0x107400 ), a

t 0x100abff50

[7] ab_RollInEnv ( 0x2000, 0x103a0c000, 0x103a0d668, 0x107bea650, 0x1077a0000, 0x1077a0 ), at 0x1008c4d30

[8] ThRollIn ( 0x1039e9000, 0x0, 0x0, 0x0, 0x1, 0x1039e9 ), at 0x100431ec4

[9] ThSessionRestore ( 0x0, 0x0, 0xffffffff665275a0, 0x0, 0x1039e9000, 0x1039e9 ), at 0x1003eeba8

[10] TskhLoop ( 0x1039e9, 0x0, 0x103800, 0x0, 0x1039e9000, 0x0 ), at 0x1003a6b30

[11] DpMain ( 0x1039e6, 0x0, 0x1039e6, 0x103800, 0x1, 0x0 ), at 0x1002b7b60

Is the above indicating anything?

regards.

Sriram2009
Active Contributor
0 Kudos

Hi Aman

1. Is this Oracle log DB or USR folder full ? and is this any dump in ST22?

2. Could you refer the SAP KBA for ERROR => dbsync

2020479 - Buffer Sync Failiure because of sequence DDLOG_SEQ wrongly created

BR

SS