cancel
Showing results for 
Search instead for 
Did you mean: 

Troubleshooting latency

patrickbachmann
Active Contributor
0 Kudos

Hi folks,

I've been troubleshooting some major latency issues and have read a bunch of suggestions in this forum which were quite old and wondering if there's any new approach/ideas on troubleshooting latency problems.  For my particular case I've been looking at performance/load charts during the early morning slow down times and am not seeing any high amount of delta merges or anything that indicates a problem on the HANA side.  On SLT I've increased amount of jobs processing and found some improvements however still seeing delays where replication just stops for an hour or two at a time for certain tables.  Talking to some folks on my team they said DBA's suggested re-index tables on the SAP side could improve things.  I'm interested in understanding why that could help plus if there's any other new ideas/suggestions to look for since some of the old posts were placed on SCN.

Thanks,

-Patrick

Accepted Solutions (0)

Answers (4)

Answers (4)

patrickbachmann
Active Contributor
0 Kudos

Ok I exhausted everything I could read about and created SAP message and they recommend upgrading SLT client drivers from 82 to 85.  Apparently we are experiencing frequent connection issues which should automatically resolve themselves but somehow are not.  For anybody interested see note 2089430 - SQLDBC Connectivity Issues after failover of master node if you are curious or experience similar problems.

Thanks everyone.

-Patrick

patrickbachmann
Active Contributor
0 Kudos

Sivakumar, do you have SLT client drivers version prior to 85 by any chance?  I'd be curious to know if it also resolves your problem.

patrickbachmann
Active Contributor
0 Kudos

Hi guys,

Just updating you on progress.  Updating SLT client drivers did not help however we upgraded SLT to DMIS 2011 SP8 and applied recommended applicable notes and now it has greatly improved.  Instead of hours of latency we are just seeing a few around 50minutes or so and a handful at 10minutes which I'm still having SAP dig into.

-Patrick

Former Member
0 Kudos

Hi Patrick,

Sorry for the late reply. I am not able to find out what SLT client drivers currently we have but the DMIS version is DMIS 2011_1 SP6.

Still there are some unexpected things happening in SLT. Tcodes LTR and LTRC both are showing different status in replication. The status in LTR is showing as red ( Issues detected regarding statistical information for a specific table ) but the same table is showing as replication in progress in LTRC. Not sure upgrading DMIS SP vertion to latest will fix it.

Thanks

Siva

patrickbachmann
Active Contributor
0 Kudos

Hi Siva,

I guess in just past 2 weeks there are even more SAP notes recommended to apply to our current DMIS 2011 SP8 and some of them may have been spawned from our exact messages to them for help so I suspect upgrading your landscape, if you can, will eventually fix your problem as well.  We are almost out of the water here and will be going to SPS9 soon although reluctantly and cautiously.

-Patrick

Former Member
0 Kudos

Thanks Patrick ,

Sure. Let me check on upgrade side. I will keep you posted.

-Siva

patrickbachmann
Active Contributor
0 Kudos

Guys,

I haven't forgotten about this thread, I'm still tinkering.  Last night I stayed up late and monitored and was able to catch a latency window where I could clearly see records stacking up in logging tables and could see it had been written to logging table an hour earlier.  I then looked at SM37 at my 22 transfer jobs and they were all active.  I have however noticed many of them with CANCELED status almost hourly and when I look at those, the errors looks something like this;

Log not found (in main memory)

Job cancelled after system exception ERROR_MESSAGE

I then tried to look at SM50 to see work processes available but I don't have access.  I tried to engage my Netweaver team but by the time they looked everything had caught up.  So my next step is trying to catch this happening live again and then have Netweaver team look at work processes and load during this time and finally I will create an SAP message as I think at that point I've done all the stuff they recommend doing for troubleshooting.

-Patrick

patrickbachmann
Active Contributor
0 Kudos

Lars,

Here's an excerpt from my dba to answer your question on the full table scans (below).  I have been watching delta merges, locks, etc on the target system.  Esssentially everything available in performance/load charts and looking for outliers during the suspect time frame.

Martin,

Your statement doesn't make sense to me.  I started digging into this problem because indeed data was not getting into the target system.  So I looked at latency and could see the many hours gap exactly at the time the users complained.  If I can not use replication statistics to measure latency then what tools can I use to monitor latency on a daily basis?  Can I not trust latency alert emails either?  Aren't those also based on these same statistics?  Thanks for any more insight you can provide. 

-Patrick

  1. 1.       One of the top statements ordered by CPU time:

-          Note that it does FULL TABLE scan (no index access):

    CPU                   CPU per           Elapsed                           

  Time (s)  Executions    Exec (s) %Total   Time (s)   %CPU    %IO    SQL Id  

---------- ------------ ---------- ------ ---------- ------ ------ -------------

  10,119.8       10,324       0.98    4.6   10,682.6   94.7     .0 amcd4pbg4068a

Module: /1CADMC/SAPLDMC010000000002665

DELETE FROM ""/1CADMC/00010824"" WHERE ""IUUC_PROCESSED"">=:A0

-          It uses index (INDEX RANGE scan), but index itself did grew big.

   5,689.7       10,386       0.55    2.6    5,999.8   94.8     .0 1pbm6w9w0snym

  1. 2.       The same statement as previously, but this time tops buffer gets list. Pretty unusual for a table that has 16 rows!  Note that those 3 statements below makes 10% of all “Buffer gets” (which translates to IO operations) from overall system load

     Buffer                 Gets              Elapsed                          

      Gets   Executions   per Exec   %Total   Time (s)  %CPU   %IO    SQL Id  

----------- ----------- ------------ ------ ---------- ----- ----- -------------

  1. 8.66250E+08      10,386     83,405.6    4.2    5,999.8  94.8     0 1pbm6w9w0snym

  1. 7.61395E+08      10,324     73,750.0    3.7   10,682.6  94.7     0 amcd4pbg4068a

Module: /1CADMC/SAPLDMC010000000002665

DELETE FROM ""/1CADMC/00010824"" WHERE ""IUUC_PROCESSED"">=:A0

  1. 3.02521E+08       2,666    113,473.8    1.5    1,506.8  92.7     0 bu7azdh0r8c1f

I agree with ([LARS THE GREAT]) observation that table fragmentation, if accessed by index does not impact performance too much, but as I showed above, tables are accessed by full table scan or by indexes that are also became inefficient.

patrickbachmann
Active Contributor
0 Kudos

PS: Lars I'm still looking into some of your other comments like mini-checks and Martin's note... more on that soon.

lbreddemann
Active Contributor
0 Kudos

Hey Patrick,

I am pretty optimistic that Martin has a good grasp on all things performance related

Anyhow, the table fragmentation seem to be present, but I fail to see how it accumulates to the large latency time and how it would only _sometimes_ be that slow.

Yes, the access to the logging tables may not be the most efficient right now, but I doubt that it is the actual culprit here.

- Lars

patrickbachmann
Active Contributor
0 Kudos

Ok thanks Lars.  And I certainly don't doubt Martin's expertise rather I'm hoping he is gifted enough so he can make my slow brain understand it. 

patrickbachmann
Active Contributor
0 Kudos

ie: if latency report shows

4:10 AM 5000 records latency 7680 seconds


My interpretation is that these records were created in source system 7680 seconds ago yet took this long to get written into the target system. 


Is that correct?

patrickbachmann
Active Contributor
0 Kudos

How about a real example from today gentlemen.  Look ok to you? 

patrickbachmann
Active Contributor
0 Kudos

How about ALL tables today sorted descending; leaving out the table names... seems every table delayed today;  this is all tables for single day

Former Member
0 Kudos

I don't have practical SLT experience, but some SLT guys got in touch with me some time ago and asked me which timestamp is written into the logging tables and if it is a good idea to measure the latency based on this timestamp. I said, it's the time of the DML operation and it's not a good idea to measure the latency based on this timestamp, because in this case a late COMMIT can massively impact the measured latency. Of course there can also other reasons for the delays - particularly if tables of different applications are involved.

The mini checks are available via SQL: "HANA_Configuration_MiniChecks..." (SAP Note 1969700) and described in detail in SAP Note 1999993.


patrickbachmann
Active Contributor
0 Kudos

Thanks Martin.  Just to clarify a bit though.  These are my assumptions can you tell me if they are correct?

So hypothetically if I create a sales order #100 and an entry gets posted in SAP table VBAK at 9AM in source table...

Assumption 1: the sales order 100 gets posted immediately in the tracking table. Is it safe to assume there is not normally latency writing to the tracking table?

Assumption 2: Latency is not based on this timestamp of getting written into the tracking table.

>>>...and then continuing my example; for whatever the reason the record isn't committed in HANA until 10AM. Who knows why just yet but besides the point for my example....

Assumption 3:  Since it's not committed until 10AM this means it's not available in the delta table.  I'm assuming when you say DML commit you are referring to when it's INSERTED/COMMITTED into the delta storage table (prior to delta merge)

Assumption 4: the timestamp for DML operation would be 10AM

Assumption 5: This means sales order #100 will NOT be visible to end users running reports off of HANA that utilize VBAK until 10AM.

Thanks!

-Patrick

Former Member
0 Kudos

As I said I am not familiar with SLT details. I just provided feedback here so that you can check if a delayed COMMIT correlates with the latency issues. Using transaction STAD it should be possible to check if a long running transaction finished at that time when the latency issues resolve. If not, my hypothesis might be wrong.

Former Member
0 Kudos

Hello Patrick

Did you find any answers for your questions? Even we have latency in our SLT replication production and hard to identify where the issue is really happening. Most of the best practice options are already followed like increase in jobs according to available work process , secondary index for tables in ECC, parallel execution but still on some day, this latency in replication is happening.

thanks

Siva

patrickbachmann
Active Contributor
0 Kudos

Hi Siva,

I too am trying to go through the best practice and everything I've found/read on SAP support about troubleshooting and about to create a message with SAP soon to see if they can help.  Will let you know what I learn.  Hopefully in next few days.

Thanks for your feedback!

-Patrick

Former Member
0 Kudos

Hi Patrick,

Do you know how to convert this UTC timestamp value in to meaningful time format  ?

I was dividing the max latency (assuming they are in secs ) by 86400 to get in h:mm:ss format in excel , apparently they are giving incorrect values when latency values are high .

Thanks

Siva

patrickbachmann
Active Contributor
0 Kudos

Hi Siva,

Can you tell me which screen you are looking at in LTRC exactly?  In replication statistics I see simply seconds and just divide by 60 for minutes.  But when I first choose my selection for start & end date and time I do enter UTC time which in our case is just 4 hours forward so I minus 4 hours from the time I expect to see. 

-Patrick

Former Member
0 Kudos

Hi Patric,

This is about the replication statistics display . Please find screen shot below. I am assuming the job has taken around 68 hours in order to insert 774152 records complete but wanted to double check on this.

 

thanks

Siva

Former Member
0 Kudos

Hi Patrick,

Can you please if back up running at a particular time and also check the availability of the background jobs or delays in the background jobs.

Check both source system and replication server.

Thanks,

Shakthi Raj Natarajan.

patrickbachmann
Active Contributor
0 Kudos

Hi Shakthi,

Thanks for your suggestion, I am looking into these two questions now.

-Patrick

patrickbachmann
Active Contributor
0 Kudos

Ok I'm told backups are not happening during this time but I have to logon early morning to view background jobs availability.  I did talk to a dba who's theory around the index is that the change tracking table (that also corresponds to the the synonym table in HANA) on the SAP side initially is very large when the table is first replicated (lets say hundreds of millions of records) and then subsequent replication after the first initial load there is just a small amount of data in these tables yet the block size is showing very large.  His theory is that doing a re-org on the tracking table would rebuild the index and improve performance.  I'm wondering what others opinion is on this?  I would think if this was the case then SAP would make recommendations to re-org tables after every initialization of a large replication job.  I realize this may or not be the root of our problem but just investigating each and every possibility.

-Patrick

lbreddemann
Active Contributor
0 Kudos

Hi Patrick,

index and table fragmentation definitively are effects that can occur on a platform like Oracle.

However, these effects typically don't affect index based data access as much.

If the logging tables are access via full table scan every single time, this might be a reason for slow access after a large growth of the tables.

This means, reading the logging entries from those tables might be slower than necessary.

That would be a constant slow down factor, not some effect that is sometimes there and sometimes not.

Now, the question is: is the latency you try to analyze here affected by this at all? And are the tables actually read by full table scans?

Common SAP on Oracle DBA activities would include checking for unnecessary large tables and reorganizing those - and there are half-automatic tools available to do just that (BRTOOLS). So, this possible problem can be addressed easily.

However, I highly recommend to not do anything before it is not clear what is causing the problem. Just starting to reorganize and rebuild some data structures just because someone heard something about that is going to do more harm than good.



You didn't yet state where your observed or perceived latency occurs.

Why don't you tell us more about what and where you saw the bad performance?

- Lars

patrickbachmann
Active Contributor
0 Kudos

Hi Lars.  I've just recently been handed SLT duties on top of my modelling ones so I've only been beginning to dig into this in the past week or so after users complained of missing data in their reports.  I troubleshooted by looking at replication statistics in LTRC transaction and first look at all tables for the day and sort by maximum latency.  Then I take a look at each of the worst tables by running the statistics by MINUTE for one table at a time and then I can clearly see entries like this for MKPF;

02:00 AM 100 records latency 2.5 seconds

02:01 AM 105 records latency 3 seconds

02:02 AM 103 records latency 2.7 seconds

I can clearly see records being updated every minute then suddenly there will be a gap of 1 or 2 hours with zero entries and it will jump like;

4:10 AM 5000 records latency 7680 seconds

4:11 AM 100 records latency 2.5 seconds

4:13 etc...

These are not real numbers but just an example.

-Patrick

lbreddemann
Active Contributor
0 Kudos

Alright,

I think it's quite clear that we can rule out now any fragmentation effects on the underlying data structures.

It seems that there rather is some kind of waiting situation, e.g. caused by lock contention. that slows down processing.

As I am not an SLT expert at all, I can only recommend a general approach.

So my next step would to check in both source and target system for long running transactions and look for locks that might be involved with those.

It is important to know where the bottleneck occurs here. Since reading on Oracle (which I assume is the source system here) typically is not blocked by record/table locks, I would probably focus on the write-out end of the replication here - the SAP HANA server.

One thing that pops to my mind is that under some rare conditions the savepoint writing can run into issues that then would lead to looping "critical phase"-executions. Those block transactions and should be very short in general.

script collection (sap note #1969700) contains a script to check the savepoint durations.

While you're at it, you should also run the minicheck script to get a general "health check".

In fact - start with the mini check and use work through the output first

- Lars

Former Member
0 Kudos

As per my understanding the SLT "latency" measure the time between the DML operation and the SLT propagation. SLT can only kick in after a COMMIT. So if a DML operation is committed after 2 hours, it is normal to see a latency of 2 hours. This is normal and doesn't indicate an issue (unless the 2 hours runtime is longer than expected).

patrickbachmann
Active Contributor
0 Kudos

Lars where can I find more information on the minicheck scripts you are referring to?  Is that the same as the note 1969700 SQL statement collection?

lbreddemann
Active Contributor
0 Kudos

Yes, that's the sap note.