Skip to Content
0

Replication Server 15.7.1 outbound queue segments accumulating

Mar 13, 2017 at 12:48 PM

114

avatar image

Hi

architecture: 2 Sybase ASE 15.7 SP62 with a replication server 15.7.1 sp304 on solaris 10.

4 databases replication using MSA

During business hours, the outbound queue of our biggest database is accumulating up to thousand segments. At the end of the day it is catching up.

during the peak hours, I launched a rs_ticket to follow 76 little inserts in a dummy table. In the result below, the problem seems to be between the DIST and the DSI:

HEADER BD PDB EXEC exec_b DIST DSI RDB NB SECONDS

---------- --------------- ------------ ------------ ------------------------- ------------ ------------ ------------ -----------

START db1 16:23:24:970 16:23:24:946 11720904356 16:23:24:980 18:27:14:596 18:27:14:846 7429

END db1 16:23:25:000 16:23:24:976 11720954638 16:23:25:163 18:27:14:830 18:27:14:886 7429

We tried to modify lots of different sqt parameters on the dsi connections but no luck.

My next step would be to analyze the stats counters but I am a newbie in this area

Any ideas how to proceed with that issue?

thanks. V

ince

10 |10000 characters needed characters left characters exceeded
* Please Login or Register to Answer, Follow or Comment.

7 Answers

Best Answer
Mark A Parsons Mar 15, 2017 at 12:20 PM
0

sp_sysmon (and statement cache hit ratios) won't be of much help. What you need to do is look into the cpu/io performance of the individual queries, and doing that is no different than monitoring any other queries in ASE (ie, MDA tables).

If you find your simple INSERT/DELETE/UPDATEs are using excessive cpu/ios (per monSysStatement), then you can dig into monSysSQLText/monSysPlanText for actual queries and plans to see why said queries are performing poorly.

ZFS shouldn't be a problem as long as you've got encryption/compression/snapshots disabled on the FS in question. Otherwise, have the system/disk admins review performance metrics for the ZFS FS to see if there is any sign of degraded performance (eg, long write times).

Share
10 |10000 characters needed characters left characters exceeded
Mark A Parsons Mar 13, 2017 at 03:11 PM
0

If those rs_tickets were submitted while PDB transactions were flowing through replication, it's hard to tell at this point if the rs_ticket delays are due to a replication server issue or perhaps an issue with applying transactions (in the queue ahead of the rs_ticket) in the RDB.

-----------

Before jumping into counters I'd want to run a few more checks to help zero in on the problem, eg:

1 - Do you see these delays for all 4 databases or just the one? [Want to figure out if this is a server wide issue, or is the issue related to a single database.]

2 - When replication is quiesced (eg, little/nothing is coming out of the PDB), how long does a rs_ticket take? [If rs_ticket is delayed by more than a couple seconds when replication is quiesced then this would be odd.]

3 - What are the actual transactions being applied in the RDB? 'simple' INSERT/DELETE/UPDATEs? stored proc invocations? are triggers and/or computed column functions being fired?

4 - When experiencing these delays, have you attempted to monitor what's going on in the RDB (eg, cpu/io performance of replicated transactions, query plans of replicated transactions)? [Wondering if there could be performance issues in the RDS and/or RDB?]

What would be convenient/nice is to find that replicated transactions are taking a long time to be applied in the RDB. Typical issues could include excessive blocking of the maint user's spid, poorly configured RDS (too-small cache => thrashing cache; descriptor re-use), or a RDB configuration issue (eg, missing index causing excessive table scans for replicated transactions, queries invoked from inside triggers/procs/functions and/or RI constraint checking).

Share
10 |10000 characters needed characters left characters exceeded
Vincent Rives Mar 14, 2017 at 04:59 PM
0

Hi Mark,

Thanks for your answer!

1- The delay is just for one database.

2- I am going to run the rs_ticket tonight when the replication is quiesced. We will then be able to compare.

3- On the RDB, the transactions are simple 'INSERT/DELETE/UPDATEs'. No stored procedure is called on RDB (only on PDB). No triggers and no computed column functions being fired.

4- On the RDB,

+ I ran sp_object_stats for 5 mins against this problematic database if there was any lock contention on some tables but none was reported.

+ I looked at our sp_sysmon reports but could not find anything abnormal but I don't consider myself as an expert to read sysmon reports either.

+ I looked at the statement cache hit ratio ==> 100% I did not see any re-use with sp_monitorconfig.

looking at cpu/io performance of replicated Xacts is good idea...I could use the MDA tables to identify the "long"/IO/CPU-consuming Xacts or even the sp_sysmon with the right option. any advice on how to capture the replicated Xacts taking a long time in the RDB?

Other point I wanted to add is the repserver is using stables devices stored in a ZFS filesystem...I wonder if this could affect the performance.

Note: the repserver is using eRSSD (SQLAnywhere).

Thanks, Vincent

Share
10 |10000 characters needed characters left characters exceeded
Vincent Rives Mar 16, 2017 at 06:14 PM
0

To keep you posted:

- no ZFS compression/encryption/snapshot ==> ok

- rs_ticket in a quiet period compared to a heavy period shows that a slower DSI duing peak time

- interesting fact: during quiet day (lower number of inbound segments) no delay (sorry for the new facts as I just joined the company)

- I am going to install a scrpt to collect the "big" queries from the MDA tables on the RDB

- I noticed some slowdowns on the network between repserver and RDB (pings going from half second to 3 seconds from time to time + ssh takes a long to connect to RDB UNIX server). Could it be an issue or is it just slow only when initiating the connection between repserver and RDB?

To summarize: my 2 paths of troubleshooting: determine the slow queries on RDB and see if the network could be an issue.

thanks,

Share
10 |10000 characters needed characters left characters exceeded
Mark A Parsons Mar 16, 2017 at 07:18 PM
0

In addition to collecting/analyzing MDA data, and potential network issues ... during slow ping/ssh times (into RDB UNIX server) as well as delays in replication into the RDB ... also take a look at the load on the RDB UNIX server (eg, excessive cpu usage @ the OS level could delay pings, ssh *and* dataserver processing).

Share
10 |10000 characters needed characters left characters exceeded
Vincent Rives Mar 22, 2017 at 01:14 PM
0

Hi Mark,

To answer your previous question, there was no excessive cpu usage.

I did 2 things that seems to help a lot (no delay this morning):

1- Thanks to the MDA tables, I identified a huge amount of logicial I/Os against the same table (13 millions of rows). I reported this problem to the application manager. He disabled a flag in the application that was impacting this table and cleaned the 13 millions rows to 3 rows (big cleaning indeed!). It was generating so much DELETE statements against the standby ASE that the replication could never catch up...so we dropped the subscription/flushed the queue and recreate the subscription with dump marker.

2- I also switched the kernel mode from "process" to "threaded" and defined a syb_default_pool pool with 14 engines

Result: This morning there is no delay in replication. Let's see if it the case in coming days.

Thanks a lot for your guidance.

Vincent

Share
10 |10000 characters needed characters left characters exceeded
Vincent Rives Mar 23, 2017 at 05:11 PM
0

The problem is fixed. There is no delay.

To summarize, what helped is:

- Less transactions against the same "allpages"-locked table

- the threaded kernel seems to be more efficient to run queries and benefit from the statement cache.

Share
10 |10000 characters needed characters left characters exceeded