cancel
Showing results for 
Search instead for 
Did you mean: 

SAP on RAC keepalive settings

Former Member
0 Kudos

Hello Oracle gurus!

I have a question regarding SAP reconnect behavior in conjunction with RAC installation.
We have created a demo system for testing purposes, on top of the RAC 12.1 (SAP NW 7.4 with latest kernel 7.42 and DBSL pathches).

TAF has been configured and if main RAC node (to which SAP is connected) shutdowns correctly (ACPI shutdown) SAP WPs doing reconnect very well (some seconds after VIP move) and it's expected behavior.

But , if main node dies without any notification , like a "power off" happens , SAP WP's can run endless (I think 7200sec by def) doing some internal jobs until max_wp_runtime passes and WP restarted completely.

We have found some recommendations to update kernel parameters on OS level  like this :

To improve fail over performance in a RAC cluster, consider changing the following IP kernel parameters as well:

net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_retries2 net.ipv4.tcp_syn_retries

We have changed for testing on server where SAP is installed these parameters :

net.ipv4.tcp_keepalive_time   = 30

net.ipv4.tcp_keepalive_intvl  = 10

net.ipv4.tcp_keepalive_probes = 2

and in addition in tnsnames has been added "enable=broken" definition. 

So the question is , this is how SAP with Oracle should be configured (in terms of fail-over) or we have missed something ?

Also , these parameters are really low , so if you have some best practice values , it's always welcome.

Accepted Solutions (1)

Accepted Solutions (1)

stefan_koehler
Active Contributor
0 Kudos

Hi Sergo,

> So the question is , this is how SAP with Oracle should be configured (in terms of fail-over) or we have missed something ?


The behavior that you observe is one of the issues with SAP that i tried to described you previously. I really have no idea why SAP does not implement FAN, but this is a different topic. You have to wait for time-outs in case of TAF (only) as this a client side feature. I also described some further details here:

However i am a little bit confused as the TCP time-out should occur pretty quick after the listener VIP failed over (and it should also do this very quick in case of power off). What is your configured DELAY between the connection attempts and how often do you re-try it (RETRIES)? Have you also set TCP related parameters in your tnsnames.ora or service configuration?


You can also enable a SQL*Net trace at client side to check the polling and fail-over behavior (maybe the client does not capture the ORA-12541, etc.):


Regarding your mentioned parameters - please check MOS ID #249213.1 (especially point 2 for "10g/11g Timeout parameters") and MOS ID #364171.1.

SAP HANA also has such issues and you can see some SAP recommendations in SAPnote # 2053504 ("You should align the actual system settings on the duration of the takeover, because the clients cannot reconnect before the takeover has been completed.")


Regards

Stefan

Former Member
0 Kudos

Hi Stefan, thank you for valuable suggestions !

DELAY=5 and RETRIES=3 both are set to client (tnsnames.ora) and server(service definition)  side accordingly.
We have tried some tests and setting the tcp_retries2 to 5 (default 15) can significantly improve Idle WPs reconnect (and the same for running DIA's as well).  But background jobs still waiting something
and don't doing reconnect on any level , sockets are still open and looking to shut-downed server's.
keepalive_time tuning works well , and everything works as expected. I have tried enabling SQL traces , but a lot of them (around 3GB) were created during testing procedure, so it's really hard to dig in.
Could you please check (if you have one) on any RAC systems installation which values you have set for these parameters (tcp_retries2 and keepalive_time ) ?
And have you tried "power off" and client was able to reconnect without any issue after that ?

I have found  the same behavior faced one of the Oracle RAC experts (but the blog are pretty OLD 2007(Oracle 11RAC)  and on Russian   )
I can translate some text related to issue from this blog :

"Stopped public interface (ifconfig eth0 down)

On this node the listener is stopped, within ~30 sec VIP address is moved to another node.

And what is TAF ? And nothing. Session tightly "stick" and hang. Waited 15 minutes then got bored.

We guessed that the session moves on because get no error. In case of shutdown abort the error comes immediately, and in that case it is not coming and all.

The answer was not very complicated. But, something not too well known.

It is called tcp_keepalive.

I.e. oracle sessions is not getting the errors because by default the underlying tcp/ip stack tries to reestablish the connection.

The solution came in the form of adding the ENABLE=BROKEN in tnsnames - it means to trust the OS settings and change the tcp parameters in Linux "

Thanks

stefan_koehler
Active Contributor
0 Kudos

Hi Sergo,

> And have you tried "power off" and client was able to reconnect without any issue after that ?

Yes and Yes. I even rebuild your mentioned test case with "ifconfig <ETH> down", but check below.

> Could you please check (if you have one) on any RAC systems installation which values you have set for these parameters (tcp_retries2 and keepalive_time ) ?

Sure. I got an Oracle 12.1.0.2 RAC system running in my lab. However i only have policy managed databases (which SAP hopefully allows in near future as well: SAP & Oracle New Features in Software Provisioning Manager - slide 32), but here is the test case.

-- RAC Node 1

[root@OELRAC1 ~]# sysctl -a | grep -i net.ipv4.tcp_keepalive

net.ipv4.tcp_keepalive_intvl = 75

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_time = 7200

-- RAC Node 2

[root@OELRAC2 ~]# sysctl -a | grep -i net.ipv4.tcp_keepalive

net.ipv4.tcp_keepalive_intvl = 75

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_time = 7200

-- Client (OCI)

localhost:~ iStefan$ sysctl -A | grep net.inet.tcp | grep keep

net.inet.tcp.keepidle: 7200000

net.inet.tcp.keepintvl: 75000

net.inet.tcp.keepinit: 75000

net.inet.tcp.always_keepalive: 0

localhost:~ iStefan$ cat tnsnames.ora

     TESTTAF =

     (DESCRIPTION =

      (ADDRESS = (PROTOCOL = TCP)(HOST = oelracscan)(PORT = 1521))

        (CONNECT_DATA =

        (SERVICE_NAME = TESTTAF)

      )

     ) 

-- Test Case

[oracle@OELRAC1 ~]# srvctl add service -db TST -service TESTTAF -serverpool OELRAC -failovertype SELECT -clbgoal LONG -cardinality UNIFORM

[oracle@OELRAC1 ~]# srvctl start service -d TST -service TESTTAF

TEST@TESTTAF:194> select instance_name, host_name from v$instance;

INSTANCE_NAME     HOST_NAME

---------------- ----------------------------------------------------------------

TST_2         OELRAC1

[root@OELRAC1 ~]# ifconfig enp0s3 down

(1961494880) [12-OCT-2015 16:07:41:884] ntt2err: soc 6 error - operation=5, ntresnt[0]=517, ntresnt[1]=54, ntresnt[2]=0

(1961494880) [12-OCT-2015 16:07:41:884] ntt2err: exit

(1961494880) [12-OCT-2015 16:07:41:884] nttfprd: exit

(1961494880) [12-OCT-2015 16:07:41:890] nserror: entry

(1961494880) [12-OCT-2015 16:07:41:890] nserror: nsres: id=0, op=68, ns=12547, ns2=12560; nt[0]=517, nt[1]=54, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0

(1961494880) [12-OCT-2015 16:07:41:890] nsbasic_brc: exit: oln=0, dln=0, tot=0, rc=-1

(1961494880) [12-OCT-2015 16:07:41:896] nioqrc:  wanted 1 got 0, type 0

(1961494880) [12-OCT-2015 16:07:41:897] nioqper:  error from nioqrc

(1961494880) [12-OCT-2015 16:07:41:897] nioqper:    ns main err code: 12547

(1961494880) [12-OCT-2015 16:07:41:897] nioqper:    ns (2)  err code: 12560

(1961494880) [12-OCT-2015 16:07:41:897] nioqper:    nt main err code: 517

(1961494880) [12-OCT-2015 16:07:41:897] nioqper:    nt (2)  err code: 54

(1961494880) [12-OCT-2015 16:07:41:897] nioqper:    nt OS   err code: 0

(1961494880) [12-OCT-2015 16:07:41:897] nioqer: entry

(1961494880) [12-OCT-2015 16:07:41:898] nioqer:  incoming err = 12151

(1961494880) [12-OCT-2015 16:07:41:898] nioqce: entry

(1961494880) [12-OCT-2015 16:07:41:898] nioqce: exit

(1961494880) [12-OCT-2015 16:07:41:898] nioqer:  returning err = 3135

(1961494880) [12-OCT-2015 16:07:41:898] nioqer: exit

(1961494880) [12-OCT-2015 16:07:41:898] nioqrc: exit

(1961494880) [12-OCT-2015 16:07:41:898] nioqds: entry

(1961494880) [12-OCT-2015 16:07:41:898] nioqds:  disconnecting...

(1961494880) [12-OCT-2015 16:07:41:925] nsopen: opening transport...

(1961494880) [12-OCT-2015 16:07:41:925] nttcon: entry

(1961494880) [12-OCT-2015 16:07:41:925] nttcon: toc = 1

(1961494880) [12-OCT-2015 16:07:41:925] nttcnp: entry

(1961494880) [12-OCT-2015 16:07:41:925] nttcnp: creating a socket.

(1961494880) [12-OCT-2015 16:07:41:925] nttcnp: exit

(1961494880) [12-OCT-2015 16:07:41:925] nttcni: entry

(1961494880) [12-OCT-2015 16:07:41:925] nttcni: Tcp conn timeout = 60000 (ms)

(1961494880) [12-OCT-2015 16:07:41:925] nttcni: TCP Connect TO enabled. Switching to NB

(1961494880) [12-OCT-2015 16:07:41:925] nttctl: entry

TEST@TESTTAF:194> select instance_name, host_name from v$instance;

INSTANCE_NAME     HOST_NAME

---------------- ----------------------------------------------------------------

TST_1         OELRAC2

The ORA error occurred immediately after the VIP was available at the second node (in my case it took round about 8 seconds).

> I have tried enabling SQL traces, but a lot of them (around 3GB) were created during testing procedure, so it's really hard to dig in.

SQL Trace? You need a SQL*Net trace and then only the relevant last part of it ("tail -f" after fail over and a few lines before that to compare the time stamp).

Regards

Stefan

P.S.: You might get an idea about the introduced complexity with RAC (even only with this tiny little network part) that i tried to explain you previously. It is absolutely not worth the effort in 99% (in my opinion) - especially in the way how SAP uses and limits RAC.

Former Member
0 Kudos

Hi Stefan,

finally I have found where the misconfiguration was done.
I have changed local_listener parameters (set them to look on rellocatible VIP)  for both DB instances and restarted databases. After that SAP has been established initial connections to this VIP
(before they were connected to Physical IP). And now most of all selects and jobs running well
and doing reconnect well without any additional tcp timeouts (still have some problem with DB02 -->
"statistic refresh job" this one still hangs without any logs and reconnects, but I think this is job specific, maybe DBSL bug or something like this, other jobs working well)

Thank you as always for your advise.

stefan_koehler
Active Contributor
0 Kudos

Hi Sergo,

> still have some problem with DB02 --> "statistic refresh job" this one still hangs without any logs and reconnects, but I think this is job specific, maybe DBSL bug or something like this, other jobs working well

Not quite sure if "DB02 -> Statistic refresh job" executes brconnect or not, but in both cases DBMS_STATS is called in the background. However PL/SQL (session state) is not covered / reconstructed by TAF. This is different from session or select failover.

You can also cross-check this in the official Oracle documentation: Enabling Advanced Features of Oracle Net Services


Server-side program variables: Server-side program variables, such as PL/SQL package states, are lost during failures, and TAF cannot recover them. They can be initialized by making a call from the failover callback.

... and failover callbacks are not implemented by SAP (SAPnote #1431241). If the application does not intercept the corresponding ORA error can be answered with a SQL*Net trace again

Regards

Stefan

Answers (0)