cancel
Showing results for 
Search instead for 
Did you mean: 

HANA Database does not start after failed takeover

Farid
Active Participant
0 Kudos

Hello,

We are running SAP HANA 122.08 on Suse 11.4 (IBM Power ppc64)
We are using system replication , in ASYNCRONOUS mode to replication sapqa2hana1 (primary) to sapqa2hana2 (Secondary).

Last Friday, we perfromed a failover from the secondnary sapqa2hana2, using the hdbnsutil command, during the failover the hdbinxserver crashed, generated dump files, the secondary database could not be resarted.

Therefore we disabled the replication , unregistered the database.

The primary database is now up and running, but the secodnary database could not start ; the indexserver crashes whenever we try to start it ... even though the replication has been disabled.

I have killed all hdb processes at the OS level , cleansed the semaphores, cleansed /tmp.

Did not help

There are several sap notes dealing with Asynchronous replication issues, but none of those notes apply to the recent SAP HANA 122.08 release.

I then double check the logs, what is preventing the datbase from restarting is the following error, contantly bein repetaed. According to sap note 2455763 - System replication takeover failed with error Invalid logical page number, the corrections are already included in our database

23: 0x00000fff84a9af3c in TrexThreads::PoolThread::run()+0x8fc at PoolThread.cpp:389 (libhdbbasement.so)
24: 0x00000fff84a9c900 in TrexThreads::PoolThread::run(void*&)+0x20 at PoolThread.cpp:164 (libhdbbasement.so)
Possible root cause:
exception 1: no.3020031 (DataAccess/PageAccess/impl/LogicalPageAccessImpl.cpp:673)
Page 0xa7ab65L not found.; $owner$=[undodir]; $patype$=Default
exception throw location:


I have attached the indexserver dump file generated whenever I try to restart the database

dumpfile.txt

Any help would be appreciated.

Thanks

Accepted Solutions (0)

Answers (3)

Answers (3)

Farid
Active Participant
0 Kudos

Hello,

Is there anyone working for HANA developement teams , or in contact with HANA management teams that we could reach ?

We are running the following infrastructure :

Version of Linux : SLES for SAP Applications 11.4 (ppc64)
Version kernel Linux : 3.0.101-71-ppc64
Version of HANA : 1.00.12208

We are facing random HANA crashes especially during hana replication failover.

We keep receiving outdated, contradictory, incorrect and irresponsible responses, is there anyone that could help us redirect to the relevant SAP HANA Support Teams ?

I have added below the latest responses that we received :

The fact that we are running on SLES 11.4 with recent kernel version does not seem to bother SAP Support Consultant, they found some older notes dealing with similar issues ... Therefore the Customer should contact SLES to solve the newer issues. And we should not expect any update of those SAP notes. Quite logic, isn't it ?

The fact that other customers have reported same issues does not seem to bother SAP Support.

The fact that SAP has reproduced the same issue on SLES 12.1 and 12.2, does not seem to be an issue whatsoever, we should just contact our hardware vendor,

And last but not least, we have been advised to run backups ... instead of relying on our system replication infrastructure !!!! It is almost funny

SOS, We need Help ( I mean real help)

Thank you

-----------------------------------------------------------------------------------------------------------------

Dear Customer,

Here are the SAP Notes on earlier xfs issue.

1.
1726839 - SAP HANA DB: potential crash when using xfs filesystem

2.
1867783 - XFS Data Inconsistency Bug with SLES 11 SP2

So as advised in one of the Notes, you have to get the support of your
hardware partners to check where it went wrong with the hardware.
Usually, HANA checks for the page consistency when it reads and writes
to the disk and all what we know in this case is that a page is not
consistent anymore. This is probably due to xfs issue or your hardware
partner might be able to give more accurate reasoning.

Therefore, we are sorry that we cannot help further as that is not a
HANA issue but mostly related to underlying OS or hardware itself.
--------------------------------------------------------------------------------------------------------

Dear Customer,

The xfs corruption issue is an old issue. When it first appeared a
while ago in HANA systems, our experts discussed with SLES and then
they delivered a fix in some Kernel version. Then that issue was a
well known and reproducible issue and fixed in a particulr Kernel.
Then after a few Kernel versions recently, the issue started to appear
again (such as in your case) in newer Kernel versions.


So this second appearance of this xfs bug is relatively new and needs
to be fixed again by SLES in next Kernel versions.

Also we had some SAP notes regarding this xfs issue. I will ask our
expert to update the note with the newer developments of the bug and
forward to you.

Hope then it will clarify your doubts and sorry for the inconvenience.

Thank you!

---------------------------------------------------------------------------------------------

Dear Customer,

Please refer to the following comments from the developer who was
working on the bug for this incident;

QUOTE:
In restart, before the log recovery takes place, the undo container
directory is iterated to build up the transactions which were open in
the savepoint from the undo information. The crash happens because the
iteration of the undo container directory leads to a failure (Page not
found) which means that the page linkage in the underlying
PageChainContainer is not correct. This is for sure not an issue in
PageChainContainer coding as this coding is in used since many years
and used by all persistent containers in all customer systems and not
changed since a long time and we have not seen such an inconsistency
anywhere up to now.

First crash happened at 2017-04-28 16:43:19.213288 and filesystem used
for DATA and LOG area is xfs. Kernel version used is 3.0.101-71-ppc64.
Although according to note 2240716 the xfs lost write issue should be
solved, this inconsistency could be caused by an xfs bug. We have seen
recently other customer issues with data corruptions with xfs and
kernel versions higher than stated in note 2240716 (see e.g. Bug
143386, Bug 130641 or Bug 143658).

I could also could reproduce in an inhouse system xfs filesystem
corruptions with a current SLES12 SP1 and SLES12 SP2 kernel version.


UNQUOTE:
Therefore, I'm afraid that we can't do much here other then
maintaining regular backups.


Thank you!

.

Farid
Active Participant
0 Kudos

Hi,

For thoses who are running System Replication on Suse with xfs FS :

SAP Dev Support helped us to restart the secondary. But when asking for the root cause it turns out to be a bug, a SLES bug, on ALL SLES kernel versions. SAP reproduced the issue on SLES12 SP1 and SLES12 SP2 kernel version.

We received some wierd and worrying response asking us to perfrom backup instead ...

Obviously, I responded back asking for a fix ...

Will keep you informed.

0 Kudos

Hello Raoul,

Have you been able to resolve this issue ?

If not, Kindly check the below sap note .

https://launchpad.support.sap.com/#/notes/0002367935

Regards

Satish