Skip to Content

HANA Database does not start after failed takeover

Hello,

We are running SAP HANA 122.08 on Suse 11.4 (IBM Power ppc64)
We are using system replication , in ASYNCRONOUS mode to replication sapqa2hana1 (primary) to sapqa2hana2 (Secondary).

Last Friday, we perfromed a failover from the secondnary sapqa2hana2, using the hdbnsutil command, during the failover the hdbinxserver crashed, generated dump files, the secondary database could not be resarted.

Therefore we disabled the replication , unregistered the database.

The primary database is now up and running, but the secodnary database could not start ; the indexserver crashes whenever we try to start it ... even though the replication has been disabled.

I have killed all hdb processes at the OS level , cleansed the semaphores, cleansed /tmp.

Did not help

There are several sap notes dealing with Asynchronous replication issues, but none of those notes apply to the recent SAP HANA 122.08 release.

I then double check the logs, what is preventing the datbase from restarting is the following error, contantly bein repetaed. According to sap note 2455763 - System replication takeover failed with error Invalid logical page number, the corrections are already included in our database

23: 0x00000fff84a9af3c in TrexThreads::PoolThread::run()+0x8fc at PoolThread.cpp:389 (libhdbbasement.so)
24: 0x00000fff84a9c900 in TrexThreads::PoolThread::run(void*&)+0x20 at PoolThread.cpp:164 (libhdbbasement.so)
Possible root cause:
exception 1: no.3020031 (DataAccess/PageAccess/impl/LogicalPageAccessImpl.cpp:673)
Page 0xa7ab65L not found.; $owner$=[undodir]; $patype$=Default
exception throw location:


I have attached the indexserver dump file generated whenever I try to restart the database

dumpfile.txt

Any help would be appreciated.

Thanks

dumpfile.txt (65.4 kB)
Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

3 Answers

  • May 02, 2017 at 12:04 PM

    Hello Raoul,

    Have you been able to resolve this issue ?

    If not, Kindly check the below sap note .

    https://launchpad.support.sap.com/#/notes/0002367935

    Regards

    Satish

    Add comment
    10|10000 characters needed characters exceeded

  • May 10, 2017 at 09:07 PM

    Hi,

    For thoses who are running System Replication on Suse with xfs FS :

    SAP Dev Support helped us to restart the secondary. But when asking for the root cause it turns out to be a bug, a SLES bug, on ALL SLES kernel versions. SAP reproduced the issue on SLES12 SP1 and SLES12 SP2 kernel version.

    We received some wierd and worrying response asking us to perfrom backup instead ...

    Obviously, I responded back asking for a fix ...

    Will keep you informed.

    Add comment
    10|10000 characters needed characters exceeded

  • May 15, 2017 at 07:52 PM

    Hello,

    Is there anyone working for HANA developement teams , or in contact with HANA management teams that we could reach ?

    We are running the following infrastructure :

    Version of Linux : SLES for SAP Applications 11.4 (ppc64)
    Version kernel Linux : 3.0.101-71-ppc64
    Version of HANA : 1.00.12208

    We are facing random HANA crashes especially during hana replication failover.

    We keep receiving outdated, contradictory, incorrect and irresponsible responses, is there anyone that could help us redirect to the relevant SAP HANA Support Teams ?

    I have added below the latest responses that we received :

    The fact that we are running on SLES 11.4 with recent kernel version does not seem to bother SAP Support Consultant, they found some older notes dealing with similar issues ... Therefore the Customer should contact SLES to solve the newer issues. And we should not expect any update of those SAP notes. Quite logic, isn't it ?

    The fact that other customers have reported same issues does not seem to bother SAP Support.

    The fact that SAP has reproduced the same issue on SLES 12.1 and 12.2, does not seem to be an issue whatsoever, we should just contact our hardware vendor,

    And last but not least, we have been advised to run backups ... instead of relying on our system replication infrastructure !!!! It is almost funny

    SOS, We need Help ( I mean real help)

    Thank you

    -----------------------------------------------------------------------------------------------------------------

    Dear Customer,

    Here are the SAP Notes on earlier xfs issue.

    1.
    1726839 - SAP HANA DB: potential crash when using xfs filesystem

    2.
    1867783 - XFS Data Inconsistency Bug with SLES 11 SP2

    So as advised in one of the Notes, you have to get the support of your
    hardware partners to check where it went wrong with the hardware.
    Usually, HANA checks for the page consistency when it reads and writes
    to the disk and all what we know in this case is that a page is not
    consistent anymore. This is probably due to xfs issue or your hardware
    partner might be able to give more accurate reasoning.

    Therefore, we are sorry that we cannot help further as that is not a
    HANA issue but mostly related to underlying OS or hardware itself.
    --------------------------------------------------------------------------------------------------------

    Dear Customer,

    The xfs corruption issue is an old issue. When it first appeared a
    while ago in HANA systems, our experts discussed with SLES and then
    they delivered a fix in some Kernel version. Then that issue was a
    well known and reproducible issue and fixed in a particulr Kernel.
    Then after a few Kernel versions recently, the issue started to appear
    again (such as in your case) in newer Kernel versions.


    So this second appearance of this xfs bug is relatively new and needs
    to be fixed again by SLES in next Kernel versions.

    Also we had some SAP notes regarding this xfs issue. I will ask our
    expert to update the note with the newer developments of the bug and
    forward to you.

    Hope then it will clarify your doubts and sorry for the inconvenience.

    Thank you!

    ---------------------------------------------------------------------------------------------

    Dear Customer,

    Please refer to the following comments from the developer who was
    working on the bug for this incident;

    QUOTE:
    In restart, before the log recovery takes place, the undo container
    directory is iterated to build up the transactions which were open in
    the savepoint from the undo information. The crash happens because the
    iteration of the undo container directory leads to a failure (Page not
    found) which means that the page linkage in the underlying
    PageChainContainer is not correct. This is for sure not an issue in
    PageChainContainer coding as this coding is in used since many years
    and used by all persistent containers in all customer systems and not
    changed since a long time and we have not seen such an inconsistency
    anywhere up to now.

    First crash happened at 2017-04-28 16:43:19.213288 and filesystem used
    for DATA and LOG area is xfs. Kernel version used is 3.0.101-71-ppc64.
    Although according to note 2240716 the xfs lost write issue should be
    solved, this inconsistency could be caused by an xfs bug. We have seen
    recently other customer issues with data corruptions with xfs and
    kernel versions higher than stated in note 2240716 (see e.g. Bug
    143386, Bug 130641 or Bug 143658).

    I could also could reproduce in an inhouse system xfs filesystem
    corruptions with a current SLES12 SP1 and SLES12 SP2 kernel version.


    UNQUOTE:
    Therefore, I'm afraid that we can't do much here other then
    maintaining regular backups.


    Thank you!

    .

    Add comment
    10|10000 characters needed characters exceeded