cancel
Showing results for 
Search instead for 
Did you mean: 

AWS EC2 Instance SAP Hana XFS File System Corruption

0 Kudos

Our AWS EC2 instance SAP Hana is currently inaccessible. Analysis of the system logs indicates multiple critical errors during the boot process, primarily related to file system corruption and service startup failures.

#### Identified Issues

1. XFS File System Corruption

- Description: The logs show a Metadata CRC error and an I/O error in the XFS file system.

- Impact: Such errors typically prevent the operating system from accessing vital file system data, leading to boot failures or system instability.

2. Failed Service Startups

- Failed Services: Critical failures in starting services like 'Setup Virtual Console'.

- Dependency Failures: Issues in 'dracut ask for additional cmdline parameters' and 'Reload Configuration from the Real Root'.

- Impact: Failure in starting these services impedes the initialization process, leading to an inability to access the instance.

3. Mounting Failure of /sysroot

- Description: The system failed to mount `/sysroot`, a crucial step in the boot process.

- Impact: This failure is a critical blocker for the boot process, rendering the system unusable.

4. Correctable Errors Collector Initialization

- Description: RAS Correctable Errors collector was initialized.

- Potential Indication: This could indicate underlying stability issues, potentially at the hardware level.

5. Keylock Active Warning

- Description: A warning regarding keylock being active.

- Relevance: While not directly related to the access issue, it indicates potential configuration or input device issues.

OS:
NAME="SLES"

VERSION="15-SP2"

VERSION_ID="15.2"

PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"

ID="sles"

ID_LIKE="suse"

ANSI_COLOR="0;32"

CPE_NAME="cpe:/o:suse:sles:15:sp2"

Linux 5.3.18-22-default #1 SMP Wed Jun 3 12:16:43 UTC 2020 (720aeba)

Any help would be appreciated.

i-0223e819f2c1c5474.txt

Accepted Solutions (0)

Answers (2)

Answers (2)

mamartins
Active Contributor
0 Kudos

That's a very insightful answer from AWS.

It seems that you SLES version have a bug (link number 7), so you should update your OS version or apply the workaround.

mamartins
Active Contributor
0 Kudos

First and most important, open an incident on AWS support and report the problem. Let them check if there is anything wrong with the EC2 instance.

Then, go after you backups and prepare them in case a restore of the DB is needed.

0 Kudos

We have formally submitted our case to AWS regarding "AWS EC2 Instance SAP Hana XFS File System Corruptio". Below is their response for your review:

I understand that you have a SUSE Linux Enterprise Server 15 SP2 instance running SAP HANA v2 and that you are encountering a Metadata corruption failure. You have mentioned this issue has occurred multiple times in the past and you have resolved it each time with the 2 methods (running xfs_repair and restoring the instance to an older backup / snapshot). You are seeking a permanent solution to the issue. Please correct me if I am mistaken.

Thank you for providing the instance ID ( i-0223e819f2c1c5474). I have taken a look at the instance and can see it is currently passing both status checks and has been operational since 2023-11-08 02:43 UTC. I have viewed the underlying hardware for the instance and volumes for the past 90 days. There has been no faults or issues with the hardware. The issue is occurring in your environment.

Thank you for providing the '11-08-23 HANA Server log.txt' file. I have reviewed the logs and identified the following error messages related to the issue:

[ 5.697536] XFS (nvme0n1p3): Metadata corruption detected at xfs_agi_verify+0x3a/0x160 [xfs], xfs_agi block 0x1deebd2 [ 5.700389] XFS (nvme0n1p3): Unmount and run xfs_repair

As you mentioned, you have already run the appropriate steps in regards to the above error message to recover and repair the volume to restore the instance to operation.

I am unable to identify the root cause of why the Metadata corruption is occurring due to limited knowledge of your environment. AWS Premium Support has no visibility into the resources provisioned on your account, or the data stored on those resources. This is due to AWS’s strict data privacy and security policies that ensure that only customers have access to their data. Further information on this and the AWS Shared Model of Responsibility can be found here [1] [2].

I will suggest the following solutions you can attempt in order to have a permanent solution to your issue. It is highly recommended to make fresh backups / snapshots of your systems before performing any of the suggested solutions. It is also recommended to test any of these solutions in a test environment before moving to production.

1. Update the SUSE Linux 15 OS to the latest Service Pack. Your instance is currently running on the SLES 15 SP2 operating system. The lifecycle for this Service Pack states that general support for this OS has ended on December 31st 2021. [3]. It is a general rule to keep your systems as up to date as possible. You can attempt to update your OS to the latest Service Pack and run the server. If the issue persists it will help us narrow down the cause.

Please note: SLES 15 SP2 cannot be directly updated to SP5. Please review the following guide [4] to determine the correct upgrade path required. SP2 will need to be updated first to SP3, then SP4 and then finally SP5 in consecutive order. Skipping service packs is not recommended. At the minimum you must upgrade to SP4 before attempting to upgrade to SP5.

2. Ensure your current SUSE Linux OS is up to date.

If you do not want to update from SP2 to SP5. Please ensure the SP2 OS is up to date and running on the latest available kernel. The latest kernel for SLES 15 SP2 is 5.3.18-150200.24.169.1 (Released 06-Nov-2023).

You can run the following CLI command to view your current kernel: uname -r To update your SUSE Linux kernel, you can run the following CLI command: sudo zypper patch

This will update to the latest available version. For more information on the Zypper package manager, you can consult the following guide [5].

3. Update the SAP System Kernel. Ensure your SAP HANA Service Pack 5 is fully updated. You can review the following official documentation for SAP on updating the System Kernel. [6].

I have found relevant documentations regarding XFS. While this page deal with SLES 12, the core principles will still apply. Please review the following documentation [7] regarding XFS Metadata corruption errors for possible solutions.

You can also reach and contact SUSE team directly at the following webpage [8] for support regarding your SUSE Linux OS and services.

If you have any further questions, please let me know via a reply to this case in the support center and I will be happy to help.

==========References==========

[1] Data Privacy FAQ - https://aws.amazon.com/compliance/data-privacy-faq/

[2] AWS Shared model of responsibility - https://aws.amazon.com/compliance/shared-responsibility-model/

[3] SUSE Linux Enterprise Server 15 Lifecycle - https://www.suse.com/lifecycle/#suse-linux-enterprise-server-15

[4] SLES Upgrade Path -https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-upgrade-paths.html

[5] SUSE Zypper package manager - https://documentation.suse.com/smart/systems-management/html/concept-zypper/index.html

[6] Update the SAP System kernel - https://help.sap.com/docs/SAP_LANDSCAPE_MANAGEMENT_ENTERPRISE/e7dead4286c545808b3bd24feee7448c/1741e...

[7] XFS metadata corruption and invalid checksum on SAP Hana servers -https://www.suse.com/support/kb/doc/?id=000019192

[8] SUSE Direct Support - https://www.suse.com/contact/