Skip to Content

Database Disconnected error during Veeam backup

Hi Everyone,

We are experiencing a problem where some clients receive the "Database disconnected" error at the conclusion of a Veeam backup taking place.

Not all clients receive the error, and it can be different clients each time. In some cases, a client may have 2 databases open, and only 1 receives the error while the other is completely fine.

The problem does not occur with every backup, but a significant portion.

We are running the following:
SAP B1 9.1 PL08
SQL Server 2014
VMWare ESXi 5.5 Update 2
Veeam Backup & Replication 9.5 (running on a physical server)

Has anyone else experienced anything similar? We use Veeam to take hourly incrementals, so the issue is affecting users several times a day and becoming quite frustrating.

Thanks,
Michael

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

2 Answers

  • Dec 14, 2018 at 06:27 PM

    Hi Michael,

    I see this is a pretty old thread, but I wonder if you ever resolved the issue? We just recently migrated our SAP ECC on SQL Server environment into VMware, and with that we started using Veeam for OS-level backups, and we are now observing the same behavior you describe. Indeed, I've found older discussion threads to indicate that this has apparently been a known issue for some time (see https://archive.sap.com/discussions/thread/3715817, and also https://www.veeam.com/kb1681, which acknowledges the issue and leads to some further links, but basically says "sorry, not our problem, it's a VMware thing!").

    We did not notice the issue in our DEV or QAS environments, I think because those have SAP application and SQL database residing on the same hosts (or VMs, now), whereas in PRD we have a dedicated database server. It's the connection from separate SAP application VM talking to the database VM that gets interrupted during the snapshot removal and consolidation that occurs at the end of the Veeam backup of the database server.

    In our case, we are not using Veeam to directly backup the database; we are using SQL Server native tools to do that to disk, and then Veeam captures that backup file during its backup of the whole server.

    In any case, I'm still researching the issue from the SAP and SQL side of things, while our Systems Operations team is researching from the VMware and Veeam angles. We don't yet have a solution.

    Cheers,
    Matt

    Add comment
    10|10000 characters needed characters exceeded

    • PS, I took the liberty of adding some more tags to the question, as this isn't restricted to the Business One environment, but seems to affect all SAP on SQL Server on VMware installations.

  • Mar 07 at 01:39 PM

    Hello All,

    I can't offer a 100% solution to the issue, but I can tell you some steps that I used to mitigate it for our overnight jobs.

    Our scenario:

    We moved our PRD database box for our SAP ECC 6.04 system from a physical server cluster hosting a SQL Server 2008 database on Windows Server 2008 to a SQL Server 2016 High Availability Group with 3 total virtual nodes/replicas. We experienced the same issues as Michael and Matt. We are using Veeam to directly backup the servers.

    The DBIF_XXXX_SQL_ERROR (where XXXX stands for NTAB, RTAB, REPO) would come up in the logs while the backup was running nightly and/or incrementally, and the length of the disconnect seemed to vary depending on the activity level on the server. We have test and quality systems that use the same backups without error, but they are not using High Availability and have very few users (analysts and developers). Our test environments are different than Matt's in that they are distributed (app server and DB host on separate servers). The overnight backups were originally running at 2am.

    Original Physical Systems:

    • Windows Server 2008
    • SQL Server 2008

    New Virtual Systems:

    • Windows Server 2016
    • SQL Server 2016

    Veeam & vmWare:

    • vmWare 6.0
    • Veeam 9.5

    Actions taken to address the issue:

    1) System Activity: I asked for and received a report that showed when each of our overnight batch jobs ran. The time period that had the least jobs starting and/or running was from 3:21am to 4:00am. The backup was rescheduled for 3:21 am.

    2) Network Throughput: Add more NICs to each of the servers/nodes/replicas (using NIC Teaming - Switch Independent + Address Hash).

    3) Synchronization: Began using asynchronous synchronization between the primary and secondary replicas. I was initially concerned about doing this to reduce the lag end-users were experiencing, but I've monitored this and the secondary replicas are not more than 30 minutes behind the primary replica. It doesn't appear that Michael or Matt are using High Availability, but if anyone else is, you can check the latency of Asynchronous by using the following code:

    SELECT ag.name AS ag_name, ar.replica_server_name AS ag_replica_server, dr_state.database_id as database_id,
    is_ag_replica_local = CASE
    WHEN ar_state.is_local = 1 THEN N'LOCAL'
    ELSE 'REMOTE'
    END ,
    ag_replica_role = CASE
    WHEN ar_state.role_desc IS NULL THEN N'DISCONNECTED'
    ELSE ar_state.role_desc
    END,
    dr_state.last_hardened_lsn, dr_state.last_hardened_time, datediff(s,last_hardened_time,
    getdate()) as 'seconds behind primary'
    FROM (( sys.availability_groups AS ag JOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id )
    JOIN sys.dm_hadr_availability_replica_states AS ar_state ON ar.replica_id = ar_state.replica_id)
    JOIN sys.dm_hadr_database_replica_states dr_state on ag.group_id = dr_state.group_id and dr_state.replica_id = ar_state.replica_id;

    4) Incremental: We replicate offsite for DR and we have 2 secondary nodes.

    Results:

    Once in a while, we will get a DBIF error, but it doesn't cause any of our overnight jobs to cancel. The error is no longer a nightly occurrence, and is the exception rather than the norm. I think I have seen one dump in a month.

    Caveats:

    Our database is relatively small (less than 1 TB).

    Conclusions:

    Thus far, it seems like the likelihood of a dump/disconnect is proportional to the amount of activity on the servers and the NICs at the time the backup is taken. Since we started replicating offsite, I haven't invested a lot of time into running the incremental backups onsite.

    Helpful Links:

    In addition to the links Matt shared, here are a few that others might find helpful if you are moving to this environment:

    MS NIC Teaming - pay special attention to the Tips

    SAP - SQL Server 2016 AlwaysOn

    Add comment
    10|10000 characters needed characters exceeded