Solved: Error starting database / nameserver

Former Member · ‎06-08-2016

Hi gurus,

After unexpected server shutdown HANA is not starting. Getting a little more deeper in the logs I´ve found the error listet bellow, it happens in nameserver startup process. Any hint?

hana001:/usr/sap/HDB/SYS/exe/hdb> ./hdbnameserver

service startup...

accepting requests at 127.0.0.1:30201; 127.0.0.2:30201

searching for master nameserver hana001:30201 ...

assign as master nameserver. assign to volume 1 started

checking for recovery request ...

loading topology ...

opening persistence ...

assign failed with persistence startup error. exception 1: no.3020046 (DataAccess/PageAccess/impl/PageImpl.cpp:399)

Wrong savepoint version: Expected 8373 but found 8361.

exception throw location:

1: 0x00007fb55f11bbbc in PageAccess::Page::verifyHeader(PageAccess::SizeClass, DataAccess::SavepointVersion const&, bool) const+0x3d8 at PageImpl.cpp:399 (libhdbdataaccess.so)

2: 0x00007fb55ee80da7 in DataAccess::SavepointImpl::loadRestartPage(PageAccess::PageIO&, unsigned long, bool, bool)+0x693 at Page.hpp:277 (libhdbdataaccess.so)

3: 0x00007fb55ee0205f in DataAccess::PersistenceManagerImpl::prepareOpen(unsigned long, bool, bool)+0x4b at PersistenceManagerImpl.cpp:5385 (libhdbdataaccess.so)

4: 0x00007fb55ee0223c in DataAccess::PersistenceManager::open(ltt::refcounted_handle<DataAccess::PersistenceConfiguration> const&, bool)+0x88 at PersistenceManagerImpl.cpp:2461 (libhdbdataaccess.so)

5: 0x00007fb56c510771 in PersistenceLayer::PersistenceSystem::initialize(NameServer::ServiceStartInfo const&, bool, PersistenceLayer::PERSISTENCE_MODE)+0x4a0 at PersistenceSystem.cpp:392 (libhdbpersistence.so)

6: 0x00007fb56c5437bc in PersistenceLayer::PersistenceFactory::initPersistence(PersistenceLayer::PERSISTENCE_MODE, ltt::releasable_handle<DataAccess::LoggerFactory>&, DataAccess::TransactionCallback*, NameServer::ServiceStartInfo&, ltt::refcounted_handle<TransactionManager::TransactionControlBlockFactory>&, bool)+0x158 at PersistenceFactory.cpp:394 (libhdbpersistence.so)

7: 0x00007fb56c43375a in PersistenceController::startup(PersistenceLayer::PERSISTENCE_MODE, NameServer::ServiceStartInfo*, bool, DataAccess::TablePreloadWriteCallback*, DataAccess::TablePreloadReadCallback*, Backup::RecoverCbc_Federation*)+0x1256 at PersistenceController.cpp:559 (libhdblogger.so)

8: 0x00007fb5751eedd5 in NameServer::Topology::initPersistence(NameServer::ServiceStartInfo&, bool, bool, TREX_ERROR::TRexError*, bool, ltt_adp::basic_string<char, ltt::char_traits<char>, ltt::integral_constant<bool, true> >, NameServer::ServiceStartInfo::RequestAction)+0x421 at Topology.cpp:285 (libhdbns.so)

9: 0x00007fb5752fb0c4 in NameServer::TREXNameServer::loadTopology(NameServer::LoadTopologyMode, NameServer::ServiceStartInfo&, Backup::Backup_ExtendedRecoveryInformation*)+0x5b0 at TREXNameServer.cpp:18396 (libhdbns.so)

10: 0x00007fb575305c23 in NameServer::TREXNameServer::assign(NameServer::ServiceStartInfo&)+0x2790 at TREXNameServer.cpp:2123 (libhdbns.so)

11: 0x00007fb5762d4f17 in TRexAPI::TREXIndexServer::assign(NameServer::ServiceStartInfo&, bool, TREX_ERROR::TRexError&)+0x93 at TREXIndexServer.cpp:992 (hdbnameserver)

12: 0x00007fb57630c129 in TRexAPI::AssignThread::run(void*)+0x35 at TREXIndexServer.cpp:543 (hdbnameserver)

13: 0x00007fb56cd03db5 in TrexThreads::PoolThread::run()+0x831 at PoolThread.cpp:389 (libhdbbasement.so)

14: 0x00007fb56cd056f0 in TrexThreads::PoolThread::run(void*&)+0x10 at PoolThread.cpp:165 (libhdbbasement.so)

15: 0x00007fb55471a9f0 in Execution::Thread::staticMainImp(void**)+0x700 at Thread.cpp:461 (libhdbbasis.so)

16: 0x00007fb55471bfc8 in Execution::Thread::staticMain(void*)+0x34 at ThreadMain.cpp:26 (libhdbbasis.so)

stopping service...

persistence initialization failed -> stopping instance ...

cannot send signal. (2, No such file or directory)

prepare for shutting service down...

stop ClockMonitor thread...

stop MasterTokenLockWriter /usr/sap/HDB/SYS/global//hdb/nameserver.lck thread...

setInactive(nameserver@hana001:30201)

hana001:/usr/sap/HDB/SYS/exe/hdb>

Also bellow HDB start log:

hana001:/usr/sap/HDB/HDB02> ./HDB start

StartService

Impromptu CCC initialization by 'rscpCInit'.

See SAP note 1266393.

OK

Starting instance using: /usr/sap/HDB/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 02 -function StartWait 2700 2

08.06.2016 03:44:24

Start

OK

08.06.2016 03:44:46

StartWait

FAIL: process hdbdaemon HDB Daemon not running

Final part from Daemon trace:

[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545057 i Daemon	TrexDaemon.cpp(03513) : stdout = /usr/sap/HDB/HDB02/hana001/trace/xsuaaserver.out
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545062 i Daemon	TrexDaemon.cpp(03513) : stderr =
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545066 i Daemon	TrexDaemon.cpp(03513) : maxstdfiles = 2
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545071 i Daemon	TrexDaemon.cpp(03513) : runlevel = 4
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545075 i Daemon	TrexDaemon.cpp(03513) : flags =
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545080 i Daemon	TrexDaemon.cpp(03513) : window =
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545084 i Daemon	TrexDaemon.cpp(03513) : isolation = 0
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545088 i Daemon	TrexDaemon.cpp(03513) : userId = 4294967295
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545093 i Daemon	TrexDaemon.cpp(03513) : groupId = 4294967295
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545104 i Daemon	Daemon.cpp(00666) : runlevel 0 completely started
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545115 i Daemon	Program.cpp(00173) : line up of program group mdcdispatcher to instances <none>.
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545126 i Daemon	Program.cpp(00173) : line up of program group nameserver to instances 0.
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545460 i Daemon	TrexDaemon.cpp(02078) : starting program at '/usr/sap/HDB/HDB02/exe/hdbnameserver' with args ''
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.569590 i Daemon	RunningInstance.cpp(00191) : start 'hdbnameserver' as process 17518
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.313937 i Daemon	SignalsUNIX.cpp(00647) : signo 3 SIGQUIT from user errno 0 code 0
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314034 i Daemon	SignalsUNIX.cpp(00647) : sender pid 17518 real user id 1000 executable '/hana/shared/HDB/exe/linuxx86_64/HDB_1.00.112.02.1459851171_2840326/hdbnameserver'
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314167 i Daemon	TrexDaemon.cpp(03558) : got shutdown event (stop)
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314228 i Daemon	Daemon.cpp(00752) : comment file contains: nameserver: persistence initialization failed
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314836 i Daemon	RunningInstance.cpp(00214) : stop process hdbnameserver with pid 17518
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.315143 i Daemon	TrexDaemon.cpp(02598) : stopped child with pid 17518 (17518)
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.401872 i Daemon	TrexDaemon.cpp(03778) : process hdbnameserver with pid 17518 exited normally with status 1
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.401939 i Daemon	TrexDaemon.cpp(03850) : all instances in runlevel 1 stopped
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.401952 i Daemon	Program.cpp(00173) : line up of program group dummyserver to instances <none>.
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.402492 i Daemon	ProgramStarter.cpp(00229) : writing started programs file /usr/sap/HDB/HDB02/hana001/lock/started_programs.txt:
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.528876 i Daemon	TrexDaemon.cpp(01616) : cleaning all
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.528931 i Daemon	TrexDaemon.cpp(02681) : found 8 segments in use with maximum id 8
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529017 i Daemon	TrexDaemon.cpp(02705) : segment 4, key 0x9f08cf81, id 3506180, sequence 107, owner 1000/1001, size 117, atime 2016-06-08 03:44:32, dtime 2016-06-08 03:44:35, ctime 2016-06-08 03:18:33, created by pid 17077, last changed by pid 17077, nattach 0
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529056 i Daemon	TrexDaemon.cpp(02705) : segment 5, key 0x00000000, id 3538949, sequence 108, owner 1000/1001, size 1060911, atime 2016-06-08 03:44:35, dtime 2016-06-08 03:44:35, ctime 2016-06-08 03:18:33, created by pid 17077, last changed by pid 17077, nattach 1 (deleted)
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529069 i Daemon	TrexDaemon.cpp(02705) : segment 6, key 0x00000000, id 3571718, sequence 109, owner 1000/113, size 1024, atime 2016-06-08 03:44:26, dtime 2016-06-08 03:44:26, ctime 2016-06-08 03:44:25, created by pid 17494, last changed by pid 17494, nattach 1
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529157 i Daemon	Main.cpp(01530) : exit hdbdaemon
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529171 i Daemon	Main.cpp(01541) : Success: /sapmnt/ld7272/a/HDB/jenkins_prod/workspace/FA_CO_LIN64GCC48HAPPY_rel_fa~newdb100_rel/sys/src/TrexDaemon/Main.cpp:1541 in "int main(int, char**)". Leaving the "main"
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.530121 i Network	NetworkListener.cpp(00805) : closing listen socket 7 bound to 127.0.0.1
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.530283 i Network	NetworkListener.cpp(00805) : closing listen socket 8 bound to 127.0.0.2

n_vorstermans2 · ‎06-08-2016

Hi,

Quickly looking at the log I experiences a similar issue on an IBM blade before, where after a HANA crash it couldn't find the right save point

Please have a look at this note: 2203059 - HANA persistence corruption because of GPFS rapid repair feature

Solution

Upgrade to at least GPFS 4.1.1-1 (GPFS 4.1.1-2 or later recommended). Please see SAP Note 1880960 for current recommended minimum GPFS versions
Restore HANA DB to the last possible point in time if you are already in the situation with the error.

I updated the GPFS which resolved the issue. As I cannot comment on what hardware you are running this, you would either need to check the GPFS or worst case do a system restore to the last point in time or full backup.

Regards,

Nick

former_member183326 · ‎06-10-2016

Do you know if the datavolume was being exchanged by an older datavolume? You should see something similair to

[15067]{-1}[-1/-1] 2016-05-19 13:58:22.353111 i Logger SavepointImpl.cpp(03893) : AnchorPage with SPV=37047 loaded. Persistence created at <DATA -- TIME>.

I would also check the var/log/messages for any suspicious activity.

What is your file system, is it XFS? If so there is a known issue with this which is detailed in SAP Note 2246163

n_vorstermans2 · ‎06-09-2016

Have you first checked if any PID's are still running? Looking at the screenshots I get the idea your system still has processes running in the background (especially on the last message). As this is not Production or is it live, you can run "HDB kill" on the OS to make sure everything is down before trying again.

The lock files was based on your earlier screenshot where it had an ID mismatch. They are generally found in /sapmnt/<SID>/HDB<Instance#>/<host>/lock

hdbadm 25990 25965 0.0 4944 616	\_ ps fx -U hdbadm -o user,pid,ppid,pcpu,vsz,rss,args
hdbadm	3155	1 0.0 171056 22000 /usr/sap/HDB/HDB02/exe/sapstartsrv pf=/usr/sap/HDB/SYS/profile/HDB_HDB02_poahnehp -D -u hdbadm

Error starting database / nameserver

Accepted Solutions (1)

Accepted Solutions (1)

Answers (2)

Answers (2)

RAP Node Extensibility for Unmanaged BO

Re: fuzzy search with multiple name fields

Re: ABAP2XLSX in an old Abap version

Re: fuzzy search with multiple name fields

Cap list output from query not working