on 06-08-2016 7:55 AM
Hi gurus,
After unexpected server shutdown HANA is not starting. Getting a little more deeper in the logs I´ve found the error listet bellow, it happens in nameserver startup process. Any hint?
hana001:/usr/sap/HDB/SYS/exe/hdb> ./hdbnameserver
service startup...
accepting requests at 127.0.0.1:30201; 127.0.0.2:30201
searching for master nameserver hana001:30201 ...
assign as master nameserver. assign to volume 1 started
checking for recovery request ...
loading topology ...
opening persistence ...
assign failed with persistence startup error. exception 1: no.3020046 (DataAccess/PageAccess/impl/PageImpl.cpp:399)
Wrong savepoint version: Expected 8373 but found 8361.
exception throw location:
1: 0x00007fb55f11bbbc in PageAccess::Page::verifyHeader(PageAccess::SizeClass, DataAccess::SavepointVersion const&, bool) const+0x3d8 at PageImpl.cpp:399 (libhdbdataaccess.so)
2: 0x00007fb55ee80da7 in DataAccess::SavepointImpl::loadRestartPage(PageAccess::PageIO&, unsigned long, bool, bool)+0x693 at Page.hpp:277 (libhdbdataaccess.so)
3: 0x00007fb55ee0205f in DataAccess::PersistenceManagerImpl::prepareOpen(unsigned long, bool, bool)+0x4b at PersistenceManagerImpl.cpp:5385 (libhdbdataaccess.so)
4: 0x00007fb55ee0223c in DataAccess::PersistenceManager::open(ltt::refcounted_handle<DataAccess::PersistenceConfiguration> const&, bool)+0x88 at PersistenceManagerImpl.cpp:2461 (libhdbdataaccess.so)
5: 0x00007fb56c510771 in PersistenceLayer::PersistenceSystem::initialize(NameServer::ServiceStartInfo const&, bool, PersistenceLayer::PERSISTENCE_MODE)+0x4a0 at PersistenceSystem.cpp:392 (libhdbpersistence.so)
6: 0x00007fb56c5437bc in PersistenceLayer::PersistenceFactory::initPersistence(PersistenceLayer::PERSISTENCE_MODE, ltt::releasable_handle<DataAccess::LoggerFactory>&, DataAccess::TransactionCallback*, NameServer::ServiceStartInfo&, ltt::refcounted_handle<TransactionManager::TransactionControlBlockFactory>&, bool)+0x158 at PersistenceFactory.cpp:394 (libhdbpersistence.so)
7: 0x00007fb56c43375a in PersistenceController::startup(PersistenceLayer::PERSISTENCE_MODE, NameServer::ServiceStartInfo*, bool, DataAccess::TablePreloadWriteCallback*, DataAccess::TablePreloadReadCallback*, Backup::RecoverCbc_Federation*)+0x1256 at PersistenceController.cpp:559 (libhdblogger.so)
8: 0x00007fb5751eedd5 in NameServer::Topology::initPersistence(NameServer::ServiceStartInfo&, bool, bool, TREX_ERROR::TRexError*, bool, ltt_adp::basic_string<char, ltt::char_traits<char>, ltt::integral_constant<bool, true> >, NameServer::ServiceStartInfo::RequestAction)+0x421 at Topology.cpp:285 (libhdbns.so)
9: 0x00007fb5752fb0c4 in NameServer::TREXNameServer::loadTopology(NameServer::LoadTopologyMode, NameServer::ServiceStartInfo&, Backup::Backup_ExtendedRecoveryInformation*)+0x5b0 at TREXNameServer.cpp:18396 (libhdbns.so)
10: 0x00007fb575305c23 in NameServer::TREXNameServer::assign(NameServer::ServiceStartInfo&)+0x2790 at TREXNameServer.cpp:2123 (libhdbns.so)
11: 0x00007fb5762d4f17 in TRexAPI::TREXIndexServer::assign(NameServer::ServiceStartInfo&, bool, TREX_ERROR::TRexError&)+0x93 at TREXIndexServer.cpp:992 (hdbnameserver)
12: 0x00007fb57630c129 in TRexAPI::AssignThread::run(void*)+0x35 at TREXIndexServer.cpp:543 (hdbnameserver)
13: 0x00007fb56cd03db5 in TrexThreads::PoolThread::run()+0x831 at PoolThread.cpp:389 (libhdbbasement.so)
14: 0x00007fb56cd056f0 in TrexThreads::PoolThread::run(void*&)+0x10 at PoolThread.cpp:165 (libhdbbasement.so)
15: 0x00007fb55471a9f0 in Execution::Thread::staticMainImp(void**)+0x700 at Thread.cpp:461 (libhdbbasis.so)
16: 0x00007fb55471bfc8 in Execution::Thread::staticMain(void*)+0x34 at ThreadMain.cpp:26 (libhdbbasis.so)
stopping service...
persistence initialization failed -> stopping instance ...
cannot send signal. (2, No such file or directory)
prepare for shutting service down...
stop ClockMonitor thread...
stop MasterTokenLockWriter /usr/sap/HDB/SYS/global//hdb/nameserver.lck thread...
setInactive(nameserver@hana001:30201)
hana001:/usr/sap/HDB/SYS/exe/hdb>
Also bellow HDB start log:
hana001:/usr/sap/HDB/HDB02> ./HDB start
StartService
Impromptu CCC initialization by 'rscpCInit'.
See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/HDB/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 02 -function StartWait 2700 2
08.06.2016 03:44:24
Start
OK
08.06.2016 03:44:46
StartWait
FAIL: process hdbdaemon HDB Daemon not running
Final part from Daemon trace:
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545057 i Daemon | TrexDaemon.cpp(03513) : stdout = /usr/sap/HDB/HDB02/hana001/trace/xsuaaserver.out |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545062 i Daemon | TrexDaemon.cpp(03513) : stderr = |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545066 i Daemon | TrexDaemon.cpp(03513) : maxstdfiles = 2 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545071 i Daemon | TrexDaemon.cpp(03513) : runlevel = 4 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545075 i Daemon | TrexDaemon.cpp(03513) : flags = |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545080 i Daemon | TrexDaemon.cpp(03513) : window = |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545084 i Daemon | TrexDaemon.cpp(03513) : isolation = 0 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545088 i Daemon | TrexDaemon.cpp(03513) : userId = 4294967295 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545093 i Daemon | TrexDaemon.cpp(03513) : groupId = 4294967295 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545104 i Daemon | Daemon.cpp(00666) : runlevel 0 completely started |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545115 i Daemon | Program.cpp(00173) : line up of program group mdcdispatcher to instances <none>. |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545126 i Daemon | Program.cpp(00173) : line up of program group nameserver to instances 0. |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.545460 i Daemon | TrexDaemon.cpp(02078) : starting program at '/usr/sap/HDB/HDB02/exe/hdbnameserver' with args '' |
[17502]{-1}[-1/-1] 2016-06-08 03:44:28.569590 i Daemon | RunningInstance.cpp(00191) : start 'hdbnameserver' as process 17518 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.313937 i Daemon | SignalsUNIX.cpp(00647) : signo 3 SIGQUIT from user errno 0 code 0 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314034 i Daemon | SignalsUNIX.cpp(00647) : sender pid 17518 real user id 1000 executable '/hana/shared/HDB/exe/linuxx86_64/HDB_1.00.112.02.1459851171_2840326/hdbnameserver' |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314167 i Daemon | TrexDaemon.cpp(03558) : got shutdown event (stop) |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314228 i Daemon | Daemon.cpp(00752) : comment file contains: nameserver: persistence initialization failed |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.314836 i Daemon | RunningInstance.cpp(00214) : stop process hdbnameserver with pid 17518 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.315143 i Daemon | TrexDaemon.cpp(02598) : stopped child with pid 17518 (17518) |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.401872 i Daemon | TrexDaemon.cpp(03778) : process hdbnameserver with pid 17518 exited normally with status 1 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.401939 i Daemon | TrexDaemon.cpp(03850) : all instances in runlevel 1 stopped |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.401952 i Daemon | Program.cpp(00173) : line up of program group dummyserver to instances <none>. |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.402492 i Daemon | ProgramStarter.cpp(00229) : writing started programs file /usr/sap/HDB/HDB02/hana001/lock/started_programs.txt: |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.528876 i Daemon | TrexDaemon.cpp(01616) : cleaning all |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.528931 i Daemon | TrexDaemon.cpp(02681) : found 8 segments in use with maximum id 8 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529017 i Daemon | TrexDaemon.cpp(02705) : segment 4, key 0x9f08cf81, id 3506180, sequence 107, owner 1000/1001, size 117, atime 2016-06-08 03:44:32, dtime 2016-06-08 03:44:35, ctime 2016-06-08 03:18:33, created by pid 17077, last changed by pid 17077, nattach 0 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529056 i Daemon | TrexDaemon.cpp(02705) : segment 5, key 0x00000000, id 3538949, sequence 108, owner 1000/1001, size 1060911, atime 2016-06-08 03:44:35, dtime 2016-06-08 03:44:35, ctime 2016-06-08 03:18:33, created by pid 17077, last changed by pid 17077, nattach 1 (deleted) |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529069 i Daemon | TrexDaemon.cpp(02705) : segment 6, key 0x00000000, id 3571718, sequence 109, owner 1000/113, size 1024, atime 2016-06-08 03:44:26, dtime 2016-06-08 03:44:26, ctime 2016-06-08 03:44:25, created by pid 17494, last changed by pid 17494, nattach 1 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529157 i Daemon | Main.cpp(01530) : exit hdbdaemon |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.529171 i Daemon | Main.cpp(01541) : Success: /sapmnt/ld7272/a/HDB/jenkins_prod/workspace/FA_CO_LIN64GCC48HAPPY_rel_fa~newdb100_rel/sys/src/TrexDaemon/Main.cpp:1541 in "int main(int, char**)". Leaving the "main" |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.530121 i Network | NetworkListener.cpp(00805) : closing listen socket 7 bound to 127.0.0.1 |
[17502]{-1}[-1/-1] 2016-06-08 03:44:35.530283 i Network | NetworkListener.cpp(00805) : closing listen socket 8 bound to 127.0.0.2 |
Hi,
Quickly looking at the log I experiences a similar issue on an IBM blade before, where after a HANA crash it couldn't find the right save point
Please have a look at this note: 2203059 - HANA persistence corruption because of GPFS rapid repair feature
Solution
I updated the GPFS which resolved the issue. As I cannot comment on what hardware you are running this, you would either need to check the GPFS or worst case do a system restore to the last point in time or full backup.
Regards,
Nick
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Nick,
Thanks for your reply. I´ve already saw the note, my hardware isn´t from IBM. This is happening in a sandbox server, I´ve tried to restore de database using HANA Studio, unfortunately during the restore process the nameserver is restarted/started and since this issue is happening when starting it I cannot proceed, hdbsql is not an option since the database is offline.
How am I supossed to restore the database in this situation?
Are you trying to recover a Point-in-Time? If so are you trying the last snapshot? You could attempt one snapshot before if possible.
In case that doesn't work either:
On OS level could you do under <SID>adm a HDB stop and then check HDB info if anything is still running for the nameserver? Also check whether the lock files are released. If the lock files are not released, please delete them manually.
In case this still doesn't work, you can also backup and delete the nameserver.ini to ensure it uses the system default found under /sapmnt/<SID>/global/hdb/custom/config
Worst case you could restore your latest full backup.
Regards,
Nick
Yes, Poin-in-Time. No snapshot is displayed for selection on screen, I supossed this was due to nameserver error. Isn´t it?
I also tried to restore using full backup, unfortunately after starting the restore process the nameserver is restarted/restarted automatically and since it doesn´t start restore doesn´t start.
Bellow HDB info after HDB stop:
USER | PID PPID %CPU | VSZ RSS COMMAND |
hdbadm 25824 25823 0.0 13884 2804 -sh
hdbadm 25965 25824 17.3 12900 1720 \_ /bin/sh /usr/sap/HDB/HDB02/HDB info
hdbadm 25990 25965 0.0 4944 616 | \_ ps fx -U hdbadm -o user,pid,ppid,pcpu,vsz,rss,args | |
hdbadm | 3155 | 1 0.0 171056 22000 /usr/sap/HDB/HDB02/exe/sapstartsrv pf=/usr/sap/HDB/SYS/profile/HDB_HDB02_poahnehp -D -u hdbadm |
Where can I find the mentioned lock files?
Thanks!
Do you know if the datavolume was being exchanged by an older datavolume? You should see something similair to
[15067]{-1}[-1/-1] 2016-05-19 13:58:22.353111 i Logger SavepointImpl.cpp(03893) : AnchorPage with SPV=37047 loaded. Persistence created at <DATA -- TIME>.
I would also check the var/log/messages for any suspicious activity.
What is your file system, is it XFS? If so there is a known issue with this which is detailed in SAP Note 2246163
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Have you first checked if any PID's are still running? Looking at the screenshots I get the idea your system still has processes running in the background (especially on the last message). As this is not Production or is it live, you can run "HDB kill" on the OS to make sure everything is down before trying again.
The lock files was based on your earlier screenshot where it had an ID mismatch. They are generally found in /sapmnt/<SID>/HDB<Instance#>/<host>/lock
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
87 | |
10 | |
10 | |
9 | |
7 | |
6 | |
6 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.