on 12-10-2013 2:34 PM
Hello all
I have problem with oracle 11 in Suse. We try upgrade Suse from 11.1 to Suse 11.2. Our scenario is:
We install new system (SLES for SAP Applications 11.2 (x86_64)) and copy all file from old SAP system on Sles system (10). All OS setting is copy from 11.1 to 11.2. We use this scenario in all other systems, and work property (on ERP, EP, XI dev & test), but not in this system (prodution XI on NW 7.0).
So, after copy all file from orginal system to new one we start oracle, but oracle not working properly, after some time, its crash. I found that the problem is when system try switch logfile (in original system its working correct). See steps to reproduce crash:
SQL> startup
ORACLE instance started.
Total System Global Area 2472083456 bytes
Fixed Size 2230792 bytes
Variable Size 872416760 bytes
Database Buffers 1577058304 bytes
Redo Buffers 20377600 bytes
Database mounted.
Database opened.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
and now oracle do nothing, system can't finish this step. When I shutdown oracle in abort mode (in other system can't stop), on OS level I have still working proces ora_lgwr, which I can't kill, because this process wait for I/O (I think for disk write).
I have snapshot for this installation, and I tried to do different trials (change OS limits, change size/number of redo log, change redolog destination, change block size in dysk) and always system crash.
On alert log, I can not find anything interesting, on OS level the same.
Anybody can help me ?
Best regards
Paweł
Hello all
On Suse 11.2 we had the same problem (on fiscal and virtual machine). Parametr log_archive_start=true not solved it. I think we have some problem with Linux kernel, but why only on this one serwer I don't know.
We solved the problem by installing a new version of Suse - 11.3. Now all work ok
Thank you very much for your help
Best regards
Paweł
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi,
I hope your problem will be resolve.
Please check the parameter.
log_archive_start=true
and as well as try the below Query to solve:
alter database clear unachived logfile group 1;
alter database clear unarchived logfile group 1unrecoverable datafile;
Alter system switch logfile;
Thank you
Mahendran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi,
Please try to drop your database control file and recreate the control file.Then give alter database open reset logs and give startup force.
Now you check and let me know the status.
Thank you
Mahendran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I done this some times and the status is the same. I recreated database using scripts
CREATE CONTROLFILE SET DATABASE"XIP" RESETLOGS ARCHIVELOG
....
and
alter database open resetlogs
systuation is the same.
I move controlfiles, log files to another disk, change size of log files. Status is the same
Best regards
Paweł
Hi,
Could you please provide the full alert log?
If you provide a full alert log, We can provide a solution for that.
Thank you
Mahendran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
after reinsatll system I found below error in XIP_m000_11027.trc:
more XIP_m000_11027.trc
Trace file /oracle/XIP/saptrace/diag/rdbms/xip/XIP/trace/XIP_m000_11027.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/XIP/112_64
System name: Linux
Node name: xi-ap2
Release: 3.0.13-0.27-default
Version: #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b)
Machine: x86_64
VM name: VMWare Version: 6
Instance name: XIP
Redo thread mounted by this instance: 1
Oracle process number: 23
Unix process pid: 11027, image: oracle@xi-ap2 (M000)
*** 2013-12-16 09:24:38.745
*** SESSION ID:(222.17) 2013-12-16 09:24:38.745
*** CLIENT ID:() 2013-12-16 09:24:38.745
*** SERVICE NAME:(SYS$BACKGROUND) 2013-12-16 09:24:38.745
*** MODULE NAME:(MMON_SLAVE) 2013-12-16 09:24:38.745
*** ACTION NAME:(Auto-Flush Slave Action) 2013-12-16 09:24:38.745
DDE rules only execution for: ORA 12751
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
Executing ASYNC actions
----- START DDE Action: 'ORA_12751_DUMP' (Sync) -----
Runtime exceeded 900 seconds
Time limit violation detected at:
ksedsts()+461<-kspol_12751_dump()+145<-dbgdaExecuteAction()+1065<-dbgerRunAction()+109<-dbgerRunActions()+4053<-dbgexPhaseII()+1873<-dbgexProcessError()+2680<-dbgeExecuteForError()+88<-dbgePostErrorKG
E()+2143<-dbkePostKGE_kgsf()+71<-kgeselv()+276<-ksesecl0()+162
<-ksucin()+224<-kcrfws()+1677<-kcbchg1_main()+12927<-kcbchg1()+1068<-ktuchg2()+1417<-ktbchg2()+281<-kdtchg()+1702<-kdtwrp()+6508<-kdtInsRow()+3093<-qerltcNoKdtBufferedInsRowCBK()+795<-qerltcFetch()+37
07<-insexe()+1634<-opiexe()+5572<-kpoal8()+2224<-opiodr()+916
<-kpoodrc()+36<-rpiswu2()+1618<-kpoodr()+622<-upirtrc()+2497<-kpurcsc()+98<-kpuexec()+10807<-OCIStmtExecute()+39<-kewrgwxf1_gwrsql_exft_1()+387<-kewrgwxf_gwrsql_exft()+1388<-kewrfos_flush_onesnap()+69
7<-kewrfsc_flush_snapshot_c()+923<-kewrafs_auto_flush_slave()+882
<-kebm_slave_main()+580<-ksvrdp()+1779<-opirip()+674<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+230<-_start()+36Current Wait Stack:
0: waiting for 'log file switch (checkpoint incomplete)'
=0x0, =0x0, =0x0
wait_id=4 seq_num=7 snap_id=2
wait times: snap=7 min 13 sec, exc=15 min 6 sec, total=15 min 6 sec
wait times: max=infinite, heur=15 min 6 sec
wait counts: calls=9021 os=9021
in_wait=1 iflags=0x15a0
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 95, ser: 1
Dumping final blocker:
inst: 1, sid: 95, ser: 1
There are 1 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 219, ser: 1
wait event: 'row cache lock'
p1: 'cache id'=0x16
p2: 'mode'=0x0
p3: 'request'=0x3
row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
min_blocked_time: 291 secs, waiter_cache_ver: 566
Wait State:
fixed_waits=0 flags=0x23 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.000000 sec since current wait
0: waited for 'latch: redo writing'
address=0x60022628, number=0xb9, tries=0x0
wait_id=5 seq_num=6 snap_id=1
wait times: snap=0.001148 sec, exc=0.001148 sec, total=0.001148 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
1: waited for 'log file switch (checkpoint incomplete)'
=0x0, =0x0, =0x0
wait_id=4 seq_num=5 snap_id=1
wait times: snap=7 min 52 sec, exc=7 min 52 sec, total=7 min 52 sec
wait times: max=infinite
wait counts: calls=4712 os=4712
occurred after 0.000371 sec of elapsed time
2: waited for 'db file sequential read'
file#=0x3, block#=0x1141, blocks=0x1
wait_id=3 seq_num=4 snap_id=1
wait times: snap=0.016183 sec, exc=0.016183 sec, total=0.016183 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000161 sec of elapsed time
3: waited for 'db file sequential read'
file#=0x3, block#=0x1142, blocks=0x1
wait_id=2 seq_num=3 snap_id=1
wait times: snap=0.015223 sec, exc=0.015223 sec, total=0.015223 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000019 sec of elapsed time
4: waited for 'Disk file operations I/O'
FileOperation=0x2, fileno=0x3, filetype=0x2
wait_id=1 seq_num=2 snap_id=1
wait times: snap=0.020978 sec, exc=0.020978 sec, total=0.020978 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.004954 sec of elapsed time
5: waited for 'class slave wait'
slave id=0xffffffffa52ff548, =0x0, =0x0
wait_id=0 seq_num=1 snap_id=1
wait times: snap=0.000087 sec, exc=0.000087 sec, total=0.000087 sec
wait times: max=infinite
wait counts: calls=1 os=1
occurred after 0.000000 sec of elapsed time
Sampled Session History of session 222 serial 17
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[121 samples, 09:22:38 - 09:24:38]
waited for 'log file switch (checkpoint incomplete)', seq_num: 7
p1: ''=0x0
p2: ''=0x0
p3: ''=0x0
time_waited: >= 120 sec (still in wait)
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'log file switch (checkpoint incomplete)'
[121 samples, 09:22:38 - 09:24:38]
time_waited: >= 120 sec (still in wait)
---------------------------------------------------
----- END DDE Action: 'ORA_12751_DUMP' (SUCCESS, 10 csec) -----
----- END DDE Actions Dump (total 10 csec) -----
*** KEWROCISTMTEXEC - encountered error: (ORA-12751: cpu time or run time policy violation
)
*** SQLSTR: total-len=342, dump-len=240,
STR={insert into WRM$_SNAPSHOT (snap_id, dbid, instance_number, startup_time, begin_interval_time, end_interval_time, snap_level, status, error_count, bl_moved, snap_flag, snap_timezone)
values (:snap_id, :dbid, :instance_number, :sta}
DDE: Problem Key 'ORA 12751' was flood controlled (0x2) (no incident)
ORA-12751: cpu time or run time policy violation
*** KEWRAFS: Error=13509 encountered by Auto Flush Slave.
KEBM: MMON slave action policy violation. kewrmafsa_; viol=1; err=13509
Dear Friend,
Could you please share your alert log, sap syslogs and OS event viewer logs.?
Also Please share your pfile to us analysis the problem?
Also yours is windows OS or unix?
Thank you.
Mahendran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
see alert.log:
more alert_XIP.log
Mon Dec 16 08:56:12 2013
Starting ORACLE instance (normal)
****************** Large Pages Information *****************
Total Shared Global Region in Large Pages = 0 KB (0%)
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
RECOMMENDATION:
Total Shared Global Region size is 2370 MB. For optimal performance,
prior to the next instance restart increase the number
of unused Large Pages by atleast 1185 2048 KB Large Pages (2370 MB)
system wide to get 100% of the Shared
Global Region allocated with Large pages
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
ILAT =27
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
ORACLE_HOME = /oracle/XIP/112_64
System name: Linux
Node name: xi-ap2
Release: 3.0.13-0.27-default
Version: #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b)
Machine: x86_64
VM name: VMWare Version: 6
Using parameter settings in server-side spfile /oracle/XIP/112_64/dbs/spfileXIP.
ora
System parameters with non-default values:
processes = 150
sessions = 248
event = "10027"
event = "10028"
event = "10142"
event = "10183"
event = "10191"
event = "10443"
event = "10995 level 2"
event = "31991"
event = "38068 level 100"
event = "38085"
event = "38087"
event = "44951 level 1024"
shared_pool_size = 800M
filesystemio_options = "setall"
control_files = "/oracle/XIP/origlogA/cntrl/cntlrXIP.dbf"
control_files = "/oracle/XIP/origlogB/cntrl/cntrlXIP.dbf"
control_files = "/oracle/XIP/sapdata1/cntrl/cntrlXIP.dbf"
control_file_record_keep_time= 30
db_block_size = 8192
db_cache_size = 1504M
compatible = "11.2.0"
log_archive_dest_1 = "location=/oracle/XIP/oraarch/XIParch"
log_archive_format = "%t_%s_%r.dbf"
log_buffer = 14245888
log_checkpoint_interval = 0
db_files = 254
log_checkpoints_to_alert = TRUE
dml_locks = 4000
undo_tablespace = "PSAPUNDO"
recyclebin = "off"
_mutex_wait_time = 10
_mutex_wait_scheme = 1
remote_os_authent = TRUE
remote_login_passwordfile= "EXCLUSIVE"
local_listener = "LISTENER_XIP"
job_queue_processes = 1
parallel_max_servers = 40
parallel_execution_message_size= 16384
_fix_control = "5099019:ON"
_fix_control = "5705630:ON"
_fix_control = "6055658:OFF"
_fix_control = "6120483:OFF"
_fix_control = "6399597:ON"
_fix_control = "6430500:ON"
_fix_control = "6440977:ON"
_fix_control = "6626018:ON"
_fix_control = "6972291:ON"
_fix_control = "7168184:OFF"
_fix_control = "8937971:ON"
_fix_control = "9196440:ON"
_fix_control = "9495669:ON"
_fix_control = "13077335:ON"
_fix_control = "13627489:ON"
_fix_control = "14255600:ON"
_fix_control = "14595273:ON"
audit_file_dest = "/oracle/XIP/saptrace/audit"
db_name = "XIP"
open_cursors = 800
star_transformation_enabled= "TRUE"
_optimizer_extended_cursor_sharing_rel= "NONE"
_optimizer_adaptive_cursor_sharing= FALSE
parallel_threads_per_cpu = 1
query_rewrite_enabled = "FALSE"
_optim_peek_user_binds = FALSE
pga_aggregate_target = 800M
optimizer_dynamic_sampling= 6
_optimizer_use_feedback = FALSE
diagnostic_dest = "/oracle/XIP/saptrace"
max_dump_file_size = "20000"
Deprecated system parameters with specified values:
remote_os_authent
End of deprecated system parameter listing
Mon Dec 16 08:56:14 2013
PMON started with pid=2, OS id=10356
Mon Dec 16 08:56:14 2013
PSP0 started with pid=3, OS id=10361
Mon Dec 16 08:56:16 2013
VKTM started with pid=4, OS id=10366 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Mon Dec 16 08:56:16 2013
GEN0 started with pid=5, OS id=10373
Mon Dec 16 08:56:16 2013
DIAG started with pid=6, OS id=10378
Mon Dec 16 08:56:16 2013
DBRM started with pid=7, OS id=10383
Mon Dec 16 08:56:16 2013
DIA0 started with pid=8, OS id=10388
Mon Dec 16 08:56:16 2013
MMAN started with pid=9, OS id=10393
Mon Dec 16 08:56:16 2013
DBW0 started with pid=10, OS id=10398
Mon Dec 16 08:56:16 2013
LGWR started with pid=11, OS id=10403
Mon Dec 16 08:56:16 2013
CKPT started with pid=12, OS id=10408
Mon Dec 16 08:56:16 2013
SMON started with pid=13, OS id=10413
Mon Dec 16 08:56:16 2013
RECO started with pid=14, OS id=10418
Mon Dec 16 08:56:16 2013
MMON started with pid=15, OS id=10423
Mon Dec 16 08:56:16 2013
MMNL started with pid=16, OS id=10428
ORACLE_BASE from environment = /oracle
Mon Dec 16 08:56:16 2013
ALTER DATABASE MOUNT
Successful mount of redo thread 1, with mount id 29035568
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
Mon Dec 16 08:56:21 2013
ALTER DATABASE OPEN
LGWR: STARTING ARCH PROCESSES
Mon Dec 16 08:56:23 2013
ARC0 started with pid=18, OS id=10451
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Mon Dec 16 08:56:24 2013
ARC1 started with pid=19, OS id=10456
Mon Dec 16 08:56:24 2013
ARC2 started with pid=20, OS id=10461
Mon Dec 16 08:56:24 2013
ARC3 started with pid=21, OS id=10466
ARC1: Archival started
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
Thread 1 opened at log sequence 224843
Current log# 2 seq# 224843 mem# 0: /oracle/XIP/origlogB/log_g12m1.dbf
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
Incremental checkpoint up to RBA [0x36e4b.18f.0], current log tail at RBA [0x36e
4b.18f.0]
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
[10446] Successfully onlined Undo Tablespace 1.
Undo initialization finished serial:0 start:235693726 end:235694266 diff:540 (5
seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is UTF8
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication fou
nd)
Starting background process QMNC
Mon Dec 16 08:56:27 2013
QMNC started with pid=22, OS id=10471
Completed: ALTER DATABASE OPEN
Mon Dec 16 08:56:32 2013
Starting background process CJQ0
Mon Dec 16 08:56:32 2013
CJQ0 started with pid=28, OS id=10507
Mon Dec 16 08:56:45 2013
System State dumped to trace file /oracle/XIP/saptrace/diag/rdbms/xip/XIP/trace/
XIP_ora_10446.trc
Mon Dec 16 08:57:34 2013
Beginning log switch checkpoint up to RBA [0x36e4c.2.10], SCN: 1956998197
Thread 1 advanced to log sequence 224844 (LGWR switch)
Current log# 4 seq# 224844 mem# 0: /oracle/XIP/origlogB/log_g14m1.dbf
Mon Dec 16 08:57:35 2013
Archived Log entry 224395 added for thread 1 sequence 224843 ID 0xf731cd0c dest
1:
Beginning log switch checkpoint up to RBA [0x36e4d.2.10], SCN: 1956998200
Thread 1 advanced to log sequence 224845 (LGWR switch)
Current log# 1 seq# 224845 mem# 0: /oracle/XIP/origlogA/log_g11m1.dbf
Mon Dec 16 08:57:36 2013
Archived Log entry 224396 added for thread 1 sequence 224844 ID 0xf731cd0c dest
1:
Beginning log switch checkpoint up to RBA [0x36e4e.2.10], SCN: 1956998203
Thread 1 advanced to log sequence 224846 (LGWR switch)
Current log# 3 seq# 224846 mem# 0: /oracle/XIP/origlogA/log_g13m1.dbf
Mon Dec 16 08:57:38 2013
Archived Log entry 224397 added for thread 1 sequence 224845 ID 0xf731cd0c dest
1:
Thread 1 cannot allocate new log, sequence 224847
Checkpoint not complete
Current log# 3 seq# 224846 mem# 0: /oracle/XIP/origlogA/log_g13m1.dbf
Mon Dec 16 08:57:42 2013
Completed checkpoint up to RBA [0x36e4c.2.10], SCN: 1956998197
xi-ap2:/oracle/XIP/saptrace/diag/rdbms/xip/XIP/trace #
Hello
This is full aletr.log
See below all steps, which I do:
SQL> startup
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
Total System Global Area 2472083456 bytes
Fixed Size 2230792 bytes
Variable Size 872416760 bytes
Database Buffers 1577058304 bytes
Redo Buffers 20377600 bytes
Database mounted.
Database opened.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL>
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
alter system switch logfile
and now system waiting,
Best regards
Paweł
Hi Pawel,
> So, after copy all file from orginal system to new one we start oracle, but oracle not working properly, after some time, its crash.
Does it really crash (core dump / seg fault) or does it just hang? In the latter case - are you able to logon anymore or does the logon also hang?
Depending on the exact scenario you can perform (several) systemstate dumps and crosscheck the reason for that "hanging" issue. I have written a blog post called about creating and using systemstate dumps. You can use the prelim option, if you are not able to logon anymore.
.. or you can use strace / pstack for further troubleshooting. Please check out my other blog post about that:
Regards
Stefan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks Stefan for your answer.
System/oracle not dump, only still running in status "alter system switch logfile". In alert log is nothing. When I do strace for process lgwr durring this situation, console look like under control I can't do nothing. On other console I olny kill this session.
Process lgwr looks like to connected to some device (ps -auxwww |grep lgwr in status is D+), and can't kill it, until system process disconnect, but never it' happend
I try check lgwr proces as is describe in:
but - when system start do something do switch log and can't finished scripts.
When I reinstall system I'll check "library cache lock", mayby this is good way to solve my problem
Best regards
Paweł
Hi Pawel,
so your system just hangs.
LGWR may not show anything (by using strace) in this situation as it also blocked.
I don't get the full picture of your issue right now (if you are able to logon with SQL*Plus in another session or not, when this hanging issue occurs), but i provided you two links how to create a systemstate dump in both situations.
The library cache lock issue, that i described in my blog has nothing to do with your issue, but this blog post includes a guide how to create a systemstate dump with the different kind of information.
The systemstate dump shows us the blocking scenario and afterwards you can do further analysis based on that result.
Regards
Stefan
Hello
I don't know what's is pstatck. How I can create this logs ?
See:
ID 10403 is a pid of lgwr process:
xi-ap2:/proc/10403 # more status
Name: oracle
State: D (disk sleep)
Tgid: 10403
Pid: 10403
PPid: 1
TracerPid: 0
Uid: 1018 1018 1018 1018
Gid: 1002 1002 1002 1002
FDSize: 512
Groups: 1000 1002 1003
VmPeak: 2687152 kB
VmSize: 2687152 kB
VmLck: 0 kB
VmHWM: 36344 kB
VmRSS: 36344 kB
VmData: 3820 kB
VmStk: 136 kB
VmExe: 186556 kB
VmLib: 14944 kB
VmPTE: 540 kB
VmSwap: 0 kB
Threads: 1
SigQ: 0/111382
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000006004207
SigCgt: 00000001c9803cf8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: ff
Cpus_allowed_list: 0-7
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000
0000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 1719
nonvoluntary_ctxt_switches: 96
Best regards
Paweł
Hi Pawel,
well your issue (not the root cause) seems to be pretty obvious.
PROCESS 11: LGWR
…
SO: 0xf0612868, type: 10, owner: 0xf0861770, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0xf34f2580, name=FileOpenBlock, file=ksfd.h LINE:6448, pg=0
(FOB) 0xf0612868 flags=264202 fib=0xf0861770 incno=0 pending i/o cnt=1
fname=/oracle/XIP/origlogA/log_g13m1.dbf
fno=0 lblksz=512 fsiz=102400
…
SO: 0xf221aea8, type: 14, owner: 0xf34f2580, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0xf34f2580, name=channel handle, file=ksr2.h LINE:367, pg=0
(broadcast handle) 0xf221aea8 flag: (2) ACTIVE SUBSCRIBER,
owner: 0xf34f2580 - ospid: 10403
event: 7, last message event: 14,
last message waited event: 14,
next message: (nil)(0), messages read: 1
channel: (0xf2235600) scumnt mount lock [name: 157]
scope: 1, event: 19, last mesage event: 14,
publishers/subscribers: 0/17,
messages published: 1
heuristic msg queue length: 0
…
SO: 0xf370ecb0, type: 4, owner: 0xf34f2580, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0xf34f2580, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 95 ser: 1 trans: (nil), creator: 0xf34f2580
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'log file parallel write'
files=0x1, blocks=0x6f, requests=0x1
wait_id=1592 seq_num=1593 snap_id=1
wait times: snap=6 min 44 sec, exc=6 min 44 sec, total=6 min 44 sec
wait times: max=infinite, heur=6 min 44 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a0
There are 3 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 33, ser: 3
wait event: 'switch logfile command'
p1: ''=0x0
p2: ''=0x0
p3: ''=0x0
row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
min_blocked_time: 403 secs, waiter_cache_ver: 163
Unfortunately you have not included the short stack dumps (adding level 256) in that trace file, but it should be obvious that writing to redo log file should not take 6 min and 44 secs (and still waiting).
What is your I/O storage subsystem? Do you use NFS? What kind of file system (and mount options) do you use? Please include the short stack traces as well or just execute pstack on LGWR when this issue occurs.
Regards
Stefan
stack file for lgwr
[<ffffffffa04e9ee5>] ocfs2_file_aio_write+0x3a5/0x960 [ocfs2]
[<ffffffff81191ba4>] aio_rw_vect_retry+0x74/0x1d0
[<ffffffff81193962>] aio_run_iocb+0x82/0x170
[<ffffffff81194370>] io_submit_one+0x180/0x260
[<ffffffff811945dd>] do_io_submit+0x18d/0x250
[<ffffffff81449692>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Best regards
Paweł
Hi Pawel,
this is not pstack / gdb. It does not include the application call stack. Please check my blog post () about this once again.
In addition we need several stack samples so that we can verify that it hangs at a specific function or OS call. FYI, pstack is just a wrapper script for gdb and looks like this (e.g. with OEL):
#!/bin/sh
if test $# -ne 1; then
echo "Usage: `basename $0 .sh` <process-id>" 1>&2
exit 1
fi
if test ! -r /proc/$1; then
echo "Process $1 not found." 1>&2
exit 1
fi
# GDB doesn't allow "thread apply all bt" when the process isn't
# threaded; need to peek at the process to determine if that or the
# simpler "bt" should be used.
backtrace="bt"
if test -d /proc/$1/task ; then
# Newer kernel; has a task/ directory.
if test `/bin/ls /proc/$1/task | /usr/bin/wc -l` -gt 1 2>/dev/null ; then
backtrace="thread apply all bt"
fi
elif test -f /proc/$1/maps ; then
# Older kernel; go by it loading libpthread.
if /bin/grep -e libpthread /proc/$1/maps > /dev/null 2>&1 ; then
backtrace="thread apply all bt"
fi
fi
GDB=${GDB:-/usr/bin/gdb}
if $GDB -nx --quiet --batch --readnever > /dev/null 2>&1; then
readnever=--readnever
else
readnever=
fi
# Run GDB, strip out unwanted noise.
$GDB --quiet $readnever -nx /proc/$1/exe $1 <<EOF 2>&1 |
set width 0
set height 0
set pagination no
$backtrace
EOF
/bin/sed -n \
-e 's/^\((gdb) \)*//' \
-e '/^#/p' \
-e '/^Thread/p'
In addition you are running this on VMware and OCFS2, which can be become very tricky. Just out of curiosity: Why are you using OCFS2? I know that you said that you have the same issue with EXT3, but we don't have that call stack as well :-((
You are hiding a lot of details, which makes it hard to help you. I also requested several more information (like mount options, etc.) as there are known issues within this call stack.
Regards
Stefan
I use ocfs because this system is on cluster, but in our scenario, now cluster is off. As I said, I copied all data from ocfs to ext3 (in this SP for Suse ocfs is only support for instalation for SAP), but the result is the same. So I hope, problem is not in ocfs (but maybe, I try next time). See mount point for oracle:
xi-ap2:~ # mount |grep oracle
/dev/sdf1 on /oracle type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdg1 on /oracle/XIP type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdh1 on /oracle/XIP/102_64 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdi1 on /oracle/XIP/sapdata type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdj1 on /oracle/XIP/origlogA type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdk1 on /oracle/XIP/origlogB type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdl1 on /oracle/XIP/oraarch type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdv1 on /oracle/XIP/11203 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/sdw1 on /oracle/stage/11203 type ocfs2 (rw,_netdev,heartbeat=local)
When I restore snap, I send dump file ect.
Best regards
Paweł
See stack_XIP.txt file.
When I do strace look like:
xi-ap2:~ # strace -cf -p 7592
Process 7592 attached - interrupt to quit
^CProcess 7592 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.016001 1333 12 io_submit
0.00 0.000000 0 8 read
0.00 0.000000 0 10 write
0.00 0.000000 0 16 open
0.00 0.000000 0 16 close
0.00 0.000000 0 6 stat
0.00 0.000000 0 9 lstat
0.00 0.000000 0 3 mmap
0.00 0.000000 0 3 munmap
0.00 0.000000 0 5 rt_sigprocmask
0.00 0.000000 0 1 1 rt_sigreturn
0.00 0.000000 0 19 pread
0.00 0.000000 0 2 pwrite
0.00 0.000000 0 3 select
0.00 0.000000 0 5 semctl
0.00 0.000000 0 6 fcntl
0.00 0.000000 0 164 getrusage
0.00 0.000000 0 350 times
0.00 0.000000 0 1 getuid
0.00 0.000000 0 1 geteuid
0.00 0.000000 0 1 getppid
0.00 0.000000 0 12 io_getevents
0.00 0.000000 0 84 82 semtimedop
------ ----------- ----------- --------- --------- ----------------
100.00 0.016001 737
but, when system crash, strace suspend, and show nothink.
Best regards
Paweł
Hi Pawel,
thanks for the additional system state dump with stack traces (now the SO objects are missing due to level 256 + 1 instead of 256 + 10, but currently it is not so important). However the process is "stuck so long", that even oradebug is not able to retrieve the call stack (check my blog for difference between pstack / gdb and oradebug) by sending a SIGUSR2 signal to the process.
*** 2013-12-17 13:18:25.781
Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 7592, image: oracle@xi-ap2 (LGWR)' to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 30000 ms
...
Current Wait Stack:
0: waiting for 'log file parallel write'
files=0x1, blocks=0x1, requests=0x1
wait_id=1421 seq_num=1422 snap_id=1
wait times: snap=2 min 20 sec, exc=2 min 20 sec, total=2 min 20 sec
wait times: max=infinite, heur=2 min 20 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a0
You do not need to restore the database several times. The problem will be persistent as it is very low level (within OS kernel functions) and it seems like the process is totally stuck.
> but, when system crash, strace suspend, and show nothink.
Not very surprising as the LGWR OS process is stuck in the following call stack "ocfs2_file_aio_write <- aio_rw_vect_retry <- aio_run_iocb <- io_submit_one <- do_io_submit <- system_call_fastpath". Strace is not showing anything as it is waiting that the kernel function is returning, but it is stuck. If the process would continue (theoretically) at some point in the future, you would see that horrible timing with strace.
RPM package "perf" is also available for SLES, which can dig deeper into system calls "on-the-fly" (e.g. something like this "shell> perf record -g -e cpu-clock -p <PID>" and "shell> perf report" afterwards).
Depending on the output of the process status (that you provided on Dec 16, 2013 9:41 AM) you can see that the process is in the state "D – Disk Sleep (uninterruptible sleep)". If the values voluntary_ctxt_switches and nonvoluntary_ctxt_switches are not increasing (by running several more / cat commands on the status file), then the process is totally stuck. You can use the "syscall" file (in /proc/<PID>) and the corresponding kernel header files to interpret it further.
However based on your provided information and (my) interpretation of this scenario - the process is fully stuck within / in OCFS2 kernel functions. I have no system with OCFS2, so i can not verify it how it works correctly and dig deeper into that.
That's so far what we can do here for you, otherwise we need access to your system. Maybe you want to check out my consulting services (my website is in my SCN profile) and engage for further troubleshooting, if this a critical issue.
Without knowing anything specific about your exact kernel code (i am using OEL) - you may want to check and verify the following kernel code fix: http://list-archives.org/2012/06/27/linux-kernel-vger-kernel-org/v4-fix-ocfs2-aio-dio-writing-proces...
Regards
Stefan
Hello
Thanks for your help
Now I try install OS on not VMWare infrastructure, but on physical machine, with ext3 file system After this, I can send next logs, and create OSS messages
It is strange, I have several installations with the same OS configuration, the same oracle binary version and work ok (dev, tst system). Only this one database is bad (this is the first production instalation, which I upgraded, and I am afraid of further production systems)
I think, problem is in some limits in kernel, but I can't find which limit.
Best regards
Paweł
Hi Pawel,
Could you stop oracle and do a
$ORACLE_HOME/bin/relink all
once this is done start oracle and reproduce the error, if the error persists please upload the oracle alert_SID.log
Best Regards
Marius
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
81 | |
24 | |
11 | |
9 | |
7 | |
5 | |
5 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.