cancel
Showing results for 
Search instead for 
Did you mean: 

problem with oracle

Former Member
0 Kudos

Hello all

I have problem with oracle 11 in Suse. We try upgrade Suse from 11.1 to Suse 11.2. Our scenario is:

We install new system (SLES for SAP Applications  11.2 (x86_64)) and copy all file from old  SAP system on Sles system (10). All OS setting is copy from 11.1 to 11.2. We use this scenario in all other systems, and work property (on ERP, EP, XI dev & test), but not in this system (prodution XI on NW 7.0).

So, after copy all file from orginal system to new one we start oracle, but oracle not working properly, after some time, its crash. I found that the problem is when system try switch logfile (in original system its working correct). See steps to reproduce crash:

SQL> startup

ORACLE instance started.

Total System Global Area 2472083456 bytes

Fixed Size                  2230792 bytes

Variable Size             872416760 bytes

Database Buffers         1577058304 bytes

Redo Buffers               20377600 bytes

Database mounted.

Database opened.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

and now oracle do nothing, system can't finish this step. When I shutdown oracle in abort mode (in other system can't stop), on OS level I have still working proces ora_lgwr, which I can't kill, because this process wait for I/O (I think for disk write).

I have snapshot for this installation, and I tried to do different trials (change OS limits, change size/number of redo log, change redolog destination, change block size in dysk) and  always system crash.

On alert log, I can not find anything interesting, on OS level the same.

Anybody can help me ?

Best regards

Paweł

Accepted Solutions (0)

Answers (7)

Answers (7)

Former Member
0 Kudos

Hello all

On Suse 11.2 we had the same problem (on fiscal and virtual machine). Parametr log_archive_start=true not solved it. I think we have some problem with Linux kernel, but why only on this one serwer I don't know.

We solved the problem by installing a new version of Suse - 11.3. Now all work ok

Thank you very much for your help

Best regards

Paweł

Former Member
0 Kudos

Hi,

I hope your problem will be resolve.

Please check the parameter.

log_archive_start=true

and as well as try the below Query to solve:

alter database clear unachived logfile group 1;

alter database clear unarchived logfile group 1unrecoverable datafile;

Alter system switch logfile;

Thank you

Mahendran

Former Member
0 Kudos

Hi,

Please try to drop your database control file and recreate the control file.Then give alter database open reset logs and give startup force.

Now you check and let me know the status.

Thank you

Mahendran

Former Member
0 Kudos

I do this. I crete new cotrolfile, with new log file in different disk. Status is the same.

Best regards

Paweł

Former Member
0 Kudos

I done this some times and the status is the same. I recreated database using scripts

CREATE CONTROLFILE SET DATABASE"XIP" RESETLOGS  ARCHIVELOG

....

and

alter database open resetlogs

systuation is the same.
I move controlfiles, log files to another disk, change size of log files. Status is the same

Best regards

Paweł

Former Member
0 Kudos

Hi,

Could you please provide the full alert log?

If you provide a full alert log, We can provide a solution for that.

Thank you

Mahendran

Former Member
0 Kudos

after reinsatll system I found below error in  XIP_m000_11027.trc:

more XIP_m000_11027.trc
Trace file /oracle/XIP/saptrace/diag/rdbms/xip/XIP/trace/XIP_m000_11027.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/XIP/112_64
System name:    Linux
Node name:      xi-ap2
Release:        3.0.13-0.27-default
Version:        #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b)
Machine:        x86_64
VM name:        VMWare Version: 6
Instance name: XIP
Redo thread mounted by this instance: 1
Oracle process number: 23
Unix process pid: 11027, image: oracle@xi-ap2 (M000)

*** 2013-12-16 09:24:38.745
*** SESSION ID:(222.17) 2013-12-16 09:24:38.745
*** CLIENT ID:() 2013-12-16 09:24:38.745
*** SERVICE NAME:(SYS$BACKGROUND) 2013-12-16 09:24:38.745
*** MODULE NAME:(MMON_SLAVE) 2013-12-16 09:24:38.745
*** ACTION NAME:(Auto-Flush Slave Action) 2013-12-16 09:24:38.745

DDE rules only execution for: ORA 12751
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
Executing ASYNC actions
----- START DDE Action: 'ORA_12751_DUMP' (Sync) -----
Runtime exceeded 900 seconds
Time limit violation detected at:
ksedsts()+461<-kspol_12751_dump()+145<-dbgdaExecuteAction()+1065<-dbgerRunAction()+109<-dbgerRunActions()+4053<-dbgexPhaseII()+1873<-dbgexProcessError()+2680<-dbgeExecuteForError()+88<-dbgePostErrorKG
E()+2143<-dbkePostKGE_kgsf()+71<-kgeselv()+276<-ksesecl0()+162
<-ksucin()+224<-kcrfws()+1677<-kcbchg1_main()+12927<-kcbchg1()+1068<-ktuchg2()+1417<-ktbchg2()+281<-kdtchg()+1702<-kdtwrp()+6508<-kdtInsRow()+3093<-qerltcNoKdtBufferedInsRowCBK()+795<-qerltcFetch()+37
07<-insexe()+1634<-opiexe()+5572<-kpoal8()+2224<-opiodr()+916
<-kpoodrc()+36<-rpiswu2()+1618<-kpoodr()+622<-upirtrc()+2497<-kpurcsc()+98<-kpuexec()+10807<-OCIStmtExecute()+39<-kewrgwxf1_gwrsql_exft_1()+387<-kewrgwxf_gwrsql_exft()+1388<-kewrfos_flush_onesnap()+69
7<-kewrfsc_flush_snapshot_c()+923<-kewrafs_auto_flush_slave()+882
<-kebm_slave_main()+580<-ksvrdp()+1779<-opirip()+674<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+230<-_start()+36Current Wait Stack:
0: waiting for 'log file switch (checkpoint incomplete)'
    =0x0, =0x0, =0x0
    wait_id=4 seq_num=7 snap_id=2
    wait times: snap=7 min 13 sec, exc=15 min 6 sec, total=15 min 6 sec
    wait times: max=infinite, heur=15 min 6 sec
    wait counts: calls=9021 os=9021
    in_wait=1 iflags=0x15a0
There is at least one session blocking this session.
  Dumping 1 direct blocker(s):
    inst: 1, sid: 95, ser: 1
  Dumping final blocker:
    inst: 1, sid: 95, ser: 1
There are 1 sessions blocked by this session.
Dumping one waiter:
  inst: 1, sid: 219, ser: 1
  wait event: 'row cache lock'
    p1: 'cache id'=0x16
    p2: 'mode'=0x0
    p3: 'request'=0x3
  row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
  min_blocked_time: 291 secs, waiter_cache_ver: 566
Wait State:
  fixed_waits=0 flags=0x23 boundary=(nil)/-1
Session Wait History:
    elapsed time of 0.000000 sec since current wait
0: waited for 'latch: redo writing'
    address=0x60022628, number=0xb9, tries=0x0
    wait_id=5 seq_num=6 snap_id=1
    wait times: snap=0.001148 sec, exc=0.001148 sec, total=0.001148 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000000 sec of elapsed time
1: waited for 'log file switch (checkpoint incomplete)'
    =0x0, =0x0, =0x0
    wait_id=4 seq_num=5 snap_id=1
    wait times: snap=7 min 52 sec, exc=7 min 52 sec, total=7 min 52 sec
    wait times: max=infinite
    wait counts: calls=4712 os=4712
    occurred after 0.000371 sec of elapsed time
2: waited for 'db file sequential read'
    file#=0x3, block#=0x1141, blocks=0x1
    wait_id=3 seq_num=4 snap_id=1
    wait times: snap=0.016183 sec, exc=0.016183 sec, total=0.016183 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000161 sec of elapsed time
3: waited for 'db file sequential read'
    file#=0x3, block#=0x1142, blocks=0x1
    wait_id=2 seq_num=3 snap_id=1
    wait times: snap=0.015223 sec, exc=0.015223 sec, total=0.015223 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000019 sec of elapsed time
4: waited for 'Disk file operations I/O'
    FileOperation=0x2, fileno=0x3, filetype=0x2
    wait_id=1 seq_num=2 snap_id=1
    wait times: snap=0.020978 sec, exc=0.020978 sec, total=0.020978 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.004954 sec of elapsed time
5: waited for 'class slave wait'
    slave id=0xffffffffa52ff548, =0x0, =0x0
    wait_id=0 seq_num=1 snap_id=1
    wait times: snap=0.000087 sec, exc=0.000087 sec, total=0.000087 sec
    wait times: max=infinite
    wait counts: calls=1 os=1
    occurred after 0.000000 sec of elapsed time
Sampled Session History of session 222 serial 17
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples  in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).

The history is displayed in reverse chronological order.

sample interval: 1 sec, max history 120 sec
---------------------------------------------------
  [121 samples,                                            09:22:38 - 09:24:38]
    waited for 'log file switch (checkpoint incomplete)', seq_num: 7
      p1: ''=0x0
      p2: ''=0x0
      p3: ''=0x0
      time_waited: >= 120 sec (still in wait)
---------------------------------------------------
Sampled Session History Summary:
  longest_non_idle_wait: 'log file switch (checkpoint incomplete)'
  [121 samples, 09:22:38 - 09:24:38]
      time_waited: >= 120 sec (still in wait)
---------------------------------------------------
----- END DDE Action: 'ORA_12751_DUMP' (SUCCESS, 10 csec) -----
----- END DDE Actions Dump (total 10 csec) -----
*** KEWROCISTMTEXEC - encountered error: (ORA-12751: cpu time or run time policy violation
)
  *** SQLSTR: total-len=342, dump-len=240,
      STR={insert into WRM$_SNAPSHOT   (snap_id, dbid, instance_number, startup_time,    begin_interval_time, end_interval_time, snap_level,    status, error_count, bl_moved, snap_flag, snap_timezone)
  values   (:snap_id, :dbid, :instance_number, :sta}
DDE: Problem Key 'ORA 12751' was flood controlled (0x2) (no incident)
ORA-12751: cpu time or run time policy violation
*** KEWRAFS: Error=13509 encountered by Auto Flush Slave.
KEBM: MMON slave action policy violation. kewrmafsa_; viol=1; err=13509

Former Member
0 Kudos


Dear Friend,

Could you please share your alert log, sap syslogs and OS event viewer logs.?

Also Please share your pfile  to us analysis the problem?

Also yours is windows OS or unix?

Thank you.

Mahendran

Former Member
0 Kudos

Now, we reinstall the system (for cheking), and we haven't this logs. When we finish I'll send this logs

Former Member
0 Kudos

see alert.log:


more alert_XIP.log
Mon Dec 16 08:56:12 2013
Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 0 KB (0%)

Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB

RECOMMENDATION:
  Total Shared Global Region size is 2370 MB. For optimal performance,
  prior to the next instance restart increase the number
  of unused Large Pages by atleast 1185 2048 KB Large Pages (2370 MB)
  system wide to get 100% of the Shared
  Global Region allocated with Large pages
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
ILAT =27
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
ORACLE_HOME = /oracle/XIP/112_64
System name:    Linux
Node name:      xi-ap2
Release:        3.0.13-0.27-default
Version:        #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b)
Machine:        x86_64
VM name:        VMWare Version: 6
Using parameter settings in server-side spfile /oracle/XIP/112_64/dbs/spfileXIP.
ora
System parameters with non-default values:
  processes                = 150
  sessions                 = 248
  event                    = "10027"
  event                    = "10028"
  event                    = "10142"
  event                    = "10183"
  event                    = "10191"
  event                    = "10443"
  event                    = "10995 level 2"
  event                    = "31991"
  event                    = "38068 level 100"
  event                    = "38085"
  event                    = "38087"
  event                    = "44951 level 1024"
  shared_pool_size         = 800M
  filesystemio_options     = "setall"
  control_files            = "/oracle/XIP/origlogA/cntrl/cntlrXIP.dbf"
  control_files            = "/oracle/XIP/origlogB/cntrl/cntrlXIP.dbf"
  control_files            = "/oracle/XIP/sapdata1/cntrl/cntrlXIP.dbf"
  control_file_record_keep_time= 30
  db_block_size            = 8192
  db_cache_size            = 1504M
  compatible               = "11.2.0"
  log_archive_dest_1       = "location=/oracle/XIP/oraarch/XIParch"
  log_archive_format       = "%t_%s_%r.dbf"
  log_buffer               = 14245888
  log_checkpoint_interval  = 0
  db_files                 = 254
  log_checkpoints_to_alert = TRUE
  dml_locks                = 4000
  undo_tablespace          = "PSAPUNDO"
  recyclebin               = "off"
  _mutex_wait_time         = 10
  _mutex_wait_scheme       = 1
  remote_os_authent        = TRUE
  remote_login_passwordfile= "EXCLUSIVE"
  local_listener           = "LISTENER_XIP"
  job_queue_processes      = 1
  parallel_max_servers     = 40
  parallel_execution_message_size= 16384
  _fix_control             = "5099019:ON"
  _fix_control             = "5705630:ON"
  _fix_control             = "6055658:OFF"
  _fix_control             = "6120483:OFF"
  _fix_control             = "6399597:ON"
  _fix_control             = "6430500:ON"
  _fix_control             = "6440977:ON"
  _fix_control             = "6626018:ON"
  _fix_control             = "6972291:ON"
  _fix_control             = "7168184:OFF"
  _fix_control             = "8937971:ON"
  _fix_control             = "9196440:ON"
  _fix_control             = "9495669:ON"
  _fix_control             = "13077335:ON"
  _fix_control             = "13627489:ON"
  _fix_control             = "14255600:ON"
  _fix_control             = "14595273:ON"
  audit_file_dest          = "/oracle/XIP/saptrace/audit"
  db_name                  = "XIP"
  open_cursors             = 800
  star_transformation_enabled= "TRUE"
  _optimizer_extended_cursor_sharing_rel= "NONE"
  _optimizer_adaptive_cursor_sharing= FALSE
  parallel_threads_per_cpu = 1
  query_rewrite_enabled    = "FALSE"
  _optim_peek_user_binds   = FALSE
  pga_aggregate_target     = 800M
  optimizer_dynamic_sampling= 6
  _optimizer_use_feedback  = FALSE
  diagnostic_dest          = "/oracle/XIP/saptrace"
  max_dump_file_size       = "20000"
Deprecated system parameters with specified values:
  remote_os_authent
End of deprecated system parameter listing
Mon Dec 16 08:56:14 2013
PMON started with pid=2, OS id=10356
Mon Dec 16 08:56:14 2013
PSP0 started with pid=3, OS id=10361
Mon Dec 16 08:56:16 2013
VKTM started with pid=4, OS id=10366 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Mon Dec 16 08:56:16 2013
GEN0 started with pid=5, OS id=10373
Mon Dec 16 08:56:16 2013
DIAG started with pid=6, OS id=10378
Mon Dec 16 08:56:16 2013
DBRM started with pid=7, OS id=10383
Mon Dec 16 08:56:16 2013
DIA0 started with pid=8, OS id=10388
Mon Dec 16 08:56:16 2013
MMAN started with pid=9, OS id=10393
Mon Dec 16 08:56:16 2013
DBW0 started with pid=10, OS id=10398
Mon Dec 16 08:56:16 2013
LGWR started with pid=11, OS id=10403
Mon Dec 16 08:56:16 2013
CKPT started with pid=12, OS id=10408
Mon Dec 16 08:56:16 2013
SMON started with pid=13, OS id=10413
Mon Dec 16 08:56:16 2013
RECO started with pid=14, OS id=10418
Mon Dec 16 08:56:16 2013
MMON started with pid=15, OS id=10423
Mon Dec 16 08:56:16 2013
MMNL started with pid=16, OS id=10428
ORACLE_BASE from environment = /oracle
Mon Dec 16 08:56:16 2013
ALTER DATABASE   MOUNT
Successful mount of redo thread 1, with mount id 29035568
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Mon Dec 16 08:56:21 2013
ALTER DATABASE OPEN
LGWR: STARTING ARCH PROCESSES
Mon Dec 16 08:56:23 2013
ARC0 started with pid=18, OS id=10451
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Mon Dec 16 08:56:24 2013
ARC1 started with pid=19, OS id=10456
Mon Dec 16 08:56:24 2013
ARC2 started with pid=20, OS id=10461
Mon Dec 16 08:56:24 2013
ARC3 started with pid=21, OS id=10466
ARC1: Archival started
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
Thread 1 opened at log sequence 224843
  Current log# 2 seq# 224843 mem# 0: /oracle/XIP/origlogB/log_g12m1.dbf
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
Incremental checkpoint up to RBA [0x36e4b.18f.0], current log tail at RBA [0x36e
4b.18f.0]
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
[10446] Successfully onlined Undo Tablespace 1.
Undo initialization finished serial:0 start:235693726 end:235694266 diff:540 (5
seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is UTF8
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication fou
nd)
Starting background process QMNC
Mon Dec 16 08:56:27 2013
QMNC started with pid=22, OS id=10471
Completed: ALTER DATABASE OPEN
Mon Dec 16 08:56:32 2013
Starting background process CJQ0
Mon Dec 16 08:56:32 2013
CJQ0 started with pid=28, OS id=10507
Mon Dec 16 08:56:45 2013
System State dumped to trace file /oracle/XIP/saptrace/diag/rdbms/xip/XIP/trace/
XIP_ora_10446.trc
Mon Dec 16 08:57:34 2013
Beginning log switch checkpoint up to RBA [0x36e4c.2.10], SCN: 1956998197
Thread 1 advanced to log sequence 224844 (LGWR switch)
  Current log# 4 seq# 224844 mem# 0: /oracle/XIP/origlogB/log_g14m1.dbf
Mon Dec 16 08:57:35 2013
Archived Log entry 224395 added for thread 1 sequence 224843 ID 0xf731cd0c dest
1:
Beginning log switch checkpoint up to RBA [0x36e4d.2.10], SCN: 1956998200
Thread 1 advanced to log sequence 224845 (LGWR switch)
  Current log# 1 seq# 224845 mem# 0: /oracle/XIP/origlogA/log_g11m1.dbf
Mon Dec 16 08:57:36 2013
Archived Log entry 224396 added for thread 1 sequence 224844 ID 0xf731cd0c dest
1:
Beginning log switch checkpoint up to RBA [0x36e4e.2.10], SCN: 1956998203
Thread 1 advanced to log sequence 224846 (LGWR switch)
  Current log# 3 seq# 224846 mem# 0: /oracle/XIP/origlogA/log_g13m1.dbf
Mon Dec 16 08:57:38 2013
Archived Log entry 224397 added for thread 1 sequence 224845 ID 0xf731cd0c dest
1:
Thread 1 cannot allocate new log, sequence 224847
Checkpoint not complete
  Current log# 3 seq# 224846 mem# 0: /oracle/XIP/origlogA/log_g13m1.dbf
Mon Dec 16 08:57:42 2013
Completed checkpoint up to RBA [0x36e4c.2.10], SCN: 1956998197
xi-ap2:/oracle/XIP/saptrace/diag/rdbms/xip/XIP/trace #

Former Member
0 Kudos

Hello

This is full aletr.log

See below all steps, which I do:

SQL> startup
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 2472083456 bytes
Fixed Size                  2230792 bytes
Variable Size             872416760 bytes
Database Buffers         1577058304 bytes
Redo Buffers               20377600 bytes
Database mounted.
Database opened.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL>
SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;
alter system switch logfile

and now system waiting,

Best regards

Paweł

stefan_koehler
Active Contributor
0 Kudos

Hi Pawel,

> So, after copy all file from orginal system to new one we start oracle, but oracle not working properly, after some time, its crash.


Does it really crash (core dump / seg fault) or does it just hang? In the latter case - are you able to logon anymore or does the logon also hang?

Depending on the exact scenario you can perform (several) systemstate dumps and crosscheck the reason for that "hanging" issue. I have written a blog post called about creating and using systemstate dumps. You can use the prelim option, if you are not able to logon anymore.

.. or you can use strace / pstack for further troubleshooting. Please check out my other blog post about that:

Regards

Stefan

Former Member
0 Kudos

Thanks Stefan for your answer.

System/oracle not dump, only still running in status "alter system switch logfile". In alert log is nothing. When I do strace for process lgwr durring this situation, console look like under control I can't do nothing. On other console I olny kill this session.

Process lgwr looks like to connected to some device (ps -auxwww |grep lgwr in status is D+), and can't kill it, until system process disconnect, but never it' happend

I try check lgwr proces as is describe in:

http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/d0fcfca8-fd3f-3110-98b8-9c47a2440...

but - when system start do something do switch log and can't finished scripts.

When I reinstall system I'll check "library cache lock", mayby this is good way to solve my problem 

Best regards

Paweł

stefan_koehler
Active Contributor
0 Kudos

Hi Pawel,

so your system just hangs.

LGWR may not show anything (by using strace) in this situation as it also blocked.

I don't get the full picture of your issue right now (if you are able to logon with SQL*Plus in another session or not, when this hanging issue occurs), but i provided you two links how to create a systemstate dump in both situations.

The library cache lock issue, that i described in my blog has nothing to do with your issue, but this blog post includes a guide how to create a systemstate dump with the different kind of information.

The systemstate dump shows us the blocking scenario and afterwards you can do further analysis based on that result.

Regards

Stefan

Former Member
0 Kudos

Can you attach pstack output for lgwr process?

Former Member
0 Kudos

See oracle dump file

Best regards

Paweł

Former Member
0 Kudos

Hello

I don't know what's is pstatck. How I can create this logs ?

See:

ID 10403 is a pid of lgwr process:

xi-ap2:/proc/10403 # more status

Name:   oracle

State:  D (disk sleep)

Tgid:   10403

Pid:    10403

PPid:   1

TracerPid:      0

Uid:    1018    1018    1018    1018

Gid:    1002    1002    1002    1002

FDSize: 512

Groups: 1000 1002 1003

VmPeak:  2687152 kB

VmSize:  2687152 kB

VmLck:         0 kB

VmHWM:     36344 kB

VmRSS:     36344 kB

VmData:     3820 kB

VmStk:       136 kB

VmExe:    186556 kB

VmLib:     14944 kB

VmPTE:       540 kB

VmSwap:        0 kB

Threads:        1

SigQ:   0/111382

SigPnd: 0000000000000000

ShdPnd: 0000000000000000

SigBlk: 0000000000000000

SigIgn: 0000000006004207

SigCgt: 00000001c9803cf8

CapInh: 0000000000000000

CapPrm: 0000000000000000

CapEff: 0000000000000000

CapBnd: ffffffffffffffff

Cpus_allowed:   ff

Cpus_allowed_list:      0-7

Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000

0000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001

Mems_allowed_list:      0

voluntary_ctxt_switches:        1719

nonvoluntary_ctxt_switches:     96

Best regards

Paweł

stefan_koehler
Active Contributor
0 Kudos

Hi Pawel,

well your issue (not the root cause) seems to be pretty obvious.


PROCESS 11: LGWR

      SO: 0xf0612868, type: 10, owner: 0xf0861770, flag: INIT/-/-/0x00 if: 0x1 c: 0x1

       proc=0xf34f2580, name=FileOpenBlock, file=ksfd.h LINE:6448, pg=0

      (FOB) 0xf0612868 flags=264202 fib=0xf0861770 incno=0 pending i/o cnt=1

       fname=/oracle/XIP/origlogA/log_g13m1.dbf

       fno=0 lblksz=512 fsiz=102400

    SO: 0xf221aea8, type: 14, owner: 0xf34f2580, flag: INIT/-/-/0x00 if: 0x1 c: 0x1

     proc=0xf34f2580, name=channel handle, file=ksr2.h LINE:367, pg=0

    (broadcast handle) 0xf221aea8 flag: (2) ACTIVE SUBSCRIBER,

                       owner: 0xf34f2580 - ospid: 10403

                       event: 7, last message event: 14,

                       last message waited event: 14,

                       next message: (nil)(0), messages read: 1

                      channel: (0xf2235600) scumnt mount lock [name: 157]

                               scope: 1, event: 19, last mesage event: 14,

                               publishers/subscribers: 0/17,

                               messages published: 1

                               heuristic msg queue length: 0

   SO: 0xf370ecb0, type: 4, owner: 0xf34f2580, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

     proc=0xf34f2580, name=session, file=ksu.h LINE:12624, pg=0

    (session) sid: 95 ser: 1 trans: (nil), creator: 0xf34f2580

              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

              flags2: (0x409) -/-/INC

              DID: , short-term DID:

              txn branch: (nil)

              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS

    ksuxds FALSE at location: 0

    service name: SYS$BACKGROUND

    Current Wait Stack:

    0: waiting for 'log file parallel write'

        files=0x1, blocks=0x6f, requests=0x1

        wait_id=1592 seq_num=1593 snap_id=1

        wait times: snap=6 min 44 sec, exc=6 min 44 sec, total=6 min 44 sec

        wait times: max=infinite, heur=6 min 44 sec

        wait counts: calls=0 os=0

        in_wait=1 iflags=0x5a0

    There are 3 sessions blocked by this session.

    Dumping one waiter:

      inst: 1, sid: 33, ser: 3

      wait event: 'switch logfile command'

        p1: ''=0x0

        p2: ''=0x0

        p3: ''=0x0

      row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0

      min_blocked_time: 403 secs, waiter_cache_ver: 163

Unfortunately you have not included the short stack dumps (adding level 256) in that trace file, but it should be obvious that writing to redo log file should not take 6 min and 44 secs (and still waiting).

What is your I/O storage subsystem? Do you use NFS? What kind of file system (and mount options) do you use? Please include the short stack traces as well or just execute pstack on LGWR when this issue occurs.

Regards

Stefan

Former Member
0 Kudos

Hello Stafan

This is ocfs file system, but for test I've copied all of data to ext3 file system, and the result was the same.

Best regards

Paweł

Former Member
0 Kudos

Please provide output of

# pstack <lgwr pid>

Former Member
0 Kudos

stack file for lgwr

[<ffffffffa04e9ee5>] ocfs2_file_aio_write+0x3a5/0x960 [ocfs2]

[<ffffffff81191ba4>] aio_rw_vect_retry+0x74/0x1d0

[<ffffffff81193962>] aio_run_iocb+0x82/0x170

[<ffffffff81194370>] io_submit_one+0x180/0x260

[<ffffffff811945dd>] do_io_submit+0x18d/0x250

[<ffffffff81449692>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

Best regards

Paweł

stefan_koehler
Active Contributor
0 Kudos

Hi Pawel,

this is not pstack / gdb. It does not include the application call stack. Please check my blog post () about this once again.

In addition we need several stack samples so that we can verify that it hangs at a specific function or OS call. FYI, pstack is just a wrapper script for gdb and looks like this (e.g. with OEL):


#!/bin/sh

if test $# -ne 1; then

    echo "Usage: `basename $0 .sh` <process-id>" 1>&2

    exit 1

fi

if test ! -r /proc/$1; then

    echo "Process $1 not found." 1>&2

    exit 1

fi

# GDB doesn't allow "thread apply all bt" when the process isn't

# threaded; need to peek at the process to determine if that or the

# simpler "bt" should be used.

backtrace="bt"

if test -d /proc/$1/task ; then

    # Newer kernel; has a task/ directory.

    if test `/bin/ls /proc/$1/task | /usr/bin/wc -l` -gt 1 2>/dev/null ; then

    backtrace="thread apply all bt"

    fi

elif test -f /proc/$1/maps ; then

    # Older kernel; go by it loading libpthread.

    if /bin/grep -e libpthread /proc/$1/maps > /dev/null 2>&1 ; then

    backtrace="thread apply all bt"

    fi

fi

GDB=${GDB:-/usr/bin/gdb}

if $GDB -nx --quiet --batch --readnever > /dev/null 2>&1; then

    readnever=--readnever

else

    readnever=

fi

# Run GDB, strip out unwanted noise.

$GDB --quiet $readnever -nx /proc/$1/exe $1 <<EOF 2>&1 |

set width 0

set height 0

set pagination no

$backtrace

EOF

/bin/sed -n \

    -e 's/^\((gdb) \)*//' \

    -e '/^#/p' \

    -e '/^Thread/p'

In addition you are running this on VMware and OCFS2, which can be become very tricky. Just out of curiosity: Why are you using OCFS2? I know that you said that you have the same issue with EXT3, but we don't have that call stack as well :-((

You are hiding a lot of details, which makes it hard to help you. I also requested several more information (like mount options, etc.) as there are known issues within this call stack.

Regards

Stefan

Former Member
0 Kudos

I use ocfs because this system is on cluster, but in our scenario, now cluster is off. As I said, I copied all data from ocfs to ext3 (in this SP for Suse ocfs is only support for instalation for SAP), but the result is the same. So I hope, problem is not in ocfs (but maybe, I try next time).  See mount point for oracle:

xi-ap2:~ # mount |grep oracle

/dev/sdf1 on /oracle type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdg1 on /oracle/XIP type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdh1 on /oracle/XIP/102_64 type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdi1 on /oracle/XIP/sapdata type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdj1 on /oracle/XIP/origlogA type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdk1 on /oracle/XIP/origlogB type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdl1 on /oracle/XIP/oraarch type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdv1 on /oracle/XIP/11203 type ocfs2 (rw,_netdev,heartbeat=local)

/dev/sdw1 on /oracle/stage/11203 type ocfs2 (rw,_netdev,heartbeat=local)

When I restore snap, I send dump file ect.

Best regards

Paweł

Former Member
0 Kudos

See stack_XIP.txt file.

When I do strace look like:

xi-ap2:~ # strace -cf -p 7592

Process 7592 attached - interrupt to quit

^CProcess 7592 detached

% time     seconds  usecs/call     calls    errors syscall

------ ----------- ----------- --------- --------- ----------------

100.00    0.016001        1333        12           io_submit

  0.00    0.000000           0         8           read

  0.00    0.000000           0        10           write

  0.00    0.000000           0        16           open

  0.00    0.000000           0        16           close

  0.00    0.000000           0         6           stat

  0.00    0.000000           0         9           lstat

  0.00    0.000000           0         3           mmap

  0.00    0.000000           0         3           munmap

  0.00    0.000000           0         5           rt_sigprocmask

  0.00    0.000000           0         1         1 rt_sigreturn

  0.00    0.000000           0        19           pread

  0.00    0.000000           0         2           pwrite

  0.00    0.000000           0         3           select

  0.00    0.000000           0         5           semctl

  0.00    0.000000           0         6           fcntl

  0.00    0.000000           0       164           getrusage

  0.00    0.000000           0       350           times

  0.00    0.000000           0         1           getuid

  0.00    0.000000           0         1           geteuid

  0.00    0.000000           0         1           getppid

  0.00    0.000000           0        12           io_getevents

  0.00    0.000000           0        84        82 semtimedop

------ ----------- ----------- --------- --------- ----------------

100.00    0.016001                   737    

but, when system crash, strace suspend, and show nothink.

Best regards

Paweł

stefan_koehler
Active Contributor
0 Kudos

Hi Pawel,

thanks for the additional system state dump with stack traces (now the SO objects are missing due to level 256 + 1 instead of 256 + 10, but currently it is not so important). However the process is "stuck so long", that even oradebug is not able to retrieve the call stack (check my blog for difference between pstack / gdb and oradebug) by sending a SIGUSR2 signal to the process.


*** 2013-12-17 13:18:25.781

Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 7592, image: oracle@xi-ap2 (LGWR)' to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 30000 ms

...

Current Wait Stack:

0: waiting for 'log file parallel write'

    files=0x1, blocks=0x1, requests=0x1

    wait_id=1421 seq_num=1422 snap_id=1

    wait times: snap=2 min 20 sec, exc=2 min 20 sec, total=2 min 20 sec

    wait times: max=infinite, heur=2 min 20 sec

    wait counts: calls=0 os=0

    in_wait=1 iflags=0x5a0

You do not need to restore the database several times. The problem will be persistent as it is very low level (within OS kernel functions) and it seems like the process is totally stuck.

> but, when system crash, strace suspend, and show nothink.

Not very surprising as the LGWR OS process is stuck in the following call stack "ocfs2_file_aio_write <- aio_rw_vect_retry <- aio_run_iocb <- io_submit_one <- do_io_submit <- system_call_fastpath". Strace is not showing anything as it is waiting that the kernel function is returning, but it is stuck. If the process would continue (theoretically) at some point in the future, you would see that horrible timing with strace.

RPM package "perf" is also available for SLES, which can dig deeper into system calls "on-the-fly" (e.g. something like this "shell> perf record -g -e cpu-clock -p <PID>" and "shell> perf report" afterwards).

Depending on the output of the process status (that you provided on

Regards

Stefan

Former Member
0 Kudos

Can you attach output of

# cat /proc/<lgwr_pid>/syscall

Former Member
0 Kudos

Hello

Thanks for your help

Now I try install OS on not VMWare infrastructure, but on physical machine, with ext3 file system After this, I can send next logs, and create OSS messages

It is strange, I have several installations with the same OS configuration, the same oracle binary version and work ok (dev, tst system). Only this one database is bad (this is the first production instalation, which I upgraded, and I am afraid of further production systems)

I think, problem is in some limits in kernel, but I can't find which limit.

Best regards

Paweł

former_member206552
Active Contributor
0 Kudos

Hi Pawel,

Could you stop oracle and do a

$ORACLE_HOME/bin/relink all

once this is done start oracle and reproduce the error, if the error persists please upload the oracle alert_SID.log

Best Regards

Marius

Former Member
0 Kudos

Thaks for you answer.

I done this step, nothink help. I copy all oracle binary from system working in this OS (ERP system, which we copy beform), relink - status the same.  I check oracle instalation, if OS setting is ok - is ok.

Best regards

Paweł