Link to home
Start Free TrialLog in
Avatar of DonFreeman
DonFreemanFlag for United States of America

asked on

ORA-01115: IO error reading block from file; ORA-27091: skgfqio: unable to queue I/O

We have started experiencing this error in the last few weeks and are trying to identify the cause.  We opened a ticket with Oracle and one with HP (SAN)  I  from the web servers and found this ORA-01115 error was logged by the web application 6 times yesterday between 2:49 and 3:10 PM.  We're not sure if it's both nodes or just one.  The last time it had been logged previously was 12/22/05.  Four of those times, it was the same file that our batch process had a problem with -- PARTIES_DATA01.dbf.  On the other 2 occasions, it was PARTIES_DATA02.dbf and LOCATION_DATA01.dbf.  

HP says the SAN is healthy, Windows hasn't logged any issues, and Oracle says it isn't their problem if there is no block corruption.  If anyone has a suggestion for diagnosing this, I'm all ears.  We have tested the datafiles for corruption with RMAN and haven't found anything. The error is hopping around to different datafiles, redolog files and blocks. HP says it's monitoring doesn't detect anything and says Oracle ought to wait longer and is suggesting setting the SCSI timeout value from 60 to 90.  If we had experienced a SCSI timeout it would be in the event log.  There are no errors in the event logs.

Other servers connected to the SAN aren't experiencing any problems, but none of them are under the same kind of load.  This is our prod transactional db.

I'm not sure how to check to see what the maximum wait time for access to the primarily identified datafile is.  Statspack doesn't show anything unusual over the period the errors occurred and I'm aware that you can't identify or troubleshoot a discrete error with aggregate data.  The disks are in RAID 10 configuration (SAME).

The only identifiable node that the error occurs on is Node 2 because our batch process only connects to that. The other connections are load balanced so we're not sure whether or not its occurring on one or both nodes.

Where do I go from here?  All the FC switch log analysis is done by HP.  I don't think we have access to any logs on the SAN Appliance or a reader to decode them.  I have a hand tied behind my back with HP.  Is there some sort of diagnostic program I could run?   Do I need to enable some sort of logging?

Here are examples:

ORA-01115: IO error reading block from file 18 (block # 182340)
ORA-01110: data file 18: 'O:\ORADATA\NEDSSPC\PARTIES_DATA01.DBF'
ORA-27091: skgfqio: unable to queue I/O
Fri May 12 23:48:59 2006
Errors in file c:\oracle\admin\nedsspc\bdump\nedsspc2_arc0_3836.trc:
ORA-00333: redo log read error block 135169 count 2048
ORA-00312: online log 6 thread 2: 'O:\ORADATA\NEDSSPC\REDO06.LOG'
ORA-27091: skgfqio: unable to queue I/O
ARC0: Completed archiving log 6 thread 2 sequence 18565
Fri May 12 23:49:04 2006

Sat May 13 23:49:10 2006
Errors in file c:\oracle\admin\nedsspc\bdump\nedsspc2_arc0_3836.trc:
ORA-00333: redo log read error block 129025 count 2048
ORA-00312: online log 7 thread 2: 'O:\ORADATA\NEDSSPC\REDO07.LOG'
ORA-27091: skgfqio: unable to queue I/O
ARC0: Completed archiving log 7 thread 2 sequence 18675

Drive information
~~~~~~~~~~~~~~~~~
redologs at: O:\ORADATA\NEDSSPC\

Drive O:
Description Local Fixed Disk
Compressed No
File System OraCFS
Size 99.99 GB (107,364,544,512 bytes)
Free Space 16.62 GB (17,850,949,632 bytes)

Avatar of vishal68
vishal68
Flag of India image

Hi

We had faces same problems when we had moved to RAC on windows about a year and half back. We were on 9iR2 at that time. Faced a number of problems before we managed to convince the management to move to a Unix box. Since then we have moved to Unix.

As for your problem, we were facing ORA-27091: skgfqio: unable to queue I/O on mainly redo log files. Oracle Support was also onsite for help. We finally moved the redo logs on a completely separate set of disks and that resolved the problem.

HTH
Vishal

Avatar of DonFreeman

ASKER

Hmmmm.....It seemed obvious to us that as a last resort we could start moving files around to see what would happen.  We have datafiles involved as well.  I'm not too sure exactly how we're going to ensure that we get everything moved around to fix the problem.  And, this type of thing is exactly what they promised we wouldn't have to do.  Storage management was supposed to become completely seamless.....
ASKER CERTIFIED SOLUTION
Avatar of Netminder
Netminder

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of fran_aro
fran_aro

Hi Guys,
Where is the Solution for these issue...?
It is seems not professionals.....
very very Bad...

Thanks
Francis
Wow, it's been a long time ago.  We finally decided we were overloading our SAN despite what the vendor and everybody said.  We removed non-production storage from the SAN and the problem went away.  I'm not sure about the configuration of the luns.  All our storage was striped and mirrored so I thought the probability that that any individual disk was being resourced by more than one instance at a time was pretty high.  The error is simply saying the queue is full and the disk is not available for writing.  The only thing that could make that happen is something else is using it.  

I hope my memory and reasoning is correct and it is helpful to you.