Solaris and ATA and SCSI errors

After upgrading from snv_88 to snv_92 (Solaris 11 or Nevada b92) my home server started spewing the following errors:

Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: abort request, target=0 lun=0
Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: abort device, target=0 lun=0
Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: reset target, target=0 lun=0
Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: reset bus, target=0 lun=0
Jun 25 14:23:54 xeon gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1/cmdk@0,0 (Disk1):
Jun 25 14:23:54 xeon Error for command ‘read sector’ Error Level: Informational
Jun 25 14:23:54 xeon gda: [ID 107833 kern.notice] Sense Key: aborted command
Jun 25 14:23:54 xeon gda: [ID 107833 kern.notice] Vendor ‘Gen-ATA ‘ error code: 0x3
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: abort request, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: abort device, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: reset target, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: reset bus, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: early timeout, target=0 lun=0
Jun 25 14:24:31 xeon gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0/cmdk@0,0 (Disk0):
Jun 25 14:24:31 xeon Error for command ‘read sector’ Error Level: Informational
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Sense Key: aborted command
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Vendor ‘Gen-ATA ‘ error code: 0x3
Jun 25 14:24:31 xeon gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0/cmdk@0,0 (Disk0):
Jun 25 14:24:31 xeon Error for command ‘read sector’ Error Level: Informational
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Sense Key: aborted command
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Vendor ‘Gen-ATA ‘ error code: 0x3

Then UFS/ZFS errors, broken logging, fsck and eventually reboot. The first thought was broken disk, however iostat -E showed no errors and “bad sectors” suddenly appeared on all four disks. That was an indication of a software error.

And, indeed, it happened to be a bug with a not so obvious workaround – disable the “Intel Microcode Update” feature:

# rm -rf /platform/i86pc/ucode
# reboot
or
# mv /platform/i86pc/ucode /platform/i86pc/orig.ucode
# reboot

Or disable multiprocessing what is absolutely unacceptable. Removing the Intel microcode did help. According to the bug description (see below) it was fixed in snv_72, however it never showed up before and appeared in snv_92.

Update 2008-07-07 @00:05:13: Looks like this bug is fixed in snv_93

Update 2008-07-14 @18:33:56: Nope. Neither the bug is fixed, nor the workaround works. The system hangs and crashes three-four times a day regardless the microcode. I would say, if you have snv_88 installed don’t rush with upgrades.

Update 2008-07-18 @14:03:38: I suspect it’s related to a zone with exclusive TCP/IP stack. I shut the zone down for several days and there were no crashes. Probably it’s also related to X-forwarding from the zone, heavy traffic or memory use. I noticed that the interface (rge0 – dedicated to the zone) speed  dropped from 1Gb to 100Mb since I upgraded from b88 to b92.

Update 2008-08-23 @15:11:26 Upgrade to b95 solved the ATA/SCSI problem. Also I figured out that I/OAT module crashes the system.  Not sure if the bugs are related, though.

Sources:
http://www.opensolaris.org/jive/thread.jspa?messageID=234154
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6586621

One thought on “Solaris and ATA and SCSI errors”

Leave a Reply

Your email address will not be published. Required fields are marked *