IOAT fatal failures

See I/OAT description here.

Solaris (at least snv_93 and and snv_95) panics if large amount of traffic is sent across the interface dedicated to a zone with exclusive TCP/IP stack. For example, start Firefox and forward the X traffic to the global zone.

# more /var/adm/messages
. . .
Aug  5 17:26:02 xeon ioat: [ID 850821 kern.warning] WARNING: channel(1) fatal failure! chanstat_lo=0x10485
83; chanerr=0x2
Aug  5 17:26:13 xeon unix: [ID 836849 kern.notice]
Aug  5 17:26:13 xeon ^Mpanic[cpu3]/thread=ffffff01eac13020:
Aug  5 17:26:13 xeon genunix: [ID 103648 kern.notice] mutex_exit: not owner, lp=ffffff01cd046f18 owner=0 t
hread=ffffff01eac13020
Aug  5 17:26:13 xeon unix: [ID 100000 kern.notice]
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903660 unix:mutex_panic+73 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903690 unix:mutex_vector_exit+41 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903710 ioat:ioat_cmd_post+227 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903740 dcopy:dcopy_cmd_post+58 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff00089037e0 genunix:uioamove+1b4 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903840 genunix:struioainit+61 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903a20 genunix:strget+270 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903b30 genunix:kstrgetmsg+2ea ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903c40 sockfs:sotpi_recvmsg+350 ()
Aug  5 17:26:13 xeon genunix: [ID 655072 kern.notice] ffffff0008903cd0 sockfs:socktpi_read+79 ()
Aug  5 17:26:14 xeon genunix: [ID 655072 kern.notice] ffffff0008903d40 genunix:fop_read+69 ()
Aug  5 17:26:14 xeon genunix: [ID 655072 kern.notice] ffffff0008903e90 genunix:read+28b ()
Aug  5 17:26:14 xeon genunix: [ID 655072 kern.notice] ffffff0008903ec0 genunix:read32+1e ()
Aug  5 17:26:14 xeon genunix: [ID 655072 kern.notice] ffffff0008903f10 unix:brand_sys_sysenter+1e6 ()
Aug  5 17:26:14 xeon unix: [ID 100000 kern.notice]
Aug  5 17:26:14 xeon genunix: [ID 672855 kern.notice] syncing file systems…
Aug  5 17:26:14 xeon genunix: [ID 733762 kern.notice]  38
Aug  5 17:26:15 xeon genunix: [ID 733762 kern.notice]  8
Aug  5 17:26:16 xeon genunix: [ID 733762 kern.notice]  1
Aug  5 17:26:37 xeon last message repeated 20 times
Aug  5 17:26:38 xeon genunix: [ID 622722 kern.notice]  done (not all i/o completed)
Aug  5 17:26:39 xeon genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c1d0s1, offset 860356608, conten
t: kernel
Aug  5 17:26:51 xeon genunix: [ID 409368 kern.notice] ^M100% done: 216800 pages dumped, compression ratio
4.42,
Aug  5 17:26:51 xeon genunix: [ID 851671 kern.notice] dump succeeded

# cd /var/crash/`uname -n`
# mdb -k *.10
Loading modules: [ unix genunix specfs dtrace cpu.generic uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba uhci fctl nca lofs zfs sd md cpc random crypto fcip fcp smbsrv nfs logindmux ptm sppp nsctl sdbc sv ii rdc nsmb ipc ]
> $c
vpanic()
mutex_panic+0x73(fffffffffb9051d8, ffffff01cd046968)
mutex_vector_exit+0x41(ffffff01cd046968)
ioat_cmd_post+0x227(ffffff01cdae84b8, ffffff01f1e89c40)
dcopy_cmd_post+0x58(ffffff01f1e89c40)
uioamove+0x1b4(ffffff01e1eb2040, bc, 0, ffffff021d9a3830)
struioainit+0x61(ffffff02011b3aa8, ffffff021d9a37f8, ffffff021d9a3830)
strget+0x270(ffffff01fab44de0, ffffff02011b3aa8, ffffff021d9a3830, 1, ffffff0008ae1af8)
kstrgetmsg+0x2ea(ffffff01fe052b00, ffffff0008ae1b90, ffffff021d9a3830, ffffff0008ae1c20, ffffff0008ae1c1c
, ffffffffffffffff)
sotpi_recvmsg+0x350(ffffff01fe0603c0, ffffff0008ae1c70, ffffff0008ae1e20)
socktpi_read+0x79(ffffff01fe052b00, ffffff0008ae1e20, 0, ffffff0214b9bec8, 0)
fop_read+0x69(ffffff01fe052b00, ffffff0008ae1e20, 0, ffffff0214b9bec8, 0)
read+0x28b(13, 8043cd4, 4000)
read32+0x1e(13, 8043cd4, 4000)
_sys_sysenter_post_swapgs+0x14b()
> $q

I have tried the following:

  • From http://bugs.opensolaris.org/view_bug.do?bug_id=6722595

    add -D disable-ioat=true to grub menu.lst kernel line

    Did not help.
  • Replace rge (RealTek) network interface with e1000g (Intel 1000/Pro), because IOAT is supported on the Intel platform only.Did not help.
  • Moved e1000g to another PCI-X slot to get rid of IRQ16 errors.Did not help.

And now the working workarounds:

  • Unload ioat module (live system):
    # modinfo | fgrep ioat
    135 fffffffff7d49000 2e70 291 1 ioat (ioat driver v1.2)
    # modunload -i 135
    # modinfo | fgrep ioat
    #
  • Remove SUNWdcopy package:
    # pkginfo SUNWdcopy
    system SUNWdcopy Sun dcopy DMA drivers
    # pkgrm SUNWdcopy
    . . .
  • Simply remove (move to another directory) the files:
    /platform/i86pc/kernel/drv/amd64/ioat
    /platform/i86pc/kernel/drv/ioat
    /platform/i86pc/kernel/drv/ioat.conf
    /platform/i86xpv/kernel/drv/amd64/ioat
    /platform/i86xpv/kernel/drv/ioat
    /platform/i86xpv/kernel/drv/ioat.conf
  • Edit /etc/system file to prevent the driver from loading:
    # vi /etc/system
    . . .
    exclude: ioat
    . . .
    # reboot

2 thoughts on “IOAT fatal failures”

Leave a Reply

Your email address will not be published.