Debugging mpt3sas, LSI-9305-16i, and SSDs

Hovsep Levi hovsep.sanjay.levi at gmail.com
Wed Apr 15 16:29:33 UTC 2020


Hello all,


I'm having trouble with a LSI-9305-16i connected to a SSD software raid
array and would like to find the best mailing list to discuss.  (LKML seems
appropriate?)

During moderate to heavy I/O the system will experience a fault and reset
the PCI card and associated drives.  Sometimes only a few of the drives are
reset.  When this happens the I/O is stalled and performance is affected.

The drives are Samsung 860 QVO SSD.


In the process of troubleshooting this issue I've:

- Updated firmware and bios
- Updated the kernel to mainline with the most recent mpt3sas driver
- Tried different combinations of software raid array sizes
- Disabled MSI-X, but later re-enabled it.
- Physically reduced the HBA drive count from 12 to 6


It seems the current approach has had fewer resets when using smaller raid0
arrays in favor of a single larger array.  For example, 3 raid0 arrays of 2
drives vs a single array of 6 or larger.  The intent behind that approach
was to experiment with underlying behavior of SATA/SAS in hopes of finding
some sort of threshold.  I've read many bugs of past for mpt3sas and have
found a few similar reports, it seems a diagnostic reset is common for
certain underlying faults. [1]

I'd like to know what "fault_state(0x4203)" corresponds to so I've
downloaded the kernel source to read.

Any advice is appreciated.  Thanks !


[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1810781

-Hovsep



Error logs:

kernel: [20595.175129] mpt3sas_cm0 fault info from func:
_base_make_ioc_ready
kernel: [20595.176624] mpt3sas_cm0: fault_state(0x4203)!
kernel: [20595.178033] mpt3sas_cm0: sending diag reset !!
kernel: [20596.141963] mpt3sas_cm0: diag reset: SUCCESS
kernel: [20596.206234] mpt3sas_cm0: CurrentHostPageSize is 0: Setting
default host page size to 4k
kernel: [20596.360571] mpt3sas_cm0: _base_display_fwpkg_version: complete
kernel: [20596.360854] mpt3sas_cm0: LSISAS3224: FWVersion(16.00.01.00),
ChipRevision(0x01), BiosVersion(08.37.00.00)
kernel: [20596.360856] mpt3sas_cm0: Protocol=(Initiator,Target),
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
kernel: [20596.360911] mpt3sas_cm0: sending port enable !!
kernel: [20605.190288] mpt3sas_cm0: port enable: SUCCESS
kernel: [20605.190380] mpt3sas_cm0: search for end-devices: start
kernel: [20605.191349] scsi target10:0:1: handle(0x0019),
sas_addr(0x300062b200f275a4)
kernel: [20605.191352] scsi target10:0:1: enclosure logical
id(0x500062b200f275a0), slot(7)
kernel: [20605.191392] scsi target10:0:2: handle(0x001a),
sas_addr(0x300062b200f275a5)
kernel: [20605.191394] scsi target10:0:2: enclosure logical
id(0x500062b200f275a0), slot(6)
kernel: [20605.191434] scsi target10:0:0: handle(0x001b),
sas_addr(0x300062b200f275a6)
kernel: [20605.191436] scsi target10:0:0: enclosure logical
id(0x500062b200f275a0), slot(4)
kernel: [20605.191475] scsi target10:0:3: handle(0x001c),
sas_addr(0x300062b200f275a7)
kernel: [20605.191477] scsi target10:0:3: enclosure logical
id(0x500062b200f275a0), slot(5)
kernel: [20605.191517] scsi target10:0:4: handle(0x001d),
sas_addr(0x300062b200f275b2)
kernel: [20605.191519] scsi target10:0:4: enclosure logical
id(0x500062b200f275a0), slot(8)
kernel: [20605.191560] scsi target10:0:5: handle(0x001e),
sas_addr(0x300062b200f275b3)
kernel: [20605.191562] scsi target10:0:5: enclosure logical
id(0x500062b200f275a0), slot(9)
kernel: [20605.191625] mpt3sas_cm0: search for end-devices: complete
kernel: [20605.191628] mpt3sas_cm0: search for expanders: start
kernel: [20605.191629] mpt3sas_cm0: search for expanders: complete
kernel: [20605.191636] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
kernel: [20605.191638] mpt3sas_cm0: _base_fault_reset_work: hard reset:
success
kernel: [20605.191644] mpt3sas_cm0: removing unresponding devices: start
kernel: [20605.191644] mpt3sas_cm0: removing unresponding devices:
end-devices
kernel: [20605.191646] mpt3sas_cm0: Removing unresponding devices: pcie
end-devices
kernel: [20605.191647] mpt3sas_cm0: removing unresponding devices: expanders
kernel: [20605.191648] mpt3sas_cm0: removing unresponding devices: complete
kernel: [20605.191651] mpt3sas_cm0: scan devices: start
kernel: [20605.191926] mpt3sas_cm0:     scan devices: expanders start
kernel: [20605.191979] mpt3sas_cm0:     break from expander scan:
ioc_status(0x0022), loginfo(0x310f0400)
kernel: [20605.191980] mpt3sas_cm0:     scan devices: expanders complete
kernel: [20605.191981] mpt3sas_cm0:     scan devices: end devices start
kernel: [20605.193151] mpt3sas_cm0:     break from end device scan:
ioc_status(0x0022), loginfo(0x310f0400)
kernel: [20605.193152] mpt3sas_cm0:     scan devices: end devices complete
kernel: [20605.193152] mpt3sas_cm0:     scan devices: pcie end devices start
kernel: [20605.193168] mpt3sas_cm0: log_info(0x3003011d): originator(IOP),
code(0x03), sub_code(0x011d)
kernel: [20605.193186] mpt3sas_cm0: log_info(0x3003011d): originator(IOP),
code(0x03), sub_code(0x011d)
kernel: [20605.193189] mpt3sas_cm0:     break from pcie end device scan:
ioc_status(0x0021), loginfo(0x3003011d)
kernel: [20605.193190] mpt3sas_cm0:     pcie devices: pcie end devices
complete
kernel: [20605.193191] mpt3sas_cm0: scan devices: complete
kernel: [20605.316241] sd 10:0:0:0: Power-on or device reset occurred
kernel: [20605.316297] sd 10:0:3:0: Power-on or device reset occurred
kernel: [20605.316361] sd 10:0:1:0: Power-on or device reset occurred
kernel: [20605.316443] sd 10:0:4:0: Power-on or device reset occurred
kernel: [20605.316570] sd 10:0:2:0: Power-on or device reset occurred
kernel: [20605.318309] sd 10:0:5:0: Power-on or device reset occurred
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20200415/1cc1f3f3/attachment.html>


More information about the ubuntu-users mailing list