Kevin Brosius on Wed, 12 Mar 2003 09:11:05 -0500


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Hard Drives Crashing


Kris Reilly wrote:
> 
> Hello All!
> 
> I am hoping that someone else has encountered this problem and been able
> to diagnose it effectively.
> 
> Under heavy load I have been experiencing a 50% failure rate.  The
> problem has appeared in machines configured with both SCSI and IDE
> drives.  The test configuration in question is the IDE setup.
> 
> We pound the machines with web requests, we generate large logs then we
> crunch them.  Crunching is very disk intensive and the drives stop
> responding.  Errors that appear in the logs are attached below.
> 
> The machines are P4 Xeon 1.2Ghz x 4 with 6GB RAM.  The drives that are
> crashing are 120G IDE.  They fail as secondary on IDE0 and also as
> primary on IDE1.  They experience the same failure using both ext2 and
> ext3.  The machines are running RedHat 7.3, kernel version
> 2.4.18-18.7x.bigmem.  I have just updated one of the boxes to
> 2.4.18-24.7x, custom compiling the kernel and leaving out any
> unnecessary cruft and am waiting to see when it crashes again.
> 
> My next approach is to use hdparm and/or muck with the proc fs though
> the logs seem to suggest that this problem is directly related to
> hardware and not operating system limitations.
> 
> Does anyone have any suggestions?
> 
> Thanks!
> Kris Reilly
> 
> **Disks that are crashing:
> 
> http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=703&p_created=1037222838
> 
> **Disks crash with this error in the logs:
> 
> Message from syslogd@105 at Fri Mar  7 19:10:18 2003 ...
> 105 kernel: Assertion failure in do_get_write_access() at
> transaction.c:737:
> "((
> (jh2bh(jh))->b_state & (1UL << BH_Uptodate)) != 0)"
> 
> Message from syslogd@103 at Sat Mar  8 06:19:41 2003 ...
> 103 kernel: Assertion failure in do_get_write_access() at
> transaction.c:737:
> "((
> (jh2bh(jh))->b_state & (1UL << BH_Uptodate)) != 0)"
> 
> **Just before the crash this is what dmesg has:
> 
> ))
> [<c0146534>] bread [kernel] 0x24 (0xd58f3d2c))
> [<f881e5a5>] ext3_get_branch [ext3] 0x55 (0xd58f3d50))
> [<f880dd6f>] journal_get_write_access_Rsmp_78dc75e5 [jbd] 0x3f
> (0xd58f3d68))
> [<f881ed55>] ext3_get_block_handle [ext3] 0x205 (0xd58f3d7c))
> [<f880e241>] journal_dirty_metadata_Rsmp_fb9ecae4 [jbd] 0x61
> (0xd58f3de4))
> [<c0146772>] create_buffers [kernel] 0x62 (0xd58f3de8))
> [<f881ee7c>] ext3_get_block [ext3] 0x5c (0xd58f3e0c))
> [<c0146d19>] __block_prepare_write [kernel] 0xe9 (0xd58f3e2c))
> [<f8821555>] ext3_mark_iloc_dirty [ext3] 0x25 (0xd58f3e5c))
> [<f8816310>] .rodata.str1.1 [jbd] 0x30 (0xd58f3e6c))
> [<c0147675>] block_prepare_write [kernel] 0x25 (0xd58f3e80))
> [<f881ee20>] ext3_get_block [ext3] 0x0 (0xd58f3e94))
> [<f880d39d>] journal_start_Rsmp_171b1921 [jbd] 0x7d (0xd58f3ea0))
> [<f881f3a5>] ext3_prepare_write [ext3] 0xd5 (0xd58f3eb0))
> [<f881ee20>] ext3_get_block [ext3] 0x0 (0xd58f3ec0))
> [<c01343ed>] generic_file_write [kernel] 0x4ed (0xd58f3ee8))
> [<c0156be4>] fcntl_setlk [kernel] 0x1a4 (0xd58f3f3c))
> [<f881cc32>] ext3_file_write [ext3] 0x22 (0xd58f3f5c))
> [<c01440f6>] sys_write [kernel] 0x96 (0xd58f3f7c))
> [<c0152e9d>] sys_fcntl64 [kernel] 0x8d (0xd58f3fac))
> [<c0108c73>] system_call [kernel] 0x33 (0xd58f3fc0))
> 
> Code: 0f 0b e1 02 f0 62 81 f8 83 c4 14 8b 44 24 34 8b 08 b8 00 e0
>  end_request: I/O error, dev 03:41 (hdb), sector 67895384
> EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> inode block - inode=4243506, block=8486923
> end_request: I/O error, dev 03:41 (hdb), sector 181670032
> end_request: I/O error, dev 03:41 (hdb), sector 181670040
> end_request: I/O error, dev 03:41 (hdb), sector 181670096
> end_request: I/O error, dev 03:41 (hdb), sector 181670128
> end_request: I/O error, dev 03:41 (hdb), sector 0
> EXT3-fs error (device ide0(3,65)) in ext3_reserve_inode_write: IO
> failure
> end_request: I/O error, dev 03:41 (hdb), sector 18752
> end_request: I/O error, dev 03:41 (hdb), sector 37224528
> EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> inode block - inode=2326537, block=4653066
> EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> inode block - inode=2326532, block=4653066
> end_request: I/O error, dev 03:41 (hdb), sector 63941840
> end_request: I/O error, dev 03:41 (hdb), sector 176160848
> EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> inode block - inode=11010060, block=22020106
> end_request: I/O error, dev 03:41 (hdb), sector 181669984
> end_request: I/O error, dev 03:41 (hdb), sector 181670016
> end_request: I/O error, dev 03:41 (hdb), sector 181670032
> end_request: I/O error, dev 03:41 (hdb), sector 181670040
> end_request: I/O error, dev 03:41 (hdb), sector 181670096
> end_request: I/O error, dev 03:41 (hdb), sector 0
> EXT3-fs error (device ide0(3,65)) in ext3_reserve_inode_write: IO
> failure
> end_request: I/O error, dev 03:41 (hdb), sector 181670016
> end_request: I/O error, dev 03:41 (hdb), sector 181670032
> end_request: I/O error, dev 03:41 (hdb), sector 181670040
> 
> ... many more of these end_request errors ...


What does the vendor say about the drive failures?  Worst case, some
drives aren't rated for 100% usage and you'll need better drives.  Best
case, you aren't keeping them cool enough.  Do the systems have adequate
circulation & cooling for the drive bays?

-- 
Kevin Brosius
_________________________________________________________________________
Philadelphia Linux Users Group        --       http://www.phillylinux.org
Announcements - http://lists.netisland.net/mailman/listinfo/plug-announce
General Discussion  --   http://lists.netisland.net/mailman/listinfo/plug