Kris Reilly on Wed, 12 Mar 2003 09:29:12 -0500


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Hard Drives Crashing


On Wed, 2003-03-12 at 09:10, Kevin Brosius wrote:
> Kris Reilly wrote:
> > 
> > Hello All!
> > 
> > I am hoping that someone else has encountered this problem and been able
> > to diagnose it effectively.
> > 
> > Under heavy load I have been experiencing a 50% failure rate.  The
> > problem has appeared in machines configured with both SCSI and IDE
> > drives.  The test configuration in question is the IDE setup.
> > 
> > We pound the machines with web requests, we generate large logs then we
> > crunch them.  Crunching is very disk intensive and the drives stop
> > responding.  Errors that appear in the logs are attached below.
> > 
> > The machines are P4 Xeon 1.2Ghz x 4 with 6GB RAM.  The drives that are
> > crashing are 120G IDE.  They fail as secondary on IDE0 and also as
> > primary on IDE1.  They experience the same failure using both ext2 and
> > ext3.  The machines are running RedHat 7.3, kernel version
> > 2.4.18-18.7x.bigmem.  I have just updated one of the boxes to
> > 2.4.18-24.7x, custom compiling the kernel and leaving out any
> > unnecessary cruft and am waiting to see when it crashes again.
> > 
> > My next approach is to use hdparm and/or muck with the proc fs though
> > the logs seem to suggest that this problem is directly related to
> > hardware and not operating system limitations.
> > 
> > Does anyone have any suggestions?
> > 
> > Thanks!
> > Kris Reilly
> > 
> > **Disks that are crashing:
> > 
> > http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=703&p_created=1037222838
> > 
> > **Disks crash with this error in the logs:
> > 
> > Message from syslogd@105 at Fri Mar  7 19:10:18 2003 ...
> > 105 kernel: Assertion failure in do_get_write_access() at
> > transaction.c:737:
> > "((
> > (jh2bh(jh))->b_state & (1UL << BH_Uptodate)) != 0)"
> > 
> > Message from syslogd@103 at Sat Mar  8 06:19:41 2003 ...
> > 103 kernel: Assertion failure in do_get_write_access() at
> > transaction.c:737:
> > "((
> > (jh2bh(jh))->b_state & (1UL << BH_Uptodate)) != 0)"
> > 
> > **Just before the crash this is what dmesg has:
> > 
> > ))
> > [<c0146534>] bread [kernel] 0x24 (0xd58f3d2c))
> > [<f881e5a5>] ext3_get_branch [ext3] 0x55 (0xd58f3d50))
> > [<f880dd6f>] journal_get_write_access_Rsmp_78dc75e5 [jbd] 0x3f
> > (0xd58f3d68))
> > [<f881ed55>] ext3_get_block_handle [ext3] 0x205 (0xd58f3d7c))
> > [<f880e241>] journal_dirty_metadata_Rsmp_fb9ecae4 [jbd] 0x61
> > (0xd58f3de4))
> > [<c0146772>] create_buffers [kernel] 0x62 (0xd58f3de8))
> > [<f881ee7c>] ext3_get_block [ext3] 0x5c (0xd58f3e0c))
> > [<c0146d19>] __block_prepare_write [kernel] 0xe9 (0xd58f3e2c))
> > [<f8821555>] ext3_mark_iloc_dirty [ext3] 0x25 (0xd58f3e5c))
> > [<f8816310>] .rodata.str1.1 [jbd] 0x30 (0xd58f3e6c))
> > [<c0147675>] block_prepare_write [kernel] 0x25 (0xd58f3e80))
> > [<f881ee20>] ext3_get_block [ext3] 0x0 (0xd58f3e94))
> > [<f880d39d>] journal_start_Rsmp_171b1921 [jbd] 0x7d (0xd58f3ea0))
> > [<f881f3a5>] ext3_prepare_write [ext3] 0xd5 (0xd58f3eb0))
> > [<f881ee20>] ext3_get_block [ext3] 0x0 (0xd58f3ec0))
> > [<c01343ed>] generic_file_write [kernel] 0x4ed (0xd58f3ee8))
> > [<c0156be4>] fcntl_setlk [kernel] 0x1a4 (0xd58f3f3c))
> > [<f881cc32>] ext3_file_write [ext3] 0x22 (0xd58f3f5c))
> > [<c01440f6>] sys_write [kernel] 0x96 (0xd58f3f7c))
> > [<c0152e9d>] sys_fcntl64 [kernel] 0x8d (0xd58f3fac))
> > [<c0108c73>] system_call [kernel] 0x33 (0xd58f3fc0))
> > 
> > Code: 0f 0b e1 02 f0 62 81 f8 83 c4 14 8b 44 24 34 8b 08 b8 00 e0
> >  end_request: I/O error, dev 03:41 (hdb), sector 67895384
> > EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> > inode block - inode=4243506, block=8486923
> > end_request: I/O error, dev 03:41 (hdb), sector 181670032
> > end_request: I/O error, dev 03:41 (hdb), sector 181670040
> > end_request: I/O error, dev 03:41 (hdb), sector 181670096
> > end_request: I/O error, dev 03:41 (hdb), sector 181670128
> > end_request: I/O error, dev 03:41 (hdb), sector 0
> > EXT3-fs error (device ide0(3,65)) in ext3_reserve_inode_write: IO
> > failure
> > end_request: I/O error, dev 03:41 (hdb), sector 18752
> > end_request: I/O error, dev 03:41 (hdb), sector 37224528
> > EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> > inode block - inode=2326537, block=4653066
> > EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> > inode block - inode=2326532, block=4653066
> > end_request: I/O error, dev 03:41 (hdb), sector 63941840
> > end_request: I/O error, dev 03:41 (hdb), sector 176160848
> > EXT3-fs error (device ide0(3,65)): ext3_get_inode_loc: unable to read
> > inode block - inode=11010060, block=22020106
> > end_request: I/O error, dev 03:41 (hdb), sector 181669984
> > end_request: I/O error, dev 03:41 (hdb), sector 181670016
> > end_request: I/O error, dev 03:41 (hdb), sector 181670032
> > end_request: I/O error, dev 03:41 (hdb), sector 181670040
> > end_request: I/O error, dev 03:41 (hdb), sector 181670096
> > end_request: I/O error, dev 03:41 (hdb), sector 0
> > EXT3-fs error (device ide0(3,65)) in ext3_reserve_inode_write: IO
> > failure
> > end_request: I/O error, dev 03:41 (hdb), sector 181670016
> > end_request: I/O error, dev 03:41 (hdb), sector 181670032
> > end_request: I/O error, dev 03:41 (hdb), sector 181670040
> > 
> > ... many more of these end_request errors ...
> 
> 
> What does the vendor say about the drive failures?  Worst case, some
> drives aren't rated for 100% usage and you'll need better drives.  Best
> case, you aren't keeping them cool enough.  Do the systems have adequate
> circulation & cooling for the drive bays?

We are waiting to hear what the vendor has to say.

The boxes don't seem to be any warmer than other boxes doing the same or
less work.

I agree that the hardware may be faulty or just plain inadequate for
this application.  I have new drives on order.  I am just concerned
because we have seen this problem on different boxes with both a SCSI
and IDE configuration.

-- 
Kris Reilly <kar@ramblingredneck.com>

Attachment: signature.asc
Description: This is a digitally signed message part