Mark Dominus on 6 Dec 2007 16:16:05 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] [plug-announce] December 5th, 2007: "What's a file?" presented by Mark Jason Dominus


Jonathan Bringhurst:
> I just wanted to thank Mark for a great presentation. 

Thanks!  I am glad you liked it.  I hope I did not harrass any one
person unduly.  If so, I apologize.

> As for the use of O_SYNC to prevent the kernel from buffering data
> to reduce context switches, I just wanted to mention that it's only
> for writing, I dunno why I was thinking it was for read()s
> there. It's only used in obscure cases anyway.

Yeah.  We did discuss this a bit during the talk.  The basic issue is
that when your process asks the kernel to write data:

        int bytes_written = write(file_descriptor,
                                  buffer,
                                  n_bytes);

the kernel normally copies the data from your buffer into a kernel
buffer and then reports success back to the process immediately, even
though the data is not on the disk yet.

Normally, the kernel writes out the buffer in due time, and the data
makes it to the disk, and you are happy because your process got to go
ahead and do some more work without having to wait for the disk, which
could take milliseconds.  ("A long time", as I so quaintly called it
yesterday.)  If some other process reads the data before it is
written, that is okay, because the kernel can give it the updated data
out of the buffer.

But if there is a catastrophe, say a power failure, then this
asynchronous writing technique has a serious problem:  you find out
that the data, which your process thought had been written, has been
lost.  

So there are a number of mechanisms in place to deal with this.  The
oldest is the "sync()" system call, which marks all the kernel buffers
to be written out ASAP.  All unix systems run a program called "init",
and one of init's primary duties is to call sync() every thirty
seconds or so, to make sure that the kernel buffers get flushed to
disk at least every thirty secnods, and so no crash will lose more
than about thirty seconds' worth of data.

(There is also a command-line program "sync" which just does a sync()
call and then exits, and old-time Unix sysadmins are in the habit of
halting the system with
        # sync
        # sync
        # sync
        # halt
because the second and third syncs give the kernel time to flush out
buffers that were marked dirty by the first "sync".  Although I
suspect that few of them know why they do this.  I swear I am not
making this up.)

But for really crucial data, sync() is not enough, because, although
it marks the kernel buffers as dirty, it *still* does not actually
write the data to the disk.

So there is also an fsync() call.  The process gives it a file
descriptor, and it forces the process to wait until all buffers
associated with that descriptor have been written to disk, and then
return success only if they have:

        if (fsync(fd)) {
          /* uh-oh, couldn't write the data! */
        } else { 
          /* data is now on the disk */
        }

The mail delivery agent will use this when it is writing your email to
your mailbox, to make sure that no mail is lost.

Then there's an O_SYNC flag than the process can supply when it opens
the file for writing:

        int fd = open("blookus", O_WRONLY | O_SYNC);

This sets the O_SYNC flag in the file pointer object; whenever data is
written to this file pointer, the kernel, contrary to its usual
practice, will implicitly fsync() the descriptor after each write.

There's an interesting question that arises with this:  suppose you
fsync() a file.  That guarantees that the data will be written.  But
does it also guarantee that the mtime and the file extent of the file
will be updated?

On most systems, yes.  But on recent versions of Linux's ext2
filesystem, no.  Linus himself broke this as a sacrifice to the false
god of efficiency, a very bad decision in my opinion.

> Another random thing is the catching of interrupts when doing a system
> call, most notably EINTR. 

Have you ever read Richard Gabriel's  essay on "Worse is Better"?
This is his big example of how Unix is Worse.  

> if the return from the syscall is < 0 you need to check errno and handle
> it. 

Ah, but you don't need to check that if your program never catches
signals.  And that is why it sucks for programs to catch signals.

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug