Endless Paradigm

Full Version: 25 year old bug fixed - seekdir()
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
When seekdir() Won't Seek to the Right Position

Quote:The other day, I got an email from Edd, an OpenBSD user, claiming that Samba would crash when serving files off an MS-DOS filesystem. This was Samba built from sources and not the one from ports. Since I use myself Samba a lot and for a quite large user base, I got interested in the issue and started investigating it.

What I found out in the end is a surprise and was not expected: A bug that has been there in all BSDs for almost all the time, since the 4.2BSD times or for roughly 25 years...
The Samba Directory Cache

Edd CCed his email to Volker and Jeremy, two Samba developers, so I got in contact with the right people quickly. Jeremy told me that Samba indeed uses a workaround, or replacement code, to access directories on the BSDs since the directory reading code in all BSDs was flawed. A claim that I could hardly believe at first and of course my first reaction was to blame Samba.

(Technically, Edd got hit by Samba's replacement code calling abort() when it hits a corner case. A patch in the OpenBSD (and FreeBSD) port removed that abort() call, so the Samba from ports would not stop working, but it would not work correctly either.)

Samba makes use of telldir()/readir(), seekdir()/readdir() to build an internal cache of (large) directories to speed up directory accesses by CIFS clients (Windows machines). To build up the cache, Samba interates over a directory using readdir() and storing the returned values in an internal data structure. The values are later used to seekdir() to the directory position and use readdir() to read in the data. Now, when between the filling of the cache and using the cache later, some files are unlinked from the directory, it can happen sometimes that a cache entry no longer points to the right file in the directory. This means that values returned by telldir() can become, under rare cicumstances, invalid when some files were unlinked. Very strange, indeed.
The Problem With seekdir()

In Samba's replacement code I found the following comment, stating the problem from the Samba point of view:

"This is needed because the existing directory handling in FreeBSD
and OpenBSD (and possibly NetBSD) doesn't correctly handle unlink()
on files in a directory where telldir() has been used. On a block
boundary it will occasionally miss a file when seekdir() is used to
return to a position previously recorded with telldir().

This also fixes a severe performance and memory usage problem with
telldir() on BSD systems. Each call to telldir() in BSD adds an
entry to a linked list, and those entries are cleaned up on
closedir(). This means with a large directory closedir() can take an
arbitrary amount of time, causing network timeouts as millions of
telldir() entries are freed
"

Apparently there are two problems: seekdir() not returning to the position initially retrieved using telldir() and a performance problem.

Before digging to much into source code, I decided to check the documentation of said functions, just to make sure, they were being used like they are meant to be used.

The OpenBSD manual page is clear:

The seekdir() function sets the position of the next readdir() operation on the named directory stream dirp. The new position reverts to the one associated with the directory stream when the telldir() operation was performed. Values returned by telldir() are good only for the lifetime of the DIR pointer, dirp, from which they are derived. If the directory is closed and then reopened, the telldir() value may be invalidated due to undetected directory compaction.

If you don't close the directory stream, seekdir() will take you back to the position previsouly obtained using telldir(). If the system behaves differently, then it's either a bug or the documentation is wrong. If the system indeed does not take you back to the right position, why do wee have these functions then in the first place?

A look at the closedir() function in OpenBSD reveals that the statement made by the Samba people about performance in closedir() is wrong: In OpenBSD there is no linked list, but a smarter memory handling scheme implemented by Otto Moerbeek (otto@). FreeBSD, however, uses a linked list indeed, but they can switch to our code at any time. So the performance problem really is a non-issue.
Hunting the seekdir() Bug

As the original author of the *dir() library, you probably fixed
one of my bugs :-) Prior to the *dir() commands, programs just
opened, read, and interpreted directories directly. I had to
update a shocking 22 programs (a large percentage of the
programs available on UNIX at the time) to replace their direct
interpretation of directories with the *dir() library calls.
(Kirk McKusick; private communication)


What I needed is a test program to exercise these functions an eventually trigger the behaviour that prevented the Samba people from enabling a directory cache on BSD systems. I wrote a C program that approximately does the following steps:

   1. Create a directory and populate it with a certain number of files
   2. Iterate over the directory using readdir(), recording the position of the entry using telldir() before readdir(), and storing the obtained values in an array
   3. Delete a random number of files and also mark them as deleted in the in-memory array
   4. Iterate over the in-memory array, skipping the entries marked as deleted, and seekdir()
      to the others, doing a readdir() and compare the returned values with the in-memory copy
   5. Output a message when the in-memory copy and the value returned by readdir() are different

Jeremy told me that the problem occurs with large directories, so I began my tests with really large directories (up to 250'000 entries) and quite quickly I hit the issue. seekdir() won't return me to the recorded position. This kind of confirmed the problem the Samba people were seeing for more than three years.

I tweaked the values of my test program, to see if I can find a pattern. But looking at thousands of directory positions, filenames, inode numbers where more likely to turn me mad than to spot the problem... I started lowering the numbers and to my surprise, I could trigger the problem with as little as 10 or 20 files and deleting just one of them. Suddenly, I had a case that shows the problem on every run, no more randomness: Create 28 files, delete file 25 and seekdir to file 26: You end up at file 27!

Staring at the output of my program I suddenly saw the pattern as clear as can be:

Creating the directory with 28 files had created a directory that spans more than one block on the disk (2 in this case). File 25 was the first entry of the second block. Obviously the problem occured when you delete the first entry in a block of a directory and then return to the recorded position of the second entry in the same block. This would actually get you one entry to far.

By that time I had involved Otto to give me a hand and to confirm my findings. His first reaction: An interesting problem... ;) Wee investigated further and I began looking at the kernel code that removes a directory entry as well as the library code in libc that implements seekdir() and friends.
How the Kernel Removed Directory Entries

The kernel function to remove a directory entry, ufs_dirremove(), indeed treats entries that are located at the beginning of a block differently than others:

Code:
        if (dp->i_count == 0) {
                /*
                 * First entry in block: set d_ino to zero.
                 */
                ep->d_ino = 0;
        } else {
                /*
                 * Collapse new free space into previous entry.
                 */
                ep->d_reclen += dp->i_reclen;
        }


If it is the first entry in a block, the inode number is set to zero, thus invalidating the entry. For all other entries, the record length of the to-be-deleted record is added to the record length of the previous entry.

Since the library uses the record length field of an entry to proceed to the next entry in readdir(), this effectively means that the removed entry, while it is still in the block on disc, will no longer be used.
How the C Library Accesses Directory Entries

The C library implements the opendir(), telldir(), readdir(), seekdir(), and closedir() functions. These functions were written in the 4.2BSD times so that UNIX programs don't need to handle directories by themselves.

Code:
/*
 * get next entry in a directory.
 */
int
_readdir_unlocked(DIR *dirp, struct dirent **result)
{
        struct dirent *dp;

        *result = NULL;
        for (;;) {
                if (dirp->dd_loc >= dirp->dd_size)
                        dirp->dd_loc = 0;
                if (dirp->dd_loc == 0) {
                        dirp->dd_size = getdirentries(dirp->dd_fd,
                            dirp->dd_buf, dirp->dd_len, &dirp->dd_seek);
                        if (dirp->dd_size == 0)
                                return (0);
                        if (dirp->dd_size < 0)
                                return (-1);
                }
                dp = (struct dirent *)(dirp->dd_buf + dirp->dd_loc);
                if ((long)dp & 03)      /* bogus pointer check */
                        return (-1);
                if (dp->d_reclen <= 0 ||
                    dp->d_reclen > dirp->dd_len + 1 - dirp->dd_loc)
                        return (-1);
                dirp->dd_loc += dp->d_reclen;
                 if (dp->d_ino == 0)
                        continue;
                *result = dp;
                return (0);
        }
}


At first sight, this code looks correct. It will skip deleted entries that have their inode number set to zero and it will use the record lenght otherwise. And since a directory traversal using readdir() works just fine, this is code effectively works.

A close look at the seekdir() library implementation finally reveals the problem:

Code:
/*
 * seek to an entry in a directory.
 * Only values returned by "telldir" should be passed to seekdir.
 */
void
__seekdir(DIR *dirp, long loc)
{
        struct ddloc *lp;
        struct dirent *dp;

        if (loc < 0 || loc >= dirp->dd_td->td_loccnt)
                return;
        lp = &dirp->dd_td->td_locs[loc];
        dirp->dd_td->td_last = loc;
        if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek)
                return;
        (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
        dirp->dd_seek = lp->loc_seek;
        dirp->dd_loc = 0;
        while (dirp->dd_loc < lp->loc_loc) {
                _readdir_unlocked(dirp, &dp, 0);
                if (dp == NULL)
                        break;
        }
}


This code will not work as expected when seeking to the second entry of a block where the first has been deleted: seekdir() calls readdir() which happily skips the first entry (it has inode set to zero), and advance to the second entry. When the user now calls readdir() to read the directory entry to which he just seekdir()ed, he does not get the second entry but the third.

Much to my surprise I not only found this problem in all other BSDs or BSD derived systems like Mac OS X, but also in very old BSD versions. I first checked 4.4BSD Lite 2, and Otto confirmed it is also in 4.2BSD. The bug has been around for roughly 25 years or more.
The Solution

The fix is surprisingly simple, not to say trivial: _readdir_unlocked() must not skip directory entries with inode set to zero when it is called from __seekdir().

q.e.d.
(and sorry that it took us almost twenty-five years to fix it)



Leigh Lundin Wrote:I'm impressed!

Marc's report didn't mention it, but I imagine the bug's effect is cumulative. An in-service utility handling a case of 250,000 files hit with a number of deletions must have created a real mess as iterations progressed.

Marc seems nonplussed that the fix was 'surprisingly simple', but it's often like that– a flaw so tiny that it gets overlooked, often only a bit or two.

Source
Not reading through all of that, but interesting that the bug was found quite a while ago, but never reported...
Reference URL's