From: Chuck Lever It's been clear since the inception of VFS support for the O_DIRECT open flag that the NFS client implementation would not fit neatly into its constraints. for example: 1. 2.4 O_DIRECT has buffer alignment restrictions that are not necessary for NFS 2. 2.4 and 2.6 take the inode semaphor and i_alloc semaphor and play with i_size to protect local file systems, but that doesn't fit well when the local i_size represents only a cached copy of the real i_size Case in point: an NFS server should be exposed to all application direct read requests, not just the ones that fit inside the client's cached i_size. otherwise, the client has to do a GETATTR (or equivalent) before each on-the-wire read in order to determine whether the read should be allowed if the file size has changed on the server. and even with GETATTRs the read can race with an extending write and fail unnecessarily. I created an implementation of NFS direct I/O in RHEL AS 2.1's kernel called NFS uncached I/O that does not use the VFS direct I/O implementation at all (mostly because it is not complete in 2.4.9 kernels). when the kernel invokes f_op->read on an uncached file, it immediately invokes the NFS client's own internal direct I/O logic rather than going through the VFS generic routines (likewise for write). this appears to work much better than the 2.4.22 (and 2.6) NFS O_DIRECT implementation for the reasons outlined above. My proposal is that support for NFS O_DIRECT in 2.6 should work the same way as uncached I/O does in RHEL AS 2.1. the VFS O_DIRECT logic, even in 2.6, is still too restrictrive for NFS; it should switch O_DIRECT files straight to its own support and skip the generic support for NFS files. Other options here include the ability for a file system to disable the i_sem and i_size shenanigans in the VFS direct read and write paths entirely, or to move this logic out of the VFS and into the file systems that need it. --- 25-akpm/fs/nfs/direct.c | 168 ++++++++++++++++++++++++++++++++++++++--- 25-akpm/fs/nfs/file.c | 11 ++ 25-akpm/include/linux/nfs_fs.h | 4 3 files changed, 173 insertions(+), 10 deletions(-) diff -puN fs/nfs/direct.c~nfs-O_DIRECT-fixes fs/nfs/direct.c --- 25/fs/nfs/direct.c~nfs-O_DIRECT-fixes 2004-04-19 12:30:42.813750248 -0700 +++ 25-akpm/fs/nfs/direct.c 2004-04-19 12:30:42.821749032 -0700 @@ -32,6 +32,7 @@ * 18 Dec 2001 Initial implementation for 2.4 --cel * 08 Jul 2002 Version for 2.4.19, with bug fixes --trondmy * 08 Jun 2003 Port to 2.5 APIs --cel + * 31 Mar 2004 Handle direct I/O without VFS support --cel * */ @@ -409,12 +410,6 @@ nfs_direct_write(struct inode *inode, st * file_offset: offset in file to begin the operation * nr_segs: size of iovec array * - * Usually a file system implements direct I/O by calling out to - * blockdev_direct_IO. The NFS client doesn't have a backing block - * device, so we do everything by hand instead. - * - * The inode's i_sem is no longer held by the VFS layer before it calls - * this function to do a write. */ ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, @@ -431,10 +426,6 @@ nfs_direct_IO(int rw, struct kiocb *iocb if (!is_sync_kiocb(iocb)) goto out; - result = nfs_revalidate_inode(NFS_SERVER(inode), inode); - if (result < 0) - goto out; - switch (rw) { case READ: dprintk("NFS: direct_IO(read) (%s) off/no(%Lu/%lu)\n", @@ -458,3 +449,160 @@ out: dprintk("NFS: direct_IO result=%d\n", result); return result; } + +/** + * nfs_file_direct_read - file direct read operation for NFS files + * @iocb: target I/O control block + * @buf: user's buffer into which to read data + * count: number of bytes to read + * pos: byte offset in file where reading starts + * + * We use this function for direct reads instead of calling + * generic_file_aio_read() in order to avoid gfar's check to see if + * the request starts before the end of the file. For that check + * to work, we must generate a GETATTR before each direct read, and + * even then there is a window between the GETATTR and the subsequent + * READ where the file size could change. So our preference is simply + * to do all reads the application wants, and the server will take + * care of managing the end of file boundary. + * + * This function also eliminates unnecessarily updating the file's + * atime locally, as the NFS server sets the file's atime, and this + * client must read the updated atime from the server back into its + * cache. + */ +ssize_t +nfs_file_direct_read(struct kiocb *iocb, char *buf, size_t count, loff_t pos) +{ + ssize_t retval = -EINVAL; + loff_t *ppos = &iocb->ki_pos; + struct file *file = iocb->ki_filp; + struct dentry *dentry = file->f_dentry; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + struct iovec iov = { .iov_base = buf, .iov_len = count }; + + dprintk("nfs: direct read(%s/%s, %lu@%lu)\n", + dentry->d_parent->d_name.name, dentry->d_name.name, + (unsigned long) count, (unsigned long) pos); + + if (!is_sync_kiocb(iocb)) + goto out; + if (count < 0) + goto out; + retval = -EFAULT; + if (!access_ok(VERIFY_WRITE, iov.iov_base, iov.iov_len)) + goto out; + retval = 0; + if (!count) + goto out; + + /* XXX: why do this for reads? */ + if (mapping->nrpages) { + retval = filemap_fdatawrite(mapping); + if (retval == 0) + retval = filemap_fdatawait(mapping); + if (retval) + goto out; + } + + retval = nfs_direct_read(inode, file, &iov, pos, 1); + if (retval > 0) + *ppos = pos + retval; + +out: + return retval; +} + +/** + * nfs_file_direct_write - file direct write operation for NFS files + * @iocb: target I/O control block + * @buf: user's buffer from which to write data + * count: number of bytes to write + * pos: byte offset in file where writing starts + * + * We use this function for direct writes instead of calling + * generic_file_aio_write() in order to avoid taking the inode + * semaphor and updating the i_size. The NFS server will set + * the new i_size and this client must read the updated size + * back into its cache. We let the server do generic write + * parameter checking and report problems. + * + * We also avoid an unnecessary invocation of generic_osync_inode(), + * as it is fairly meaningless to sync the metadata of an NFS file. + * + * And we eliminate local atime updates, see direct read above. + * + * Note that O_APPEND is not supported for NFS direct writes, as there + * is no atomic O_APPEND write facility in the NFS protocol. + */ +ssize_t +nfs_file_direct_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos) +{ + ssize_t retval = -EINVAL; + loff_t *ppos = &iocb->ki_pos; + struct file *file = iocb->ki_filp; + struct dentry *dentry = file->f_dentry; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; + unsigned long limit = current->rlim[RLIMIT_FSIZE].rlim_cur; + + dfprintk(VFS, "nfs: direct write(%s/%s(%ld), %lu@%lu)\n", + dentry->d_parent->d_name.name, dentry->d_name.name, + inode->i_ino, (unsigned long) count, (unsigned long) pos); + + if (!is_sync_kiocb(iocb)) + goto out; + if (count < 0) + goto out; + if (pos < 0) + goto out; + retval = -EFAULT; + if (!access_ok(VERIFY_READ, iov.iov_base, iov.iov_len)) + goto out; + if (file->f_error) { + retval = file->f_error; + file->f_error = 0; + goto out; + } + retval = -EFBIG; + if (limit != RLIM_INFINITY) { + if (pos >= limit) { + send_sig(SIGXFSZ, current, 0); + goto out; + } + if (count > limit - (unsigned long) pos) + count = limit - (unsigned long) pos; + } + retval = 0; + if (!count) + goto out; + + /* + * The server's file system may not be POSIX-compliant, so we want + * to ensure that the suid bit is gone. + */ + down(&inode->i_sem); + retval = remove_suid(dentry); + up(&inode->i_sem); + if (retval) + goto out; + + if (mapping->nrpages) { + retval = filemap_fdatawrite(mapping); + if (retval == 0) + retval = filemap_fdatawait(mapping); + if (retval) + goto out; + } + + retval = nfs_direct_write(inode, file, &iov, pos, 1); + if (mapping->nrpages) + invalidate_inode_pages2(mapping); + if (retval > 0) + *ppos = pos + retval; + +out: + return retval; +} diff -puN fs/nfs/file.c~nfs-O_DIRECT-fixes fs/nfs/file.c --- 25/fs/nfs/file.c~nfs-O_DIRECT-fixes 2004-04-19 12:30:42.814750096 -0700 +++ 25-akpm/fs/nfs/file.c 2004-04-19 12:30:42.821749032 -0700 @@ -154,6 +154,11 @@ nfs_file_read(struct kiocb *iocb, char * struct inode * inode = dentry->d_inode; ssize_t result; +#ifdef CONFIG_NFS_DIRECTIO + if (iocb->ki_filp->f_flags & O_DIRECT) + return nfs_file_direct_read(iocb, buf, count, pos); +#endif + dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n", dentry->d_parent->d_name.name, dentry->d_name.name, (unsigned long) count, (unsigned long) pos); @@ -268,6 +273,11 @@ nfs_file_write(struct kiocb *iocb, const struct inode * inode = dentry->d_inode; ssize_t result; +#ifdef CONFIG_NFS_DIRECTIO + if (iocb->ki_filp->f_flags & O_DIRECT) + return nfs_file_direct_write(iocb, buf, count, pos); +#endif + dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n", dentry->d_parent->d_name.name, dentry->d_name.name, inode->i_ino, (unsigned long) count, (unsigned long) pos); @@ -275,6 +285,7 @@ nfs_file_write(struct kiocb *iocb, const result = -EBUSY; if (IS_SWAPFILE(inode)) goto out_swapfile; + result = nfs_revalidate_inode(NFS_SERVER(inode), inode); if (result) goto out; diff -puN include/linux/nfs_fs.h~nfs-O_DIRECT-fixes include/linux/nfs_fs.h --- 25/include/linux/nfs_fs.h~nfs-O_DIRECT-fixes 2004-04-19 12:30:42.816749792 -0700 +++ 25-akpm/include/linux/nfs_fs.h 2004-04-19 12:30:42.822748880 -0700 @@ -306,6 +306,10 @@ nfs_file_cred(struct file *file) */ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t, unsigned long); +extern ssize_t nfs_file_direct_read(struct kiocb *iocb, char *buf, + size_t count, loff_t pos); +extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const char *buf, + size_t count, loff_t pos); /* * linux/fs/nfs/dir.c _