Skip to content

Commit

Permalink
fs: introduce new truncate sequence
Browse files Browse the repository at this point in the history
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.

simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.

simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).

To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
  the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
  the event of write failure after allocation, so this must be performed
  in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
  cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
  variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
  to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.

Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed.  This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).

Cc: Christoph Hellwig <[email protected]>
Acked-by: Jan Kara <[email protected]>
Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Al Viro <[email protected]>
  • Loading branch information
[email protected] authored and Al Viro committed May 28, 2010
1 parent 7000d3c commit 7bb46a6
Show file tree
Hide file tree
Showing 8 changed files with 300 additions and 63 deletions.
7 changes: 6 additions & 1 deletion Documentation/filesystems/vfs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -401,11 +401,16 @@ otherwise noted.
started might not be in the page cache at the end of the
walk).

truncate: called by the VFS to change the size of a file. The
truncate: Deprecated. This will not be called if ->setsize is defined.
Called by the VFS to change the size of a file. The
i_size field of the inode is set to the desired size by the
VFS before this method is called. This method is called by
the truncate(2) system call and related functionality.

Note: ->truncate and vmtruncate are deprecated. Do not add new
instances/calls of these. Filesystems should be converted to do their
truncate sequence via ->setattr().

permission: called by the VFS to check for access rights on a POSIX-like
filesystem.

Expand Down
50 changes: 40 additions & 10 deletions fs/attr.c
Original file line number Diff line number Diff line change
Expand Up @@ -67,14 +67,14 @@ EXPORT_SYMBOL(inode_change_ok);
* @offset: the new size to assign to the inode
* @Returns: 0 on success, -ve errno on failure
*
* inode_newsize_ok must be called with i_mutex held.
*
* inode_newsize_ok will check filesystem limits and ulimits to check that the
* new inode size is within limits. inode_newsize_ok will also send SIGXFSZ
* when necessary. Caller must not proceed with inode size change if failure is
* returned. @inode must be a file (not directory), with appropriate
* permissions to allow truncate (inode_newsize_ok does NOT check these
* conditions).
*
* inode_newsize_ok must be called with i_mutex held.
*/
int inode_newsize_ok(const struct inode *inode, loff_t offset)
{
Expand Down Expand Up @@ -104,17 +104,25 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
}
EXPORT_SYMBOL(inode_newsize_ok);

int inode_setattr(struct inode * inode, struct iattr * attr)
/**
* generic_setattr - copy simple metadata updates into the generic inode
* @inode: the inode to be updated
* @attr: the new attributes
*
* generic_setattr must be called with i_mutex held.
*
* generic_setattr updates the inode's metadata with that specified
* in attr. Noticably missing is inode size update, which is more complex
* as it requires pagecache updates. See simple_setsize.
*
* The inode is not marked as dirty after this operation. The rationale is
* that for "simple" filesystems, the struct inode is the inode storage.
* The caller is free to mark the inode dirty afterwards if needed.
*/
void generic_setattr(struct inode *inode, const struct iattr *attr)
{
unsigned int ia_valid = attr->ia_valid;

if (ia_valid & ATTR_SIZE &&
attr->ia_size != i_size_read(inode)) {
int error = vmtruncate(inode, attr->ia_size);
if (error)
return error;
}

if (ia_valid & ATTR_UID)
inode->i_uid = attr->ia_uid;
if (ia_valid & ATTR_GID)
Expand All @@ -135,6 +143,28 @@ int inode_setattr(struct inode * inode, struct iattr * attr)
mode &= ~S_ISGID;
inode->i_mode = mode;
}
}
EXPORT_SYMBOL(generic_setattr);

/*
* note this function is deprecated, the new truncate sequence should be
* used instead -- see eg. simple_setsize, generic_setattr.
*/
int inode_setattr(struct inode *inode, const struct iattr *attr)
{
unsigned int ia_valid = attr->ia_valid;

if (ia_valid & ATTR_SIZE &&
attr->ia_size != i_size_read(inode)) {
int error;

error = vmtruncate(inode, attr->ia_size);
if (error)
return error;
}

generic_setattr(inode, attr);

mark_inode_dirty(inode);

return 0;
Expand Down
123 changes: 98 additions & 25 deletions fs/buffer.c
Original file line number Diff line number Diff line change
Expand Up @@ -1949,14 +1949,11 @@ static int __block_commit_write(struct inode *inode, struct page *page,
}

/*
* block_write_begin takes care of the basic task of block allocation and
* bringing partial write blocks uptodate first.
*
* If *pagep is not NULL, then block_write_begin uses the locked page
* at *pagep rather than allocating its own. In this case, the page will
* not be unlocked or deallocated on failure.
* Filesystems implementing the new truncate sequence should use the
* _newtrunc postfix variant which won't incorrectly call vmtruncate.
* The filesystem needs to handle block truncation upon failure.
*/
int block_write_begin(struct file *file, struct address_space *mapping,
int block_write_begin_newtrunc(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block)
Expand Down Expand Up @@ -1992,20 +1989,50 @@ int block_write_begin(struct file *file, struct address_space *mapping,
unlock_page(page);
page_cache_release(page);
*pagep = NULL;

/*
* prepare_write() may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
*/
if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
}
}

out:
return status;
}
EXPORT_SYMBOL(block_write_begin_newtrunc);

/*
* block_write_begin takes care of the basic task of block allocation and
* bringing partial write blocks uptodate first.
*
* If *pagep is not NULL, then block_write_begin uses the locked page
* at *pagep rather than allocating its own. In this case, the page will
* not be unlocked or deallocated on failure.
*/
int block_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block)
{
int ret;

ret = block_write_begin_newtrunc(file, mapping, pos, len, flags,
pagep, fsdata, get_block);

/*
* prepare_write() may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
*
* Filesystems which pass down their own page also cannot
* call into vmtruncate here because it would lead to lock
* inversion problems (*pagep is locked). This is a further
* example of where the old truncate sequence is inadequate.
*/
if (unlikely(ret) && *pagep == NULL) {
loff_t isize = mapping->host->i_size;
if (pos + len > isize)
vmtruncate(mapping->host, isize);
}

return ret;
}
EXPORT_SYMBOL(block_write_begin);

int block_write_end(struct file *file, struct address_space *mapping,
Expand Down Expand Up @@ -2324,7 +2351,7 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
int cont_write_begin_newtrunc(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
Expand All @@ -2345,11 +2372,30 @@ int cont_write_begin(struct file *file, struct address_space *mapping,
}

*pagep = NULL;
err = block_write_begin(file, mapping, pos, len,
err = block_write_begin_newtrunc(file, mapping, pos, len,
flags, pagep, fsdata, get_block);
out:
return err;
}
EXPORT_SYMBOL(cont_write_begin_newtrunc);

int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
int ret;

ret = cont_write_begin_newtrunc(file, mapping, pos, len, flags,
pagep, fsdata, get_block, bytes);
if (unlikely(ret)) {
loff_t isize = mapping->host->i_size;
if (pos + len > isize)
vmtruncate(mapping->host, isize);
}

return ret;
}
EXPORT_SYMBOL(cont_write_begin);

int block_prepare_write(struct page *page, unsigned from, unsigned to,
Expand Down Expand Up @@ -2381,7 +2427,7 @@ EXPORT_SYMBOL(block_commit_write);
*
* We are not allowed to take the i_mutex here so we have to play games to
* protect against truncate races as the page could now be beyond EOF. Because
* vmtruncate() writes the inode size before removing pages, once we have the
* truncate writes the inode size before removing pages, once we have the
* page lock we can determine safely if the page is beyond EOF. If it is not
* beyond EOF, then the page is guaranteed safe against truncation until we
* unlock the page.
Expand Down Expand Up @@ -2464,10 +2510,11 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
}

/*
* On entry, the page is fully not uptodate.
* On exit the page is fully uptodate in the areas outside (from,to)
* Filesystems implementing the new truncate sequence should use the
* _newtrunc postfix variant which won't incorrectly call vmtruncate.
* The filesystem needs to handle block truncation upon failure.
*/
int nobh_write_begin(struct file *file, struct address_space *mapping,
int nobh_write_begin_newtrunc(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block)
Expand Down Expand Up @@ -2500,8 +2547,8 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
unlock_page(page);
page_cache_release(page);
*pagep = NULL;
return block_write_begin(file, mapping, pos, len, flags, pagep,
fsdata, get_block);
return block_write_begin_newtrunc(file, mapping, pos, len,
flags, pagep, fsdata, get_block);
}

if (PageMappedToDisk(page))
Expand Down Expand Up @@ -2605,8 +2652,34 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
page_cache_release(page);
*pagep = NULL;

if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
return ret;
}
EXPORT_SYMBOL(nobh_write_begin_newtrunc);

/*
* On entry, the page is fully not uptodate.
* On exit the page is fully uptodate in the areas outside (from,to)
*/
int nobh_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block)
{
int ret;

ret = nobh_write_begin_newtrunc(file, mapping, pos, len, flags,
pagep, fsdata, get_block);

/*
* prepare_write() may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
*/
if (unlikely(ret)) {
loff_t isize = mapping->host->i_size;
if (pos + len > isize)
vmtruncate(mapping->host, isize);
}

return ret;
}
Expand Down
61 changes: 40 additions & 21 deletions fs/direct-io.c
Original file line number Diff line number Diff line change
Expand Up @@ -1134,27 +1134,8 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
return ret;
}

/*
* This is a library function for use by filesystem drivers.
*
* The locking rules are governed by the flags parameter:
* - if the flags value contains DIO_LOCKING we use a fancy locking
* scheme for dumb filesystems.
* For writes this function is called under i_mutex and returns with
* i_mutex held, for reads, i_mutex is not held on entry, but it is
* taken and dropped again before returning.
* For reads and writes i_alloc_sem is taken in shared mode and released
* on I/O completion (which may happen asynchronously after returning to
* the caller).
*
* - if the flags value does NOT contain DIO_LOCKING we don't use any
* internal locking but rather rely on the filesystem to synchronize
* direct I/O reads/writes versus each other and truncate.
* For reads and writes both i_mutex and i_alloc_sem are not held on
* entry and are never taken.
*/
ssize_t
__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
__blockdev_direct_IO_newtrunc(int rw, struct kiocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
dio_submit_t submit_io, int flags)
Expand Down Expand Up @@ -1247,22 +1228,60 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
nr_segs, blkbits, get_block, end_io,
submit_io, dio);

out:
return retval;
}
EXPORT_SYMBOL(__blockdev_direct_IO_newtrunc);

/*
* This is a library function for use by filesystem drivers.
*
* The locking rules are governed by the flags parameter:
* - if the flags value contains DIO_LOCKING we use a fancy locking
* scheme for dumb filesystems.
* For writes this function is called under i_mutex and returns with
* i_mutex held, for reads, i_mutex is not held on entry, but it is
* taken and dropped again before returning.
* For reads and writes i_alloc_sem is taken in shared mode and released
* on I/O completion (which may happen asynchronously after returning to
* the caller).
*
* - if the flags value does NOT contain DIO_LOCKING we don't use any
* internal locking but rather rely on the filesystem to synchronize
* direct I/O reads/writes versus each other and truncate.
* For reads and writes both i_mutex and i_alloc_sem are not held on
* entry and are never taken.
*/
ssize_t
__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
dio_submit_t submit_io, int flags)
{
ssize_t retval;

retval = __blockdev_direct_IO_newtrunc(rw, iocb, inode, bdev, iov,
offset, nr_segs, get_block, end_io, submit_io, flags);
/*
* In case of error extending write may have instantiated a few
* blocks outside i_size. Trim these off again for DIO_LOCKING.
* NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this in
* their own manner. This is a further example of where the old
* truncate sequence is inadequate.
*
* NOTE: filesystems with their own locking have to handle this
* on their own.
*/
if (flags & DIO_LOCKING) {
if (unlikely((rw & WRITE) && retval < 0)) {
loff_t isize = i_size_read(inode);
loff_t end = offset + iov_length(iov, nr_segs);

if (end > isize)
vmtruncate(inode, isize);
}
}

out:
return retval;
}
EXPORT_SYMBOL(__blockdev_direct_IO);
Loading

0 comments on commit 7bb46a6

Please sign in to comment.