Although it is expected nowadays that every new sysfs attribute is
documented under Documentation/ABI, this is not yet mentioned in the
kernel documentation. This patch adds a note in the sysfs
documentation about that requirement.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
For a long time now orlov is the default block allocator in the ext3. It
performs better than the old one and no one seems to claim otherwise so
we can safely drop it and make oldalloc and orlov mount option
deprecated.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Remove the name of Sergey Kostyliov as maintainer of befs.
In the MAINTAINERS file, befs is orphaned.
Signed-off-by: Marcos Souza <marcos.mage@gmail.com>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
merge fchmod() and fchmodat() guts, kill ancient broken kludge
xfs: fix misspelled S_IS...()
xfs: get rid of open-coded S_ISREG(), etc.
vfs: document locking requirements for d_move, __d_move and d_materialise_unique
omfs: fix (mode & S_IFDIR) abuse
btrfs: S_ISREG(mode) is not mode & S_IFREG...
ima: fmode_t misspelled as mode_t...
pci-label.c: size_t misspelled as mode_t
jffs2: S_ISLNK(mode & S_IFMT) is pointless
snd_msnd ->mode is fmode_t, not mode_t
v9fs_iop_get_acl: get rid of unused variable
vfs: dont chain pipe/anon/socket on superblock s_inodes list
Documentation: Exporting: update description of d_splice_alias
fs: add missing unlock in default_llseek()
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
ext3.txt: update the links in the section "useful links" to the latest ones
ext3: Fix data corruption in inodes with journalled data
ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
ext3: Fix compilation with -DDX_DEBUG
quota: Remove unused declaration
jbd: Use WRITE_SYNC in journal checkpoint.
jbd: Fix oops in journal_remove_journal_head()
ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
ext3/ioctl.c: silence sparse warnings about different address spaces
ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
ext3: Improve truncate error handling
ext3: use proper little-endian bitops
ext2: include fs.h into ext2_fs.h
ext3: Fix oops in ext3_try_to_allocate_with_rsv()
jbd: fix a bug of leaking jh->b_jcount
jbd: remove dependency on __GFP_NOFAIL
ext3: Convert ext3 to new truncate calling convention
jbd: Add fixed tracepoints
ext3: Add fixed tracepoints
Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
new fixed tracepoints.
Following commits a904937 and 0c1aa9a update the d_splice_alias
desciption.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
updated Documentation/ja_JP/SubmittingPatches
debugfs: add documentation for debugfs_create_x64
uio: uio_pdrv_genirq: Add OF support
firmware: gsmi: remove sysfs entries when unload the module
Documentation/zh_CN: Fix messy code file email-clients.txt
driver core: add more help description for "path to uevent helper"
driver-core: modify FIRMWARE_IN_KERNEL help message
driver-core: Kconfig grammar corrections in firmware configuration
DOCUMENTATION: Replace create_device() with device_create().
DOCUMENTATION: Update overview.txt in Doc/driver-model.
pti: pti_tty_install documentation mispelling.
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: Make ZLIB compression support optional
Squashfs: Update documentation for XZ and add squashfs-tools devel tree
* 'for-3.1' of git://linux-nfs.org/~bfields/linux:
nfsd: don't break lease on CLAIM_DELEGATE_CUR
locks: rename lock-manager ops
nfsd4: update nfsv4.1 implementation notes
nfsd: turn on reply cache for NFSv4
nfsd4: call nfsd4_release_compoundargs from pc_release
nfsd41: Deny new lock before RECLAIM_COMPLETE done
fs: locks: remove init_once
nfsd41: check the size of request
nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
nfsd4: fix file leak on open_downgrade
nfsd4: remember to put RW access on stateid destruction
NFSD: Added TEST_STATEID operation
NFSD: added FREE_STATEID operation
svcrpc: fix list-corrupting race on nfsd shutdown
rpc: allow autoloading of gss mechanisms
svcauth_unix.c: quiet sparse noise
svcsock.c: include sunrpc.h to quiet sparse noise
nfsd: Remove deprecated nfsctl system call and related code.
NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In Documentation/filesystems/ext3.txt, the section "useful links"
provides two links which link to ibm developerworks articles. While
the second one can be redirected to the latest one, the first one
http://www.ibm.com/developerworks/library/l-fs7.html
fails to be redirected.
Update the 2 links to the latest ones.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (32 commits)
MAINTAINERS: change e-mail of Adrian Hunter
UBIFS: fix master node recovery
UBIFS: improve power cut emulation testing
UBIFS: rename recovery testing variables
UBIFS: remove custom list of superblocks
UBIFS: stop re-defining UBI operations
UBIFS: switch to I/O helpers
UBIFS: switch to ubifs_leb_write
UBIFS: switch to ubifs_leb_read
UBIFS: introduce more I/O helpers
UBIFS: always print stacktrace when switching to R/O mode
UBIFS: remove unused and unneeded debugging function
UBIFS: add global debugfs knobs
UBIFS: introduce debugfs helpers
UBIFS: re-arrange debugging code a bit
UBIFS: be more informative in failure mode
UBIFS: switch self-check knobs to debugfs
UBIFS: lessen amount of debugging check types
UBIFS: introduce helper functions for debugging checks and tests
UBIFS: amend debugging inode size check function prototype
...
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. We need to match solaris
here, and the definitions are
*o* If /whence/ is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.
*o* If /whence/ is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.
So in the generic case the entire file is data and there is a virtual hole at
the end. That means we will just return i_size for SEEK_HOLE and will return
the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
the same way.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.
To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.
To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.
Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both the filesystem and the lock manager can associate operations with a
lock. Confusingly, one of them (fl_release_private) actually has the
same name in both operation structures.
It would save some confusion to give the lock-manager ops different
names.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Resize feature was supported by the commit 4e33f9eab0 but it was not
reflected to the list of unsupported features in nilfs2.txt file.
This updates the list to fix discrepancy.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Add an FS-Cache helper to bulk uncache pages on an inode. This will
only work for the circumstance where the pages in the cache correspond
1:1 with the pages attached to an inode's page cache.
This is required for CIFS and NFS: When disabling inode cookie, we were
returning the cookie and setting cifsi->fscache to NULL but failed to
invalidate any previously mapped pages. This resulted in "Bad page
state" errors and manifested in other kind of errors when running
fsstress. Fix it by uncaching mapped pages when we disable the inode
cookie.
This patch should fix the following oops and "Bad page state" errors
seen during fsstress testing.
------------[ cut here ]------------
kernel BUG at fs/cachefiles/namei.c:201!
invalid opcode: 0000 [#1] SMP
Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282
RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
Stack:
0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
Call Trace:
cachefiles_lookup_object+0x78/0xd4 [cachefiles]
fscache_lookup_object+0x131/0x16d [fscache]
fscache_object_work_func+0x1bc/0x669 [fscache]
process_one_work+0x186/0x298
worker_thread+0xda/0x15d
kthread+0x84/0x8c
kernel_thread_helper+0x4/0x10
RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles]
---[ end trace 1d481c9af1804caa ]---
I tested the uncaching by the following means:
(1) Create a big file on my NFS server (104857600 bytes).
(2) Read the file into the cache with md5sum on the NFS client. Look in
/proc/fs/fscache/stats:
Pages : mrk=25601 unc=0
(3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc
again:
Pages : mrk=25601 unc=25601
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UBIFS has many built-in self-check functions which can be enabled using the
debug_chks module parameter or the corresponding sysfs file
(/sys/module/ubifs/parameters/debug_chks). However, this is not flexible enough
because it is not per-filesystem. This patch moves this to debugfs interfaces.
We already have debugfs support, so this patch just adds more debugfs files.
While looking at debugfs support I've noticed that it is racy WRT file-system
unmount, and added a TODO entry for that. This problem has been there for long
time and it is quite standard debugfs PITA. The plan is to fix this later.
This patch is simple, but it is large because it changes many places where we
check if a particular type of checks is enabled or disabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We have too many different debugging checks - lessen the amount by merging all
index-related checks into one. At the same time, move the "force in-the-gap"
test to the "index checks" class, because it is too heavy for the "general"
class.
This patch merges TNC, Old index, and Index size check and calles this just
"index checks".
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Bh and nobh mount option has been deprecated in ext4
(206f7ab4f4) and in ext3
(4c4d390122)
so remove those options from documentation.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread)
introduced performance regression. In an AIM7 test, this commit degraded
performance by about 40%.
The commit runs rcu callbacks in a kthread instead of softirq. We observed
high rate of context switch which is caused by this. Out test system has
64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
which is caused by RCU's per-CPU kthread. A trace showed that most of
the time the RCU per-CPU kthread doesn't actually handle any callbacks,
but instead just does a very small amount of work handling grace periods.
This means that RCU's per-CPU kthreads are making the scheduler do quite
a bit of work in order to allow a very small amount of RCU-related
processing to be done.
Alex Shi's analysis determined that this slowdown is due to lock
contention within the scheduler. Unfortunately, as Peter Zijlstra points
out, the scheduler's real-time semantics require global action, which
means that this contention is inherent in real-time scheduling. (Yes,
perhaps someone will come up with a workaround -- otherwise, -rt is not
going to do well on large SMP systems -- but this patch will work around
this issue in the meantime. And "the meantime" might well be forever.)
This patch therefore re-introduces softirq processing to RCU, but only
for core RCU work. RCU callbacks are still executed in kthread context,
so that only a small amount of RCU work runs in softirq context in the
common case. This should minimize ksoftirqd execution, allowing us to
skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Tested-by: "Alex,Shi" <alex.shi@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Change all "arch/i386" to "arch/x86" in Documentaion/,
since the directory has changed.
Also update the files which have changed their filename
in the meantime accordingly.
Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com>
[jkosina@suse.cz: reword changelog]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
Cache xattr security drop check for write v2
fs: block_page_mkwrite should wait for writeback to finish
mm: Wait for writeback when grabbing pages to begin a write
configfs: remove unnecessary dentry_unhash on rmdir, dir rename
fat: remove unnecessary dentry_unhash on rmdir, dir rename
hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
minix: remove unnecessary dentry_unhash on rmdir, dir rename
fuse: remove unnecessary dentry_unhash on rmdir, dir rename
coda: remove unnecessary dentry_unhash on rmdir, dir rename
afs: remove unnecessary dentry_unhash on rmdir, dir rename
affs: remove unnecessary dentry_unhash on rmdir, dir rename
9p: remove unnecessary dentry_unhash on rmdir, dir rename
ncpfs: fix rename over directory with dangling references
ncpfs: document dentry_unhash usage
ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
hfs: remove unnecessary dentry_unhash on rmdir, dir rename
omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
udf: remove unnecessary dentry_unhash from rmdir, dir rename
...
* 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs:
Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/<moved_file> to Documentation/security/<moved_file>.
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.
This is just the prototype change with no user for it yet. I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.
Also remove incorrect comments that ->dirty_inode can't block. That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When configfs_register_subsystem() fails, we unregister too many
subsystems in configfs_example_init. Decrement i by one to not unregister
non-registered subsystem.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (28 commits)
Ocfs2: Teach local-mounted ocfs2 to handle unwritten_extents correctly.
ocfs2/dlm: Do not migrate resource to a node that is leaving the domain
ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG
Ocfs2/move_extents: Set several trivial constraints for threshold.
Ocfs2/move_extents: Let defrag handle partial extent moving.
Ocfs2/move_extents: move/defrag extents within a certain range.
Ocfs2/move_extents: helper to calculate the defraging length in one run.
Ocfs2/move_extents: move entire/partial extent.
Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
Ocfs2/move_extents: helper to validate and adjust moving goal.
Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
Ocfs2/move_extents: defrag a range of extent.
Ocfs2/move_extents: move a range of extent.
Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
Ocfs2/move_extents: Add basic framework and source files for extent moving.
Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent
xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec
xfs: fix up asserts in xfs_iflush_fork
xfs: do not do pointer arithmetic on extent records
xfs: do not use unchecked extent indices in xfs_bunmapi
xfs: do not use unchecked extent indices in xfs_bmapi
xfs: do not use unchecked extent indices in xfs_bmap_add_extent_*
xfs: remove if_lastex
xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag
xfs: do not discard alloc btree blocks
xfs: add online discard support
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
jbd2: Add MAINTAINERS entry
jbd2: fix a potential leak of a journal_head on an error path
ext4: teach ext4_ext_split to calculate extents efficiently
ext4: Convert ext4 to new truncate calling convention
ext4: do not normalize block requests from fallocate()
ext4: enable "punch hole" functionality
ext4: add "punch hole" flag to ext4_map_blocks()
ext4: punch out extents
ext4: add new function ext4_block_zero_page_range()
ext4: add flag to ext4_has_free_blocks
ext4: reserve inodes and feature code for 'quota' feature
ext4: add support for multiple mount protection
ext4: ensure f_bfree returned by ext4_statfs() is non-negative
ext4: protect bb_first_free in ext4_trim_all_free() with group lock
ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
jbd2: Fix comment to match the code in jbd2__journal_start()
ext4: fix waiting and sending of a barrier in ext4_sync_file()
jbd2: Add function jbd2_trans_will_send_data_barrier()
jbd2: fix sending of data flush on journal commit
ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: update Documentation pointers
net/9p: enable 9p to work in non-default network namespace
net/9p: p9_idpool_get return -1 on error
fs/9p: Don't clunk dentry fid when we fail to get a writeback inode
9p: Small cleanup in <net/9p/9p.h>
9p: remove experimental tag from tested configurations
9p: typo fixes and minor cleanups
net/9p: Change linuxdoc names to match functions.
Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the
cpu count is large.
Setting smp affinity to cpus 256 to 263 would be:
echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity
instead of:
echo 256-263 > smp_affinity_list
Think about what it looks like for cpus around say, 4088 to 4095.
We already have many alternate "list" interfaces:
/sys/devices/system/cpu/cpuX/indexY/shared_cpu_list
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list
/sys/devices/system/cpu/cpuX/topology/core_siblings_list
/sys/devices/system/node/nodeX/cpulist
/sys/devices/pci***/***/local_cpulist
Add a companion interface, smp_affinity_list to use cpu lists instead of
cpu maps. This conforms to other companion interfaces where both a map
and a list interface exists.
This required adding a bitmap_parselist_user() function in a manner
similar to the bitmap_parse_user() function.
[akpm@linux-foundation.org: make __bitmap_parselist() static]
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update documentation pointers to include virtfs publication, 9p RFC
as well as updated list of servers and alternative clients.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (52 commits)
UBIFS: switch to dynamic printks
UBIFS: fix kernel-doc comments
UBIFS: fix extremely rare mount failure
UBIFS: simplify LEB recovery function further
UBIFS: always cleanup the recovered LEB
UBIFS: clean up LEB recovery function
UBIFS: fix-up free space on mount if flag is set
UBIFS: add the fixup function
UBIFS: add a superblock flag for free space fix-up
UBIFS: share the next_log_lnum helper
UBIFS: expect corruption only in last journal head LEBs
UBIFS: synchronize write-buffer before switching to the next bud
UBIFS: remove BUG statement
UBIFS: change bud replay function conventions
UBIFS: substitute the replay tree with a replay list
UBIFS: simplify replay
UBIFS: store free and dirty space in the bud replay entry
UBIFS: remove unnecessary stack variable
UBIFS: double check that buds are replied in order
UBIFS: make 2 functions static
...
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.
The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard. Note that we don't bother
supporting discard for the non-delaylog mode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Switch to debugging using dynamic printk (pr_debug()). There is no good reason
to carry custom debugging prints if there is so cool and powerful generic
dynamic printk infrastructure, see Documentation/dynamic-debug-howto.txt. With
dynamic printks we can switch on/of individual prints, per-file, per-function
and per format messages. This means that instead of doing old-fashioned
echo 1 > /sys/module/ubifs/parameters/debug_msgs
to enable general messages, we can do:
echo 'format "UBIFS DBG gen" +ptlf' > control
to enable general messages and additionally ask the dynamic printk
infrastructure to print process ID, line number and function name. So there is
no reason to keep UBIFS-specific crud if there is more powerful generic thing.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
move LSM-, credentials-, and keys-related files from Documentation/
to Documentation/security/,
add Documentation/security/00-INDEX, and
update all occurrences of Documentation/<moved_file>
to Documentation/security/<moved_file>.
UBIFS can force itself to use the 'in-the-gaps' commit method - the last resort
method which is normally invoced very very rarely. Currently this "force
int-the-gaps" debugging feature is a separate test mode. But it is a bit saner
to make it to be the "general" self-test check instead.
This patch is just a clean-up which should make the debugging code look a bit
nicer and easier to use - we have way too many debugging options.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers. Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked. If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.
But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.
Also add comments and properly synchronized all accesses to
rcu_cpu_kthread_task, as suggested by Lai Jiangshan.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits)
ext4: fix a BUG in mb_mark_used during trim.
ext4: unused variables cleanup in fs/ext4/extents.c
ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep()
ext4: add more tracepoints and use dev_t in the trace buffer
ext4: don't kfree uninitialized s_group_info members
ext4: add missing space in printk's in __ext4_grp_locked_error()
ext4: add FITRIM to compat_ioctl.
ext4: handle errors in ext4_clear_blocks()
ext4: unify the ext4_handle_release_buffer() api
ext4: handle errors in ext4_rename
jbd2: add COW fields to struct jbd2_journal_handle
jbd2: add the b_cow_tid field to journal_head struct
ext4: Initialize fsync transaction ids in ext4_new_inode()
ext4: Use single thread to perform DIO unwritten convertion
ext4: optimize ext4_bio_write_page() when no extent conversion is needed
ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
ext4: use the nblocks arg to ext4_truncate_restart_trans()
ext4: fix missing iput of root inode for some mount error paths
ext4: make FIEMAP and delayed allocation play well together
ext4: suppress verbose debugging information if malloc-debug is off
...
Fi up conflicts in fs/ext4/super.c due to workqueue changes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: simplify iget & friends
fs: pull inode->i_lock up out of writeback_single_inode
fs: rename inode_lock to inode_hash_lock
fs: move i_wb_list out from under inode_lock
fs: move i_sb_list out from under inode_lock
fs: remove inode_lock from iput_final and prune_icache
fs: Lock the inode LRU list separately
fs: factor inode disposal
fs: protect inode->i_state with inode->i_lock
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
autofs4 - remove autofs4_lock
autofs4 - fix d_manage() return on rcu-walk
autofs4 - fix autofs4_expire_indirect() traversal
autofs4 - fix dentry leak in autofs4_expire_direct()
autofs4 - reinstate last used update on access
vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
Now that inode state changes are protected by the inode->i_lock and
the inode LRU manipulations by the inode_lru_lock, we can remove the
inode_lock from prune_icache and the initial part of iput_final().
instead of using the inode_lock to protect the inode during
iput_final, use the inode->i_lock instead. This protects the inode
against new references being taken while we change the inode state
to I_FREEING, as well as preventing prune_icache from grabbing the
inode while we are manipulating it. Hence we no longer need the
inode_lock in iput_final prior to setting I_FREEING on the inode.
For prune_icache, we no longer need the inode_lock to protect the
LRU list, and the inodes themselves are protected against freeing
races by the inode->i_lock. Hence we can lift the inode_lock from
prune_icache as well.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: Use vmalloc rather than kmalloc for zlib workspace
Squashfs: handle corruption of directory structure
Squashfs: wrap squashfs_mount() definition
Squashfs: xz_wrapper doesn't need to include squashfs_fs_i.h anymore
Squashfs: Update documentation to include compression options
Squashfs: Update Kconfig help text to include xz compression
Squashfs: add compression options support to xz decompressor
Squashfs: extend decompressor framework to handle compression options
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: deprecate the commands pending counter
exofs: Write sbi->s_nextid as part of the Create command
exofs: Add option to mount by osdname
exofs: Override read-ahead to align on stripe_size
exofs: simple fsync race fix
exofs: Optimize read_4_write
exofs: Trivial: fix some indentation and debug prints
exofs: Remove redundant unlikely()
ADFS (FileCore) storage complies with the RISC OS filetype specification
(12 bits of file type information is stored in the file load address,
rather than using a file extension). The existing driver largely ignores
this information and does not present it to the end user.
It is desirable that stored filetypes be made visible to the end user to
facilitate a precise copy of data and metadata from a hard disc (or image
thereof) into a RISC OS emulator (such as RPCEmu) or to a network share
which can be accessed by real Acorn systems.
This patch implements a per-mount filetype suffix option (use -o
ftsuffix=1) to present any filetype as a ,xyz hexadecimal suffix on each
file. This type suffix is compatible with that used by RISC OS systems
that access network servers using NFS client software and by RPCemu's host
filing system.
Signed-off-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (23 commits)
xfs: don't name variables "panic"
xfs: factor agf counter updates into a helper
xfs: clean up the xfs_alloc_compute_aligned calling convention
xfs: kill support/debug.[ch]
xfs: Convert remaining cmn_err() callers to new API
xfs: convert the quota debug prints to new API
xfs: rename xfs_cmn_err_fsblock_zero()
xfs: convert xfs_fs_cmn_err to new error logging API
xfs: kill xfs_fs_mount_cmn_err() macro
xfs: kill xfs_fs_repair_cmn_err() macro
xfs: convert xfs_cmn_err to xfs_alert_tag
xfs: Convert xlog_warn to new logging interface
xfs: Convert linux-2.6/ files to new logging interface
xfs: introduce new logging API.
xfs: zero proper structure size for geometry calls
xfs: enable delaylog by default
xfs: more sensible inode refcounting for ialloc
xfs: stop using xfs_trans_iget in the RT allocator
xfs: check if device support discard in xfs_ioc_trim()
xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: call security_d_instantiate in d_obtain_alias V2
lose 'mounting_here' argument in ->d_manage()
don't pass 'mounting_here' flag to follow_down()
change the locking order for namespace_sem
fix deadlock in pivot_root()
vfs: split off vfsmount-related parts of vfs_kern_mount()
Some fixes for pstore
kill simple_set_mnt()
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (25 commits)
UBIFS: clean-up commentaries
UBIFS: save 128KiB or more RAM
UBIFS: allocate orphans scan buffer on demand
UBIFS: allocate lpt dump buffer on demand
UBIFS: allocate ltab checking buffer on demand
UBIFS: allocate scanning buffer on demand
UBIFS: allocate dump buffer on demand
UBIFS: do not check data crc by default
UBIFS: simplify UBIFS Kconfig menu
UBIFS: print max. index node size
UBIFS: handle allocation failures in UBIFS write path
UBIFS: use max_write_size during recovery
UBIFS: use max_write_size for write-buffers
UBIFS: introduce write-buffer size field
UBI: incorporate LEB offset information
UBIFS: incorporate maximum write size
UBI: provide LEB offset information
UBI: incorporate maximum write size
UBIFS: fix LEB number in printk
UBIFS: restrict world-writable debugfs files
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
* 'nfs-for-2.6.39' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (54 commits)
RPC: killing RPC tasks races fixed
xprt: remove redundant check
SUNRPC: Convert struct rpc_xprt to use atomic_t counters
SUNRPC: Ensure we always run the tk_callback before tk_action
sunrpc: fix printk format warning
xprt: remove redundant null check
nfs: BKL is no longer needed, so remove the include
NFS: Fix a warning in fs/nfs/idmap.c
Cleanup: Factor out some cut-and-paste code.
cleanup: save 60 lines/100 bytes by combining two mostly duplicate functions.
NFS: account direct-io into task io accounting
gss:krb5 only include enctype numbers in gm_upcall_enctypes
RPCRDMA: Fix FRMR registration/invalidate handling.
RPCRDMA: Fix to XDR page base interpretation in marshalling logic.
NFSv4: Send unmapped uid/gids to the server when using auth_sys
NFSv4: Propagate the error NFS4ERR_BADOWNER to nfs4_do_setattr
NFSv4: cleanup idmapper functions to take an nfs_server argument
NFSv4: Send unmapped uid/gids to the server if the idmapper fails
NFSv4: If the server sends us a numeric uid/gid then accept it
NFSv4.1: reject zero layout with zeroed stripe unit
...
* 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: bury ->get_sb()
nfs: switch NFS from ->get_sb() to ->mount()
nfs: stop mangling ->mnt_devname on NFS
vfs: new superblock methods to override /proc/*/mount{s,info}
nfs: nfs_do_{ref,sub}mount() superblock argument is redundant
nfs: make nfs_path() work without vfsmount
nfs: store devname at disconnected NFS roots
nfs: propagate devname to nfs{,4}_get_root()
If /dev/osd* devices are shuffled because more devices
where added, and/or login order has changed. It is hard to
mount the FS you want.
Add an option to mount by osdname. osdname is any osd-device's
osdname as specified to the mkfs.exofs command when formatting
the osd-devices.
The new mount format is:
OPT="osdname=$UUID0,pid=$PID,_netdev"
mount -t exofs -o $OPT $DEV_OSD0 $MOUNTDIR
if "osdname=" is specified in options above $DEV_OSD0 is
ignored and can be empty.
Also while at it: Removed some old unused Opt_* enums.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Change the default UBIFS behavior WRT data CRC checking. Currently,
UBIFS checks data CRC when reading, which slows it down quite a bit,
and this is the default option. However, it looks like in average
user does not need this feature and would prefer faster read speed
over extra reliability. And this seems to be de-facto standard that
file-systems do not check data CRC every time they read from the
media.
Thus, make UBIFS default behavior so that it does not check data
CRC. This corresponds to the no_chk_data_crc mount option. Those users
who need extra protection can always enable it using the chk_data_crc
option.
Please, read more information about this feature here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_checksumming
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Add documentation for mount options and ioctls to
Documentation/filesystem/ext4.txt, which has not been udpated for some
time. Also add for ext4 sysfs tunables to the
Documentation/ABI/testing/sysfs-fs-ext4 file, and fix a few
typographical errors in that file.
https://bugzilla.kernel.org/show_bug.cgi?id=9423
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Since snprintf() may return a value that exceeds its second argument,
show() methods should use scnprintf() instead of snprintf(). This patch
updates the example in the sysfs documentation accordingly.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some time ago the way how sysfs stores a pointer to a kobject
corresponding to a directory was modified. This patch brings the
documentation again in sync with the implementation.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In ntfs_mft_record_alloc() when mapping the new extent mft record with
map_extent_mft_record() we overwrite @m with the return value and on
error, we then try to use the old @m but that is no longer there as @m
now contains an error code instead so we crash when dereferencing the
error code as if it were a pointer.
The simple fix is to use a temporary variable to store the return value
thus preserving the original @m for later use. This is a backport from
the commercial Tuxera-NTFS driver and is well tested...
Thanks go to Julia Lawall for pointing this out (whilst I had fixed it
in the commercial driver I had failed to fix it in the Linux kernel).
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
sanitize vfsmount refcounting changes
fix old umount_tree() breakage
autofs4: Merge the remaining dentry ops tables
Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
Allow d_manage() to be used in RCU-walk mode
Remove a further kludge from __do_follow_link()
autofs4: Bump version
autofs4: Add v4 pseudo direct mount support
autofs4: Fix wait validation
autofs4: Clean up autofs4_free_ino()
autofs4: Clean up dentry operations
autofs4: Clean up inode operations
autofs4: Remove unused code
autofs4: Add d_manage() dentry operation
autofs4: Add d_automount() dentry operation
Remove the automount through follow_link() kludge code from pathwalk
CIFS: Use d_automount() rather than abusing follow_link()
NFS: Use d_automount() rather than abusing follow_link()
AFS: Use d_automount() rather than abusing follow_link()
Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
...
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.
This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.
To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.
d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.
This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call
d_manage() directly. d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).
autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits)
nfsd4: fix callback restarting
nfsd: break lease on unlink, link, and rename
nfsd4: break lease on nfsd setattr
nfsd: don't support msnfs export option
nfsd4: initialize cb_per_client
nfsd4: allow restarting callbacks
nfsd4: simplify nfsd4_cb_prepare
nfsd4: give out delegations more quickly in 4.1 case
nfsd4: add helper function to run callbacks
nfsd4: make sure sequence flags are set after destroy_session
nfsd4: re-probe callback on connection loss
nfsd4: set sequence flag when backchannel is down
nfsd4: keep finer-grained callback status
rpc: allow xprt_class->setup to return a preexisting xprt
rpc: keep backchannel xprt as long as server connection
rpc: move sk_bc_xprt to svc_xprt
nfsd4: allow backchannel recovery
nfsd4: support BIND_CONN_TO_SESSION
nfsd4: modify session list under cl_lock
Documentation: fl_mylease no longer exists
...
Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work. The
vfs-scale work touched some msnfs cases, and this merge removes support
for that entirely, so the conflict was trivial to resolve.
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes
Fixed up trivial conflicts in Documentation/filesystems/porting
We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.
Alternative considered:
* a setuid binary
* a daemon with CAP_SYS_RESOURCE
Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.
This patch updated from original patch based on feedback from David
Rientjes.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is no way to find whether a process has locked its pages
in memory or not. And which of the memory regions are locked in memory.
Add a new field "Locked" to export this information via the smaps file.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
This patch simply adds documentation on how to handle the hole punching mode of
fallocate for any filesystem wishing to use it.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix writev() to not keep writing the first segment over and over again
instead of moving onto subsequent segments and update the NTFS entry in
MAINTAINERS to reflect that Tuxera Inc. now supports the NTFS driver.
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (36 commits)
serial: apbuart: Fixup apbuart_console_init()
TTY: Add tty ioctl to figure device node of the system console.
tty: add 'active' sysfs attribute to tty0 and console device
drivers: serial: apbuart: Handle OF failures gracefully
Serial: Avoid unbalanced IRQ wake disable during resume
tty: fix typos/errors in tty_driver.h comments
pch_uart : fix warnings for 64bit compile
8250: fix uninitialized FIFOs
ip2: fix compiler warning on ip2main_pci_tbl
specialix: fix compiler warning on specialix_pci_tbl
rocket: fix compiler warning on rocket_pci_ids
8250: add a UPIO_DWAPB32 for 32 bit accesses
8250: use container_of() instead of casting
serial: omap-serial: Add support for kernel debugger
serial: fix pch_uart kconfig & build
drivers: char: hvc: add arm JTAG DCC console support
RS485 documentation: add 16C950 UART description
serial: ifx6x60: fix memory leak
serial: ifx6x60: free IRQ on error
Serial: EG20T: add PCH_UART driver
...
Fixed up conflicts in drivers/serial/apbuart.c with evil merge that
makes the code look fairly sane (unlike either side).
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The ->trim_fs has been removed meanwhile, so remove it from the documentation
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mostly inspired by all the recent BKL removal changes, but a lot of older
updates also weren't properly recorded.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix panic after nfs_umount()
nfs: remove extraneous and problematic calls to nfs_clear_request
nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
NFS: Fix fcntl F_GETLK not reporting some conflicts
nfs: Discard ACL cache on mode update
NFS: Readdir cleanups
NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
NFS: Fix a memory leak in nfs_readdir
Call the filesystem back whenever a page is removed from the page cache
NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
->releasepage() does not remove the page from the mapping.
Acked-by: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.
This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.
Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
[trondmy: cover the cases of invalidate_inode_pages2() and
truncate_inode_pages()]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If "p" is NULL then it will cause an oops when we pass it to
simple_strtoul(). In this case "p" can not be NULL so I removed the
check. I also changed the check a little to make it more explicit that
we are testing whether p points to the NUL char.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It allows users to see what consoles are currently known to the system
and with what flags.
It is based on Werner's patch, the part about traversing fds was
removed, the code was moved to kernel/printk.c, where consoles are
handled and it makes more sense to me.
Signed-off-by: Jiri Slaby <jslaby@suse.cz> [cleanups]
Signed-off-by: "Dr. Werner Fink" <werner@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We promised to do this for 2.6.37, and the code looks stable enough to
keep that promise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
This one was only used for a nasty hack in nfsd, which has recently
been removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This merges the staging-next tree to Linus's tree and resolves
some conflicts that were present due to changes in other trees that were
affected by files here.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Export the number of anonymous pages in a mapping via smaps.
Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.
Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
This patch renames the idmapper upcall program from nfs.upcall to nfs.idmap in
the NFS documentation. This is because the program has been renamed in the
nfs-utils source.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
net/sunrpc: Use static const char arrays
nfs4: fix channel attribute sanity-checks
NFSv4.1: Use more sensible names for 'initialize_mountpoint'
NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
NFSv4.1: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
NFS: client needs to maintain list of inodes with active layouts
NFS: create and destroy inode's layout cache
NFSv4.1: pnfs: filelayout: introduce minimal file layout driver
NFSv4.1: pnfs: full mount/umount infrastructure
NFS: set layout driver
NFS: ask for layouttypes during v4 fsinfo call
NFS: change stateid to be a union
NFSv4.1: pnfsd, pnfs: protocol level pnfs constants
SUNRPC: define xdr_decode_opaque_fixed
NFSD: remove duplicate NFS4_STATEID_SIZE
Documentation: Fix trivial typo in filesystems/sharedsubtree.txt
This typo is easy to ignore unless you have spent a great deal of time
thinking about how to eliminate duplicate dentries in unions.
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (67 commits)
SUNRPC: Cleanup duplicate assignment in rpcauth_refreshcred
nfs: fix unchecked value
Ask for time_delta during fsinfo probe
Revalidate caches on lock
SUNRPC: After calling xprt_release(), we must restart from call_reserve
NFSv4: Fix up the 'dircount' hint in encode_readdir
NFSv4: Clean up nfs4_decode_dirent
NFSv4: nfs4_decode_dirent must clear entry->fattr->valid
NFSv4: Fix a regression in decode_getfattr
NFSv4: Fix up decode_attr_filehandle() to handle the case of empty fh pointer
NFS: Ensure we check all allocation return values in new readdir code
NFS: Readdir plus in v4
NFS: introduce generic decode_getattr function
NFS: check xdr_decode for errors
NFS: nfs_readdir_filler catch all errors
NFS: readdir with vmapped pages
NFS: remove page size checking code
NFS: decode_dirent should use an xdr_stream
SUNRPC: Add a helper function xdr_inline_peek
NFS: remove readdir plus limit
...
Allow a module implementing a layout type to register, and
have its mount/umount routines called for filesystems that
the server declares support it.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit f4a3e0bceb. Jiri
Sladby points out that the tty structure we're using may already be
gone, and Al Viro doesn't hold back in complaining about the random
loading of 'filp->private_data' which doesn't have to be a pointer at
all, nor does checking the magic field for TTY_MAGIC prove anything.
Belated review by Al:
"a) global variable depending on stdin of the last opener? Affecting
output of read(2)? Really?
b) iterator is broken; list should be locked in ->start(), unlocked in
->stop() and *NOT* unlocked/relocked in ->next()
c) ->show() ought to do nothing in case of ->device == NULL, instead
of skipping those in ->next()/->start()
d) regardless of the merits of the bright idea about asterisk at that
line in output *and* regardless of (a), the implementation is not
only atrociously ugly, it's actually very likely to be a roothole.
Verifying that Cthulhu knows what number happens to be address of a
tty_struct by blindly dereferencing memory at that address...
Ouch.
Please revert that crap."
And Christoph pipes in and NAK's the approach of walking fd tables etc
too. So it's pretty unanimous.
Noticed-by: Jri Slaby <jslaby@suse.cz>
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Werner Fink <werner@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (49 commits)
serial8250: ratelimit "too much work" error
serial: bfin_sport_uart: speed up sport RX sample rate to be 3% faster
serial: abstraction for 8250 legacy ports
serial/imx: check that the buffer is non-empty before sending it out
serial: mfd: add more baud rates support
jsm: Remove the uart port on errors
Alchemy: Add UART PM methods.
8250: allow platforms to override PM hook.
altera_uart: Don't use plain integer as NULL pointer
altera_uart: Fix missing prototype for registering an early console
altera_uart: Fixup type usage of port flags
altera_uart: Make it possible to use Altera UART and 8250 ports together
altera_uart: Add support for different address strides
altera_uart: Add support for getting mapbase and IRQ from resources
altera_uart: Add support for polling mode (IRQ-less)
serial: Factor out uart_poll_timeout() from 8250 driver
serial: mark the 8250 driver as maintained
serial: 8250: Don't delay after transmitter is ready.
tty: MAINTAINERS: add drivers/serial/jsm/ as maintained driver
vcs: invoke the vt update callback when /dev/vcs* is written to
...
Add a new file /proc/tty/consoles to be able to determine the registered
system console lines. If the reading process holds /dev/console open at
the regular standard input stream the active device will be marked by an
asterisk. Show possible operations and also decode the used flags of
the listed console lines.
Signed-off-by: Werner Fink <werner@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently, the default behavior of O_DIRECT writes was allowing
concurrent writing among nodes to the same file, with no cluster
coherency guaranteed (no EX lock held). This can leave stale data in
the cache for buffered reads on other nodes.
The new mount option introduce a chance to choose two different
behaviors for O_DIRECT writes:
* coherency=full, as the default value, will disallow
concurrent O_DIRECT writes by taking
EX locks.
* coherency=buffered, allow concurrent O_DIRECT writes
without EX lock among nodes, which
gains high performance at risk of
getting stale data on other nodes.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch creates a new idmapper system that uses the request-key function to
place a call into userspace to map user and group ids to names. The old
idmapper was single threaded, which prevented more than one request from running
at a single time. This means that a user would have to wait for an upcall to
finish before accessing a cached result.
The upcall result is stored on a keyring of type id_resolver. See the file
Documentation/filesystems/nfs/idmapper.txt for instructions.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
[Trond: fix up the return value of nfs_idmap_lookup_name and clean up code]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
smbfs has been scheduled for removal in 2.6.27, so
maybe we can now move it to drivers/staging on the
way out.
smbfs still uses the big kernel lock and nobody
is going to fix that, so we should be getting
rid of it soon.
This removes the 32 bit compat mount and ioctl
handling code, which is implemented in common fs
code, and moves all smbfs related files into
drivers/staging/smbfs.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As a convenience, introduce a kernel command line option to enable
NFSROOT debugging messages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The last user is gone, so we can safely remove this
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: fix checkpatch.pl warnings
Squashfs: fix filename typo
Squashfs: update Kconfig and documentation for LZO
Squashfs: fix block size use in LZO decompressor
Squashfs: Add LZO compression support
squashfs: fix filename in header comment
Squashfs: Make XATTR config name consistent with other file systems
squashfs: fix compiler inline warning
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed. The target date for removal is August 2012.
A warning will be printed to the kernel log if a task attempts to use this
interface. Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (45 commits)
nilfs2: reject filesystem with unsupported block size
nilfs2: avoid rec_len overflow with 64KB block size
nilfs2: simplify nilfs_get_page function
nilfs2: reject incompatible filesystem
nilfs2: add feature set fields to super block
nilfs2: clarify byte offset in super block format
nilfs2: apply read-ahead for nilfs_btree_lookup_contig
nilfs2: introduce check flag to btree node buffer
nilfs2: add btree get block function with readahead option
nilfs2: add read ahead mode to nilfs_btnode_submit_block
nilfs2: fix buffer head leak in nilfs_btnode_submit_block
nilfs2: eliminate inline keywords in btree implementation
nilfs2: get maximum number of child nodes from bmap object
nilfs2: reduce repetitive calculation of max number of child nodes
nilfs2: optimize calculation of min/max number of btree node children
nilfs2: remove redundant pointer checks in bmap lookup functions
nilfs2: get rid of nilfs_bmap_union
nilfs2: unify bmap set_target_v operations
nilfs2: get rid of nilfs_btree uses
nilfs2: get rid of nilfs_direct uses
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
workqueue: mark init_workqueues() as early_initcall()
workqueue: explain for_each_*cwq_cpu() iterators
fscache: fix build on !CONFIG_SYSCTL
slow-work: kill it
gfs2: use workqueue instead of slow-work
drm: use workqueue instead of slow-work
cifs: use workqueue instead of slow-work
fscache: drop references to slow-work
fscache: convert operation to use workqueue instead of slow-work
fscache: convert object to use workqueue instead of slow-work
workqueue: fix how cpu number is stored in work->data
workqueue: fix mayday_mask handling on UP
workqueue: fix build problem on !CONFIG_SMP
workqueue: fix locking in retry path of maybe_create_worker()
async: use workqueue for worker pool
workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
workqueue: implement unbound workqueue
workqueue: prepare for WQ_UNBOUND implementation
libata: take advantage of cmwq and remove concurrency limitations
workqueue: fix worker management invocation without pending works
...
Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (30 commits)
PCI: update for owner removal from struct device_attribute
PCI: Fix warnings when CONFIG_DMI unset
PCI: Do not run NVidia quirks related to MSI with MSI disabled
x86/PCI: use for_each_pci_dev()
PCI: use for_each_pci_dev()
PCI: MSI: Restore read_msi_msg_desc(); add get_cached_msi_msg_desc()
PCI: export SMBIOS provided firmware instance and label to sysfs
PCI: Allow read/write access to sysfs I/O port resources
x86/PCI: use host bridge _CRS info on ASRock ALiveSATA2-GLAN
PCI: remove unused HAVE_ARCH_PCI_SET_DMA_MAX_SEGMENT_{SIZE|BOUNDARY}
PCI: disable mmio during bar sizing
PCI: MSI: Remove unsafe and unnecessary hardware access
PCI: Default PCIe ASPM control to on and require !EMBEDDED to disable
PCI: kernel oops on access to pci proc file while hot-removal
PCI: pci-sysfs: remove casts from void*
ACPI: Disable ASPM if the platform won't provide _OSC control for PCIe
PCI hotplug: make sure child bridges are enabled at hotplug time
PCI hotplug: shpchp: Removed check for hotplug of display devices
PCI hotplug: pciehp: Fixed return value sign for pciehp_unconfigure_device
PCI: Don't enable aspm before drivers have had a chance to veto it
...
Update compression types supported and add some help text for
the LZO Kconfig option.
Also add missing "default n" line and make some trivial whitespace
cleanups too.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Fix all discrepancies I know of between the sysfs implementation and its
documentation.
Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...
Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
Below you will find an updated version from the original series bunching all patches into one big patch
updating broken web addresses that are located in Documentation/*
Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
Now there are also some addresses pointing to .spec files some are located, but some(after searching
on the companies site)where still no where to be found. In this case I just changed the address
to the company site this way the users can contact the company and they can locate them for the users.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Thomas Weber <weber@corscience.de>
Signed-off-by: Mike Frysinger <vapier.adi@gmail.com>
Cc: Paulo Marques <pmarques@grupopie.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
PCI sysfs resource files currently only allow mmap'ing. On x86 this
works fine for memory backed BARs, but doesn't work at all for I/O
port backed BARs. Add read/write to I/O port PCI sysfs resource
files to allow userspace access to these device regions.
Acked-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
the osyncisosync option a no-op. Warn the users about this and remove
the mount flag for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Nilfs has "discard" mount option which issues discard/TRIM commands to
underlying block device, but it lacks a complementary option and has
no way to disable the feature through remount.
This adds "nodiscard" option to resolve this imbalance.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs enables write barriers by default and has "nobarrier" mount
option to disable this feature. But it lacks the complementary option
and has no way to re-enable the feature on remount.
This adds "barrier" option to resolve this imbalance.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Make fscache object state transition callbacks use workqueue instead
of slow-work. New dedicated unbound CPU workqueue fscache_object_wq
is created. get/put callbacks are renamed and modified to take
@object and called directly from the enqueue wrapper and the work
function. While at it, make all open coded instances of get/put to
use fscache_get/put_object().
* Unbound workqueue is used.
* work_busy() output is printed instead of slow-work flags in object
debugging outputs. They mean basically the same thing bit-for-bit.
* sysctl fscache.object_max_active added to control concurrency. The
default value is nr_cpus clamped between 4 and
WQ_UNBOUND_MAX_ACTIVE.
* slow_work_sleep_till_thread_needed() is replaced with fscache
private implementation fscache_object_sleep_till_congested() which
waits on fscache_object_wq congestion.
* debugfs support is dropped for now. Tracing API based debug
facility is planned to be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Now it's possible to update the DNS record for $HOST_NAME with
ip=::::$HOST_NAME::dhcp
CC: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The inode's i_size is not protected by the big kernel lock. Therefore it
does not make sense to recommend taking the BKL in filesystems llseek
operations. Instead it should use the inode's mutex or use just use
i_size_read() instead. Add a note that this is not protecting
file->f_pos.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
squashfs: update documentation to include description of xattr layout
squashfs: fix name reading in squashfs_xattr_get
squashfs: constify xattr handlers
squashfs: xattr fix sparse warnings
squashfs: xattr_lookup sparse fix
squashfs: add xattr support configure option
squashfs: add new extended inode types
squashfs: add support for xattr reading
squashfs: add xattr id support
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: Ensure inode allocation buffers are fully replayed
xfs: enable background pushing of the CIL
xfs: forced unmounts need to push the CIL
xfs: Introduce delayed logging core code
xfs: Delayed logging design documentation
xfs: Improve scalability of busy extent tracking
xfs: make the log ticket ID available outside the log infrastructure
xfs: clean up log ticket overrun debug output
xfs: Clean up XFS_BLI_* flag namespace
xfs: modify buffer item reference counting
xfs: allow log ticket allocation to take allocation flags
xfs: Don't reuse the same transaction ID for duplicated transactions.
Update Documentation/filesystems/tmpfs.txt to describe the interaction of
tmpfs mount option memory policy with tasks' cpuset mems_allowed.
Note: the mount(8) man page [in the util-linux-ng package] requires
similiar updates.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Document the design of the delayed logging implementation. This
includes assumptions made, dead ends followed, the reasoning behind
the structuring of the code, the layout of various structures, how
things fit together, traps and pit-falls avoided, etc. This is all
too much to document in the code itself, so do it in a separate
file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (31 commits)
dquot: Detect partial write error to quota file in write_blk() and add printk_ratelimit for quota error messages
ocfs2: Fix lock inversion in quotas during umount
ocfs2: Use __dquot_transfer to avoid lock inversion
ocfs2: Fix NULL pointer deref when writing local dquot
ocfs2: Fix estimate of credits needed for quota allocation
ocfs2: Fix quota locking
ocfs2: Avoid unnecessary block mapping when refreshing quota info
ocfs2: Do not map blocks from local quota file on each write
quota: Refactor dquot_transfer code so that OCFS2 can pass in its references
quota: unify quota init condition in setattr
quota: remove sb_has_quota_active in get/set_info
quota: unify ->set_dqblk
quota: unify ->get_dqblk
ext3: make barrier options consistent with ext4
quota: Make quota stat accounting lockless.
suppress warning: "quotatypes" defined but not used
ext3: Fix waiting on transaction during fsync
jbd: Provide function to check whether transaction will issue data barrier
ufs: add ufs speciffic ->setattr call
BKL: Remove BKL from ext2 filesystem
...
ext4 was updated to accept barrier/nobarrier mount options
in addition to the older barrier=0/1. The barrier story
is complex enough, we should help people by making the options
the same at least, even if the defaults are different.
This patch allows the barrier/nobarrier mount options for ext3,
while keeping nobarrier the default.
It also unconditionally displays barrier status in show_options,
and prints a message at mount time if barriers are not enabled,
just as ext4 does.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The first three paragraphs are almost verbatim taken from Eric's
commit message on the patch introducing network ns tags. The next
two paragraphs I wrote to be a brief high level overview. The last
section is taken from the commit message on "Implement sysfs tagged
directory support", but updated. Hopefully correctly.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (23 commits)
nilfs2: disallow remount of snapshot from/to a regular mount
nilfs2: use huge_encode_dev/huge_decode_dev
nilfs2: update comment on deactivate_super at nilfs_get_sb
nilfs2: replace MS_VERBOSE with MS_SILENT
nilfs2: add missing initialization of s_mode
nilfs2: fix misuse of open_bdev_exclusive/close_bdev_exclusive
nilfs2: enlarge s_volume_name member in nilfs_super_block
nilfs2: use checkpoint number instead of timestamp to select super block
nilfs2: add missing endian conversion on super block magic number
nilfs2: make nilfs_sc_*_ops static
nilfs2: add kernel doc comments to persistent object allocator functions
nilfs2: change sc_timer from a pointer to an embedded one in struct nilfs_sc_info
nilfs2: remove nilfs_segctor_init() in segment.c
nilfs2: insert checkpoint number in segment summary header
nilfs2: add a print message after loading nilfs2
nilfs2: cleanup multi kmem_cache_{create,destroy} code
nilfs2: move out checksum routines to segment buffer code
nilfs2: move pointer to super root block into logs
nilfs2: change default of 'errors' mount option to 'remount-ro' mode
nilfs2: Combine nilfs_btree_release_path() and nilfs_btree_free_path()
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
ocfs2: Silence a gcc warning.
ocfs2: Don't retry xattr set in case value extension fails.
ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
fs/ocfs2/dlm: Use kstrdup
fs/ocfs2/dlm: Drop memory allocation cast
Ocfs2: Optimize punching-hole code.
Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
ocfs2: Wrap signal blocking in void functions.
ocfs2/dlm: Increase o2dlm lockres hash size
ocfs2: Make ocfs2_extend_trans() really extend.
ocfs2/trivial: Code cleanup for allocation reservation.
ocfs2: make ocfs2_adjust_resv_from_alloc simple.
ocfs2: Make nointr a default mount option
ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
o2net: log socket state changes
ocfs2: print node # when tcp fails
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
* 'for-2.6.35' of git://linux-nfs.org/~bfields/linux: (45 commits)
Revert "nfsd4: distinguish expired from stale stateids"
nfsd: safer initialization order in find_file()
nfs4: minor callback code simplification, comment
NFSD: don't report compiled-out versions as present
nfsd4: implement reclaim_complete
nfsd4: nfsd4_destroy_session must set callback client under the state lock
nfsd4: keep a reference count on client while in use
nfsd4: mark_client_expired
nfsd4: introduce nfs4_client.cl_refcount
nfsd4: refactor expire_client
nfsd4: extend the client_lock to cover cl_lru
nfsd4: use list_move in move_to_confirmed
nfsd4: fold release_session into expire_client
nfsd4: rename sessionid_lock to client_lock
nfsd4: fix bare destroy_session null dereference
nfsd4: use local variable in nfs4svc_encode_compoundres
nfsd: further comment typos
sunrpc: centralise most calls to svc_xprt_received
nfsd4: fix unlikely race in session replay case
nfsd4: fix filehandle comment
...
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Clear CPU mask in affinity_hint when none is provided
genirq: Add CPU mask affinity hint
genirq: Remove IRQF_DISABLED from core code
genirq: Run irq handlers with interrupts disabled
genirq: Introduce request_any_context_irq()
genirq: Expose irq_desc->node in proc/irq
Fixed up trivial conflicts in Documentation/feature-removal-schedule.txt
This is a mandatory operation. Also, here (not in open) is where we
should be committing the reboot recovery information.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.
Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.
Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.
Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.
Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.
This patch nearly undoes the effect of all these patches.
The reason for reverting these is it provides an unusable value in
field 28. For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address. This unpredictability of the stack_start value makes
it worthless. That includes the intended use of showing how much stack
space a thread has.
Other architectures will get different values. As an example, ia64
gets 0. The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.
I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") . If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured. Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.
I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") . I left the KSTK_ESP() change in
place as that seemed worthwhile.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the introduced i_sem by an i_mutex in the filesystem locking
documentation. This was introduced [1] after all occurrences were
already replaced in the same text [2]. However, the term "inode
semaphore" has not been replaced then, and it's replaced now.
[1] afddba49d1
[2] a7bc02f4f4
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Like ext3, nilfs has 'errors' mount option to allow specifying desired
behavior on severe errors.
Currently, the default action is 'errors=continue' and has potential
to advance filesystem corruption for severe errors.
This will change the action to 'errors=remount-ro' to avoid the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default reservation size of 4 (32-bit windows) is a bit too ambitious.
Scale it back to 16 bits (resv_level=2). I have been testing various sizes
on a 4-node cluster which runs a mixed workload that is heavily threaded.
With a 256MB local alloc, I get *roughly* the following levels of average file
fragmentation:
resv_level=0 70%
resv_level=1 21%
resv_level=2 23%
resv_level=3 24%
resv_level=4 60%
resv_level=5 did not test
resv_level=6 60%
resv_level=2 seemed like a good compromise between not letting windows be
too small, but not so big that heavier workloads will immediately suffer
without tuning.
This patch also change the behavior of directory reservations - they now
track file reservations. The previous compromise of giving directory
windows only 8 bits wound up fragmenting more at some window sizes because
file allocations had smaller unused windows to poach from.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.
Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Fix obvious cases of "it's" being used when "its" was meant.
Signed-off-by: Francis Galiegue <fgaliegue@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch adds documentation for new 9P options introduced in
2.6.34.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
New documentation should have an entry in the 00-INDEX. Correct git
urls.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
commit 3f226aa1c (mempolicy: support mpol=local tmpfs mount option) added
new mpol=local mount option. but it didn't add a documentation.
This patch does it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Expose irq_desc->node as /proc/irq/*/node.
This file provides device hardware locality information for apps
desiring to include hardware locality in irq mapping decisions.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits)
ceph: update for write_inode API change
ceph: reset osd after relevant messages timed out
ceph: fix flush_dirty_caps race with caps migration
ceph: include migrating caps in issued set
ceph: fix osdmap decoding when pools include (removed) snaps
ceph: return EBADF if waiting for caps on closed file
ceph: set osd request message front length correctly
ceph: reset front len on return to msgpool; BUG on mismatched front iov
ceph: fix snaptrace decoding on cap migration between mds
ceph: use single osd op reply msg
ceph: reset bits on connection close
ceph: remove bogus mds forward warning
ceph: remove fragile __map_osds optimization
ceph: fix connection fault STANDBY check
ceph: invalidate_authorizer without con->mutex held
ceph: don't clobber write return value when using O_SYNC
ceph: fix client_request_forward decoding
ceph: drop messages on unregistered mds sessions; cleanup
ceph: fix comments, locking in destroy_inode
ceph: move dereference after NULL test
...
Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
doc: fix typo in comment explaining rb_tree usage
Remove fs/ntfs/ChangeLog
doc: fix console doc typo
doc: cpuset: Update the cpuset flag file
Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
Remove drivers/parport/ChangeLog
Remove drivers/char/ChangeLog
doc: typo - Table 1-2 should refer to "status", not "statm"
tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
devres/irq: Fix devm_irq_match comment
Remove reference to kthread_create_on_cpu
tree-wide: Assorted spelling fixes
tree-wide: fix 'lenght' typo in comments and code
drm/kms: fix spelling in error message
doc: capitalization and other minor fixes in pnp doc
devres: typo fix s/dev/devm/
Remove redundant trailing semicolons from macros
fix typo "definetly" -> "definitely" in comment
tree-wide: s/widht/width/g typo in comments
...
Fix trivial conflict in Documentation/laptops/00-INDEX
Make dnotify_test.c source file and add it to Makefile so that
bitrot can be prevented.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs:
[LogFS] Change magic number
[LogFS] Remove h_version field
[LogFS] Check feature flags
[LogFS] Only write journal if dirty
[LogFS] Fix bdev erases
[LogFS] Silence gcc
[LogFS] Prevent 64bit divisions in hash_index
[LogFS] Plug memory leak on error paths
[LogFS] Add MAINTAINERS entry
[LogFS] add new flash file system
Fixed up trivial conflict in lib/Kconfig, and a semantic conflict in
fs/logfs/inode.c introduced by write_inode() being changed to use
writeback_control' by commit a9185b41a4
("pass writeback_control to ->write_inode")
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd4: fix minor memory leak
svcrpc: treat uid's as unsigned
nfsd: ensure sockets are closed on error
Revert "sunrpc: move the close processing after do recvfrom method"
Revert "sunrpc: fix peername failed on closed listener"
sunrpc: remove unnecessary svc_xprt_put
NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN
xfs_export_operations.commit_metadata
commit_metadata export operation replacing nfsd_sync_dir
lockd: don't clear sm_monitored on nsm_reboot_lookup
lockd: release reference to nsm_handle in nlm_host_rebooted
nfsd: Use vfs_fsync_range() in nfsd_commit
NFSD: Create PF_INET6 listener in write_ports
SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found"
SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt()
NFSD: Support AF_INET6 in svc_addsock() function
SUNRPC: Use rpc_pton() in ip_map_parse()
nfsd: 4.1 has an rfc number
nfsd41: Create the recovery entry for the NFSv4.1 client
nfsd: use vfs_fsync for non-directories
...
A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.
Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)
This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc/<pid>/status file as
[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB <=============== this.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.
This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.
Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.
rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...
Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.
Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.
Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not. Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
init: Open /dev/console from rootfs
mqueue: fix typo "failues" -> "failures"
mqueue: only set error codes if they are really necessary
mqueue: simplify do_open() error handling
mqueue: apply mathematics distributivity on mq_bytes calculation
mqueue: remove unneeded info->messages initialization
mqueue: fix mq_open() file descriptor leak on user-space processes
fix race in d_splice_alias()
set S_DEAD on unlink() and non-directory rename() victims
vfs: add NOFOLLOW flag to umount(2)
get rid of ->mnt_parent in tomoyo/realpath
hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Mirror MS_KERNMOUNT in ->mnt_flags
get rid of useless vfsmount_lock use in put_mnt_ns()
Take vfsmount_lock to fs/internal.h
get rid of insanity with namespace roots in tomoyo
take check for new events in namespace (guts of mounts_poll()) to namespace.c
Don't mess with generic_permission() under ->d_lock in hpfs
sanitize const/signedness for udf
nilfs: sanitize const/signedness in dealing with ->d_name.name
...
Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
* document locking
* add the missing part of data structure invariants (relationship
between mnt_share and mnt_slave lists in case of a peer group
among slaves).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: add reader's lock for cno in nilfs_ioctl_sync
nilfs2: delete unnecessary condition in load_segment_summary
nilfs2: move iterator to write log into segment buffer
nilfs2: get rid of s_dirt flag use
nilfs2: get rid of nilfs_segctor_req struct
nilfs2: delete unnecessary condition in nilfs_dat_translate
nilfs2: fix potential hang in nilfs_error on errors=remount-ro
nilfs2: use mnt_want_write in ioctls where write access is needed
nilfs2: issue discard request after cleaning segments
Fixes a typo in proc.txt documentation. Table 1-2 should refer to
"status", not "statm"
Signed-off-by: Mulyadi Santosa <mulyadi.santosa@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This adds a function to send discard requests for given array of
segment numbers, and calls the function when garbage collection
succeeded.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status. But it cause large
performance regression. Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component. Because both to
take mmap_sem and page table walk are heavily operation.
If many process run, the ps performance is,
[before d899bf7b]
% perf stat ps >/dev/null
Performance counter stats for 'ps':
4090.435806 task-clock-msecs # 0.032 CPUs
229 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
234 page-faults # 0.000 M/sec
8587565207 cycles # 2099.425 M/sec
9866662403 instructions # 1.149 IPC
3789415411 cache-references # 926.409 M/sec
30419509 cache-misses # 7.437 M/sec
128.859521955 seconds time elapsed
[after d899bf7b]
% perf stat ps > /dev/null
Performance counter stats for 'ps':
4305.081146 task-clock-msecs # 0.028 CPUs
480 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
237 page-faults # 0.000 M/sec
9021211334 cycles # 2095.480 M/sec
10605887536 instructions # 1.176 IPC
3612650999 cache-references # 839.160 M/sec
23917502 cache-misses # 5.556 M/sec
152.277819582 seconds time elapsed
Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.
Commit d899bf7b introduced two features:
1) Add the annotattion of [thread stack: xxxx] mark to
/proc/{pid}/task/{tid}/maps.
2) Add StackUsage field to /proc/{pid}/status.
I only revert (2), because I haven't seen (1) cause regression.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: update mailing list address
nilfs2: Storage class should be before const qualifier
nilfs2: trivial coding style fix
This replaces the list address for nilfs discussion to linux-nilfs at
vger.kernel.org from users at nilfs.org.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Per commit 240799cd, the option name for readahead should be
inode_readahead_blks, not inode_readahead.
Signed-off-by: Fang Wenqi <antonf@turbolinux.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Many struct driver_attribute descriptors are purely read-only
structures, and there's no need to change them. Therefore make
the promise not to, which will let those descriptors be put in
a ro section.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Most device_attributes are const, and are begging to be
put in a ro section. However, the create and remove
file interfaces were failing to propagate the const promise
which the only functions they call offer.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: (42 commits)
nfsd: remove pointless paths in file headers
nfsd: move most of nfsfh.h to fs/nfsd
nfsd: remove unused field rq_reffh
nfsd: enable V4ROOT exports
nfsd: make V4ROOT exports read-only
nfsd: restrict filehandles accepted in V4ROOT case
nfsd: allow exports of symlinks
nfsd: filter readdir results in V4ROOT case
nfsd: filter lookup results in V4ROOT case
nfsd4: don't continue "under" mounts in V4ROOT case
nfsd: introduce export flag for v4 pseudoroot
nfsd: let "insecure" flag vary by pseudoflavor
nfsd: new interface to advertise export features
nfsd: Move private headers to source directory
vfs: nfsctl.c un-used nfsd #includes
lockd: Remove un-used nfsd headers #includes
s390: remove un-used nfsd #includes
sparc: remove un-used nfsd #includes
parsic: remove un-used nfsd #includes
compat.c: Remove dependence on nfsd private headers
...
Using create_proc_entry() + ->proc_fops assignment is racy because
->proc_fops will be NULL for some time, use proc_create() to avoid race.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.
However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.
This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Mike Fulton <fultonm@ca.ibm.com>
Cc: Sean Foley <Sean_Foley@ca.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
ext3: Fix data / filesystem corruption when write fails to copy data
ext4: Support for 64-bit quota format
ext3: Support for vfsv1 quota format
quota: Implement quota format with 64-bit space and inode limits
quota: Move definition of QFMT_OCFS2 to linux/quota.h
ext2: fix comment in ext2_find_entry about return values
ext3: Unify log messages in ext3
ext2: clear uptodate flag on super block I/O error
ext2: Unify log messages in ext2
ext3: make "norecovery" an alias for "noload"
ext3: Don't update the superblock in ext3_statfs()
ext3: journal all modifications in ext3_xattr_set_handle
ext2: Explicitly assign values to on-disk enum of filetypes
quota: Fix WARN_ON in lookup_one_len
const: struct quota_format_ops
ubifs: remove manual O_SYNC handling
afs: remove manual O_SYNC handling
kill wait_on_page_writeback_range
vfs: Implement proper O_SYNC semantics
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (49 commits)
nilfs2: separate wait function from nilfs_segctor_write
nilfs2: add iterator for segment buffers
nilfs2: hide nilfs_write_info struct in segment buffer code
nilfs2: relocate io status variables to segment buffer
nilfs2: do not return io error for bio allocation failure
nilfs2: use list_splice_tail or list_splice_tail_init
nilfs2: replace mark_inode_dirty as nilfs_mark_inode_dirty
nilfs2: delete mark_inode_dirty in nilfs_delete_entry
nilfs2: delete mark_inode_dirty in nilfs_commit_chunk
nilfs2: change return type of nilfs_commit_chunk
nilfs2: split nilfs_unlink as nilfs_do_unlink and nilfs_unlink
nilfs2: delete redundant mark_inode_dirty
nilfs2: expand inode_inc_link_count and inode_dec_link_count
nilfs2: delete mark_inode_dirty from nilfs_set_link
nilfs2: delete mark_inode_dirty in nilfs_new_inode
nilfs2: add norecovery mount option
nilfs2: add helper to get if volume is in a valid state
nilfs2: move recovery completion into load_nilfs function
nilfs2: apply readahead for recovery on mount
nilfs2: clean up get/put function of a segment usage
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
ext4: Do not override ext2 or ext3 if built they are built as modules
jbd2: Export jbd2_log_start_commit to fix ext4 build
ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
ext4: Wait for proper transaction commit on fsync
ext4: fix incorrect block reservation on quota transfer.
ext4: quota macros cleanup
ext4: ext4_get_reserved_space() must return bytes instead of blocks
ext4: remove blocks from inode prealloc list on failure
ext4: wait for log to commit when umounting
ext4: Avoid data / filesystem corruption when write fails to copy data
ext4: Use ext4 file system driver for ext2/ext3 file system mounts
ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
ext4: remove unused parameter wbc from __ext4_journalled_writepage()
ext4: remove encountered_congestion trace
ext4: move_extent_per_page() cleanup
ext4: initialize moved_len before calling ext4_move_extents()
ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
ext4: use ext4_data_block_valid() in ext4_free_blocks()
...
Users on the list recently complained about differences across
filesystems w.r.t. how to mount without a journal replay.
In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext3.
Also show this status in /proc/mounts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Add exofs.txt to filesystems Documentation index and fix some typos,
identation and grammar.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
the description in Documentation/filesystems/proc.txt of the
procs_running entry in /proc/stat is confusing (according to that
description, it looks as if procs_running could only be a number
between 0 and the number of CPUs).
Changed it to a more accurate description in the patch attached.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (31 commits)
FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n
SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE
SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()
CacheFiles: Don't log lookup/create failing with ENOBUFS
CacheFiles: Catch an overly long wait for an old active object
CacheFiles: Better showing of debugging information in active object problems
CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy
CacheFiles: Handle truncate unlocking the page we're reading
CacheFiles: Don't write a full page if there's only a partial page to cache
FS-Cache: Actually requeue an object when requested
FS-Cache: Start processing an object's operations on that object's death
FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
FS-Cache: Add a retirement stat counter
FS-Cache: Handle pages pending storage that get evicted under OOM conditions
FS-Cache: Handle read request vs lookup, creation or other cache failure
FS-Cache: Don't delete pending pages from the page-store tracking tree
FS-Cache: Fix lock misorder in fscache_write_op()
FS-Cache: The object-available state can't rely on the cookie to be available
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase
FS-Cache: Use radix tree preload correctly in tracking of pages to be stored
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Trivial cleanup of jbd compatibility layer removal
ocfs2: Refresh documentation
ocfs2: return f_fsid info in ocfs2_statfs()
ocfs2: duplicate inline data properly during reflink.
ocfs2: Move ocfs2_complete_reflink to the right place.
ocfs2: Return -EINVAL when a device is not ocfs2.
This adds "norecovery" mount option which disables temporal write
access to read-only mounts or snapshots during mount/recovery.
Without this option, write access will be even performed for those
types of mounts; the temporal write access is needed to mount root
file system read-only after an unclean shutdown.
This option will be helpful when user wants to prevent any write
access to the device.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Eric Sandeen <sandeen@redhat.com>
Since most of fs using nofoobar style option,
modified barrier=off option as nobarrier.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Users on the linux-ext4 list recently complained about differences
across filesystems w.r.t. how to mount without a journal replay.
In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext4.
Also show this status in /proc/mounts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues. Make
this mount-time optional, and default it to off until we know
that things are working out OK.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Catch an overly long wait for an old, dying active object when we want to
replace it with a new one. The probability is that all the slow-work threads
are hogged, and the delete can't get a look in.
What we do instead is:
(1) if there's nothing in the slow work queue, we sleep until either the dying
object has finished dying or there is something in the slow work queue
behind which we can queue our object.
(2) if there is something in the slow work queue, we return ETIMEDOUT to
fscache_lookup_object(), which then puts us back on the slow work queue,
presumably behind the deletion that we're blocked by. We are then
deferred for a while until we work our way back through the queue -
without blocking a slow-work thread unnecessarily.
A backtrace similar to the following may appear in the log without this patch:
INFO: task kslowd004:5711 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd004 D 0000000000000000 0 5711 2 0x00000080
ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
Call Trace:
[<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
[<ffffffff81353153>] __wait_on_bit+0x43/0x76
[<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
[<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
[<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
[<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
[<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
[<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
[<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
[<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
1 lock held by kslowd004/5711:
#0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]
Signed-off-by: David Howells <dhowells@redhat.com>
Start processing an object's operations when that object moves into the DYING
state as the object cannot be destroyed until all its outstanding operations
have completed.
Furthermore, make sure that read and allocation operations handle being woken
up on a dead object. Such events are recorded in the Allocs.abt and
Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.
The code for waiting for object activation for the read and allocation
operations is also extracted into its own function as it is much the same in
all cases, differing only in the stats incremented.
Signed-off-by: David Howells <dhowells@redhat.com>
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache. Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.
The problem is typified by the following trace of a stuck process:
kslowd005 D 0000000000000000 0 4253 2 0x00000080
ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
Call Trace:
[<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
[<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
[<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff810885d3>] try_to_release_page+0x32/0x3b
[<ffffffff81093203>] shrink_page_list+0x316/0x4ac
[<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
[<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
[<ffffffff81093aa2>] shrink_list+0x8d/0x8f
[<ffffffff81093d1c>] shrink_zone+0x278/0x33c
[<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
[<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
[<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
[<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
[<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
[<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
[<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
[<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
[<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
[<ffffffff810b2e82>] do_sync_write+0xe3/0x120
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
[<ffffffff810b1a76>] ? dentry_open+0x82/0x89
[<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
[<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
[<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
In the above backtrace, the following is happening:
(1) A page storage operation is being executed by a slow-work thread
(fscache_write_op()).
(2) FS-Cache farms the operation out to the cache to perform
(cachefiles_write_page()).
(3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
standard write (do_sync_write()) under KERNEL_DS directly from the netfs
page.
(4) However, for Ext3 to perform the write, it must allocate some memory, in
particular, it must allocate at least one page cache page into which it
can copy the data from the netfs page.
(5) Under OOM conditions, the memory allocator can't immediately come up with
a page, so it uses vmscan to find something to discard
(try_to_free_pages()).
(6) vmscan finds a clean netfs page it might be able to discard (possibly the
one it's trying to write out).
(7) The netfs is called to throw the page away (nfs_release_page()) - but it's
called with __GFP_WAIT, so the netfs decides to wait for the store to
complete (__fscache_wait_on_page_write()).
(8) This blocks a slow-work processing thread - possibly against itself.
The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.
To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed. This means that some data won't make it into the
cache this time. To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.
The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.
What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages. If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache doesn't correctly handle the netfs requesting a read from the cache
on an object that failed or was withdrawn by the cache. A trace similar to
the following might be seen:
CacheFiles: Lookup failed error -105
[exe ] unexpected submission OP165afe [OBJ6cac OBJECT_LC_DYING]
[exe ] objstate=OBJECT_LC_DYING [OBJECT_LC_DYING]
[exe ] objflags=0
[exe ] objevent=9 [fffffffffffffffb]
[exe ] ops=0 inp=0 exc=0
Pid: 6970, comm: exe Not tainted 2.6.32-rc6-cachefs #50
Call Trace:
[<ffffffffa0076477>] fscache_submit_op+0x3ff/0x45a [fscache]
[<ffffffffa0077997>] __fscache_read_or_alloc_pages+0x187/0x3c4 [fscache]
[<ffffffffa00b6480>] ? nfs_readpage_from_fscache_complete+0x0/0x66 [nfs]
[<ffffffffa00b6388>] __nfs_readpages_from_fscache+0x7e/0x176 [nfs]
[<ffffffff8108e483>] ? __alloc_pages_nodemask+0x11c/0x5cf
[<ffffffffa009d796>] nfs_readpages+0x114/0x1d7 [nfs]
[<ffffffff81090314>] __do_page_cache_readahead+0x15f/0x1ec
[<ffffffff81090228>] ? __do_page_cache_readahead+0x73/0x1ec
[<ffffffff810903bd>] ra_submit+0x1c/0x20
[<ffffffff810906bb>] ondemand_readahead+0x227/0x23a
[<ffffffff81090762>] page_cache_sync_readahead+0x17/0x19
[<ffffffff8108a99e>] generic_file_aio_read+0x236/0x5a0
[<ffffffffa00937bd>] nfs_file_read+0xe4/0xf3 [nfs]
[<ffffffff810b2fa2>] do_sync_read+0xe3/0x120
[<ffffffff81354cc3>] ? _spin_unlock_irq+0x2b/0x31
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff811848e5>] ? selinux_file_permission+0x5d/0x10f
[<ffffffff81352bdb>] ? thread_return+0x3e/0x101
[<ffffffff8117d7b0>] ? security_file_permission+0x11/0x13
[<ffffffff810b3b06>] vfs_read+0xaa/0x16f
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff810b3c84>] sys_read+0x45/0x6c
[<ffffffff8100ae2b>] system_call_fastpath+0x16/0x1b
The object state might also be OBJECT_DYING or OBJECT_WITHDRAWING.
This should be handled by simply rejecting the new operation with ENOBUFS.
There's no need to log an error for it. Events of this type now appear in the
stats file under Ops:rej.
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache has two structs internally for keeping track of the internal state of
a cached file: the fscache_cookie struct, which represents the netfs's state,
and fscache_object struct, which represents the cache's state. Each has a
pointer that points to the other (when both are in existence), and each has a
spinlock for pointer maintenance.
Since netfs operations approach these structures from the cookie side, they get
the cookie lock first, then the object lock. Cache operations, on the other
hand, approach from the object side, and get the object lock first. It is not
then permitted for a cache operation to get the cookie lock whilst it is
holding the object lock lest deadlock occur; instead, it must do one of two
things:
(1) increment the cookie usage counter, drop the object lock and then get both
locks in order, or
(2) simply hold the object lock as certain parts of the cookie may not be
altered whilst the object lock is held.
It is also not permitted to follow either pointer without holding the lock at
the end you start with. To break the pointers between the cookie and the
object, both locks must be held.
fscache_write_op(), however, violates the locking rules: It attempts to get the
cookie lock without (a) checking that the cookie pointer is a valid pointer,
and (b) holding the object lock to protect the cookie pointer whilst it follows
it. This is so that it can access the pending page store tree without
interference from __fscache_write_page().
This is fixed by splitting the cookie lock, such that the page store tracking
tree is protected by its own lock, and checking that the cookie pointer is
non-NULL before we attempt to follow it whilst holding the object lock.
The new lock is subordinate to both the cookie lock and the object lock, and so
should be taken after those.
Signed-off-by: David Howells <dhowells@redhat.com>
Permit the operations to retrieve data from the cache or to allocate space in
the cache for future writes to be interrupted whilst they're waiting for
permission for the operation to proceed. Typically this wait occurs whilst the
cache object is being looked up on disk in the background.
If an interruption occurs, and the operation has not yet been given the
go-ahead to run, the operation is dequeued and cancelled, and control returns
to the read operation of the netfs routine with none of the requested pages
having been read or in any way marked as known by the cache.
This means that the initial wait is done interruptibly rather than
uninterruptibly.
In addition, extra stats values are made available to show the number of ops
cancelled and the number of cache space allocations interrupted.
Signed-off-by: David Howells <dhowells@redhat.com>
Count entries to and exits from cache operation table functions. Maintain
these as a single counter that's added to or removed from as appropriate.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow the current state of all fscache objects to be dumped by doing:
cat /proc/fs/fscache/objects
By default, all objects and all fields will be shown. This can be restricted
by adding a suitable key to one of the caller's keyrings (such as the session
keyring):
keyctl add user fscache:objlist "<restrictions>" @s
The <restrictions> are:
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
And paired restrictions:
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have slow work queued
s Show objects that don't have slow work queued
If neither side of a restriction pair is given, then both are implied. For
example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
Signed-off-by: David Howells <dhowells@redhat.com>
This reverts commit d0646f7b63, as
requested by Eric Sandeen.
It can basically cause an ext4 filesystem to miss recovery (and thus get
mounted with errors) if the journal checksum does not match.
Quoth Eric:
"My hand-wavy hunch about what is happening is that we're finding a
bad checksum on the last partially-written transaction, which is
not surprising, but if we have a wrapped log and we're doing the
initial scan for head/tail, and we abort scanning on that bad
checksum, then we are essentially running an unrecovered filesystem.
But that's hand-wavy and I need to go look at the code.
We lived without journal checksums on by default until now, and at
this point they're doing more harm than good, so we should revert
the default-changing commit until we can fix it and do some good
power-fail testing with the fixes in place."
See
http://bugzilla.kernel.org/show_bug.cgi?id=14354
for all the gory details.
Requested-by: Eric Sandeen <sandeen@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Alexey Fisher <bug-track@fisher-privat.net>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're adding enough nfs documentation that it may as well have its own
subdirectory.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.
This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.
And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.
The original discussions can be found here:
http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.htmlhttp://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html
Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256314810-7897-1-git-send-email-ozaki.ryota@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix time encoding with extra epoch bits
ext4: Add a stub for mpage_da_data in the trace header
jbd2: Use tracepoints for history file
ext4: Use tracepoints for mb_history trace file
ext4, jbd2: Drop unneeded printks at mount and unmount time
ext4: Handle nested ext4_journal_start/stop calls without a journal
ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
ext4: Avoid updating the inode table bh twice in no journal mode
ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
ext4: async direct IO for holes and fallocate support
ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
ext4: Split uninitialized extents for direct I/O
ext4: release reserved quota when block reservation for delalloc retry
ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
ext4: Fix hueristic which avoids group preallocation for closed files
ext4: Use ext4_msg() for ext4_da_writepage() errors
ext4: Update documentation about quota mount options
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
fat: Check s_dirt in fat_sync_fs()
vfat: change the default from shortname=lower to shortname=mixed
fat/nls: Fix handling of utf8 invalid char
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.
By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ca_maxresponsesize and ca_maxrequest size include the RPC header.
sv_max_mesg is sv_max_payolad plus a page for overhead and is used in
svc_init_buffer to allocate server buffer space for both the request and reply.
Note that this means we can service an RPC compound that requires
ca_maxrequestsize (MAXWRITE) or ca_max_responsesize (MAXREAD) but that we do
not support an RPC compound that requires both ca_maxrequestsize and
ca_maxresponsesize.
Signed-off-by: Andy Adamson <andros@netapp.com>
[bfields@citi.umich.edu: more documentation updates]
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Documentation/filesystems/sharedsubtree.txt needs updating because the
mount command in util-linux package is well aware of shared subtree
features now. The patch also fixes two typos in sharedsubtree.txt.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mount(8) handles shared subtrees just fine, so remove the smount program
from Documentation/filesystems/sharedsubtree.txt.
Fix annoying "Lets" -> "Let's".
Insert space between '#' prompt and "mount" command.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the documentation to describe FS-Cache related
caching parameters. This patch also updates the pointers
to 9p-related papers and adds pointer to the Wiki.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A patch to give a better overview of the userland application stack usage,
especially for embedded linux.
Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.
There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.
A sample output of /proc/<pid>/task/<tid>/maps looks like:
08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
which will you give the current stack usage in kb.
A sample output of /proc/self/status looks like:
Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB
I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.
[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.32' of git://linux-nfs.org/~bfields/linux: (68 commits)
nfsd4: nfsv4 clients should cross mountpoints
nfsd: revise 4.1 status documentation
sunrpc/cache: avoid variable over-loading in cache_defer_req
sunrpc/cache: use list_del_init for the list_head entries in cache_deferred_req
nfsd: return success for non-NFS4 nfs4_state_start
nfsd41: Refactor create_client()
nfsd41: modify nfsd4.1 backchannel to use new xprt class
nfsd41: Backchannel: Implement cb_recall over NFSv4.1
nfsd41: Backchannel: cb_sequence callback
nfsd41: Backchannel: Setup sequence information
nfsd41: Backchannel: Server backchannel RPC wait queue
nfsd41: Backchannel: Add sequence arguments to callback RPC arguments
nfsd41: Backchannel: callback infrastructure
nfsd4: use common rpc_cred for all callbacks
nfsd4: allow nfs4 state startup to fail
SUNRPC: Defer the auth_gss upcall when the RPC call is asynchronous
nfsd4: fix null dereference creating nfsv4 callback client
nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition
nfsd41: sunrpc: add new xprt class for nfsv4.1 backchannel
sunrpc/cache: simplify cache_fresh_locked and cache_fresh_unlocked.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...
oom-killer kills a process, not task. Then oom_score should be calculated
as per-process too. it makes consistency more and makes speed up
select_bad_process().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing. This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).
The patch adds anonymous and file backed filters to the clear_refs interface.
echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only
Any other value is ignored
Signed-off-by: Moussa A. Ba <moussa.a.ba@gmail.com>
Signed-off-by: Jared E. Hulbert <jaredeh@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We added a new column in cpuX lines of /proc/stat, to show the amount of
time spent by a cpu servicing a guest, without updating
Documentation/filesystems/proc.txt
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some small updates, a caveat about the minorversion control interface,
and an attempt to put missing features in context.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Truncating metadata pages is not safe right now before
we haven't audited all file systems.
To enable truncation only for data address space define
a new address_space callback error_remove_page.
This is used for memory_failure.c memory error handling.
This can be then set to truncate_inode_page()
This patch just defines the new operation and adds documentation.
Callers and users come in followon patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: Whitespace fixes
GFS2: Remove unused sysfs file
GFS2: Be extra careful about deallocating inodes
GFS2: Remove no_formal_ino generating code
GFS2: Rename eattr.[ch] as xattr.[ch]
GFS2: Clean up of extended attribute support
GFS2: Add explanation of extended attr on-disk format
GFS2: Add "-o errors=panic|withdraw" mount options
GFS2: jumping to wrong label?
GFS2: free disk inode which is deleted by remote node -V2
GFS2: Add a document explaining GFS2's uevents
GFS2: Add sysfs link to device
GFS2: Replace assertion with proper error handling
GFS2: Improve error handling in inode allocation
GFS2: Add some more info to uevents
GFS2: Add online uevent to GFS2
Small error in the "dd" command example, "out=" should be "of=".
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: update documentation pointers
9p: remove unnecessary v9fses->options which duplicates the mount string
net/9p: insulate the client against an invalid error code sent by a 9p server
9p: Add missing cast for the error return value in v9fs_get_inode
9p: Remove redundant inode uid/gid assignment
9p: Fix possible regressions when ->get_sb fails.
9p: Fix v9fs show_options
9p: Fix possible memleak in v9fs_inode_from fid.
9p: minor comment fixes
9p: Fix possible inode leak in v9fs_get_inode.
9p: Check for error in return value of v9fs_fid_add
The NFSv4 and NFSv4.1 protocols both allow for the redirection of a client
from one server to another in order to support filesystem migration and
replication. For full protocol support, we need to add the ability to
convert a DNS host name into an IP address that we can feed to the RPC
client.
We'll reuse the sunrpc cache, now that it has been converted to work with
rpc_pipefs.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix some issues with the AFS documentation, found when testing AFS on ppc64:
- Update AFS features: reading/writing, local caching
- Typo in kafs sysfs debug file
- Use modprobe instead of insmod in example
- Update IPs for grand.central.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct. It was a very good first step for sanitize OOM.
However Paul Menage reported the commit makes regression to his job
scheduler. Current OOM logic can kill OOM_DISABLED process.
Why? His program has the code of similar to the following.
...
set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
...
if (vfork() == 0) {
set_oom_adj(0); /* Invoked child can be killed */
execve("foo-bar-cmd");
}
....
vfork() parent and child are shared the same mm_struct. then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.
Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.
Then, this patch revert commit 2ff05b2b and related commit.
Reverted commit list
---------------------
- commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b57 (mm: copy over oom_adj value at fork time)
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will be essential reading for anybody who wants to
understand how GFS2 interacts with the userland gfs_controld,
and the details of recovery.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Because, with "shortname=lower", copying one FAT filesystem tree to
another FAT filesystem tree using Linux results in semantically
different filesystems. (E.g.: Filenames which were once "all
uppercase" are now "all lowercase").
So, this changes the default of "shortname=lower" to "shortname=mixed".
Signed-off-by: Paul Wise <pabs3@bonedaddy.net>
[change fat_show_options()]
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
The original text suggested that sysfs is mandatory and always
compiled in the kernel.
Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The rules for locking in many superblock operations has changed
significantly, so update the documentation for it. Also correct some
older updates and ommissions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.31' of git://fieldses.org/git/linux-nfsd: (60 commits)
SUNRPC: Fix the TCP server's send buffer accounting
nfsd41: Backchannel: minorversion support for the back channel
nfsd41: Backchannel: cleanup nfs4.0 callback encode routines
nfsd41: Remove ip address collision detection case
nfsd: optimise the starting of zero threads when none are running.
nfsd: don't take nfsd_mutex twice when setting number of threads.
nfsd41: sanity check client drc maxreqs
nfsd41: move channel attributes from nfsd4_session to a nfsd4_channel_attr struct
NFS: kill off complicated macro 'PROC'
sunrpc: potential memory leak in function rdma_read_xdr
nfsd: minor nfsd_vfs_write cleanup
nfsd: Pull write-gathering code out of nfsd_vfs_write
nfsd: track last inode only in use_wgather case
sunrpc: align cache_clean work's timer
nfsd: Use write gathering only with NFSv2
NFSv4: kill off complicated macro 'PROC'
NFSv4: do exact check about attribute specified
knfsd: remove unreported filehandle stats counters
knfsd: fix reply cache memory corruption
knfsd: reply cache cleanups
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: clean up jbd2_journal_try_to_free_buffers()
ext4: Don't update ctime for non-extent-mapped inodes
ext4: Fix up whitespace issues in fs/ext4/inode.c
ext4: Fix 64-bit block type problem on 32-bit platforms
ext4: teach the inode allocator to use a goal inode number
ext4: Use a hash of the topdir directory name for the Orlov parent group
ext4: document the "abort" mount option
ext4: move the abort flag from s_mount_opts to s_mount_flags
ext4: update the s_last_mounted field in the superblock
ext4: change s_mount_opt to be an unsigned int
ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl
ext4: avoid unnecessary spinlock in critical POSIX ACL path
ext3: avoid unnecessary spinlock in critical POSIX ACL path
ext4: convert instrumentation from markers to tracepoints
jbd2: convert instrumentation from markers to tracepoints
So far, permissions set via 'mode' and/or 'dmode' mount options were
effective only if the medium had no rock ridge extensions (or was mounted
without them). Add 'overriderockmode' mount option to indicate that these
options should override permissions set in rock ridge extensions. Maybe
this should be default but the current behavior is there since mount
options were created so I think we should not change how they behave.
Cc: <Hans-Joachim.Baader@cjt.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext2.txt says that dirs can have 32,768 subdirs, but the actual value of
EXT2_LINK_MAX is 32000.
ext3 is the same, but the doc does not mention it. One of ext4's features
is to "fix 32000 subdirectory limit".
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An update for the "Process-Specific Subdirectories" section to reflect the
changes till kernel 2.6.30.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export statistics for softirq in /proc/softirqs and /proc/stat.
1. /proc/softirqs
Implement /proc/softirqs which shows the number of softirq
for each CPU like /proc/interrupts.
2. /proc/stat
Add the "softirq" line to /proc/stat.
This line shows the number of softirq for all cpu.
The first column is the total of all softirqs and
each subsequent column is the total for particular softirq.
[kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
get rid of BKL in fs/sysv
get rid of BKL in fs/minix
get rid of BKL in fs/efs
befs ->pust_super() doesn't need BKL
Cleanup of adfs headers
9P doesn't need BKL in ->umount_begin()
fuse doesn't need BKL in ->umount_begin()
No instance of ->bmap() needs BKL
remove unlock_kernel() left accidentally
ext4: avoid unnecessary spinlock in critical POSIX ACL path
ext3: avoid unnecessary spinlock in critical POSIX ACL path
* akpm: (182 commits)
fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
fbdev: *bfin*: fix __dev{init,exit} markings
fbdev: *bfin*: drop unnecessary calls to memset
fbdev: bfin-t350mcqb-fb: drop unused local variables
fbdev: blackfin has __raw I/O accessors, so use them in fb.h
fbdev: s1d13xxxfb: add accelerated bitblt functions
tcx: use standard fields for framebuffer physical address and length
fbdev: add support for handoff from firmware to hw framebuffers
intelfb: fix a bug when changing video timing
fbdev: use framebuffer_release() for freeing fb_info structures
radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
s3c-fb: CPUFREQ frequency scaling support
s3c-fb: fix resource releasing on error during probing
carminefb: fix possible access beyond end of carmine_modedb[]
acornfb: remove fb_mmap function
mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
mb862xxfb: restrict compliation of platform driver to PPC
Samsung SoC Framebuffer driver: add Alpha Channel support
atmel-lcdc: fix pixclock upper bound detection
offb: use framebuffer_alloc() to allocate fb_info struct
...
Manually fix up conflicts due to kmemcheck in mm/slab.c
The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm. If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.
This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.
Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.
Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (22 commits)
nilfs2: support contiguous lookup of blocks
nilfs2: add sync_page method to page caches of meta data
nilfs2: use device's backing_dev_info for btree node caches
nilfs2: return EBUSY against delete request on snapshot
nilfs2: modify list of unsupported features in caveats
nilfs2: enable sync_page method
nilfs2: set bio unplug flag for the last bio in segment
nilfs2: allow future expansion of metadata read out via get info ioctl
NILFS2: Pagecache usage optimization on NILFS2
nilfs2: remove nilfs_btree_operations from btree mapping
nilfs2: remove nilfs_direct_operations from direct mapping
nilfs2: remove bmap pointer operations
nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct
nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function
nilfs2: move get block functions in bmap.c into btree codes
nilfs2: remove nilfs_bmap_delete_block
nilfs2: remove nilfs_bmap_put_block
nilfs2: remove header file for segment list operations
nilfs2: eliminate removal list of segments
nilfs2: add sufile function that can modify multiple segment usages
...
* 'docs-next' of git://git.lwn.net/linux-2.6:
Document the debugfs API
Documentation: Add "how to write a good patch summary" to SubmittingPatches
SubmittingPatches: fix typo
docs: Encourage better changelogs in the development process document
Document Reported-by in SubmittingPatches
This is an updated document covering the internal API for the debugfs
filesystem. Thanks to Shen Feng for suggesting that I put this text here
and noting that the old LWN version was rather out of date.
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Reported-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
On severe errors FAT remounts itself in read-only mode. Allow to
specify FAT fs desired behavior through 'errors' mount option:
panic, continue or remount read-only.
`mount -t [fat|vfat] -o errors=[panic,remount-ro,continue] \
<bdev> <mount point>`
This is analog to ext2 fs 'errors' mount option.
Signed-off-by: Denis Karpov <ext-denis.2.karpov@nokia.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
My old address will shut down in a few days time: remove it from the tree,
and add a tmpfs (shmem filesystem) maintainer entry with the new address.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty. This allows the filesystem to have
full control of page dirtying events coming from the VM.
Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.
The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite. It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).
In this window, the VM could write out the page, clearing page-dirty. The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.
It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail. The
filesystem may need to allocate some data structure, for example.
And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.
This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window. This provides the filesystem with a strong
synchronisation against the VM here.
- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
under dirty pages might be susceptible to i_size changing under partial page
at the end of file (we also have a buffer.c-side problem here, but it cannot
be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
page_mkwrite functions themselves.
[ This also moves page_mkwrite another step closer to fault, which should
eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
filesystem calldown and page lock/unlock cycle in __do_fault. ]
[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust the CacheFiles documentation to use the correct names of the credential
pointers in task_struct.
The documentation was using names from the old versions of the credentials
patches.
Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation/filesystems/vfs.txt incorrectly states that the kernel is
locked during the call to statfs (Documentation/filesystems/Locking
correctly says it is not). This patch removes the offending sentence.
remove reference to BKL being held in statfs
Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The sketch file is a file to mark checkpoints with user data. It was
experimentally introduced in the original implementation, and now
obsolete. The file was handled differently with regular files; the file
size got truncated when a checkpoint was created.
This stops the special treatment and will treat it as a regular file.
Most users are not affected because mkfs.nilfs2 no longer makes this file.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a document describing the features, mount options, userland
tools, usage, disk format, and related URLs for the nilfs2 file system.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits)
nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4
nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
nfsd41: Documentation/filesystems/nfs41-server.txt
nfsd41: CREATE_EXCLUSIVE4_1
nfsd41: SUPPATTR_EXCLCREAT attribute
nfsd41: support for 3-word long attribute bitmask
nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
nfsd41: pass writable attrs mask to nfsd4_decode_fattr
nfsd41: provide support for minor version 1 at rpc level
nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap
nfsd41: access_valid
nfsd41: clientid handling
nfsd41: check encode size for sessions maxresponse cached
nfsd41: stateid handling
nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op
nfsd41: destroy_session operation
nfsd41: non-page DRC for solo sequence responses
nfsd41: Add a create session replay cache
nfsd41: create_session operation
...
Initial nfs41 server write up describing the status of the linux
server implementation.
[nfsd41: document unenforced nfs41 compound ordering rules.]
[get rid of CONFIG_NFSD_V4_1]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
trivial: Update my email address
trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
trivial: Fix misspelling of "Celsius".
trivial: remove unused variable 'path' in alloc_file()
trivial: fix a pdlfush -> pdflush typo in comment
trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
trivial: wusb: Storage class should be before const qualifier
trivial: drivers/char/bsr.c: Storage class should be before const qualifier
trivial: h8300: Storage class should be before const qualifier
trivial: fix where cgroup documentation is not correctly referred to
trivial: Give the right path in Documentation example
trivial: MTD: remove EOL from MODULE_DESCRIPTION
trivial: Fix typo in bio_split()'s documentation
trivial: PWM: fix of #endif comment
trivial: fix typos/grammar errors in Kconfig texts
trivial: Fix misspelling of firmware
trivial: cgroups: documentation typo and spelling corrections
trivial: Update contact info for Jochen Hein
trivial: fix typo "resgister" -> "register"
...
This patch includes POHMELFS design and implementation description.
Separate file includes mount options, default parameters and usage examples.
Signed-off-by: Eveniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (41 commits)
NFS: Add mount options to enable local caching on NFS
NFS: Display local caching state
NFS: Store pages from an NFS inode into a local cache
NFS: Read pages from FS-Cache into an NFS inode
NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching
NFS: Add read context retention for FS-Cache to call back with
NFS: FS-Cache page management
NFS: Add some new I/O counters for FS-Cache doing things for NFS
NFS: Invalidate FsCache page flags when cache removed
NFS: Use local disk inode cache
NFS: Define and create inode-level cache objects
NFS: Define and create superblock-level objects
NFS: Define and create server-level objects
NFS: Register NFS for caching and retrieve the top-level index
NFS: Permit local filesystem caching to be enabled for NFS
NFS: Add FS-Cache option bit and debug bit
NFS: Add comment banners to some NFS functions
FS-Cache: Make kAFS use FS-Cache
CacheFiles: A cache that backs onto a mounted filesystem
CacheFiles: Export things for CacheFiles
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: Don't write integrity descriptor too often
udf: Try anchor in block 256 first
udf: Some type fixes and cleanups
udf: use hardware sector size
udf: fix novrs mount option
udf: Fix oops when invalid character in filename occurs
udf: return f_fsid for statfs(2)
udf: Add checks to not underflow sector_t
udf: fix default mode and dmode options handling
udf: fix sparse warnings:
udf: unsigned last[i] cannot be less than 0
udf: implement mode and dmode mounting options
udf: reduce stack usage of udf_get_filename
udf: reduce stack usage of udf_load_pvoldesc
Fix the udf code not to pass structs on stack where possible.
Remove struct typedefs from fs/udf/ecma_167.h et al.
Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.
CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling. This is called cachefilesd and lives in
/sbin. The source for the daemon can be downloaded from:
http://people.redhat.com/~dhowells/cachefs/cachefilesd.c
And an example configuration from:
http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf
The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services. Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.
CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
and whilst it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.
CacheFiles is currently limited to a single cache.
CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section. This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.
============
REQUIREMENTS
============
The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:
- dnotify.
- extended attributes (xattrs).
- openat() and friends.
- bmap() support on files in the filesystem (FIBMAP ioctl).
- The use of bmap() to detect a partial page at the end of the file.
It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.
=============
CONFIGURATION
=============
The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available:
(*) brun <N>%
(*) bcull <N>%
(*) bstop <N>%
(*) frun <N>%
(*) fcull <N>%
(*) fstop <N>%
Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits.
(*) dir <path>
Specify the directory containing the root of the cache. Mandatory.
(*) tag <name>
Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles".
(*) debug <mask>
Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information:
1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug())
This mask can also be set through sysfs, eg:
echo 5 >/sys/modules/cachefiles/parameters/debug
==================
STARTING THE CACHE
==================
The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live.
The daemon is run as follows:
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
The flags are:
(*) -d
Increase the debugging level. This can be specified multiple times and
is cumulative with itself.
(*) -s
Send messages to stderr instead of syslog.
(*) -n
Don't daemonise and go into background.
(*) -f <configfile>
Use an alternative configuration file rather than the default one.
===============
THINGS TO AVOID
===============
Do not mount other things within the cache as this will cause problems. The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.
Do not create, rename or unlink files and directories in the cache whilst the
cache is active, as this may cause the state to become uncertain.
Renaming files in the cache might make objects appear to be other objects (the
filename is part of the lookup key).
Do not change or remove the extended attributes attached to cache files by the
cache as this will cause the cache state management to get confused.
Do not create files or directories in the cache, lest the cache get confused or
serve incorrect data.
Do not chmod files in the cache. The module creates things with minimal
permissions to prevent random users being able to access them directly.
=============
CACHE CULLING
=============
The cache may need culling occasionally to make space. This involves
discarding objects from the cache that have been used less recently than
anything else. Culling is based on the access time of data objects. Empty
directories are culled if not in use.
Cache culling is done on the basis of the percentage of blocks and the
percentage of files available in the underlying filesystem. There are six
"limits":
(*) brun
(*) frun
If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off.
(*) bcull
(*) fcull
If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started.
(*) bstop
(*) fstop
If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above
these limits again.
These must be configured thusly:
0 <= bstop < bcull < brun < 100
0 <= fstop < fcull < frun < 100
Note that these are percentages of available space and available files, and do
_not_ appear as 100 minus the percentage displayed by the "df" program.
The userspace daemon scans the cache to build up a table of cullable objects.
These are then culled in least recently used order. A new scan of the cache is
started as soon as space is made in the table. Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.
===============
CACHE STRUCTURE
===============
The CacheFiles module will create two directories in the directory it was
given:
(*) cache/
(*) graveyard/
The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
to the graveyard from which the daemon will actually delete them.
The daemon uses dnotify to monitor the graveyard directory, and will delete
anything that appears therein.
The module represents index objects as directories with the filename "I..." or
"J...". Note that the "cache/" directory is itself a special index.
Data objects are represented as files if they have no children, or directories
if they do. Their filenames all begin "D..." or "E...". If represented as a
directory, data objects will have a file in the directory called "data" that
actually holds the data.
Special objects are similar to data objects, except their filenames begin
"S..." or "T...".
If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child
objects:
INDEX INDEX INDEX DATA FILES
========= ========== ================================= ================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have
'+' prepended:
J1223/@23/+xy...z/+kl...m/Epqr
Note that keys are raw data, and not only may they exceed NAME_MAX in size,
they may also contain things like '/' and NUL characters, and so they may not
be suitable for turning directly into a filename.
To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding:
OBJECT TYPE PRINTABLE ENCODED
=============== =============== ===============
Index "I..." "J..."
Data "D..." "E..."
Special "S..." "T..."
Intermediate directories are always "@" or "+" as appropriate.
Each object in the cache has an extended attribute label that holds the object
type ID (required to distinguish special objects) and the auxiliary data from
the netfs. The latter is used to detect stale objects in the cache and update
or retire them.
Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).
==========================
SECURITY MODEL AND SELINUX
==========================
CacheFiles is implemented to deal properly with the LSM security features of
the Linux kernel and the SELinux facility.
One of the problems that CacheFiles faces is that it is generally acting on
behalf of a process, and running in that process's context, and that includes a
security context that is not appropriate for accessing the cache - either
because the files in the cache are inaccessible to that process, or because if
the process creates a file in the cache, that file may be inaccessible to other
processes.
The way CacheFiles works is to temporarily change the security context (fsuid,
fsgid and actor security label) that the process acts as - without changing the
security context of the process when it the target of an operation performed by
some other process (so signalling and suchlike still work correctly).
When the CacheFiles module is asked to bind to its cache, it:
(1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default,
this is:
cachefiles_var_t
(2) Finds the security label of the process which issued the bind request
(presumed to be the cachefilesd daemon), which by default will be:
cachefilesd_t
and asks LSM to supply a security ID as which it should act given the
daemon's label. By default, this will be:
cachefiles_kernel_t
SELinux transitions the daemon's security ID to the module's security ID
based on a rule of this form in the policy.
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
For instance:
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
The module's security ID gives it permission to create, move and remove files
and directories in the cache, to find and access directories and files in the
cache, to set and access extended attributes on cache objects, and to read and
write files in the cache.
The daemon's security ID gives it only a very restricted set of permissions: it
may scan directories, stat files and erase files and directories. It may
not read or write files in the cache, and so it is precluded from accessing the
data cached therein; nor is it permitted to create new files in the cache.
There are policy source files available in:
http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
and later versions. In that tarball, see the files:
cachefilesd.te
cachefilesd.fc
cachefilesd.if
They are built and installed directly by the RPM.
If a non-RPM based system is being used, then copy the above files to their own
directory and run:
make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp
You will need checkpolicy and selinux-policy-devel installed prior to the
build.
By default, the cache is located in /var/fscache, but if it is desirable that
it should be elsewhere, than either the above policy files must be altered, or
an auxiliary policy must be installed to label the alternate location of the
cache.
For instructions on how to add an auxiliary policy to enable the cache to be
located elsewhere when SELinux is in enforcing mode, please see:
/usr/share/doc/cachefilesd-*/move-cache.txt
When the cachefilesd rpm is installed; alternatively, the document can be found
in the sources.
==================
A NOTE ON SECURITY
==================
CacheFiles makes use of the split security in the task_struct. It allocates
its own task_security structure, and redirects current->act_as to point to it
when it acts on behalf of another process, in that process's context.
The reason it does this is that it calls vfs_mkdir() and suchlike rather than
bypassing security and calling inode ops directly. Therefore the VFS and LSM
may deny the CacheFiles access to the cache data because under some
circumstances the caching code is running in the security context of whatever
process issued the original syscall on the netfs.
Furthermore, should CacheFiles create a file or directory, the security
parameters with that object is created (UID, GID, security label) would be
derived from that process that issued the system call, thus potentially
preventing other processes from accessing the cache - including CacheFiles's
cache management daemon (cachefilesd).
What is required is to temporarily override the security of the process that
issued the system call. We can't, however, just do an in-place change of the
security data as that affects the process as an object, not just as a subject.
This means it may lose signals or ptrace events for example, and affects what
the process looks like in /proc.
So CacheFiles makes use of a logical split in the security between the
objective security (task->sec) and the subjective security (task->act_as). The
objective security holds the intrinsic security properties of a process and is
never overridden. This is what appears in /proc, and is what is used when a
process is the target of an operation by some other process (SIGKILL for
example).
The subjective security holds the active security properties of a process, and
may be overridden. This is not seen externally, and is used whan a process
acts upon another object, for example SIGKILLing another process or opening a
file.
LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
for CacheFiles to run in a context of a specific security label, or to create
files and directories with another security label.
This documentation is added by the patch to:
Documentation/filesystems/caching/cachefiles.txt
Signed-Off-By: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
Add and document asynchronous operation handling for use by FS-Cache's data
storage and retrieval routines.
The following documentation is added to:
Documentation/filesystems/caching/operations.txt
================================
ASYNCHRONOUS OPERATIONS HANDLING
================================
========
OVERVIEW
========
FS-Cache has an asynchronous operations handling facility that it uses for its
data storage and retrieval routines. Its operations are represented by
fscache_operation structs, though these are usually embedded into some other
structure.
This facility is available to and expected to be be used by the cache backends,
and FS-Cache will create operations and pass them off to the appropriate cache
backend for completion.
To make use of this facility, <linux/fscache-cache.h> should be #included.
===============================
OPERATION RECORD INITIALISATION
===============================
An operation is recorded in an fscache_operation struct:
struct fscache_operation {
union {
struct work_struct fast_work;
struct slow_work slow_work;
};
unsigned long flags;
fscache_operation_processor_t processor;
...
};
Someone wanting to issue an operation should allocate something with this
struct embedded in it. They should initialise it by calling:
void fscache_operation_init(struct fscache_operation *op,
fscache_operation_release_t release);
with the operation to be initialised and the release function to use.
The op->flags parameter should be set to indicate the CPU time provision and
the exclusivity (see the Parameters section).
The op->fast_work, op->slow_work and op->processor flags should be set as
appropriate for the CPU time provision (see the Parameters section).
FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
operation and waited for afterwards.
==========
PARAMETERS
==========
There are a number of parameters that can be set in the operation record's flag
parameter. There are three options for the provision of CPU time in these
operations:
(1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
may decide it wants to handle an operation itself without deferring it to
another thread.
This is, for example, used in read operations for calling readpages() on
the backing filesystem in CacheFiles. Although readpages() does an
asynchronous data fetch, the determination of whether pages exist is done
synchronously - and the netfs does not proceed until this has been
determined.
If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
before submitting the operation, and the operating thread must wait for it
to be cleared before proceeding:
wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
fscache_wait_bit, TASK_UNINTERRUPTIBLE);
(2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
will be given to keventd to process. Such an operation is not permitted
to sleep on I/O.
This is, for example, used by CacheFiles to copy data from a backing fs
page to a netfs page after the backing fs has read the page in.
If this option is used, op->fast_work and op->processor must be
initialised before submitting the operation:
INIT_WORK(&op->fast_work, do_some_work);
(3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
will be given to the slow work facility to process. Such an operation is
permitted to sleep on I/O.
This is, for example, used by FS-Cache to handle background writes of
pages that have just been fetched from a remote server.
If this option is used, op->slow_work and op->processor must be
initialised before submitting the operation:
fscache_operation_init_slow(op, processor)
Furthermore, operations may be one of two types:
(1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
conjunction with any other operation on the object being operated upon.
An example of this is the attribute change operation, in which the file
being written to may need truncation.
(2) Shareable. Operations of this type may be running simultaneously. It's
up to the operation implementation to prevent interference between other
operations running at the same time.
=========
PROCEDURE
=========
Operations are used through the following procedure:
(1) The submitting thread must allocate the operation and initialise it
itself. Normally this would be part of a more specific structure with the
generic op embedded within.
(2) The submitting thread must then submit the operation for processing using
one of the following two functions:
int fscache_submit_op(struct fscache_object *object,
struct fscache_operation *op);
int fscache_submit_exclusive_op(struct fscache_object *object,
struct fscache_operation *op);
The first function should be used to submit non-exclusive ops and the
second to submit exclusive ones. The caller must still set the
FSCACHE_OP_EXCLUSIVE flag.
If successful, both functions will assign the operation to the specified
object and return 0. -ENOBUFS will be returned if the object specified is
permanently unavailable.
The operation manager will defer operations on an object that is still
undergoing lookup or creation. The operation will also be deferred if an
operation of conflicting exclusivity is in progress on the object.
If the operation is asynchronous, the manager will retain a reference to
it, so the caller should put their reference to it by passing it to:
void fscache_put_operation(struct fscache_operation *op);
(3) If the submitting thread wants to do the work itself, and has marked the
operation with FSCACHE_OP_MYTHREAD, then it should monitor
FSCACHE_OP_WAITING as described above and check the state of the object if
necessary (the object might have died whilst the thread was waiting).
When it has finished doing its processing, it should call
fscache_put_operation() on it.
(4) The operation holds an effective lock upon the object, preventing other
exclusive ops conflicting until it is released. The operation can be
enqueued for further immediate asynchronous processing by adjusting the
CPU time provisioning option if necessary, eg:
op->flags &= ~FSCACHE_OP_TYPE;
op->flags |= ~FSCACHE_OP_FAST;
and calling:
void fscache_enqueue_operation(struct fscache_operation *op)
This can be used to allow other things to have use of the worker thread
pools.
=====================
ASYNCHRONOUS CALLBACK
=====================
When used in asynchronous mode, the worker thread pool will invoke the
processor method with a pointer to the operation. This should then get at the
container struct by using container_of():
static void fscache_write_op(struct fscache_operation *_op)
{
struct fscache_storage *op =
container_of(_op, struct fscache_storage, op);
...
}
The caller holds a reference on the operation, and will invoke
fscache_put_operation() when the processor function returns. The processor
function is at liberty to call fscache_enqueue_operation() or to take extra
references.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
Implement the cache object management state machine.
The following documentation is added to illuminate the working of this state
machine. It will also be added as:
Documentation/filesystems/caching/object.txt
====================================================
IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
====================================================
==============
REPRESENTATION
==============
FS-Cache maintains an in-kernel representation of each object that a netfs is
currently interested in. Such objects are represented by the fscache_cookie
struct and are referred to as cookies.
FS-Cache also maintains a separate in-kernel representation of the objects that
a cache backend is currently actively caching. Such objects are represented by
the fscache_object struct. The cache backends allocate these upon request, and
are expected to embed them in their own representations. These are referred to
as objects.
There is a 1:N relationship between cookies and objects. A cookie may be
represented by multiple objects - an index may exist in more than one cache -
or even by no objects (it may not be cached).
Furthermore, both cookies and objects are hierarchical. The two hierarchies
correspond, but the cookies tree is a superset of the union of the object trees
of multiple caches:
NETFS INDEX TREE : CACHE 1 : CACHE 2
: :
: +-----------+ :
+----------->| IObject | :
+-----------+ | : +-----------+ :
| ICookie |-------+ : | :
+-----------+ | : | : +-----------+
| +------------------------------>| IObject |
| : | : +-----------+
| : V : |
| : +-----------+ : |
V +----------->| IObject | : |
+-----------+ | : +-----------+ : |
| ICookie |-------+ : | : V
+-----------+ | : | : +-----------+
| +------------------------------>| IObject |
+-----+-----+ : | : +-----------+
| | : | : |
V | : V : |
+-----------+ | : +-----------+ : |
| ICookie |------------------------->| IObject | : |
+-----------+ | : +-----------+ : |
| V : | : V
| +-----------+ : | : +-----------+
| | ICookie |-------------------------------->| IObject |
| +-----------+ : | : +-----------+
V | : V : |
+-----------+ | : +-----------+ : |
| DCookie |------------------------->| DObject | : |
+-----------+ | : +-----------+ : |
| : : |
+-------+-------+ : : |
| | : : |
V V : : V
+-----------+ +-----------+ : : +-----------+
| DCookie | | DCookie |------------------------>| DObject |
+-----------+ +-----------+ : : +-----------+
: :
In the above illustration, ICookie and IObject represent indices and DCookie
and DObject represent data storage objects. Indices may have representation in
multiple caches, but currently, non-index objects may not. Objects of any type
may also be entirely unrepresented.
As far as the netfs API goes, the netfs is only actually permitted to see
pointers to the cookies. The cookies themselves and any objects attached to
those cookies are hidden from it.
===============================
OBJECT MANAGEMENT STATE MACHINE
===============================
Within FS-Cache, each active object is managed by its own individual state
machine. The state for an object is kept in the fscache_object struct, in
object->state. A cookie may point to a set of objects that are in different
states.
Each state has an action associated with it that is invoked when the machine
wakes up in that state. There are four logical sets of states:
(1) Preparation: states that wait for the parent objects to become ready. The
representations are hierarchical, and it is expected that an object must
be created or accessed with respect to its parent object.
(2) Initialisation: states that perform lookups in the cache and validate
what's found and that create on disk any missing metadata.
(3) Normal running: states that allow netfs operations on objects to proceed
and that update the state of objects.
(4) Termination: states that detach objects from their netfs cookies, that
delete objects from disk, that handle disk and system errors and that free
up in-memory resources.
In most cases, transitioning between states is in response to signalled events.
When a state has finished processing, it will usually set the mask of events in
which it is interested (object->event_mask) and relinquish the worker thread.
Then when an event is raised (by calling fscache_raise_event()), if the event
is not masked, the object will be queued for processing (by calling
fscache_enqueue_object()).
PROVISION OF CPU TIME
---------------------
The work to be done by the various states is given CPU time by the threads of
the slow work facility (see Documentation/slow-work.txt). This is used in
preference to the workqueue facility because:
(1) Threads may be completely occupied for very long periods of time by a
particular work item. These state actions may be doing sequences of
synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
getxattr, truncate, unlink, rmdir, rename).
(2) Threads may do little actual work, but may rather spend a lot of time
sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
workqueues don't necessarily have the right numbers of threads.
LOCKING SIMPLIFICATION
----------------------
Because only one worker thread may be operating on any particular object's
state machine at once, this simplifies the locking, particularly with respect
to disconnecting the netfs's representation of a cache object (fscache_cookie)
from the cache backend's representation (fscache_object) - which may be
requested from either end.
=================
THE SET OF STATES
=================
The object state machine has a set of states that it can be in. There are
preparation states in which the object sets itself up and waits for its parent
object to transit to a state that allows access to its children:
(1) State FSCACHE_OBJECT_INIT.
Initialise the object and wait for the parent object to become active. In
the cache, it is expected that it will not be possible to look an object
up from the parent object, until that parent object itself has been looked
up.
There are initialisation states in which the object sets itself up and accesses
disk for the object metadata:
(2) State FSCACHE_OBJECT_LOOKING_UP.
Look up the object on disk, using the parent as a starting point.
FS-Cache expects the cache backend to probe the cache to see whether this
object is represented there, and if it is, to see if it's valid (coherency
management).
The cache should call fscache_object_lookup_negative() to indicate lookup
failure for whatever reason, and should call fscache_obtained_object() to
indicate success.
At the completion of lookup, FS-Cache will let the netfs go ahead with
read operations, no matter whether the file is yet cached. If not yet
cached, read operations will be immediately rejected with ENODATA until
the first known page is uncached - as to that point there can be no data
to be read out of the cache for that file that isn't currently also held
in the pagecache.
(3) State FSCACHE_OBJECT_CREATING.
Create an object on disk, using the parent as a starting point. This
happens if the lookup failed to find the object, or if the object's
coherency data indicated what's on disk is out of date. In this state,
FS-Cache expects the cache to create
The cache should call fscache_obtained_object() if creation completes
successfully, fscache_object_lookup_negative() otherwise.
At the completion of creation, FS-Cache will start processing write
operations the netfs has queued for an object. If creation failed, the
write ops will be transparently discarded, and nothing recorded in the
cache.
There are some normal running states in which the object spends its time
servicing netfs requests:
(4) State FSCACHE_OBJECT_AVAILABLE.
A transient state in which pending operations are started, child objects
are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
lookup data is freed.
(5) State FSCACHE_OBJECT_ACTIVE.
The normal running state. In this state, requests the netfs makes will be
passed on to the cache.
(6) State FSCACHE_OBJECT_UPDATING.
The state machine comes here to update the object in the cache from the
netfs's records. This involves updating the auxiliary data that is used
to maintain coherency.
And there are terminal states in which an object cleans itself up, deallocates
memory and potentially deletes stuff from disk:
(7) State FSCACHE_OBJECT_LC_DYING.
The object comes here if it is dying because of a lookup or creation
error. This would be due to a disk error or system error of some sort.
Temporary data is cleaned up, and the parent is released.
(8) State FSCACHE_OBJECT_DYING.
The object comes here if it is dying due to an error, because its parent
cookie has been relinquished by the netfs or because the cache is being
withdrawn.
Any child objects waiting on this one are given CPU time so that they too
can destroy themselves. This object waits for all its children to go away
before advancing to the next state.
(9) State FSCACHE_OBJECT_ABORT_INIT.
The object comes to this state if it was waiting on its parent in
FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
(10) State FSCACHE_OBJECT_RELEASING.
(11) State FSCACHE_OBJECT_RECYCLING.
The object comes to one of these two states when dying once it is rid of
all its children, if it is dying because the netfs relinquished its
cookie. In the first state, the cached data is expected to persist, and
in the second it will be deleted.
(12) State FSCACHE_OBJECT_WITHDRAWING.
The object transits to this state if the cache decides it wants to
withdraw the object from service, perhaps to make space, but also due to
error or just because the whole cache is being withdrawn.
(13) State FSCACHE_OBJECT_DEAD.
The object transits to this state when the in-memory object record is
ready to be deleted. The object processor shouldn't ever see an object in
this state.
THE SET OF EVENTS
-----------------
There are a number of events that can be raised to an object state machine:
(*) FSCACHE_OBJECT_EV_UPDATE
The netfs requested that an object be updated. The state machine will ask
the cache backend to update the object, and the cache backend will ask the
netfs for details of the change through its cookie definition ops.
(*) FSCACHE_OBJECT_EV_CLEARED
This is signalled in two circumstances:
(a) when an object's last child object is dropped and
(b) when the last operation outstanding on an object is completed.
This is used to proceed from the dying state.
(*) FSCACHE_OBJECT_EV_ERROR
This is signalled when an I/O error occurs during the processing of some
object.
(*) FSCACHE_OBJECT_EV_RELEASE
(*) FSCACHE_OBJECT_EV_RETIRE
These are signalled when the netfs relinquishes a cookie it was using.
The event selected depends on whether the netfs asks for the backing
object to be retired (deleted) or retained.
(*) FSCACHE_OBJECT_EV_WITHDRAW
This is signalled when the cache backend wants to withdraw an object.
This means that the object will have to be detached from the netfs's
cookie.
Because the withdrawing releasing/retiring events are all handled by the object
state machine, it doesn't matter if there's a collision with both ends trying
to sever the connection at the same time. The state machine can just pick
which one it wants to honour, and that effects the other.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
Make FS-Cache create its /proc interface and present various statistical
information through it. Also provide the functions for updating this
information.
These features are enabled by:
CONFIG_FSCACHE_PROC
CONFIG_FSCACHE_STATS
CONFIG_FSCACHE_HISTOGRAM
The /proc directory for FS-Cache is also exported so that caching modules can
add their own statistics there too.
The FS-Cache module is loadable at this point, and the statistics files can be
examined by userspace:
cat /proc/fs/fscache/stats
cat /proc/fs/fscache/histogram
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
Add the API for a generic facility (FS-Cache) by which caches may declare them
selves open for business, and may obtain work to be done from network
filesystems. The header file is included by:
#include <linux/fscache-cache.h>
Documentation for the API is also added to:
Documentation/filesystems/caching/backend-api.txt
This API is not usable without the implementation of the utility functions
which will be added in further patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
Add the API for a generic facility (FS-Cache) by which filesystems (such as AFS
or NFS) may call on local caching capabilities without having to know anything
about how the cache works, or even if there is a cache:
+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+
General documentation and documentation of the netfs specific API are provided
in addition to the header files.
As this patch stands, it is possible to build a filesystem against the facility
and attempt to use it. All that will happen is that all requests will be
immediately denied as if no cache is present.
Further patches will implement the core of the facility. The facility will
transfer requests from networking filesystems to appropriate caches if
possible, or else gracefully deny them.
If this facility is disabled in the kernel configuration, then all its
operations will trivially reduce to nothing during compilation.
WHY NOT I_MAPPING?
==================
I have added my own API to implement caching rather than using i_mapping to do
this for a number of reasons. These have been discussed a lot on the LKML and
CacheFS mailing lists, but to summarise the basics:
(1) Most filesystems don't do hole reportage. Holes in files are treated as
blocks of zeros and can't be distinguished otherwise, making it difficult
to distinguish blocks that have been read from the network and cached from
those that haven't.
(2) The backing inode must be fully populated before being exposed to
userspace through the main inode because the VM/VFS goes directly to the
backing inode and does not interrogate the front inode's VM ops.
Therefore:
(a) The backing inode must fit entirely within the cache.
(b) All backed files currently open must fit entirely within the cache at
the same time.
(c) A working set of files in total larger than the cache may not be
cached.
(d) A file may not grow larger than the available space in the cache.
(e) A file that's open and cached, and remotely grows larger than the
cache is potentially stuffed.
(3) Writes go to the backing filesystem, and can only be transferred to the
network when the file is closed.
(4) There's no record of what changes have been made, so the whole file must
be written back.
(5) The pages belong to the backing filesystem, and all metadata associated
with that page are relevant only to the backing filesystem, and not
anything stacked atop it.
OVERVIEW
========
FS-Cache provides (or will provide) the following facilities:
(1) Caches can be added / removed at any time, even whilst in use.
(2) Adds a facility by which tags can be used to refer to caches, even if
they're not available yet.
(3) More than one cache can be used at once. Caches can be selected
explicitly by use of tags.
(4) The netfs is provided with an interface that allows either party to
withdraw caching facilities from a file (required for (1)).
(5) A netfs may annotate cache objects that belongs to it. This permits the
storage of coherency maintenance data.
(6) Cache objects will be pinnable and space reservations will be possible.
(7) The interface to the netfs returns as few errors as possible, preferring
rather to let the netfs remain oblivious.
(8) Cookies are used to represent indices, files and other objects to the
netfs. The simplest cookie is just a NULL pointer - indicating nothing
cached there.
(9) The netfs is allowed to propose - dynamically - any index hierarchy it
desires, though it must be aware that the index search function is
recursive, stack space is limited, and indices can only be children of
indices.
(10) Indices can be used to group files together to reduce key size and to make
group invalidation easier. The use of indices may make lookup quicker,
but that's cache dependent.
(11) Data I/O is effectively done directly to and from the netfs's pages. The
netfs indicates that page A is at index B of the data-file represented by
cookie C, and that it should be read or written. The cache backend may or
may not start I/O on that page, but if it does, a netfs callback will be
invoked to indicate completion. The I/O may be either synchronous or
asynchronous.
(12) Cookies can be "retired" upon release. At this point FS-Cache will mark
them as obsolete and the index hierarchy rooted at that point will get
recycled.
(13) The netfs provides a "match" function for index searches. In addition to
saying whether a match was made or not, this can also specify that an
entry should be updated or deleted.
FS-Cache maintains a virtual index tree in which all indices, files, objects
and pages are kept. Bits of this tree may actually reside in one or more
caches.
FSDEF
|
+------------------------------------+
| |
NFS AFS
| |
+--------------------------+ +-----------+
| | | |
homedir mirror afs.org redhat.com
| | |
+------------+ +---------------+ +----------+
| | | | | |
00001 00002 00007 00125 vol00001 vol00002
| | | | |
+---+---+ +-----+ +---+ +------+------+ +-----+----+
| | | | | | | | | | | | |
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
| |
PG0 +-------+
| |
00001 00003
|
+---+---+
| | |
PG0 PG1 PG2
In the example above, two netfs's can be seen to be backed: NFS and AFS. These
have different index hierarchies:
(*) The NFS primary index will probably contain per-server indices. Each
server index is indexed by NFS file handles to get data file objects.
Each data file objects can have an array of pages, but may also have
further child objects, such as extended attributes and directory entries.
Extended attribute objects themselves have page-array contents.
(*) The AFS primary index contains per-cell indices. Each cell index contains
per-logical-volume indices. Each of volume index contains up to three
indices for the read-write, read-only and backup mirrors of those volumes.
Each of these contains vnode data file objects, each of which contains an
array of pages.
The very top index is the FS-Cache master index in which individual netfs's
have entries.
Any index object may reside in more than one cache, provided it only has index
children. Any index with non-index object children will be assumed to only
reside in one cache.
The FS-Cache overview can be found in:
Documentation/filesystems/caching/fscache.txt
The netfs API to FS-Cache can be found in:
Documentation/filesystems/caching/netfs-api.txt
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
Now /proc/sys is described in many places and much information is
redundant. This patch updates the proc.txt and move the /proc/sys
desciption out to the files in Documentation/sysctls.
Details are:
merge
- 2.1 /proc/sys/fs - File system data
- 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
- 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
with Documentation/sysctls/fs.txt.
remove
- 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
since it's not better then the Documentation/binfmt_misc.txt.
merge
- 2.3 /proc/sys/kernel - general kernel parameters
with Documentation/sysctls/kernel.txt
remove
- 2.5 /proc/sys/dev - Device specific parameters
since it's obsolete the sysfs is used now.
remove
- 2.6 /proc/sys/sunrpc - Remote procedure calls
since it's not better then the Documentation/sysctls/sunrpc.txt
move
- 2.7 /proc/sys/net - Networking stuff
- 2.9 Appletalk
- 2.10 IPX
to newly created Documentation/sysctls/net.txt.
remove
- 2.8 /proc/sys/net/ipv4 - IPV4 settings
since it's not better then the Documentation/networking/ip-sysctl.txt.
add
- Chapter 3 Per-Process Parameters
to descibe /proc/<pid>/xxx parameters.
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"dmode" allows overriding permissions of directories and
"mode" allows overriding permissions of files.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits)
ext4: Regularize mount options
ext4: fix locking typo in mballoc which could cause soft lockup hangs
ext4: fix typo which causes a memory leak on error path
jbd2: Update locking coments
ext4: Rename pa_linear to pa_type
ext4: add checks of block references for non-extent inodes
ext4: Check for an valid i_mode when reading the inode from disk
ext4: Use WRITE_SYNC for commits which are caused by fsync()
ext4: Add auto_da_alloc mount option
ext4: Use struct flex_groups to calculate get_orlov_stats()
ext4: Use atomic_t's in struct flex_groups
ext4: remove /proc tuning knobs
ext4: Add sysfs support
ext4: Track lifetime disk writes
ext4: Fix discard of inode prealloc space with delayed allocation.
ext4: Automatically allocate delay allocated blocks on rename
ext4: Automatically allocate delay allocated blocks on close
ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl
ext4: Simplify delalloc code by removing mpage_da_writepages()
ext4: Save stack space by removing fake buffer heads
...
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (88 commits)
PCI: fix HT MSI mapping fix
PCI: don't enable too much HT MSI mapping
x86/PCI: make pci=lastbus=255 work when acpi is on
PCI: save and restore PCIe 2.0 registers
PCI: update fakephp for bus_id removal
PCI: fix kernel oops on bridge removal
PCI: fix conflict between SR-IOV and config space sizing
powerpc/PCI: include pci.h in powerpc MSI implementation
PCI Hotplug: schedule fakephp for feature removal
PCI Hotplug: rename legacy_fakephp to fakephp
PCI Hotplug: restore fakephp interface with complete reimplementation
PCI: Introduce /sys/bus/pci/devices/.../rescan
PCI: Introduce /sys/bus/pci/devices/.../remove
PCI: Introduce /sys/bus/pci/rescan
PCI: Introduce pci_rescan_bus()
PCI: do not enable bridges more than once
PCI: do not initialize bridges more than once
PCI: always scan child buses
PCI: pci_scan_slot() returns newly found devices
PCI: don't scan existing devices
...
Fix trivial append-only conflict in Documentation/feature-removal-schedule.txt
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added some documentation in exofs.txt, as well as a BUGS file.
For further reading, operation instructions, example scripts
and up to date infomation and code please see:
http://open-osd.org
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ext3 has quite unexpected semantics or "ro" and defaults are
not what they are documented to be, due to mkfs override.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Add support for using the mount options "barrier" and "nobarrier", and
"auto_da_alloc" and "noauto_da_alloc", which is more consistent than
"barrier=<0|1>" or "auto_da_alloc=<0|1>". Most other ext3/ext4 mount
options use the foo/nofoo naming convention. We allow the old forms
of these mount options for backwards compatibility.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Document the format and semantics of the /proc/fs/nfsd/pool_stats file.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'bkl-removal' of git://git.lwn.net/linux-2.6:
Rationalize fasync return values
Move FASYNC bit handling to f_op->fasync()
Use f_lock to protect f_flags
Rename struct file->f_ep_lock
Revert the change to the orphan dates of Windows 95, DOS, compression.
Add a new orphan date for OS/2.
Signed-off-by: Jody McIntyre <scjody@sun.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (32 commits)
ucc_geth: Fix oops when using fixed-link support
dm9000: locking bugfix
net: update dnet.c for bus_id removal
dnet: DNET should depend on HAS_IOMEM
dca: add missing copyright/license headers
nl80211: Check that function pointer != NULL before using it
sungem: missing net_device_ops
be2net: fix to restore vlan ids into BE2 during a IF DOWN->UP cycle
be2net: replenish when posting to rx-queue is starved in out of mem conditions
bas_gigaset: correctly allocate USB interrupt transfer buffer
smsc911x: reset last known duplex and carrier on open
sh_eth: Fix mistake of the address of SH7763
sh_eth: Change handling of IRQ
netns: oops in ip[6]_frag_reasm incrementing stats
net: kfree(napi->skb) => kfree_skb
net: fix sctp breakage
ipv6: fix display of local and remote sit endpoints
net: Document /proc/sys/net/core/netdev_budget
tulip: fix crash on iface up with shirq debug
virtio_net: Make virtio_net support carrier detection
...
This patch adds an attribute named "remove" to a PCI device's sysfs
directory. Writing a non-zero value to this attribute will remove the PCI
device and any children of it.
Trent Piepho wrote the original implementation and documentation.
Thanks to Vegard Nossum for testing under kmemcheck and finding locking
issues with the sysfs interface.
Cc: Trent Piepho <xyzzy@speakeasy.org>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The NAPI poll parameter netdev_budget is not documented in
kernel-docs. Since it may have a substantial effect on at least some
network loads, it should be.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing the BKL from FASYNC handling ran into the challenge of keeping the
setting of the FASYNC bit in filp->f_flags atomic with regard to calls to
the underlying fasync() function. Andi Kleen suggested moving the handling
of that bit into fasync(); this patch does exactly that. As a result, we
have a couple of internal API changes: fasync() must now manage the FASYNC
bit, and it will be called without the BKL held.
As it happens, every fasync() implementation in the kernel with one
exception calls fasync_helper(). So, if we make fasync_helper() set the
FASYNC bit, we can avoid making any changes to the other fasync()
functions - as long as those functions, themselves, have proper locking.
Most fasync() implementations do nothing but call fasync_helper() - which
has its own lock - so they are easily verified as correct. The BKL had
already been pushed down into the rest.
The networking code has its own version of fasync_helper(), so that code
has been augmented with explicit FASYNC bit handling.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Trivial patch to fix bad links in the ext2 and ext3 documentation.
Signed-off-by: Jody McIntyre <scjody@sun.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been
replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Some poeple are reading the ext4 feature list too literally and create
dubious test cases involving very long filenames and 1k blocksize and
then complain when they run into an htree-imposed limit. So add fine
print to the "fix 32000 subdirectory limit" ext4 feature.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix descriptions of device attributes to be consistent with the actual
implementations in include/linux/device.h
Signed-off-by: Mike Murphy <mamurph[at]cs.clemson.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the ROM reading code return an error to user space if
the size of the ROM read is equal to 0.
The patch also emits a warnings if the contents of the ROM are invalid,
and documents the effects of the "enable" file on ROM reading.
Signed-off-by: Timothy S. Nelson <wayland@wayland.id.au>
Acked-by: Alex Villacis-Lasso <a_villacis@palosanto.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: remove fast unmounting
UBIFS: return sensible error codes
UBIFS: remount ro fixes
UBIFS: spelling fix 'date' -> 'data'
UBIFS: sync wbufs after syncing inodes and pages
UBIFS: fix LPT out-of-space bug (again)
UBIFS: fix no_chk_data_crc
UBIFS: fix assertions
UBIFS: ensure orphan area head is initialized
UBIFS: always clean up GC LEB space
UBIFS: add re-mount debugging checks
UBIFS: fix LEB list freeing
UBIFS: simplify locking
UBIFS: document dark_wm and dead_wm better
UBIFS: do not treat all data as short term
UBIFS: constify operations
UBIFS: do not commit twice
This UBIFS feature has never worked properly, and it was a mistake
to add it because we simply have no use-cases. So, lets still accept
the fast_unmount mount option, but ignore it. This does not change
much, because UBIFS commit in sync_fs anyway, and sync_fs is called
while unmounting.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Update the NFS/RDMA documentation to use the new port number assigned
by IANA.
Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
More specifically, the section on /proc/sys/vm in
Documentation/filesystems/proc.txt was removed and a link to
Documentation/sysctl/vm.txt added.
Most of the verbiage from proc.txt was simply moved in vm.txt, with new
addtional text for "swappiness" and "stat_interval".
Signed-off-by: Peter W Morreale <pmorreale@novell.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests. So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.
In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
and it would be used to get the consistent backup.
If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.
So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
or the snapshot.
This patch:
VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.
ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.
reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (864 commits)
Btrfs: explicitly mark the tree log root for writeback
Btrfs: Drop the hardware crc32c asm code
Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING
Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook
Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies
Btrfs: tree logging checksum fixes
Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents
Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation
Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code
Btrfs: drop EXPORT symbols from extent_io.c
Btrfs: Fix checkpatch.pl warnings
Btrfs: Fix free block discard calls down to the block layer
Btrfs: avoid orphan inode caused by log replay
Btrfs: avoid potential super block corruption
Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super
Btrfs: fix a memory leak in btrfs_get_sb
Btrfs: Fix typo in clear_state_cb
Btrfs: Fix memset length in btrfs_file_write
Btrfs: update directory's size when creating subvol/snapshot
Btrfs: add permission checks to the ioctls
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
ext4: Remove "extents" mount option
block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
ext4: Make printk's consistently prefixed with "EXT4-fs: "
ext4: Add sanity checks for the superblock before mounting the filesystem
ext4: Add mount option to set kjournald's I/O priority
jbd2: Submit writes to the journal using WRITE_SYNC
jbd2: Add pid and journal device name to the "kjournald2 starting" message
ext4: Add markers for better debuggability
ext4: Remove code to create the journal inode
ext4: provide function to release metadata pages under memory pressure
ext3: provide function to release metadata pages under memory pressure
add releasepage hooks to block devices which can be used by file systems
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
ext4: Init the complete page while building buddy cache
ext4: Don't allow new groups to be added during block allocation
ext4: mark the blocks/inode bitmap beyond end of group as used
ext4: Use new buffer_head flag to check uninit group bitmaps initialization
ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
ext4: code cleanup
...
f_op->poll is the only vfs operation which is not allowed to sleep. It's
because poll and select implementation used task state to synchronize
against wake ups, which doesn't have to be the case anymore as wait/wake
interface can now use custom wake up functions. The non-sleep restriction
can be a bit tricky because ->poll is not called from an atomic context
and the result of accidentally sleeping in ->poll only shows up as
temporary busy looping when the timing is right or rather wrong.
This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events. The
only added overhead is an extra function call during wake up and
negligible.
This patch removes the one non-sleep exception from vfs locking rules and
is beneficial to userland filesystem implementations like FUSE, 9p or
peculiar fs like spufs as it's very difficult for those to implement
non-sleeping poll method.
While at it, make the following cosmetic changes to make poll.h and
select.c checkpatch friendly.
* s/type * symbol/type *symbol/ : three places in poll.h
* remove blank line before EXPORT_SYMBOL() : two places in select.c
Oleg: spotted missing barrier in poll_schedule_timeout()
Davide: spotted missing write barrier in pollwake()
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Brad Boyer <flar@allandria.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This mount option is largely superfluous, and in fact the way it was
implemented was buggy; if a filesystem which did not have the extents
feature flag was mounted -o extents, the filesystem would attempt to
create and use extents-based file even though the extents feature flag
was not eabled. The simplest thing to do is to nuke the mount option
entirely. It's not all that useful to force the non-creation of new
extent-based files if the filesystem can support it.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds the Kconfig option "CONFIG_OCFS2_FS_POSIX_ACL"
and mount options "acl" to enable acls in Ocfs2.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
/proc/*/stack adds the ability to query a task's stack trace. It is more
useful than /proc/*/wchan as it provides full stack trace instead of single
depth. Example output:
$ cat /proc/self/stack
[<c010a271>] save_stack_trace_tsk+0x17/0x35
[<c01827b4>] proc_pid_stack+0x4a/0x76
[<c018312d>] proc_single_show+0x4a/0x5e
[<c016bdec>] seq_read+0xf3/0x29f
[<c015a004>] vfs_read+0x6d/0x91
[<c015a0c1>] sys_read+0x3b/0x60
[<c0102eda>] syscall_call+0x7/0xb
[<ffffffff>] 0xffffffff
[add save_stack_trace_tsk() on mips, ACK Ralf --adobriyan]
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
This code has been obsolete in quite some time, since the supported
method for adding a journal inode is to use tune2fs (or to creating
new filesystem with a journal via mke2fs or mkfs.ext4).
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (33 commits)
UBIFS: add more useful debugging prints
UBIFS: print debugging messages properly
UBIFS: fix numerous spelling mistakes
UBIFS: allow mounting when short of space
UBIFS: fix writing uncompressed files
UBIFS: fix checkpatch.pl warnings
UBIFS: fix sparse warnings
UBIFS: simplify make_free_space
UBIFS: do not lie about used blocks
UBIFS: restore budg_uncommitted_idx
UBIFS: always commit on unmount
UBIFS: use ubi_sync
UBIFS: always commit in sync_fs
UBIFS: fix file-system synchronization
UBIFS: fix constants initialization
UBIFS: avoid unnecessary calculations
UBIFS: re-calculate min_idx_size after the commit
UBIFS: use nicer 64-bit math
UBIFS: fix available blocks count
UBIFS: various comment improvements and fixes
...
Changelog [v2]:
- Add note indicating strict isolation is not possible unless all
mounts of devpts use the 'newinstance' mount option.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the hopelessly misguided ->dir_notify(). The only instance (cifs)
has been broken by design from the very beginning; the objects it creates
are never destroyed, keep references to struct file they can outlive, nothing
that could possibly evict them exists on close(2) path *and* no locking
whatsoever is done to prevent races with close(), should the previous, er,
deficiencies someday be dealt with.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Documentation/filesystems/files.txt was not updated when
f_count became an atomic_long_t.
atomic_long_inc_not_zero() is now used instead of atomic_inc_not_zero()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
no function named d_put(), it should be dput().
Impact: fix document and comment, no functionality changed
Signed-off-by: Zhao Lei <zhaolei@cn.fuijtsu.com>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is very handy to be able to change default UBIFS compressor
via mount options. Introduce -o compr=<name> mount option support.
Currently only "none", "lzo" and "zlib" compressors are supported.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A very minor patch on ramfs-rootfs-initramfs.txt: update the location
where CONFIG_INITRAMFS_SOURCE lives in menuconfig
Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>
Acked-by: Rob Landley <rob@landley.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FAT has the ATTR_RO (read-only) attribute. But on Windows, the ATTR_RO
of the directory will be just ignored actually, and is used by only
applications as flag. E.g. it's setted for the customized folder by
Explorer.
http://msdn2.microsoft.com/en-us/library/aa969337.aspx
This adds "rodir" option. If user specified it, ATTR_RO is used as
read-only flag even if it's the directory. Otherwise, inode->i_mode
is not used to hold ATTR_RO (i.e. fat_mode_can_save_ro() returns 0).
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While debugging a sync mount regression on vfat I noticed that there were
mount options parsed by the driver that were not documented.
[hirofumi@mail.parknet.co.jp: fix some parts]
Signed-off-by: Bart Trojanowski <bart@jukie.net>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add new mount options, min_batch_time and max_batch_time, which
controls how long the jbd2 layer should wait for additional filesystem
operations to get batched with a synchronous write transaction.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Nothing uses prepare_write or commit_write. Remove them from the tree
completely.
[akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On Linux all filesystems are supposed to be operating under Posix'
restricted chown. Restricted chown means it restricts chown to the owner
unless you have CAP_FOWNER.
NOTE: that 2 files outside of fs/xfs have been modified too for this
change.
Reviewed-by: Dave Chinner <david@fromorbit.com>
SGI-PV: 988919
SGI-Modid: 2.6.x-xfs-melb:linux:32413b
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (25 commits)
UBIFS: fix ubifs_compress commentary
UBIFS: amend printk
UBIFS: do not read unnecessary bytes when unpacking bits
UBIFS: check buffer length when scanning for LPT nodes
UBIFS: correct condition to eliminate unecessary assignment
UBIFS: add more debugging messages for LPT
UBIFS: fix bulk-read handling uptodate pages
UBIFS: improve garbage collection
UBIFS: allow for sync_fs when read-only
UBIFS: commit on sync_fs
UBIFS: correct comment for commit_on_unmount
UBIFS: update dbg_dump_inode
UBIFS: fix commentary
UBIFS: fix races in bit-fields
UBIFS: ensure data read beyond i_size is zeroed out correctly
UBIFS: correct key comparison
UBIFS: use bit-fields when possible
UBIFS: check data CRC when in error state
UBIFS: improve znode splitting rules
UBIFS: add no_chk_data_crc mount option
...
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently. Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error. It's scary for mission critical systems. On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable. So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.
If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error. If you mount it with data_err=ignore, it doesn't abort, just
call printk(). data_err=ignore is the default.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current documentation of dirty_ratio and dirty_background_ratio is a
bit misleading.
In the documentation we say that they are "a percentage of total system
memory", but the current page writeback policy, intead, is to apply the
percentages to the dirtyable memory, that means free pages + reclaimable
pages.
Better to be more explicit to clarify this concept.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped. But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g. vmalloc, sound related).
Thus hugepages are never dumped and it can't be debugged easily. Many
developers want hugepages to be included into core-dump.
However, We can't read generic VM_RESERVED area because this area is often
IO mapping area. then these area reading may change device state. it is
definitly undesiable side-effect.
So adding a hugepage specific bit to the coredump filter is better. It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.
In additional, libhugetlb use hugetlb private mapping pages as anonymous
page. Then, hugepage private mapping pages should be core dumped by
default.
Then, /proc/[pid]/core_dump_filter has two new bits.
- bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
- bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)
I tested by following method.
% ulimit -c unlimited
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include "hugetlbfs.h"
int main(int argc, char** argv){
char* p;
int ch;
int mmap_flags = MAP_SHARED;
int fd;
int nr_pages;
while((ch = getopt(argc, argv, "p")) != -1) {
switch (ch) {
case 'p':
mmap_flags &= ~MAP_SHARED;
mmap_flags |= MAP_PRIVATE;
break;
default:
/* nothing*/
break;
}
}
argc -= optind;
argv += optind;
if (argc == 0){
printf("need # of pages\n");
exit(1);
}
nr_pages = atoi(argv[0]);
if (nr_pages < 2) {
printf("nr_pages must >2\n");
exit(1);
}
fd = hugetlbfs_unlinked_fd();
p = mmap(NULL, nr_pages * gethugepagesize(),
PROT_READ|PROT_WRITE, mmap_flags, fd, 0);
sleep(2);
*(p + gethugepagesize()) = 1; /* COW */
sleep(2);
/* crash! */
*(int*)0 = 1;
return 0;
}
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since Ext4 is supposed to be stable in 2.6.28-rc, ext4's documentation
file should be updated.
[ More updates also added by Theodore Ts'o. ]
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add documentation for the miscellaneous device module of autofs4.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds "panic_on_unrecovered_nmi" sysctl to
Documentation/filesystems/proc.txt. The text is mainly taken from
http://readlist.com/lists/vger.kernel.org/linux-kernel/43/217998.html.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove Andrew Morton's http://www.zip.com.au/~akpm/ urls, update to new
ones when necessary, delete references otherwise.
There are still instances of that living in:
Documentation/zh_CN/HOWTO
Documentation/zh_CN/SubmittingPatches
Documentation/ko_KR/HOWTO
Documentation/ja_JP/SubmittingPatches
Signed-off-by: Francois Cami <francois.cami@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a missing word to the explanation of the purpose of the zdisk and
bzdisk make targets.
Signed-off-by: Shane McDonald <mcdonald.shane@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
First a file hello.c is created, then the file hello2.c is compiled.
Change this to hello.c
Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit. This parallels XFS. 64bit systems can handle the
larger inode numbers.
[ Added description of inode64 mount option in ocfs2.txt. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
proc: remove kernel.maps_protect
proc: remove now unneeded ADDBUF macro
[PATCH] proc: show personality via /proc/pid/personality
[PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock()
proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig
proc: make grab_header() static
proc: remove unused get_dma_list()
proc: remove dummy vmcore_open()
proc: proc_sys_root tweak
proc: fix return value of proc_reg_open() in "too late" case
Fixed up trivial conflict in removed file arch/sparc/include/asm/dma_32.h
If the journal doesn't abort when it gets an IO error in file data
blocks, the file data corruption will spread silently. Because
most of applications and commands do buffered writes without fsync(),
they don't notice the IO error. It's scary for mission critical
systems. On the other hand, if the journal aborts whenever it gets
an IO error in file data blocks, the system will easily become
inoperable. So this patch introduces a filesystem option to
determine whether it aborts the journal or just call printk() when
it gets an IO error in file data.
If you mount an ext4 fs with data_err=abort option, it aborts on file
data write error. If you mount it with data_err=ignore, it doesn't
abort, just call printk(). data_err=ignore is the default.
Here is the corresponding patch of the ext3 version:
http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix. Also remove the requirement for the TEST_FILESYS
flag.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After commit 831830b5a2 aka
"restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace"
sysctl stopped being relevant because commit moved security checks from ->show
time to ->start time (mm_for_maps()).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Kees Cook <kees.cook@canonical.com>
Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
inode operation.
Userspace can get extent information on a file via fiemap ioctl. As input,
the fiemap ioctl takes a struct fiemap which includes an array of struct
fiemap_extent (fm_extents). Size of the extent array is passed as
fm_extent_count and number of extents returned will be written into
fm_mapped_extents. Offset and length fields on the fiemap structure
(fm_start, fm_length) describe a logical range which will be searched for
extents. All extents returned will at least partially contain this range.
The actual extent offsets and ranges returned will be unmodified from their
offset and range on-disk.
The fiemap ioctl returns '0' on success. On error, -1 is returned and errno
is set. If errno is equal to EBADR, then fm_flags will contain those flags
which were passed in which the kernel did not understand. On all other
errors, the contents of fm_extents is undefined.
As fiemap evolved, there have been many authors of the vfs patch. As far as
I can tell, the list includes:
Kalpak Shah <kalpak.shah@sun.com>
Andreas Dilger <adilger@sun.com>
Eric Sandeen <sandeen@redhat.com>
Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: linux-api@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block. So request readahead for adjacent inode table
blocks to reduce the time it takes when iterating over directories
(especially when doing this in htree sort order) in a cold cache case.
With this patch, the time it takes to run "git status" on a kernel
tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches"
is reduced by 21%.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
UBIFS read performance can be improved by skipping the CRC
check when data nodes are read. This option can be used if
the underlying media is considered to be highly reliable.
Note that CRCs are always checked for metadata.
Read speed on Arm platform with OneNAND goes from 19 MiB/s
to 27 MiB/s with data CRC checking disabled.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Some flash media are capable of reading sequentially at faster rates.
UBIFS bulk-read facility is designed to take advantage of that, by
reading in one go consecutive data nodes that are also located
consecutively in the same LEB.
Read speed on Arm platform with OneNAND goes from 17 MiB/s to
19 MiB/s.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
There is no description of bit 4 of coredump_filter in the
documentation. This patch adds it.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the 2.6.27 circle ->fasync lost the BKL, and the last remaining
->open variant that takes the BKL is also gone. ->get_sb and ->kill_sb
didn't have BKL forever, so updated the entries while we're at that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update Documentation/filesystems/proc.txt: it describes the file
auto_msgmni intoduced to enable/disable msgmni automatic recomputing upon
memory add/remove (see thread http://lkml.org/lkml/2008/7/4/27). Also
added a description for msgmni (this filex is only listed in
Documentation/sysctl/kernel.txt).
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the location of the NTFS homepage in several files.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Update documentation to remind users to update mke2fs.conf
ext4: Fix small file fragmentation
ext4: Initialize writeback_index to 0 when allocating a new inode
ext4: make sure ext4_has_free_blocks returns 0 for ENOSPC
ext4: journal credit fix for the delayed allocation's writepages() function
ext4: Rework the ext4_da_writepages() function
ext4: journal credits reservation fixes for DIO, fallocate
ext4: journal credits reservation fixes for extent file writepage
ext4: journal credits calulation cleanup and fix for non-extent writepage
ext4: Fix bug where we return ENOSPC even though we have plenty of inodes
ext4: don't try to resize if there are no reserved gdt blocks left
ext4: Use ext4_discard_reservations instead of mballoc-specific call
ext4: Fix ext4_dx_readdir hash collision handling
ext4: Fix delalloc release block reservation for truncate
ext4: Fix potential truncate BUG due to i_prealloc_list being non-empty
ext4: Handle unwritten extent properly with delayed allocation
Currently source files in the Documentation/ sub-dir can easily bit-rot
since they are not generally buildable, either because they are hidden in
text files or because there are no Makefile rules for them. This needs to
be fixed so that the source files remain usable and good examples of code
instead of bad examples.
Add the ability to build source files that are in the Documentation/ dir.
Add to Kconfig as "BUILD_DOCSRC" config symbol.
Use "CONFIG_BUILD_DOCSRC=1 make ..." to build objects from the
Documentation/ sources. Or enable BUILD_DOCSRC in the *config system.
However, this symbol depends on HEADERS_CHECK since the header files need
to be installed (for userspace builds).
Built (using cross-tools) for x86-64, i386, alpha, ia64, sparc32,
sparc64, powerpc, sh, m68k, & mips.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysfs has the _ATTR() and _ATTR_RO() macros to make defining extended
form attributes easier. configfs should have something similiar.
- _CONFIGFS_ATTR() and _CONFIGFS_ATTR_RO() are the counterparts to the
sysfs macros.
- CONFIGFS_ATTR_STRUCT() creates the extended form attribute structure.
- CONFIGFS_ATTR_OPS() defines the show_attribute()/store_attribute()
operations that call the show()/store() operations of the extended
form configfs_attributes.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
These patches add the Optimized MPEG Filesystem, a proprietary filesystem used
by the embedded devices Rio Karma and ReplayTV, which are no longer
manufactured. This filesystem module enables people to access files on these
devices.
This patch:
OMFS is a proprietary filesystem created for the ReplayTV and also used by the
Rio Karma. It uses hash tables with unordered, unbounded lists in each bucket
for directories, extents for data blocks, 64-bit addressing for blocks, with
up to 8K blocks (only 2K of a given block is ever used for metadata, so the FS
still works with 4K pages).
Document the filesystem usage and structures.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allows one to create and use a channel with no associated files. Files
can be initialized later. This is useful in scenarios such as logging in
early code, before VFS is up. Therefore, such channels can be created and
used as soon as kmem_cache_init() completed.
This is needed by kmemtrace to do tracing in early kernel code.
[kosaki.motohiro@jp.fujitsu.com: build fix]
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to be able to debug things like the X server and programs using
the PPC Cell SPUs, the debugger needs to be able to access device memory
through ptrace and /proc/pid/mem.
This patch:
Add the generic_access_phys access function and put the hooks in place
to allow access_process_vm to access device or PPC Cell SPU memory.
[riel@redhat.com: Add documentation for the vm_ops->access function]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Benjamin Herrensmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
remove CONFIG_KMOD from core kernel code
remove CONFIG_KMOD from lib
remove CONFIG_KMOD from sparc64
rework try_then_request_module to do less in non-modular kernels
remove mention of CONFIG_KMOD from documentation
make CONFIG_KMOD invisible
modules: Take a shortcut for checking if an address is in a module
module: turn longs into ints for module sizes
Shrink struct module: CONFIG_UNUSED_SYMBOLS ifdefs
module: reorder struct module to save space on 64 bit builds
module: generic each_symbol iterator function
module: don't use stop_machine for waiting rmmod
Also includes a few Kconfig files (xtensa, blackfin)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Why?:
There are occasions where userspace would like to access sysfs
attributes for a device but it may not know how sysfs has named the
device or the path. For example what is the sysfs path for
/dev/disk/by-id/ata-ST3160827AS_5MT004CK? With this change a call to
stat(2) returns the major:minor then userspace can see that
/sys/dev/block/8:32 links to /sys/block/sdc.
What are the alternatives?:
1/ Add an ioctl to return the path: Doable, but sysfs is meant to reduce
the need to proliferate ioctl interfaces into the kernel, so this
seems counter productive.
2/ Use udev to create these symlinks: Also doable, but it adds a
udev dependency to utilities that might be running in a limited
environment like an initramfs.
3/ Do a full-tree search of sysfs.
[kay.sievers@vrfy.org: fix duplicate registrations]
[kay.sievers@vrfy.org: cleanup suggestions]
Cc: Neil Brown <neilb@suse.de>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Reviewed-by: SL Baur <steve@xemacs.org>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Mark Lord <lkml@rtr.ca>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits)
nfsd: nfs4xdr.c do-while is not a compound statement
nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
lockd: Pass "struct sockaddr *" to new failover-by-IP function
lockd: get host reference in nlmsvc_create_block() instead of callers
lockd: minor svclock.c style fixes
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
lockd: nlm_release_host() checks for NULL, caller needn't
file lock: reorder struct file_lock to save space on 64 bit builds
nfsd: take file and mnt write in nfs4_upgrade_open
nfsd: document open share bit tracking
nfsd: tabulate nfs4 xdr encoding functions
nfsd: dprint operation names
svcrdma: Change WR context get/put to use the kmem cache
svcrdma: Create a kmem cache for the WR contexts
svcrdma: Add flush_scheduled_work to module exit function
svcrdma: Limit ORD based on client's advertised IRD
svcrdma: Remove unused wait q from svcrdma_xprt structure
svcrdma: Remove unneeded spin locks from __svc_rdma_free
svcrdma: Add dma map count and WARN_ON
...
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return ERR_PTR() values. These errors are
bubbled up appropriately. NULL returns are changed to -ENOMEM for
compatibility.
Also updated are the in-kernel users of configfs.
This is a rework of reverted commit 11c3b79218.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
[PATCH] ocfs2: fix oops in mmap_truncate testing
configfs: call drop_link() to cleanup after create_link() failure
configfs: Allow ->make_item() and ->make_group() to return detailed errors.
configfs: Fix failing mkdir() making racing rmdir() fail
configfs: Fix deadlock with racing rmdir() and rename()
configfs: Make configfs_new_dirent() return error code instead of NULL
configfs: Protect configfs_dirent s_links list mutations
configfs: Introduce configfs_dirent_lock
ocfs2: Don't snprintf() without a format.
ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs
ocfs2/net: Silence build warnings on sparc64
ocfs2: Handle error during journal load
ocfs2: Silence an error message in ocfs2_file_aio_read()
ocfs2: use simple_read_from_buffer()
ocfs2: fix printk format warnings with OCFS2_FS_STATS=n
[PATCH 2/2] ocfs2: Instrument fs cluster locks
[PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option
* 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6:
UBIFS: include to compilation
UBIFS: add new flash file system
UBIFS: add brief documentation
MAINTAINERS: add UBIFS section
do_mounts: allow UBI root device name
VFS: export sync_sb_inodes
VFS: move inode_lock into sync_sb_inodes
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return an int.
Also updated are the in-kernel users of configfs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Adding some documentations for delayed allocation and new ordered mode.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Some of the information in Documentation/filesystems/ext4.txt is out
of date and in need of an update.
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Update the NFS/RDMA documentation to clarify how to run mount.nfs.
Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
For the ranges with IORESOURCE_PREFETCH, export a new resource_wc interface in
pci /sysfs along with resource (which is uncached).
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Current IRQ affinity interface does not provide a way to set affinity
for the IRQs that will be allocated/activated in the future.
This patch creates /proc/irq/default_smp_affinity that lets users set
default affinity mask for the newly allocated IRQs. Changing the default
does not affect affinity masks for the currently active IRQs, they
have to be changed explicitly.
Updated based on Paul J's comments and added some more documentation.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: a.p.zijlstra@chello.nl
Cc: tglx@linutronix.de
Cc: rdunlap@xenotime.net
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I can't think of any valid reason for ext4 to not use barriers when
they are available; I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.
An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.
SuSE has carried such a patch for ext3 for quite some time now.
Also document the mount option while we're at it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
And with that last patch to affs killing the last put_inode instance we
can finally, after many years of transition kill this racy and awkward
interface.
(It's kinda funny that even the description in
Documentation/filesystems/vfs.txt was entirely wrong..)
Also remove a very misleading comment above the defintion of
struct super_operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A few fields in /proc/meminfo were not documented. Fix.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normally utime(2) checks current process is owner of the file, or it
has CAP_FOWNER capability. But FAT filesystem doesn't have uid/gid as
on disk info, so normal check is too unflexible.
With this option you can relax it.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Updates Documentation/vm/numa_memory_policy.txt and
Documentation/filesystems/tmpfs.txt to describe optional mempolicy mode flags.
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nothing in the tree uses nopage any more. Remove support for it in the
core mm code and documentation (and a few stray references to it in
comments).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://linux-nfs.org/~bfields/linux:
nfsd: don't allow setting ctime over v4
Update to NFS/RDMA documentation
locks: don't call ->copy_lock methods on return of conflicting locks
lockd: unlock lockd locks held for a certain filesystem
lockd: unlock lockd locks associated with a given server ip
leases: remove unneeded variable from fcntl_setlease().
leases: move lock allocation earlier in generic_setlease()
leases: when unlocking, skip locking-related steps
leases: fix a return-value mixup
Update to the NFS/RDMA documentation to clarify how to configure the
exports file.
Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2.6.26 adds a SEQ_SKIP return value for the seq_file show() function;
update the documentation to match.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Add some instructions for using the new NFS/RDMA features.
Signed-off-by: James Lentini <jlentini@netapp.com>
Cc: Roland Dreier <rdreier@cisco.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Show peer group ID of nearest dominating group that has intersection
with the mount's namespace.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mszeredi@suse.cz] rewrite and split big patch into managable chunks
/proc/mounts in its current form lacks important information:
- propagation state
- root of mount for bind mounts
- the st_dev value used within the filesystem
- identifier for each mount and it's parent
It also suffers from the following problems:
- not easily extendable
- ambiguity of mountpoints within a chrooted environment
- doesn't distinguish between filesystem dependent and independent options
- doesn't distinguish between per mount and per super block options
This patch introduces /proc/<pid>/mountinfo which attempts to address
all these deficiencies.
Code shared between /proc/<pid>/mounts and /proc/<pid>/mountinfo is
extracted into separate functions.
Thanks to Al Viro for the help in getting the design right.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Requiring userspace to close and re-open sysfs attributes has been the
policy since before 2.6.12. It allows userspace to get a consistent
snapshot of kernel state and consume it with incremental reads and seeks.
Now, if the file position is zero the kernel assumes userspace wants to see
the new value. The application for this change is to allow a userspace
RAID metadata handler to check the state of an array without causing any
memory allocations. Thus not causing writeback to a raid array that might
be blocked waiting for userspace to take action.
Cc: Neil Brown <neilb@suse.de>
Acked-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Mention how DMAPI affects default for noikeep.
Slightly modified since Josef's patch was based on
an old xfs.txt prior to Dave's (dgc) checkin which
missed going to oss.
Signed-off-by: Josef Sipek <jeffpc@josefsipek.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Update xfs docs for:
* In memory inode hashes has been removed.
* noikeep is now the default.
SGI-PV: 969561
SGI-Modid: 2.6.x-xfs-melb:linux:29481b
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tim Shimmin <tes@sgi.com>
A couple of typos crept into the newly added document about the seq_file
interface. This patch corrects those typos and simultaneously deletes
unnecessary trailing spaces.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This file is nfs-related. (Maybe Documentation/filesystems/ would
benefit from a separate nfs/ directory at some point.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Documentation/ is a little large, and filesystems/ seems an obvious
place for this file.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
On Friday 2008-03-28 19:20, Jonathan Corbet wrote:
>commit 9756ccfda31b4c4544aa010aacf71b6672d668e8
>Date: Fri Mar 28 11:19:56 2008 -0600
>
> Add the seq_file documentation
patch on top:
- add const qualifiers
- remove void* casts
- use proper specifier (%Ld is not valid)
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Move laptop-mode.txt into the laptops/ sub-directory to consolidate
laptop doc files there.
Update references to the file's location.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This series addresses the problem of showing mount options in
/proc/mounts.
Several filesystems which use mount options, have not implemented a
.show_options superblock operation. Several others have implemented
this callback, but have not kept it fully up to date with the parsed
options.
Q: Why do we need correct option showing in /proc/mounts?
A: We want /proc/mounts to fully replace /etc/mtab. The reasons for
this are:
- unprivileged mounters won't be able to update /etc/mtab
- /etc/mtab doesn't work with private mount namespaces
- /etc/mtab can become out-of-sync with reality
Q: Can't this be done, so that filesystems need not bother with
implementing a .show_mounts callback, and keeping it up to date?
A: Only in some cases. Certain filesystems allow modification of a
subset of options in their remount_fs method. It is not possible
to take this into account without knowing exactly how the
filesystem handles options.
For the simple case (no remount or remount resets all options) the
patchset introduces two helpers:
generic_show_options()
save_mount_options()
These can also be used to emulate the old /etc/mtab behavior, until
proper support is added. Even if this is not 100% correct, it's still
better than showing no options at all.
The following patches fix up most in-tree filesystems, some have been
compile tested only, some have been reviewed and acked by the
maintainer.
Table displaying status of all in-kernel filesystems:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
legend:
none - fs has options, but doesn't define ->show_options()
some - fs defines ->show_options(), but some only options are shown
good - fs shows all options
noopt - fs does not have options
patch - a patch will be posted
merged - a patch has been merged by subsystem maintainer
9p good
adfs patch
affs patch
afs patch
autofs patch
autofs4 patch
befs patch
bfs noopt
cifs some
coda noopt
configfs noopt
cramfs noopt
debugfs noopt
devpts patch
ecryptfs good
efs noopt
ext2 patch
ext3 good
ext4 merged
fat patch
freevxfs noopt
fuse patch
fusectl noopt
gfs2 good
gfs2meta noopt
hfs good
hfsplus good
hostfs patch
hpfs patch
hppfs noopt
hugetlbfs patch
isofs patch
jffs2 noopt
jfs merged
minix noopt
msdos ->fat
ncpfs patch
nfs some
nfsd noopt
ntfs good
ocfs2 good
ocfs2/dlmfs noopt
openpromfs noopt
proc noopt
qnx4 noopt
ramfs noopt
reiserfs patch
romfs noopt
smbfs good
sysfs noopt
sysv noopt
udf patch
ufs good
vfat ->fat
xfs good
mm/shmem.c patch
drivers/oprofile/oprofilefs.c noopt
drivers/infiniband/hw/ipath/ipath_fs.c noopt
drivers/misc/ibmasm/ibmasmfs.c noopt
drivers/usb/core (usbfs) merged
drivers/usb/gadget (gadgetfs) noopt
drivers/isdn/capi/capifs.c patch
kernel/cpuset.c noopt
fs/binfmt_misc.c noopt
net/sunrpc/rpc_pipe.c noopt
arch/powerpc/platforms/cell/spufs patch
arch/s390/hypfs good
ipc/mqueue.c noopt
security (securityfs) noopt
security/selinux/selinuxfs.c noopt
kernel/cgroup.c good
security/smack/smackfs.c noopt
in -mm:
reiser4 some
unionfs good
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This patch:
Document the rules for handling mount options in the .show_options
super operation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement dmode option for iso9660 filesystem to allow setting of access
rights for directories on the filesystem.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Ilya N. Golubev" <gin@mo.msk.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the old iget() call and the read_inode() superblock operation it uses
as these are really obsolete, and the use of read_inode() does not produce
proper error handling (no distinction between ENOMEM and EIO when marking an
inode bad).
Furthermore, this removes the temptation to use iget() to find an inode by
number in a filesystem from code outside that filesystem.
iget_locked() should be used instead. A new function is added in an earlier
patch (iget_failed) that is to be called to mark an inode as bad, unlock it
and release it should the get routine fail. Mark iget() and read_inode() as
being obsolete and remove references to them from the documentation.
Typically a filesystem will be modified such that the read_inode function
becomes an internal iget function, for example the following:
void thingyfs_read_inode(struct inode *inode)
{
...
}
would be changed into something like:
struct inode *thingyfs_iget(struct super_block *sp, unsigned long ino)
{
struct inode *inode;
int ret;
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
return inode;
...
unlock_new_inode(inode);
return inode;
error:
iget_failed(inode);
return ERR_PTR(ret);
}
and then thingyfs_iget() would be called rather than iget(), for example:
ret = -EINVAL;
inode = iget(sb, ino);
if (!inode || is_bad_inode(inode))
goto error;
becomes:
inode = thingyfs_iget(sb, ino);
if (IS_ERR(inode)) {
ret = PTR_ERR(inode);
goto error;
}
Note that is_bad_inode() does not need to be called. The error returned by
thingyfs_iget() should render it unnecessary.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a function to register failure in an inode construction path. This
includes marking the inode under construction as bad, unlocking it and
releasing it.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This documentation is also vfs-related.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm inclined to think dnotify belongs in filesystems/.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.
Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.
Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.
This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.
[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Though the lower_zone_protection was changed to lowmem_reserve_ratio, the
document has been not changed. The lowmem_reserve_ratio seems quite hard
to estimate, but there is no guidance. This patch is to change document
for it.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andrea Arcangeli <andrea@cpushare.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add vm.highmem_is_dirtyable toggle
A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.
Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.
[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana <brong@fastmail.fm>
Cc: Ethan Solomita <solo@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the normal, expected mountpoint in the relay(fs) example
for debugfs.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
execve arguments can be quite large. There is no limit on the number of
arguments and a 4G limit on the size of an argument.
this patch prints those aruguments in bite sized pieces. a userspace size
limitation of 8k was discovered so this keeps messages around 7.5k
single arguments larger than 7.5k in length are split into multiple records
and can be identified as aX[Y]=
Signed-off-by: Eric Paris <eparis@redhat.com>
Current ip route cache implementation is not suited to large caches.
We can consume a lot of CPU when cache must be invalidated, since we
currently need to evict all cache entries, and this eviction is
sometimes asynchronous. min_delay & max_delay can somewhat control this
asynchronism behavior, but whole thing is a kludge, regularly triggering
infamous soft lockup messages. When entries are still in use, this also
consumes a lot of ram, filling dst_garbage.list.
A better scheme is to use a generation identifier on each entry,
so that cache invalidation can be performed by changing the table
identifier, without having to scan all entries.
No more delayed flushing, no more stalling when secret_interval expires.
Invalidated entries will then be freed at GC time (controled by
ip_rt_gc_timeout or stress), or when an invalidated entry is found
in a chain when an insert is done.
Thus we keep a normal equilibrium.
This patch :
- renames rt_hash_rnd to rt_genid (and makes it an atomic_t)
- Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit)
- Checks entry->rt_genid at appropriate places :
The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.
For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Hook up ocfs2_flock(), using the new flock lock type in dlmglue.c. A new
mount option, "localflocks" is added so that users can revert to old
functionality as need be.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Local alloc is a performance optimization in ocfs2 in which a node
takes a window of bits from the global bitmap and then uses that for
all small local allocations. This window size is fixed to 8MB currently.
This patch allows users to specify the window size in MB including
disabling it by passing in 0. If the number specified is too large,
the fs will use the default value of 8MB.
mount -o localalloc=X /dev/sdX /mntpoint
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Mostly taken from ext3. This allows the user to set the jbd commit interval,
in seconds. The default of 5 seconds stays the same, but now users can
easily increase the commit interval. Typically, this would be increased in
order to benefit performance at the expense of data-safety.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Remove 'readpages' from the list in ocfs2.txt. Instead of having two
identical lists, I just removed the list in the OCFS2 section of fs/Kconfig
and added a pointer to Documentation/filesystems/ocfs2.txt.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This adds a transport to 9p for communicating between guests and a host
using a virtio based transport.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Update documentation to the current state of affairs. Remove duplicated
method descruptions in exportfs.h and point to Documentation/filesystems/
Exporting instead. Add a little file header comment in expfs.c describing
what's going on and mentioning Neils and my copyright [1].
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: David Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Chris Mason <mason@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix some grammar in the explanation of the Journal Block Device layer.
Signed-off-by: Shaun Zinck <shaun.zinck@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
rcuref_inc_lf() is not used anymore. Replace it by atomic_inc_not_zero()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Most of these fixes were already submitted for old kernel versions, and were
approved, but for some reason they never made it into the releases.
Because this is a consolidation of a couple old missed patches, it touches both
Kconfigs and documentation texts.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Fix the various misspellings of "system", controller", "interrupt" and
"[un]necessary".
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
The 9P2000 protocol requires the authentication and permission checks to be
done in the file server. For that reason every user that accesses the file
server tree has to authenticate and attach to the server separately.
Multiple users can share the same connection to the server.
Currently v9fs does a single attach and executes all I/O operations as a
single user. This makes using v9fs in multiuser environment unsafe as it
depends on the client doing the permission checking.
This patch improves the 9P2000 support by allowing every user to attach
separately. The patch defines three modes of access (new mount option
'access'):
- attach-per-user (access=user) (default mode for 9P2000.u)
If a user tries to access a file served by v9fs for the first time, v9fs
sends an attach command to the server (Tattach) specifying the user. If
the attach succeeds, the user can access the v9fs tree.
As there is no uname->uid (string->integer) mapping yet, this mode works
only with the 9P2000.u dialect.
- allow only one user to access the tree (access=<uid>)
Only the user with uid can access the v9fs tree. Other users that attempt
to access it will get EPERM error.
- do all operations as a single user (access=any) (default for 9P2000)
V9fs does a single attach and all operations are done as a single user.
If this mode is selected, the v9fs behavior is identical with the current
one.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Change the names of 'uid' and 'gid' parameters to the more appropriate
'dfltuid' and 'dfltgid'. This also sets the default uid/gid to -2
(aka nfsnobody)
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch abstracts out the interfaces to underlying transports so that
new transports can be added as modules. This should also allow kernel
configuration of transports without ifdef-hell.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Add missing IRQs and IRQ descriptions to /proc/interrupts.
/proc/interrupts is most useful when it displays every IRQ vector in use by
the system, not just those somebody thought would be interesting.
This patch inserts the following vector displays to the i386 and x86_64
platforms, as appropriate:
rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts
A threshold interrupt occurs when ECC memory correction is occuring at too
high a frequency. Thresholds are used by the ECC hardware as occasional
ECC failures are part of normal operation, but long sequences of ECC
failures usually indicate a memory chip that is about to fail.
Thermal event interrupts occur when a temperature threshold has been
exceeded for some CPU chip. IIRC, a thermal interrupt is also generated
when the temperature drops back to a normal level.
A spurious interrupt is an interrupt that was raised then lowered by the
device before it could be fully processed by the APIC. Hence the apic sees
the interrupt but does not know what device it came from. For this case
the APIC hardware will assume a vector of 0xff.
Rescheduling, call, and TLB flush interrupts are sent from one CPU to
another per the needs of the OS. Typically, their statistics would be used
to discover if an interrupt flood of the given type has been occuring.
AK: merged v2 and v4 which had some more tweaks
AK: replace Local interrupts with Local timer interrupts
AK: Fixed description of interrupt types.
[ tglx: arch/x86 adaptation ]
[ mingo: small cleanup ]
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement sending of quota messages via netlink interface. The advantage
is that in userspace we can better decide what to do with the message - for
example display a dialogue in your X session or just write the message to
the console. As a bonus, we can get rid of problems with console locking
deep inside filesystem code once we remove the old printing mechanism.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
GFS2 were converted to the new aops, so we can make some simplifications
for that.
[michal.k.k.piotrowski@gmail.com: fix warning]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).
[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'locks' of git://linux-nfs.org/~bfields/linux:
nfsd: remove IS_ISMNDLCK macro
Rework /proc/locks via seq_files and seq_list helpers
fs/locks.c: use list_for_each_entry() instead of list_for_each()
NFS: clean up explicit check for mandatory locks
AFS: clean up explicit check for mandatory locks
9PFS: clean up explicit check for mandatory locks
GFS2: clean up explicit check for mandatory locks
Cleanup macros for distinguishing mandatory locks
Documentation: move locks.txt in filesystems/
locks: add warning about mandatory locking races
Documentation: move mandatory locking documentation to filesystems/
locks: Fix potential OOPS in generic_setlease()
Use list_first_entry in locks_wake_up_blocks
locks: fix flock_lock_file() comment
Memory shortage can result in inconsistent flocks state
locks: kill redundant local variable
locks: reverse order of posix_locks_conflict() arguments
Big thanks go to Mathias Kolehmainen for reporting the bug, providing
debug output and testing the patches I sent him to get it working.
The fix was to stop calling ntfs_attr_set() at mount time as that causes
balance_dirty_pages_ratelimited() to be called which on systems with
little memory actually tries to go and balance the dirty pages which tries
to take the s_umount semaphore but because we are still in fill_super()
across which the VFS holds s_umount for writing this results in a
deadlock.
We now do the dirty work by hand by submitting individual buffers. This
has the annoying "feature" that mounting can take a few seconds if the
journal is large as we have clear it all. One day someone should improve
on this by deferring the journal clearing to a helper kernel thread so it
can be done in the background but I don't have time for this at the moment
and the current solution works fine so I am leaving it like this for now.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mandatory file locking implementation has long-standing races that
probably render it useless. I know of no plans to fix them. Till we
do, we should at least warn people.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Shouldn't this mandatory-locking documentation be in the
Documentation/filesystems directory?
Give it a more descriptive name while we're at it, and update 00-INDEX
with a more inclusive description of Documentation/filesystems (which
has already talked about more than just individual filesystems).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: Fix calculation of i_blocks during truncate
[PATCH] ocfs2: Fix a wrong cluster calculation.
[PATCH] ocfs2: fix mount option parsing
ocfs2: update docs for new features
ecryptfs.txt moved into filesystems, make 00-INDEX follow.
Signed-off-by: Rob Landley <rob@landley.net>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update documentation listing ocfs2 features to reflect the current state of
the file system. Add missing descriptions for some mount options which ocfs2
supports.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Updates to the MAINTAINERS file and documentation for 9p to point to the
swik wiki versus the outdated sf.net page. Also updated some email addresses
and added pointers to papers which better describe the implementation and
application of the Linux 9p client.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).
Here is a short excerpt of the semantic patch performing
this transformation:
@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@
x =
- kmalloc
+ kzalloc
(E1,E2)
... when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);
@@
expression E1,E2,E3;
@@
- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)
[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of audit_bprm() is to log the argv array to a userspace daemon at
the end of the execve system call. Since user-space hasn't had time to run,
this array is still in pristine state on the process' stack; so no need to
copy it, we can just grab it from there.
In order to minimize the damage to audit_log_*() copy each string into a
temporary kernel buffer first.
Currently the audit code requires that the full argument vector fits in a
single packet. So currently it does clip the argv size to a (sysctl) limit,
but only when execve auditing is enabled.
If the audit protocol gets extended to allow for multiple packets this check
can be removed.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ollie Wild <aaw@google.com>
Cc: <linux-audit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).
This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.
struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.
The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There seems to be very little documentation about this callback in general.
The locking in particular is a bit tricky, so it's worth having this in
writing.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.
->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping. The hitch here
is that the ->nopage handler didn't pass down enough information (ie. pgoff).
But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).
Having the populate handler install the pte itself is likewise a nasty thing
to be doing.
This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.
The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.
After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.
NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
users have hit mainline yet.
[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the kernelcore= parameter for x86.
Once all patches are applied, a new command-line parameter exist and a new
sysctl. This patch adds the necessary documentation.
From: Yasunori Goto <y-goto@jp.fujitsu.com>
When "kernelcore" boot option is specified, kernel can't boot up on ia64
because of an infinite loop. In addition, the parsing code can be handled
in an architecture-independent manner.
This patch uses common code to handle the kernelcore= parameter. It is
only available to architectures that support arch-independent zone-sizing
(i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
ignore the boot parameter.
[bunk@stusta.de: make cmdline_parse_kernelcore() static]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (32 commits)
[PATCH] ocfs2: zero_user_page conversion
ocfs2: Support xfs style space reservation ioctls
ocfs2: support for removing file regions
ocfs2: update truncate handling of partial clusters
ocfs2: btree support for removal of arbirtrary extents
ocfs2: Support creation of unwritten extents
ocfs2: support writing of unwritten extents
ocfs2: small cleanup of ocfs2_write_begin_nolock()
ocfs2: btree changes for unwritten extents
ocfs2: abstract btree growing calls
ocfs2: use all extent block suballocators
ocfs2: plug truncate into cached dealloc routines
ocfs2: simplify deallocation locking
ocfs2: harden buffer check during mapping of page blocks
ocfs2: shared writeable mmap
ocfs2: factor out write aops into nolock variants
ocfs2: rework ocfs2_buffered_write_cluster()
ocfs2: take ip_alloc_sem during entire truncate
ocfs2: Add "preferred slot" mount option
[KJ PATCH] Replacing memset(<addr>,0,PAGE_SIZE) with clear_page() in fs/ocfs2/dlm/dlmrecovery.c
...
Update the description of struct file_system_type and get_sb() in
Documentation/filesystems/vfs.txt to match the current code.
Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes other drivers depend on particular configfs items. For
example, ocfs2 mounts depend on a heartbeat region item. If that
region item is removed with rmdir(2), the ocfs2 mount must BUG or go
readonly. Not happy.
This provides two additional API calls: configfs_depend_item() and
configfs_undepend_item(). A client driver can call
configfs_depend_item() on an existing item to tell configfs that it is
depended on. configfs will then return -EBUSY from rmdir(2) for that
item. When the item is no longer depended on, the client driver calls
configfs_undepend_item() on it.
These API cannot be called underneath any configfs callbacks, as
they will conflict. They can block and allocate. A client driver
probably shouldn't calling them of its own gumption. Rather it should
be providing an API that external subsystems call.
How does this work? Imagine the ocfs2 mount process. When it mounts,
it asks for a heart region item. This is done via a call into the
heartbeat code. Inside the heartbeat code, the region item is looked
up. Here, the heartbeat code calls configfs_depend_item(). If it
succeeds, then heartbeat knows the region is safe to give to ocfs2.
If it fails, it was being torn down anyway, and heartbeat can gracefully
pass up an error.
[ Fixed some bad whitespace in configfs.txt. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Add a notification callback, ops->disconnect_notify(). It has the same
prototype as ->drop_item(), but it will be called just before the item
linkage is broken. This way, configfs users who want to do work while
the object is still in the heirarchy have a chance.
Client drivers will still need to config_item_put() in their
->drop_item(), if they implement it. They need do nothing in
->disconnect_notify(). They don't have to provide it if they don't
care. But someone who wants to be notified before ci_parent is set to
NULL can now be notified.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Convert the su_sem member of struct configfs_subsystem to a struct
mutex, as that's what it is. Also convert all the users and update
Documentation/configfs.txt and Documentation/configfs_example.c
accordingly.
[ Conflict in fs/dlm/config.c with commit
3168b0780d manually resolved. --Mark ]
Inspired-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an
offline node, crashes as soon as data is allocated upon it. Now restrict it
to online nodes, where before it restricted to MAX_NUMNODES.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. it got changed to 'i_mutex' some time ago.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch substitutes i_sem by i_mutex in
Documentation/filesystems/Locking.
The patch also removes a couple of trailing white-spaces.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Fix various typos in kernel docs and Kconfigs, 2.6.21-rc4.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
JFS: Fix race waking up jfsIO kernel thread
JFS: use __set_current_state()
Copy i_flags to jfs inode flags on write
JFS: document uid, gid, and umask mount options in jfs.txt
It seems that the recent Windows changed specification, and it's
undocumented. Windows doesn't update ->free_clusters correctly.
This patch doesn't use ->free_clusters by default. (instead, add "usefree"
for forcing to use it)
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Juergen Beisert <juergen127@kreuzholzen.de>
Cc: Andreas Schwab <schwab@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1) Introduces a new method in 'struct dentry_operations'. This method
called d_dname() might be called from d_path() to build a pathname for
special filesystems. It is called without locks.
Future patches (if we succeed in having one common dentry for all
pipes/sockets) may need to change prototype of this method, but we now
use : char *d_dname(struct dentry *dentry, char *buffer, int buflen);
2) Adds a dynamic_dname() helper function that eases d_dname() implementations
3) Defines d_dname method for sockets : No more sprintf() at socket
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...
4) Defines d_dname method for pipes : No more sprintf() at pipe
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...
A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a
*nice* speedup on my Pentium(M) 1.6 Ghz :
3.090 s instead of 3.450 s
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes. Issues:
- maps should not be world-readable, especially if programs expect any
kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
check the maps when %n is in a *printf call, and a setuid(getuid())
process wouldn't be able to read its own maps file. (For reference
see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
non-root applications that depend on access to the maps contents.
This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents. To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.
[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat]
Signed-off-by: Kees Cook <kees@outflux.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
pte_mkold() and ClearPageReferenced() is called for each pte and its
corresponding page, respectively, in that task's VMAs. This file is only
writable by the user who owns the task.
It is now possible to measure _approximately_ how much memory a task is using
by clearing the reference bits with
echo 1 > /proc/pid/clear_refs
and checking the reference count for each VMA from the /proc/pid/smaps output
at a measured time interval. For example, to observe the approximate change
in memory footprint for a task, write a script that clears the references
(echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
extracts the size in kB. Add the sizes for each VMA together for the total
referenced footprint. Moments later, repeat the process and observe the
difference.
For example, using an efficient Mozilla:
accumulated time referenced memory
---------------- -----------------
0 s 408 kB
1 s 408 kB
2 s 556 kB
3 s 1028 kB
4 s 872 kB
5 s 1956 kB
6 s 416 kB
7 s 1560 kB
8 s 2336 kB
9 s 1044 kB
10 s 416 kB
This is a valuable tool to get an approximate measurement of the memory
footprint for a task.
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
[akpm@linux-foundation.org: build fixes]
[mpm@selenic.com: rename for_each_pmd]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Covert network warning messages from a compile time to runtime choice.
Removes kernel config option and replaces it with new /proc/sys/net/core/warnings.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add some documentation for the new and very useful io-accounting feature.
It's being added to Documentation/filesystems/proc.txt
Signed-off-by: Roland Kletzing <devzero@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
simple_prepare_write leaks uninitialised kernel data. This happens because
the it leaves an uninitialised "hole" over the part of the page that the
write is expected to go to. This is fine, but it then marks the page
uptodate, which means a concurrent read can come in and copy the
uninitialised memory into userspace before it written to.
Fix it by simply marking it uptodate in simple_commit_write instead, after
the hole has been filled in. This could theoretically break an fs that
uses simple_prepare_write and not simple_commit_write, and that relies on
the incorrect simple_prepare_write behaviour. Luckily, none of those
exists in the tree.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: implement optional loose read cache
9p: Use kthread_stop instead of sending a SIGKILL.
While cacheing is generally frowned upon in the 9p world, it has its
place -- particularly in situations where the remote file system is
exclusive and/or read-only. The vacfs views of venti content addressable
store are a real-world instance of such a situation. To facilitate higher
performance for these workloads (and eventually use the fscache patches),
we have enabled a "loose" cache mode which does not attempt to maintain
any form of consistency on the page-cache or dcache. This results in over
two orders of magnitude performance improvement for cacheable block reads
in the Bonnie benchmark. The more aggressive use of the dcache also seems
to improve metadata operational performance.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
These series of patches add UFS2 write-support. UFS2 - is default file system
for recent versions of FreeBSD.
The main differences from UFS1 from write support point of view
are:
1)Not all inodes are allocated during formatation of disk.
2)All meta-data(pointer to data blocks) are 64bit(in UFS1 they
are 32bit).
So patch series consist of
1)make possible mount UFS2 in read-write mode
2)code to write ufs2 inodes and code to initialize inodes chunks.
3)work with 64bit meta-data
I made simple testing like create/deleting/writing/reading/truncating, also I
ran fsx-linux and untar and build kernel on UFS1 and UFS2, after that FreeBSD
fsck do not find any errors in fs.
This patch makes possible to mount ufs2 "rw", and updates UFS2 documentation:
remove note about bug(it fixed by reallocate blocks on the fly patch) and add
me in the list of people who want receive bug reports.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mathieu originally needed to add this for tracing Xen, but it's something
that's needed for any application that can be tracing while cpus are added.
unplug isn't supported by this patch. The thought was that at minumum a new
buffer needs to be added when a cpu comes up, but it wasn't worth the effort
to remove buffers on cpu down since they'd be freed soon anyway when the
channel was closed.
[zanussi@us.ibm.com: avoid lock_cpu_hotplug deadlock]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the documentation to cover using Inferno as a server for 9p and to
include information about spfs (a stable single-threaded stand-alone 9p
server).
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix deadlock in fs/ntfs/inode.c::ntfs_put_inode(). Thanks to Sergey
Vlasov for the report and detailed analysis of the deadlock. The fix
involved getting rid of ntfs_put_inode() altogether and hence NTFS no
longer has a ->put_inode super operation.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
NFS: Fix race in nfs_release_page()
invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.
Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.
[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As Adrian pointed out recently, there were still a couple of places where
I should have fixed my email address.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We forgot to document the atime_quantum mount option in ocfs2.txt. This adds
a proper description of how it works.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Remove two different changelog files from fs/sysv/ and merges the INTRO
file into Documentation/filesystems/sysv-fs.txt
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add 'blksize' option for block device based filesystems. During
initialization this is used to set the block size on the device and the super
block. The default block size is 512bytes.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I never intended this, but people started using fuse to implement block device
based "real" filesystems (ntfs-3g, zfs).
The following four patches add better support for these kinds of filesystems.
Unlike "normal" fuse filesystems, using this feature should require superuser
privileges (enforced by the fusermount utility).
Thanks to Szabolcs Szakacsits for the input and testing.
This patch adds a 'fuseblk' filesystem type, which is only different from the
'fuse' filesystem type in how the 'dev_name' mount argument is interpreted.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes typos in various Documentation txts. The patch addresses some
misc words.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses some
+words starting with the letters 'U-Z'.
Looks like I made it through the alphabet...just in time to start over again
+too! Maybe I can fit more profound fixes into the next round...? Time will
+tell. :)
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses some
+words starting with the letter 'T'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Change Documentation/filesystems/udf.txt from saying that read/write mounts
on cd media are not supported to instead state the current level of
support. Specifically that it works fine on dvd+rw media and can be made
to work on cd-rw media via the pktcdvd device.
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This file, ext4.txt, was put together with information from Andrew Morton,
Andreas Dilger, Suparna Bhattacharya, and Ted Ts'o.
I copied the mount options, with the exception of "extents", from ext3.txt,
so if anyone is aware of anything out-of-date, please let me know.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove many duplicated words under Documentation/ and do other small
cleanups.
Examples:
"and and" --> "and"
"in in" --> "in"
"the the" --> "the"
"the the" --> "to the"
...
Signed-off-by: Paolo Ornati <ornati@fastwebnet.it>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses
some words starting with the letter 'S'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses
some words starting with the letters 'Q'-'R'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Randy brought it to my attention that in proper english "can not" should always
be written "cannot". I donot see any reason to argue, even if I mightnot
understand why this rule exists. This patch fixes "can not" in several
Documentation files as well as three Kconfigs.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses
some words starting with the letters 'N'-'P'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses
some words starting with the letters 'H'-'M'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. The patch addresses
some words starting with the letters 'F'-'G'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. This patch addresses
some words starting with the letters 'D'-'E'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts. This patch addresses some
words starting with the letters 'B'-'C'. There are also a few grammar fixes
thrown in for Randy. ;)
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch fixes typos in various Documentation txts.
This patch addresses some words starting with the letter 'A'.
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Now that devfs is removed, there's no longer any need to document how to
do this or that with devfs.
This patch includes some improvements by Joe Perches.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I was looking for the a way around an OOM-problem, and found a couple of
undocumented new features for tuning the OOM-score of individual processes.
Here's a small documentation patch for /proc/<pid>/oom_adj and
/proc/<pid>/oom_score.
Signed-off-by: Jan-Frode Myklebust <mykleb@no.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds a new /proc/sys/kernel/nmi_watchdog call that will enable/disable the
nmi watchdog.
By entering a non-zero value here, a user can enable the nmi watchdog to
monitor the online cpus in the system. By entering a zero value here, a
user can disable the nmi watchdog and free up a performance counter which
could then be utilized by the oprofile subsystem, otherwise oprofile may be
short a counter when in use.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Here's updated documentation for the relay interface, rewritten to match
the relayfs->relay changes. It also moves relayfs.txt to relay.txt in the
process.
It includes the changes to relayfs.txt previously posted by Randy Dunlap,
thanks for those.
The relay-apps examples have also been updated to match, and can be found
on the sourceforge relayfs website.
Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As I was looking over the get_sb() changes, I stumbled across a little
mistake in the documentation updates. Unless we're getting into an
interesting new object-oriented realm, I doubt that get_sb() should really
return "struct int"...
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: remove redundant NULL checks in ocfs2_direct_IO_get_blocks()
ocfs2: clean up some osb fields
ocfs2: fix init of uuid_net_key
ocfs2: silence a debug print
ocfs2: silence ENOENT during lookup of broken links
ocfs2: Cleanup message prints
ocfs2: silence -EEXIST from ocfs2_extent_map_insert/lookup
[PATCH] fs/ocfs2/dlm/dlmrecovery.c: make dlm_lockres_master_requery() static
ocfs2: warn the user on a dead timeout mismatch
ocfs2: OCFS2_FS must depend on SYSFS
ocfs2: Compile-time disabling of ocfs2 debugging output.
configfs: Clear up a few extra spaces where there should be TABs.
configfs: Release memory in configfs_example.
The configfs_example module was missing a ->release().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch adds "-o bh" option to force use of buffer_heads. This option
is needed when we make "nobh" as default - and if we run into problems.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (51 commits)
nfs: remove nfs_put_link()
nfs-build-fix-99
git-nfs-build-fixes
Merge branch 'odirect'
NFS: alloc nfs_read/write_data as direct I/O is scheduled
NFS: Eliminate nfs_get_user_pages()
NFS: refactor nfs_direct_free_user_pages
NFS: remove user_addr, user_count, and pos from nfs_direct_req
NFS: "open code" the NFS direct write rescheduler
NFS: Separate functions for counting outstanding NFS direct I/Os
NLM: Fix reclaim races
NLM: sem to mutex conversion
locks.c: add the fl_owner to nlm_compare_locks
NFS: Display the chosen RPCSEC_GSS security flavour in /proc/mounts
NFS: Split fs/nfs/inode.c
NFS: Fix typo in nfs_do_clone_mount()
NFS: Fix compile errors introduced by referrals patches
NFSv4: Ensure that referral mounts bind to a reserved port
NFSv4: A root pathname is sent as a zero component4
NFSv4: Follow a referral
...
New section on creating an external initramfs image using cpio (with
script), a warning about bad advice in the cpio man page, a bit of
debugging advice (hello world and rdinit=/bin/sh), and a few minor tweaks
to other parts of it.
Signed-off-by: Rob Landley <rob@landley.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add synchronous request interruption. This is needed for file locking
operations which have to be interruptible. However filesystem may implement
interruptibility of other operations (e.g. like NFS 'intr' mount option).
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a control filesystem to fuse, replacing the attributes currently exported
through sysfs. An empty directory '/sys/fs/fuse/connections' is still created
in sysfs, and mounting the control filesystem here provides backward
compatibility.
Advantages of the control filesystem over the previous solution:
- allows the object directory and the attributes to be owned by the
filesystem owner, hence letting unpriviled users abort the
filesystem connection
- does not suffer from module unload race
[akpm@osdl.org: fix this fs for recent dhowells depredations]
[akpm@osdl.org: fix 64-bit printk warnings]
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't put requests into the background when a fatal interrupt occurs while the
request is in userspace. This removes a major wart from the implementation.
Backgrounding of requests was introduced to allow breaking of deadlocks.
However now the same can be achieved by aborting the filesystem through the
'abort' sysfs attribute.
This is a change in the interface, but should not cause problems, since these
kinds of deadlocks never happen during normal operation.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.
This complements the get_sb() patch. That reduced the significance of
sb->s_root, allowing NFS to place a fake root there. However, NFS does
require a dentry to use as a target for the statfs operation. This permits
the root in the vfsmount to be used instead.
linux/mount.h has been added where necessary to make allyesconfig build
successfully.
Interest has also been expressed for use with the FUSE and XFS filesystems.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update kernel documentation to include a description of the inotify
kernel API.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace all module uses with the new vfs_kern_mount() interface, and fix up
simple_pin_fs().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds the new splice_write and splice_read file operations to
Documentation/filesystems/vfs.txt.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Jens Axboe <axboe@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits)
BUG_ON() Conversion in drivers/video/
BUG_ON() Conversion in drivers/parisc/
BUG_ON() Conversion in drivers/block/
BUG_ON() Conversion in sound/sparc/cs4231.c
BUG_ON() Conversion in drivers/s390/block/dasd.c
BUG_ON() Conversion in lib/swiotlb.c
BUG_ON() Conversion in kernel/cpu.c
BUG_ON() Conversion in ipc/msg.c
BUG_ON() Conversion in block/elevator.c
BUG_ON() Conversion in fs/coda/
BUG_ON() Conversion in fs/binfmt_elf_fdpic.c
BUG_ON() Conversion in input/serio/hil_mlc.c
BUG_ON() Conversion in md/dm-hw-handler.c
BUG_ON() Conversion in md/bitmap.c
The comment describing how MS_ASYNC works in msync.c is confusing
rcu: undeclared variable used in documentation
fix typos "wich" -> "which"
typo patch for fs/ufs/super.c
Fix simple typos
tabify drivers/char/Makefile
...
As Pekka Enberg pointed out, with the if still following the else, you can
still get a null uid written to the disk if you specify a default uid= without
uid=forget. In other words, if the desktop user is uid=1000 and the mount
option uid=1000 is given ( which is done on ubuntu automatically and probably
other distributions that use hal ), then if any other user besides uid 1000
owns a file then a 0 will be written to the media as the owning uid instead.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Flesh out the description of the address_space operations.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Avishay Traeger <atraeger@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix documentation to match current implementation.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There were a number of conflicting naming schemes used in the v9fs project.
The directory was fs/9p, but MAINTAINERS and Documentation referred to
v9fs. The module name itself was 9p2000, and the file system type was 9P.
This patch attempts to clean that up, changing all references to 9p in
order to match the directory name. We'll also start using 9p instead of
v9fs as our patch prefix.
There is also a minor consistency cleanup in the options changing the name
option to uname in order to more closely match the Plan 9 options.
Signed-off-by: Eric Van Hensbergevan <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
akpm points out that switching to a non-NUMA kernel could be irritating
if mounting tmpfs fails on an mpol option: tmpfs.txt recommend remount.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I've been dissatisfied with the mpol_nodelist mount option which was
added to tmpfs earlier in -rc. Replace it by mpol=policy:nodelist.
And it was broken: a nodelist is a comma-separated list of numbers and
ranges; the mount options are a comma-separated list of token=values.
Whoops, blindly strsep'ing on commas doesn't work so well: since we've
no numeric tokens, and unlikely to add them, use that to distinguish.
Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject
all its options as invalid if not NUMA. /proc shows MPOL_PREFERRED
as "prefer", so use that name for the policy instead of "preferred".
Enforce that mpol=default has no nodelist; that mpol=prefer has one
node only; that mpol=bind has a nodelist; but let mpol=interleave use
node_online_map if no nodelist given. Describe this in tmpfs.txt.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Robin Holt <holt@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Minor updates to the documentation to bring them into sync with current
websites and available features. The debug flag was switched back to hex
to match the documentation.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
configfs always made item and attribute ownership root.root and
permissions based on a umask of 022. Add ->setattr() to allow
chown(2)/chmod(2), and persist the changes for the lifetime of the
items and attributes.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Update ocfs2.txt to add "cluster aware lockf" under missing features.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Add documentation for new attributes in sysfs. Also describe the filesystem.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anything that writes into a tmpfs filesystem is liable to disproportionately
decrease the available memory on a particular node. Since there's no telling
what sort of application (e.g. dd/cp/cat) might be dropping large files
there, this lets the admin choose the appropriate default behavior for their
site's situation.
Introduce a tmpfs mount option which allows specifying a memory policy and
a second option to specify the nodelist for that policy. With the default
policy, tmpfs will behave as it does today. This patch adds support for
preferred, bind, and interleave policies.
The default policy will cause pages to be added to tmpfs files on the node
which is doing the writing. Some jobs expect a single process to create
and manage the tmpfs files. This results in a node which has a
significantly reduced number of free pages.
With this patch, the administrator can specify the policy and nodes for
that policy where they would prefer allocations.
This patch was originally written by Brent Casavant and Hugh Dickins. I
added support for the bind and preferred policies and the mpol_nodelist
mount option.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Undocument the non-working resize= mount option in ext3, and add some
references to the ext2resize package instead, which appears to be the only
proper way of doing online resizing of ext3 filesystems.
Signed-off-by: Tore Anderson <tore@fud.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
idr gently pointed out today that not only is the sysfs rom file
interface somewhat unintuitive (despite my efforts and initial
implementation), but it's also undocumented! This patch to
Documentation/filesystems/sysfs-pci.txt corrects the latter problem; the
former is a userland ABI now though, so we're stuck with it for awhile
at least.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Based on questions people have asked me. Repeatedly.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch below adds a new mount option to allow the external journal
device to be specified.
The syntax is as follows:
# mount -t ext3 -o journal_dev=0x0820 ...
where 0x0820 means major=8 and minor=32.
Signed-off-by: Johann Lombardi <johann.lombardi@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
librelay and relay-app.h have been retired - update Documentation to reflect
that.
Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Documentation update for creating global buffers.
Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Documentation update for creating relay files in other filesystems.
Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can. THis
operation requires root permissions.
It won't drop dirty data, so the user should run `sync' first.
Caveats:
a) Holds inode_lock for exorbitant amounts of time.
b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.
This is a debugging feature: useful for getting consistent results between
filesystem benchmarks. We could possibly put it under a config option, but
it's less than 300 bytes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.
This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.
The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.
See Documentation/filesystems/spufs.txt on how to use
spufs.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
dlmfs: A minimal dlm userspace interface implemented via a virtual
file system.
Most of the OCFS2 tools make use of this to take cluster locks when
doing operations on the file system.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Configfs, a file system for userspace-driven kernel object configuration.
The OCFS2 stack makes extensive use of this for propagation of cluster
configuration information into kernel.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Remove non-existing entry for fat_cvf.txt (was it ever supported?).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Reported by Jacques de Mer and Daniel Drake <dsd@gentoo.org>.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Correct lots of URLs in Documentation/ Also a few minor whitespace cleanups
and typo/spello fixes. Sadly there are still a lot of bad URLs remaining.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The CONFIG_EXT{2,3}_CHECK options where were never available, and all they
did was to implement a subset of e2fsck in the kernel.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Docs for ramfs, rootfs, and initramfs.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch splits dentry locking documentation from
Documentation/filesystems/vfs.txt to a separate file. The dentry locking
bits are useful but do not fit into the VFS overview document as is.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch updates the Documentation/filesystems/vfs.txt document. I
rearranged and rewrote parts of the introduction chapter and added better
headings for each section. I also added a description for the inode
rename() operation which was missing and added links to some useful
external VFS documentation.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
file operations ->write(), ->aio_write(), and ->writev() for regular
files. This replaces the old use of generic_file_write(), et al and
the address space operations ->prepare_write and ->commit_write.
This means that both sparse and non-sparse (unencrypted and
uncompressed) files can now be extended using the normal write(2)
code path. There are two limitations at present and these are that
we never create sparse files and that we only have limited support
for highly fragmented files, i.e. ones whose data attribute is split
across multiple extents. When such a case is encountered,
EOPNOTSUPP is returned.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Make data caching behavior selectable on a per-open basis instead of
per-mount. Compatibility for the old mount options 'kernel_cache' and
'direct_io' is retained in the userspace library (version 2.4.0-pre1 or
later).
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds the FUSE device handling functions.
This contains the following files:
o dev.c
- fuse device operations (read, write, release, poll)
- registers misc device
- support for sending requests to userspace
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch brings the now out-of-date Documentation/filesystems/vfs.txt
back to life. Thanks to Carsten Otte, Trond Myklebust, and Anton
Altaparmakov for their help on updating this documentation.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Someone complained about the docs for vm_overcommit_memory being wrong.
This patch copies the text from the vm documentation into procfs.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OVERVIEW
V9FS is a distributed file system for Linux which provides an
implementation of the Plan 9 resource sharing protocol 9P. It can be
used to share all sorts of resources: static files, synthetic file servers
(such as /proc or /sys), devices, and application file servers (such as
FUSE).
BACKGROUND
Plan 9 (http://plan9.bell-labs.com/plan9) is a research operating
system and associated applications suite developed by the Computing
Science Research Center of AT&T Bell Laboratories (now a part of
Lucent Technologies), the same group that developed UNIX , C, and C++.
Plan 9 was initially released in 1993 to universities, and then made
generally available in 1995. Its core operating systems code laid the
foundation for the Inferno Operating System released as a product by
Lucent Bell-Labs in 1997. The Inferno venture was the only commercial
embodiment of Plan 9 and is currently maintained as a product by Vita
Nuova (http://www.vitanuova.com). After updated releases in 2000 and
2002, Plan 9 was open-sourced under the OSI approved Lucent Public
License in 2003.
The Plan 9 project was started by Ken Thompson and Rob Pike in 1985.
Their intent was to explore potential solutions to some of the
shortcomings of UNIX in the face of the widespread use of high-speed
networks to connect machines. In UNIX, networking was an afterthought
and UNIX clusters became little more than a network of stand-alone
systems. Plan 9 was designed from first principles as a seamless
distributed system with integrated secure network resource sharing.
Applications and services were architected in such a way as to allow
for implicit distribution across a cluster of systems. Configuring an
environment to use remote application components or services in place
of their local equivalent could be achieved with a few simple command
line instructions. For the most part, application implementations
operated independent of the location of their actual resources.
Commercial operating systems haven't changed much in the 20 years
since Plan 9 was conceived. Network and distributed systems support is
provided by a patchwork of middle-ware, with an endless number of
packages supplying pieces of the puzzle. Matters are complicated by
the use of different complicated protocols for individual services,
and separate implementations for kernel and application resources.
The V9FS project (http://v9fs.sourceforge.net) is an attempt to bring
Plan 9's unified approach to resource sharing to Linux and other
operating systems via support for the 9P2000 resource sharing
protocol.
V9FS HISTORY
V9FS was originally developed by Ron Minnich and Maya Gokhale at Los
Alamos National Labs (LANL) in 1997. In November of 2001, Greg Watson
setup a SourceForge project as a public repository for the code which
supported the Linux 2.4 kernel.
About a year ago, I picked up the initial attempt Ron Minnich had
made to provide 2.6 support and got the code integrated into a 2.6.5
kernel. I then went through a line-for-line re-write attempting to
clean-up the code while more closely following the Linux Kernel style
guidelines. I co-authored a paper with Ron Minnich on the V9FS Linux
support including performance comparisons to NFSv3 using Bonnie and
PostMark - this paper appeared at the USENIX/FREENIX 2005
conference in April 2005:
( http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html ).
CALL FOR PARTICIPATION/REQUEST FOR COMMENTS
Our 2.6 kernel support is stabilizing and we'd like to begin pursuing
its integration into the official kernel tree. We would appreciate any
review, comments, critiques, and additions from this community and are
actively seeking people to join our project and help us produce
something that would be acceptable and useful to the Linux community.
STATUS
The code is reasonably stable, although there are no doubt corner cases
our regression tests haven't discovered yet. It is in regular use by several
of the developers and has been tested on x86 and PowerPC
(32-bit and 64-bit) in both small and large (LANL cluster) deployments.
Our current regression tests include fsx, bonnie, and postmark.
It was our intention to keep things as simple as possible for this
release -- trying to focus on correctness within the core of the
protocol support versus a rich set of features. For example: a more
complete security model and cache layer are in the road map, but
excluded from this release. Additionally, we have removed support for
mmap operations at Al Viro's request.
PERFORMANCE
Detailed performance numbers and analysis are included in the FREENIX
paper, but we show comparable performance to NFSv3 for large file
operations based on the Bonnie benchmark, and superior performance for
many small file operations based on the PostMark benchmark. Somewhat
preliminary graphs (from the FREENIX paper) are available
(http://v9fs.sourceforge.net/perf/index.html).
RESOURCES
The source code is available in a few different forms:
tarballs: http://v9fs.sf.net
CVSweb: http://cvs.sourceforge.net/viewcvs.py/v9fs/linux-9p/
CVS: :pserver:anonymous@cvs.sourceforge.net:/cvsroot/v9fs/linux-9p
Git: rsync://v9fs.graverobber.org/v9fs (webgit: http://v9fs.graverobber.org)
9P: tcp!v9fs.graverobber.org!6564
The user-level server is available from either the Plan 9 distribution
or from http://v9fs.sf.net
Other support applications are still being developed, but preliminary
version can be downloaded from sourceforge.
Documentation on the protocol has historically been the Plan 9 Man
pages (http://plan9.bell-labs.com/sys/man/5/INDEX.html), but there is
an effort under way to write a more complete Internet-Draft style
specification (http://v9fs.sf.net/rfc).
There are a couple of mailing lists supporting v9fs, but the most used
is v9fs-developer@lists.sourceforge.net -- please direct/cc your
comments there so the other v9fs contibutors can participate in the
conversation. There is also an IRC channel: irc://freenode.net/#v9fs
This part of the patch contains Documentation, Makefiles, and configuration
file changes.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's the latest version of relayfs, against linux-2.6.11-mm2. I'm hoping
you'll consider putting this version back into your tree - the previous
rounds of comment seem to have shaken out all the API issues and the number
of comments on the code itself have also steadily dwindled.
This patch is essentially the same as the relayfs redux part 5 patch, with
some minor changes based on reviewer comments. Thanks again to Pekka
Enberg for those. The patch size without documentation is now a little
smaller at just over 40k. Here's a detailed list of the changes:
- removed the attribute_flags in relay open and changed it to a
boolean specifying either overwrite or no-overwrite mode, and removed
everything referencing the attribute flags.
- added a check for NULL names in relayfs_create_entry()
- got rid of the unnecessary multiple labels in relay_create_buf()
- some minor simplification of relay_alloc_buf() which got rid of a
couple params
- updated the Documentation
In addition, this version (through code contained in the relay-apps tarball
linked to below, not as part of the relayfs patch) tries to make it as easy
as possible to create the cooperating kernel/user pieces of a typical and
common type of logging application, one where kernel logging is kicked off
when a user space data collection app starts and stops when the collection
app exits, with the data being automatically logged to disk in between. To
create this type of application, you basically just include a header file
(relay-app.h, included in the relay-apps tarball) in your kernel module,
define a couple of callbacks and call an initialization function, and on
the user side call a single function that sets up and continuously monitors
the buffers, and writes data to files as it becomes available. Channels
are created when the collection app is started and destroyed when it exits,
not when the kernel module is inserted, so different channel buffer sizes
can be specified for each separate run via command-line options. See the
README in the relay-apps tarball for details.
Also included in the relay-apps tarball are a couple examples
demonstrating how you can use this to create quick and dirty kernel
logging/debugging applications. They are:
- tprintk, short for 'tee printk', which temporarily puts a kprobe on
printk() and writes a duplicate stream of printk output to a relayfs
channel. This could be used anywhere there's printk() debugging code
in the kernel which you'd like to exercise, but would rather not have
your system logs cluttered with debugging junk. You'd probably want
to kill klogd while you do this, otherwise there wouldn't be much
point (since putting a kprobe on printk() doesn't change the output
of printk()). I've used this method to temporarily divert the packet
logging output of the iptables LOG target from the system logs to
relayfs files instead, for instance.
- klog, which just provides a printk-like formatted logging function
on top of relayfs. Again, you can use this to keep stuff out of your
system logs if used in place of printk.
The example applications can be found here:
http://prdownloads.sourceforge.net/dprobes/relay-apps.tar.gz?download
From: Christoph Hellwig <hch@lst.de>
avoid lookup_hash usage in relayfs
Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change filemode to use defines in stead of 0644,
based on suggestions by Walter Harms and Domen Puncer.
Signed-off-by: Jan Veldeman <Jan.Veldeman@advalvas.be>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix whitespace after comma between parameters.
Signed-off-by: Jan Veldeman <Jan.Veldeman@advalvas.be>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add a "smaps" entry to /proc/pid: show howmuch memory is resident in each
mapping.
People that want to perform a memory consumption analysing can use it
mainly if someone needs to figure out which libraries can be reduced for
embedded systems. So the new features are the physical size of shared and
clean [or dirty]; private and clean [or dirty].
Take a look the example below:
# cat /proc/4576/smaps
08048000-080dc000 r-xp /bin/bash
Size: 592 KB
Rss: 500 KB
Shared_Clean: 500 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 0 KB
080dc000-080e2000 rw-p /bin/bash
Size: 24 KB
Rss: 24 KB
Shared_Clean: 0 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 24 KB
080e2000-08116000 rw-p
Size: 208 KB
Rss: 208 KB
Shared_Clean: 0 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 208 KB
b7e2b000-b7e34000 r-xp /lib/tls/libnss_files-2.3.2.so
Size: 36 KB
Rss: 12 KB
Shared_Clean: 12 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 0 KB
...
(Includes a cleanup from "Richard Purdie" <rpurdie@rpsys.net>)
From: Torsten Foertsch <torsten.foertsch@gmx.net>
show_smap calls first show_map and then prints its additional information to
the seq_file. show_map checks if all it has to print fits into the buffer and
if yes marks the current vma as written. While that is correct for show_map
it is not for show_smap. Here the vma should be marked as written only after
the additional information is also written.
The attached patch cures the problem. It moves the functionality of the
show_map function to a new function show_map_internal that is called with an
additional struct mem_size_stats* argument. Then show_map calls
show_map_internal with NULL as struct mem_size_stats* whereas show_smap calls
it with a real pointer. Now the final
if (m->count < m->size) /* vma is copied successfully */
m->version = (vma != get_gate_vma(task))? vma->vm_start: 0;
is done only if the whole entry fits into the buffer.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up and expand some of the inotify documentation.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The situation: VFS inode X on a mounted ntfs volume is dirty. For
same inode X, the ntfs_inode is dirty and thus corresponding on-disk
inode, i.e. mft record, which is in a dirty PAGE_CACHE_PAGE belonging
to the table of inodes, i.e. $MFT, inode 0.
What happens:
Process 1: sys_sync()/umount()/whatever... calls
__sync_single_inode() for $MFT -> do_writepages() -> write_page for
the dirty page containing the on-disk inode X, the page is now locked
-> ntfs_write_mst_block() which clears PageUptodate() on the page to
prevent anyone else getting hold of it whilst it does the write out.
This is necessary as the on-disk inode needs "fixups" applied before
the write to disk which are removed again after the write and
PageUptodate is then set again. It then analyses the page looking
for dirty on-disk inodes and when it finds one it calls
ntfs_may_write_mft_record() to see if it is safe to write this
on-disk inode. This then calls ilookup5() to check if the
corresponding VFS inode is in icache(). This in turn calls ifind()
which waits on the inode lock via wait_on_inode whilst holding the
global inode_lock.
Process 2: pdflush results in a call to __sync_single_inode for the
same VFS inode X on the ntfs volume. This locks the inode (I_LOCK)
then calls write-inode -> ntfs_write_inode -> map_mft_record() ->
read_cache_page() for the page (in page cache of table of inodes
$MFT, inode 0) containing the on-disk inode. This page has
PageUptodate() clear because of Process 1 (see above) so
read_cache_page() blocks when it tries to take the page lock for the
page so it can call ntfs_read_page().
Thus Process 1 is holding the page lock on the page containing the
on-disk inode X and it is waiting on the inode X to be unlocked in
ifind() so it can write the page out and then unlock the page.
And Process 2 is holding the inode lock on inode X and is waiting for
the page to be unlocked so it can call ntfs_readpage() or discover
that Process 1 set PageUptodate() again and use the page.
Thus we have a deadlock due to ifind() waiting on the inode lock.
The solution: The fix is to use the newly introduced
ilookup5_nowait() which does not wait on the inode's lock and hence
avoids the deadlock. This is safe as we do not care about the VFS
inode and only use the fact that it is in the VFS inode cache and the
fact that the vfs and ntfs inodes are one struct in memory to find
the ntfs inode in memory if present. Also, the ntfs inode has its
own locking so it does not matter if the vfs inode is locked.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
The current isofs treatment of hidden files is flawed in two ways. First,
it does not provide sufficient granularity; it hides both 'hidden' files
and 'associated' files (resource fork for Mac files). Second, the default
behavior to completely strip hidden files, while an admirable
implementation of the spec, is a poor choice given the real world use of
hidden files as a poor mans copy protection scheme for MSDOS and Windows
based systems. A longer description of this is available here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.3/0267.html
This patch was originally built after a few private conversations with Alan
Cox; I shamefully failed to persist in seeing it go forward, I hope to make
amends now.
This patch introduces granularity by allowing explicit control for both
hidden and associated files. It also reverses the default so that by
default, hidden files are treated as regular files on the iso9660 file
system.
This allow Wine to process Windows CDs, including those that are hybrid
Mac/Windows CDs properly and completely, without our having to go muck up
peoples fstabs as we do now. (I have tested this with such a hybrid +
hidden CD and have verified that this patch works as claimed).
Signed-off-by: Jeremy White <jwhite@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To improve shmem scalability, we allowed tmpfs instances which don't need
their blocks or inodes limited not to count them, and not to allocate any
sbinfo. Which was okay when the only use for the sbinfo was accounting
blocks and inodes; but since then a couple of unrelated projects extending
tmpfs want to store other data in the sbinfo. Whether either extension
reaches mainline is beside the point: I'm guilty of a bad design decision,
and should restore sbinfo to make any such future extensions easier.
So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
inodes. Brent Casavant verified (many months ago) that this does not
perceptibly impact the scalability (since the unlimited sbinfo cacheline is
repeatedly accessed but only once dirtied).
And merge shmem_set_size into its sole caller shmem_remount_fs.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The driver model has a "detach_state" mechanism that:
- Has never been used by any in-kernel drive;
- Is superfluous, since driver remove() methods can do the same thing;
- Became buggy when the suspend() parameter changed semantics and type;
- Could self-deadlock when called from certain suspend contexts;
- Is effectively wasted documentation, object code, and headspace.
This removes that "detach_state" mechanism; net code shrink, as well
as a per-device saving in the driver model and sysfs.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
enable bit which is set appropriately and a per inode sparse disable
bit which is preset on some system file inodes as appropriate.
- Enforce that sparse support is disabled on NTFS volumes pre 3.0.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
The patch updates the documentation for /proc. super-nr and super-max have
been dropped from the kernel since 2.4.9 due to minor numbering issues.
This change was not documented in the documentation.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!