2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

850 Commits

Author SHA1 Message Date
Jan Kara
dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Darrick J. Wong
feb0ab32a5 ext4: make block group checksums use metadata_csum algorithm
metadata_csum supersedes uninit_bg.  Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set.  Print a warning if
we try to mount a filesystem with both feature flags set.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:45:10 -04:00
Darrick J. Wong
a9c4731780 ext4: calculate and verify superblock checksum
Calculate and verify the superblock checksum.  Since the UUID and
block group number are embedded in each copy of the superblock, we
need only checksum the entire block.  Refactor some of the code to
eliminate open-coding of the checksum update call.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:29:10 -04:00
Darrick J. Wong
0441984a33 ext4: load the crc32c driver if necessary
Obtain a reference to the cryptoapi and crc32c if we mount a
filesystem with metadata checksumming enabled.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:27:10 -04:00
Darrick J. Wong
d25425f8e0 ext4: record the checksum algorithm in use in the superblock
Record the type of checksum algorithm we're using for metadata in the
superblock, in case we ever want/need to change the algorithm.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:25:10 -04:00
Eldad Zack
db7e5c668e super.c: unused variable warning without CONFIG_QUOTA
sb info is only checked with quota support.

fs/ext4/super.c: In function ‘parse_options’:
fs/ext4/super.c:1600:23: warning: unused variable ‘sbi’ [-Wunused-variable]

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2012-04-23 21:44:41 -04:00
Theodore Ts'o
57f73c2c89 ext4: fix handling of journalled quota options
Commit 26092bf5 broke handling of journalled quota mount options by
trying to parse argument of every mount option as a number.  Fix this
by dealing with the quota options before we call match_int().

Thanks to Jan Kara for discovering this regression.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-04-16 18:55:26 -04:00
Theodore Ts'o
9cd70b347e ext4: address scalability issue by removing extent cache statistics
Andi Kleen and Tim Chen have reported that under certain circumstances
the extent cache statistics are causing scalability problems due to
cache line bounces.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-04-16 12:16:20 -04:00
Linus Torvalds
69e1aaddd6 Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes
The changes to export dirty_writeback_interval are from Artem's s_dirt
 cleanup patch series.  The same is true of the change to remove the
 s_dirt helper functions which never got used by anyone in-tree.  I've
 run these changes by Al Viro, and am carrying them so that Artem can
 more easily fix up the rest of the file systems during the next merge
 window.  (Originally we had hopped to remove the use of s_dirt from
 ext4 during this merge window, but his patches had some bugs, so I
 ultimately ended dropping them from the ext4 tree.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABCAAGBQJPb39rAAoJENNvdpvBGATwVz8P/3V1NqSsk20VJOLbmEE45GxL
 GDzQJ6OsFG0UiQk6ISSrSdwxfav/KTCGySsU9UtAoOdPcBwnnsf8S7wc6OggwwuC
 hBFGwwFzk6YSQaZ58sUxWRGeOJuP/FPem6Id6buC4DQ1KIcznP/hEEgEnh/ir4Ec
 vrsfexY93TR8BE2Mi23v2epDVLU0B6bY/w9nDqbTXif3xN/gh/ypoHHouuM6Bs2n
 TyWHOwD15NwfnvRHd8PfDDqQM/D29x3QI0FMrWj9McpwIz4d4cBfhN4LQ/G+yLDY
 izv5DM10GbinwHPrsOTGVAW3KIdSS9rP3jCJGVuOrJZ9ufGXosvHuIYVhI7J3SBK
 JhBu6QEsN1IsvlVYpz9q8mqVKaDXQLsz2eaTw+i4yfmyOk1kOX7nIEOxYFF78G+V
 Of/W1SpIpJQaXvLHRcDj9fDj0fZTciUZA8v7/HOFS+co2dzIl0iZbcfBFp0/56RY
 sWdQoeRlx1ciVDPR+w2TQO5w3VWQw1gT5aqux0NiPj0XFoiUHScxgNGAYbqENMQw
 v9chvyDMlorqj0rF/Vey5SssgEDi7MTdYuYTi4YyMqr7pcvOJaO85pf+wH9g2eKW
 XhW33PhPGuwCJDP5Pg8Y0Z2Hp/Q3DCqhLqhGfTyAs/NG9+hR4wgp3VWb8CUqhA1t
 C/yzNeOYqScAefCzQx2V
 =+9zk
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates for 3.4 from Ted Ts'o:
 "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

  The changes to export dirty_writeback_interval are from Artem's s_dirt
  cleanup patch series.  The same is true of the change to remove the
  s_dirt helper functions which never got used by anyone in-tree.  I've
  run these changes by Al Viro, and am carrying them so that Artem can
  more easily fix up the rest of the file systems during the next merge
  window.  (Originally we had hopped to remove the use of s_dirt from
  ext4 during this merge window, but his patches had some bugs, so I
  ultimately ended dropping them from the ext4 tree.)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
  vfs: remove unused superblock helpers
  mm: export dirty_writeback_interval
  ext4: remove useless s_dirt assignment
  ext4: write superblock only once on unmount
  ext4: do not mark superblock as dirty unnecessarily
  ext4: correct ext4_punch_hole return codes
  ext4: remove restrictive checks for EOFBLOCKS_FL
  ext4: always set then trimmed blocks count into len
  ext4: fix trimmed block count accunting
  ext4: fix start and len arguments handling in ext4_trim_fs()
  ext4: update s_free_{inodes,blocks}_count during online resize
  ext4: change some printk() calls to use ext4_msg() instead
  ext4: avoid output message interleaving in ext4_error_<foo>()
  ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
  ext4: add no_printk argument validation, fix fallout
  ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
  ext4: give more helpful error message in ext4_ext_rm_leaf()
  ext4: remove unused code from ext4_ext_map_blocks()
  ext4: rewrite punch hole to use ext4_ext_remove_space()
  jbd2: cleanup journal tail after transaction commit
  ...
2012-03-28 10:02:55 -07:00
Artem Bityutskiy
182f514f88 ext4: remove useless s_dirt assignment
Clean-up ext4 a tiny bit by removing useless s_dirt assignment in
'ext4_fill_super()' because a bit later we anyway call
'ext4_setup_super()' which writes the superblock to the media
unconditionally.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 22:30:06 -04:00
Artem Bityutskiy
a8e25a8324 ext4: write superblock only once on unmount
In some rather rare cases it is possible that ext4 may the superblock
to the media twice. This patch makes sure this does not happen. This
should speed up unmounting in those cases.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 22:29:15 -04:00
Al Viro
07c0c5d8b8 ext4: initialization of ext4_li_mtx needs to be done earlier
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 22:05:02 -04:00
Al Viro
48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
Theodore Ts'o
92b9781658 ext4: change some printk() calls to use ext4_msg() instead
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:41:49 -04:00
Joe Perches
d9ee81da93 ext4: avoid output message interleaving in ext4_error_<foo>()
Using KERN_CONT means that messages from multiple threads may be
interleaved.  Avoid this by using a single printk call in
ext4_error_inode and ext4_error_file.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:15:43 -04:00
Theodore Ts'o
f70486055e ext4: try to deprecate noacl and noxattr_user mount options
No other file system allows ACL's and extended attributes to be
enabled or disabled via a mount option.  So let's try to deprecate
these options from ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 22:06:20 -05:00
Theodore Ts'o
c7198b9c1e ext4: ignore mount options supported by ext2/3 (but have since been removed)
Users who tried to use the ext4 file system driver is being used for
the ext2 or ext3 file systems (via the CONFIG_EXT4_USE_FOR_EXT23
option) could have failed mounts if their /etc/fstab contains options
recognized by ext2 or ext3 but which have since been removed in ext4.

So teach ext4 to recognize them and give a warning that the mount
option was removed.

Report: https://bbs.archlinux.org/profile.php?id=33804

Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Baechler <thomas@archlinux.org>
Cc: Tobias Powalowski <tobias.powalowski@googlemail.com>
Cc: Dave Reisner <d@falconindy.com>
2012-03-04 22:00:53 -05:00
Theodore Ts'o
66acdcf4ea ext4: add debugging /proc file showing file system options
Now that /proc/mounts is consistently showing only those mount options
which need to be specified in /etc/fstab or on the mount command line,
it is useful to have file which shows exactly which file system
options are enabled.  This can be useful when debugging a user
problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 20:21:38 -05:00
Theodore Ts'o
5a916be1b3 ext4: make ext4_show_options() be table-driven
Consistently show mount options which are the non-default, so that
/proc/mounts accurately shows the mount options that would be
necessary to mount the file system in its current mode of operation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 19:27:31 -05:00
Theodore Ts'o
2adf6da837 ext4: move ext4_show_options() after parse_options()
This commit is strictly a code movement so in preparation of changing
ext4_show_options to be table driven.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:50 -05:00
Theodore Ts'o
26092bf524 ext4: use a table-driven handler for mount options
By using a table-drive approach, we shave about 100 lines of code from
ext4, and make the code a bit more regular and factored out.  This
will also make it possible in a future patch to use this table for
displaying the mount options that were specified in /proc/mounts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:47 -05:00
Theodore Ts'o
72578c33c4 ext4: unify handling of mount options which have been removed
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 18:04:40 -05:00
Theodore Ts'o
39ef17f1b0 ext4: simplify handling of the errors=* mount options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 17:56:23 -05:00
Theodore Ts'o
c64db50e76 ext4: remove the I_VERSION mount flag and use the super_block flag instead
There's no point to have two bits that are set in parallel; so use the
MS_I_VERSION flag that is needed by the VFS anyway, and that way we
free up a bit in sbi->s_mount_opts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:23:11 -05:00
Theodore Ts'o
ee4a3fcd1d ext4: remove Opt_ignore
This is completely unused so let's just get rid of it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:14:24 -05:00
Theodore Ts'o
87f26807e9 ext4: remove deprecation warnings for minix_df and grpid
People complained about removing both of these features, so per
Linus's dictate, we won't be able to remove them.  Sigh...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 00:03:21 -05:00
Eric Sandeen
661aa52057 ext4: remove the resize mount option
The resize mount option seems to be of limited value,
especially in the age of online resize2fs.  Nuke it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Eric Sandeen
43e625d84f ext4: remove the journal=update mount option
The V2 journal format was introduced around ten years ago,
for ext3. It seems highly unlikely that anyone will need this
migration option for ext4.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Bobi Jam
18aadd47f8 ext4: expand commit callback and
The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released.  This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:02 -05:00
Theodore Ts'o
ff9cb1c4ee Merge branch 'for_linus' into for_linus_merged
Conflicts:
	fs/ext4/ioctl.c
2012-01-10 11:54:07 -05:00
Xi Wang
d50f2ab6f0 ext4: fix undefined behavior in ext4_fill_flex_info()
Commit 503358ae01 ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.

	sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
	groups_per_flex = 1 << sbi->s_log_groups_per_flex;

	if (groups_per_flex < 2) { ... }

This patch fixes two potential issues in the previous commit.

1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount.  That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0.  This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.

2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways.  Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.

	groups_per_flex = 1 << sbi->s_log_groups_per_flex;
	if (groups_per_flex == 0 || groups_per_flex == 1) {

We compile the code snippet using Clang 3.0 and GCC 4.6.  Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original.  GCC keeps the check, but
there is no guarantee that future versions will do the same.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-01-10 11:51:10 -05:00
Al Viro
94bf608a18 ext4: fix failure exits
a) leaking root dentry is bad
b) in case of failed ext4_mb_init() we don't want to do ext4_mb_release()
c) OTOH, in the same case we *do* want ext4_ext_release()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09 15:57:20 -05:00
Linus Torvalds
eb59c505f8 Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
  PM / Hibernate: Implement compat_ioctl for /dev/snapshot
  PM / Freezer: fix return value of freezable_schedule_timeout_killable()
  PM / shmobile: Allow the A4R domain to be turned off at run time
  PM / input / touchscreen: Make st1232 use device PM QoS constraints
  PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
  PM / shmobile: Remove the stay_on flag from SH7372's PM domains
  PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
  PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
  PM: Drop generic_subsys_pm_ops
  PM / Sleep: Remove forward-only callbacks from AMBA bus type
  PM / Sleep: Remove forward-only callbacks from platform bus type
  PM: Run the driver callback directly if the subsystem one is not there
  PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
  PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
  PM / Sleep: Merge internal functions in generic_ops.c
  PM / Sleep: Simplify generic system suspend callbacks
  PM / Hibernate: Remove deprecated hibernation snapshot ioctls
  PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
  ARM: S3C64XX: Implement basic power domain support
  PM / shmobile: Use common always on power domain governor
  ...

Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
2012-01-08 13:10:57 -08:00
Al Viro
34c80b1d93 vfs: switch ->show_options() to struct dentry *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:19:54 -05:00
Al Viro
d8c9584ea2 vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:16:53 -05:00
Ben Hutchings
1d526fc91b ext4: Report max_batch_time option correctly
Currently the value reported for max_batch_time is really the
value of min_batch_time.

Reported-by: Russell Coker <russell@coker.com.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-01-04 21:22:51 -05:00
Al Viro
6b520e0565 vfs: fix the stupidity with i_dentry in inode destructors
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Rafael J. Wysocki
b00f4dc5ff Merge branch 'master' into pm-sleep
* master: (848 commits)
  SELinux: Fix RCU deref check warning in sel_netport_insert()
  binary_sysctl(): fix memory leak
  mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
  ipmi_watchdog: restore settings when BMC reset
  oom: fix integer overflow of points in oom_badness
  memcg: keep root group unchanged if creation fails
  nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
  nilfs2: unbreak compat ioctl
  cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  evm: prevent racing during tfm allocation
  evm: key must be set once during initialization
  mmc: vub300: fix type of firmware_rom_wait_states module parameter
  Revert "mmc: enable runtime PM by default"
  mmc: sdhci: remove "state" argument from sdhci_suspend_host
  x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
  IB/qib: Correct sense on freectxts increment and decrement
  RDMA/cma: Verify private data length
  cgroups: fix a css_set not found bug in cgroup_attach_proc
  oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
  Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
  ...

Conflicts:
	kernel/cgroup_freezer.c
2011-12-21 21:59:45 +01:00
Zheng Liu
5635a62b83 ext4: add missing space to ext4_msg output in ext4_fill_super()
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-18 16:13:58 -05:00
Theodore Ts'o
fc6cb1cda5 ext4: display the correct mount option in /proc/mounts for [no]init_itable
/proc/mounts was showing the mount option [no]init_inode_table when
the correct mount option that will be accepted by parse_options() is
[no]init_itable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-12-12 22:06:18 -05:00
Rafael J. Wysocki
986b11c3ee Merge branch 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into pm-freezer
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
  freezer: fix wait_event_freezable/__thaw_task races
  freezer: kill unused set_freezable_with_signal()
  dmatest: don't use set_freezable_with_signal()
  usb_storage: don't use set_freezable_with_signal()
  freezer: remove unused @sig_only from freeze_task()
  freezer: use lock_task_sighand() in fake_signal_wake_up()
  freezer: restructure __refrigerator()
  freezer: fix set_freezable[_with_signal]() race
  freezer: remove should_send_signal() and update frozen()
  freezer: remove now unused TIF_FREEZE
  freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
  cgroup_freezer: prepare for removal of TIF_FREEZE
  freezer: clean up freeze_processes() failure path
  freezer: kill PF_FREEZING
  freezer: test freezable conditions while holding freezer_lock
  freezer: make freezing indicate freeze condition in effect
  freezer: use dedicated lock instead of task_lock() + memory barrier
  freezer: don't distinguish nosig tasks on thaw
  freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
  freezer: rename thaw_process() to __thaw_task() and simplify the implementation
  ...
2011-11-23 21:09:02 +01:00
Tejun Heo
a0acae0e88 freezer: unexport refrigerator() and update try_to_freeze() slightly
There is no reason to export two functions for entering the
refrigerator.  Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.

* Rename refrigerator() to __refrigerator() and make it return bool
  indicating whether it scheduled out for freezing.

* Update try_to_freeze() to return bool and relay the return value of
  __refrigerator() if freezing().

* Convert all refrigerator() users to try_to_freeze().

* Update documentation accordingly.

* While at it, add might_sleep() to try_to_freeze().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
2011-11-21 12:32:22 -08:00
Richard Weinberger
2397256d62 ext4: Remove kernel_lock annotations
The BKL is gone, these annotations are useless.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:50:09 -05:00
Theodore Ts'o
eb513689c9 ext4: ignore journalled data options on remount if fs has no journal
This avoids a confusing failure in the init scripts when the
/etc/fstab has data=writeback or data=journal but the file system does
not have a journal.  So check for this case explicitly, and warn the
user that we are ignoring the (pointless, since they have no journal)
data=* mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:47:42 -05:00
Eric Gouriou
6f91bc5fda ext4: optimize ext4_ext_convert_to_initialized()
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.

In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.

Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 11:43:23 -04:00
Fabrice Jouhaud
d44651d0f9 ext4: fix ext4 so it works without CONFIG_PROC_FS
This fixes a bug which was introduced in dd68314ccf.  The problem
came from the test of the return value of proc_mkdir which is always
false without procfs, and this would initialization of ext4.

Signed-off-by: Fabrice Jouhaud <yargil@free.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 16:26:03 -04:00
Lukas Czerner
4113c4caa4 ext4: remove deprecated oldalloc
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.

This is a part of the effort to reduce number of ext4 options hence the
test matrix.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 14:34:47 -04:00
Tao Ma
dcf2d804ed ext4: Free resources in some error path in ext4_fill_super
Some of the error path in ext4_fill_super don't release the
resouces properly. So this patch just try to release them
in the right way.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06 12:10:11 -04:00
Theodore Ts'o
5dee54372c ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:10:51 -04:00
Theodore Ts'o
021b65bb1e ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions.  So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:08:51 -04:00
Aditya Kali
7b415bf60f ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:04:51 -04:00
Theodore Ts'o
f975d6bcc7 ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:00:51 -04:00
Theodore Ts'o
24aaa8ef4e ext4: convert the free_blocks field in s_flex_groups to be free_clusters
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:58:51 -04:00
Theodore Ts'o
5704265188 ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:56:51 -04:00
Theodore Ts'o
bab08ab964 ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online resizing, online defrag, or
the FITRIM ioctl.  This simplifies the initial implementation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:36:51 -04:00
Theodore Ts'o
281b599597 ext4: read-only support for bigalloc file systems
This adds supports for bigalloc file systems.  It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:34:51 -04:00
Theodore Ts'o
7c2e70879f ext4: add ext4-specific kludge to avoid an oops after the disk disappears
The del_gendisk() function uninitializes the disk-specific data
structures, including the bdi structure, without telling anyone
else.  Once this happens, any attempt to call mark_buffer_dirty()
(for example, by ext4_commit_super), will cause a kernel OOPS.

Fix this for now until we can fix things in an architecturally correct
way.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:28:51 -04:00
Theodore Ts'o
56889787cf ext4: improve handling of conflicting mount options
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).

Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported.  Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.

Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03 18:22:38 -04:00
Jiaying Zhang
2581fdc810 ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.

Also moves ext4_flush_completed_IO and ext4_ioend_wait from
ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
ext4_evict_inode() is called before ext4_destroy_inode() and in
ext4_evict_inode(), we may call ext4_truncate() without holding
i_mutex lock. As a result, there is a race between flush_completed_IO
that is called from ext4_ext_truncate() and ext4_end_io_work, which
may cause corruption on an io_end structure. This change moves
ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
to ext4_evict_inode() to resolve the race between ext4_truncate() and
ext4_end_io_work during inode deletion.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-08-13 12:17:13 -04:00
Mathias Krause
db9481c047 ext4: use kzalloc in ext4_kzalloc()
Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and
ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced
a typo for ext4_kzalloc() by not using kzalloc() but kmalloc().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-03 14:57:11 -04:00
Theodore Ts'o
f18a5f21c2 ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:38 -04:00
Theodore Ts'o
9933fc0ac1 ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:02 -04:00
Yongqiang Yang
8f82f840ec ext4: prevent parallel resizers by atomic bit ops
Before this patch, parallel resizers are allowed and protected by a
mutex lock, actually, there is no need to support parallel resizer, so
this patch prevents parallel resizers by atmoic bit ops, like
lock_page() and unlock_page() do.

To do this, the patch removed the mutex lock s_resize_lock from struct
ext4_sb_info and added a unsigned long field named s_resize_flags
which inidicates if there is a resizer.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:35:44 -04:00
Dan Ehrenberg
3eb0865843 ext4: ignore a stripe width of 1
If the stripe width was set to 1, then this patch will ignore
that stripe width and ext4 will act as if the stripe width
were 0 with respect to optimizing allocations.

Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-17 21:18:51 -04:00
Theodore Ts'o
12706394bc ext4: add tracepoint for ext4_journal_start
This will help debug who is responsible for starting a jbd2 transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-10 22:37:50 -04:00
Lukas Czerner
f17722f917 ext4: Fix max file size and logical block counting of extent format file
Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
format and fill the tail of file up to its end. We will hit the BUG_ON
when we write the last block (2^32-1) into the sparse file.

The root cause of the problem lies in the fact that we specifically set
s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
which is 32 bit long. However, we are not storing start and end block
number, but rather start block number and length in blocks. It means
that in order to cover extent from 0 to EXT_MAX_BLOCK we need
EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
and it does not.

The only way to fix it without changing the meaning of the struct
ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
by one fs block so we can cover the whole extent we can get by the
on-disk extent format.

Also in many places EXT_MAX_BLOCK is used as length instead of maximum
logical block number as the name suggests, it is all a bit messy. So
this commit renames it to EXT_MAX_BLOCKS and change its usage in some
places to actually be maximum number of blocks in the extent.

The bug which this commit fixes can be reproduced as follows:

 dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-2))
 sync
 dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-1))

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-06 00:05:17 -04:00
Linus Torvalds
f8d613e2a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
  xen: cleancache shim to Xen Transcendent Memory
  ocfs2: add cleancache support
  ext4: add cleancache support
  btrfs: add cleancache support
  ext3: add cleancache support
  mm/fs: add hooks to support cleancache
  mm: cleancache core ops functions and config
  fs: add field to superblock to support cleancache
  mm/fs: cleancache documentation

Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
2011-05-26 10:50:56 -07:00
Dan Magenheimer
7abc52c2ed ext4: add cleancache support
This seventh patch of eight in this cleancache series "opts-in"
cleancache for ext4.  Filesystems must explicitly enable cleancache
by calling cleancache_init_fs anytime an instance of the filesystem
is mounted. For ext4, all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:02:03 -06:00
Johann Lombardi
c5e06d101a ext4: add support for multiple mount protection
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:31:25 -04:00
Kazuya Mio
d02a9391f7 ext4: ensure f_bfree returned by ext4_statfs() is non-negative
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

CASE 1:
ext4_da_writepages
  mpage_da_map_and_submit
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
        ext4_da_update_reserve_space
            percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                            used + ei->i_allocated_meta_blocks);

CASE 2:
ext4_write_begin
  __block_write_begin
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
2011-05-24 18:30:07 -04:00
Vivek Haldar
77f4135f2a ext4: count hits/misses of extent cache and expose in sysfs
The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.

Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:24:16 -04:00
Theodore Ts'o
373cd5c53d ext4: don't show mount options in /proc/mounts if there is no journal
After creating an ext4 file system without a journal:

  # mke2fs -t ext4 -O ^has_journal /dev/sda
  # mount -t ext4 /dev/sda /test

the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.

So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.

Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 16:12:35 -04:00
Lukas Czerner
1bb933fb1f ext4: fix possible use-after-free in ext4_remove_li_request()
We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:29 -04:00
Lukas Czerner
51ce651156 ext4: fix the mount option "init_itable=n" to work as expected for n=0
For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition.  The 'n'
should be also properly used on remount.

When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:16 -04:00
Lukas Czerner
e1290b3e62 ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:51 -04:00
Lukas Czerner
4ed5c033c1 ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread
In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.

This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.

Addresses-Red-Hat-Bugzilla: #699708

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:04 -04:00
Tao Ma
ed3ce80a52 ext4: don't warn about mnt_count if it has been disabled
Currently, if we mkfs a new ext4 volume with s_max_mnt_count set to
zero, and mount it for the first time, we will get the warning:

	maximal mount count reached, running e2fsck is recommended

It is really misleading. So change the check so that it won't warn in
that case.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-18 13:29:57 -04:00
Amir Goldstein
0b26859027 ext4: fix oops in ext4_quota_off()
If quota is not enabled when ext4_quota_off() is called, we must not
dereference quota file inode since it is NULL.  Check properly for
this.

This fixes a bug in commit 21f976975c (ext4: remove unnecessary
[cm]time update of quota file), which was merged for 2.6.39-rc3.

Reported-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-16 09:59:13 -04:00
Amerigo Wang
66bb82798d ext4: remove redundant #ifdef in super.c
There is already an #ifdef CONFIG_QUOTA some lines above,
so this one is totally useless.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:30:41 -04:00
Tao Ma
55ff3840a2 ext4: remove redundant check for first_not_zeroed in ext4_register_li_request
We have checked first_not_zeroed == ngroups already above, so remove
this redundant check.

sbi->s_li_request = NULL above is also removed since it is NULL
already.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:28:41 -04:00
Theodore Ts'o
2035e77605 ext4: check for ext[23] file system features when mounting as ext[23]
Provide better emulation for ext[23] mode by enforcing that the file
system does not have any unsupported file system features as defined
by ext[23] when emulating the ext[23] file system driver when
CONFIG_EXT4_USE_FOR_EXT23 is defined.

This causes the file system type information in /proc/mounts to be
correct for the automatically mounted root file system.  This also
means that "mount -t ext2 /dev/sda /mnt" will fail if /dev/sda
contains an ext3 or ext4 file system, just as one would expect if the
original ext2 file system driver were in use.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-18 17:29:14 -04:00
Linus Torvalds
a97b52022a Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix data corruption regression by reverting commit 6de9843dab
  ext4: Allow indirect-block file to grow the file size to max file size
  ext4: allow an active handle to be started when freezing
  ext4: sync the directory inode in ext4_sync_parent()
  ext4: init timer earlier to avoid a kernel panic in __save_error_info
  jbd2: fix potential memory leak on transaction commit
  ext4: fix a double free in ext4_register_li_request
  ext4: fix credits computing for indirect mapped files
  ext4: remove unnecessary [cm]time update of quota file
  jbd2: move bdget out of critical section
2011-04-11 15:45:47 -07:00
Yongqiang Yang
be4f27d324 ext4: allow an active handle to be started when freezing
ext4_journal_start_sb() should not prevent an active handle from being
started due to s_frozen.  Otherwise, deadlock is easy to happen, below
is a situation.

================================================
     freeze         |       truncate
================================================
                    |  ext4_ext_truncate()
    freeze_super()  |   starts a handle
    sets s_frozen   |
                    |  ext4_ext_truncate()
                    |  holds i_data_sem
  ext4_freeze()     |
  waits for updates |
                    |  ext4_free_blocks()
                    |  calls dquot_free_block()
                    |
                    |  dquot_free_blocks()
                    |  calls ext4_dirty_inode()
                    |
                    |  ext4_dirty_inode()
                    |  trys to start an active
                    |  handle
                    |
                    |  block due to s_frozen
================================================

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Amir Goldstein <amir73il@users.sf.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2011-04-10 22:06:07 -04:00
Tao Ma
0449641130 ext4: init timer earlier to avoid a kernel panic in __save_error_info
During mount, when we fail to open journal inode or root inode, the
__save_error_info will mod_timer. But actually s_err_report isn't
initialized yet and the kernel oops. The detailed information can
be found https://bugzilla.kernel.org/show_bug.cgi?id=32082.

The best way is to check whether the timer s_err_report is initialized
or not. But it seems that in include/linux/timer.h, we can't find a
good function to check the status of this timer, so this patch just
move the initializtion of s_err_report earlier so that we can avoid
the kernel panic. The corresponding del_timer is also added in the
error path.

Reported-by: Sami Liedes <sliedes@cc.hut.fi>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-05 19:55:28 -04:00
Tao Ma
46e4690bbd ext4: fix a double free in ext4_register_li_request
In ext4_register_li_request, we malloc a ext4_li_request and
inserts it into ext4_li_info->li_request_list. In case of any
error later, we free it in the end.  But if we have some error
in ext4_run_lazyinit_thread, the whole li_request_list will be
dropped and freed in it. So we will double free this ext4_li_request.

This patch just sets elr to NULL after it is inserted to the list
so that the latter kfree won't double free it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-04-04 16:00:49 -04:00
Jan Kara
21f976975c ext4: remove unnecessary [cm]time update of quota file
It is not necessary to update [cm]time of quota file on each quota
file write and it wastes journal space and IO throughput with inode
writes. So just remove the updating from ext4_quota_write() and only
update times when quotas are being turned off. Userspace cannot get
anything reliable from quota files while they are used by the kernel
anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-04 15:33:39 -04:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds
ae005cbed1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits)
  ext4: fix a BUG in mb_mark_used during trim.
  ext4: unused variables cleanup in fs/ext4/extents.c
  ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep()
  ext4: add more tracepoints and use dev_t in the trace buffer
  ext4: don't kfree uninitialized s_group_info members
  ext4: add missing space in printk's in __ext4_grp_locked_error()
  ext4: add FITRIM to compat_ioctl.
  ext4: handle errors in ext4_clear_blocks()
  ext4: unify the ext4_handle_release_buffer() api
  ext4: handle errors in ext4_rename
  jbd2: add COW fields to struct jbd2_journal_handle
  jbd2: add the b_cow_tid field to journal_head struct
  ext4: Initialize fsync transaction ids in ext4_new_inode()
  ext4: Use single thread to perform DIO unwritten convertion
  ext4: optimize ext4_bio_write_page() when no extent conversion is needed
  ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
  ext4: use the nblocks arg to ext4_truncate_restart_trans()
  ext4: fix missing iput of root inode for some mount error paths
  ext4: make FIEMAP and delayed allocation play well together
  ext4: suppress verbose debugging information if malloc-debug is off
  ...

Fi up conflicts in fs/ext4/super.c due to workqueue changes
2011-03-25 09:57:41 -07:00
Robin Dong
21149d611e ext4: add missing space in printk's in __ext4_grp_locked_error()
When we do performence-testing on ext4 filesystem, we observed a
warning like this:

EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd

instead, it should be

"group 2598, 25901 blocks in bitmap, 26057 in gd"

Reviewed-by: Coly Li <bosong.ly@taobao.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 20:39:22 -04:00
Linus Torvalds
bd2895eead Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix build failure introduced by s/freezeable/freezable/
  workqueue: add system_freezeable_wq
  rds/ib: use system_wq instead of rds_ib_fmr_wq
  net/9p: replace p9_poll_task with a work
  net/9p: use system_wq instead of p9_mux_wq
  xfs: convert to alloc_workqueue()
  reiserfs: make commit_wq use the default concurrency level
  ocfs2: use system_wq instead of ocfs2_quota_wq
  ext4: convert to alloc_workqueue()
  scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
  scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
  misc/iwmc3200top: use system_wq instead of dedicated workqueues
  i2o: use alloc_workqueue() instead of create_workqueue()
  acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
  fs/aio: aio_wq isn't used in memory reclaim path
  input/tps6507x-ts: use system_wq instead of dedicated workqueue
  cpufreq: use system_wq instead of dedicated workqueues
  wireless/ipw2x00: use system_wq instead of dedicated workqueues
  arm/omap: use system_wq in mailbox
  workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
2011-03-16 08:20:19 -07:00
Aneesh Kumar K.V
f2fa2ffc20 ext4: Copy fs UUID to superblock
File system UUID is made available to application
via  /proc/<pid>/mountinfo

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:45 -04:00
Mingming Cao
198868f35d ext4: Use single thread to perform DIO unwritten convertion
While running ext4 testing on multiple core, we found there are per
cpu ext4-dio-unwritten threads processing conversion from unwritten
extents to written for IOs completed from async direct IO patch.  Per
filesystem is enough, we don't need per cpu threads to work on
conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2011-03-05 11:52:45 -05:00
Amir Goldstein
d39195c33b ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.

This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-28 00:53:45 -05:00
Manish Katiyar
32a9bb57d7 ext4: fix missing iput of root inode for some mount error paths
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.

https://bugzilla.kernel.org/show_bug.cgi?id=26752

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 20:42:06 -05:00
Theodore Ts'o
6fd7a46781 ext4: enable mblk_io_submit by default
Now that we've fixed the file corruption bug in commit d50bdd5aa5,
it's time to enable mblk_io_submit by default.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 13:53:09 -05:00
Eric Sandeen
ea66333694 ext4: enable acls and user_xattr by default
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.

Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel.  At some point
we can start to deprecate the options, perhaps.

For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:51:51 -05:00
Lukas Czerner
0b75a84012 ext4: mark file-local functions and variables as static
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:22:49 -05:00
Alexander V. Lukyanov
5dbd571d87 ext4: allow inode_readahead_blks=0 (linux-2.6.37)
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37):

# echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks 
bash: echo: write error: Invalid argument

On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.

This patch fixes the problem by checking for zero explicitly.

Signed-off-by: Alexander V. Lukyanov <lav@netis.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:33:21 -05:00
Peter Huewe
7dc576158d ext4: Fix sparse warning: Using plain integer as NULL pointer
This patch fixes the warning "Using plain integer as NULL pointer",
generated by sparse, by replacing the offending 0s with NULL.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:01:42 -05:00
Tejun Heo
43d133c18b Merge branch 'master' into for-2.6.39 2011-02-21 09:43:56 +01:00
Eric Sandeen
e9e3bcecf4 ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
Theodore Ts'o
dd68314ccf ext4: fix up ext4 error handling
Make sure we the correct cleanup happens if we die while trying to
load the ext4 file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:49 -05:00
Lukas Czerner
8f021222c1 ext4: unregister features interface on module unload
Ext4 features interface was not properly unregistered which led to
problems while unloading/reloading ext4 module. This commit fixes that by
adding proper kobject unregistration code into ext4_exit_fs() as well as
fail-path of ext4_init_fs()

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-02-03 14:33:33 -05:00
Eric Sandeen
8f1f745331 ext4: fix panic on module unload when stopping lazyinit thread
https://bugzilla.kernel.org/show_bug.cgi?id=27652

If the lazyinit thread is running, the teardown function
ext4_destroy_lazyinit_thread() has problems:

        ext4_clear_request_list();
        while (ext4_li_info->li_task) {
                wake_up(&ext4_li_info->li_wait_daemon);
                wait_event(ext4_li_info->li_wait_task,
                           ext4_li_info->li_task == NULL);
        }

Clearing the request list will cause the thread to exit and free
ext4_li_info, so then we're waiting on something which is getting
freed.

Fix this up by making the thread respond to kthread_stop, and exit,
without the need to wait for that exit in some other homegrown way.

Cc: stable@kernel.org
Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:15 -05:00
Tejun Heo
fd89d5f203 ext4: convert to alloc_workqueue()
Convert create_workqueue() to alloc_workqueue().  This is an identity
conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
2011-02-01 11:42:42 +01:00
Linus Torvalds
4843456c5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix deadlock during path resolution
2011-01-21 07:33:37 -08:00
Linus Torvalds
275220f0fc Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
  block: ensure that completion error gets properly traced
  blktrace: add missing probe argument to block_bio_complete
  block cfq: don't use atomic_t for cfq_group
  block cfq: don't use atomic_t for cfq_queue
  block: trace event block fix unassigned field
  block: add internal hd part table references
  block: fix accounting bug on cross partition merges
  kref: add kref_test_and_get
  bio-integrity: mark kintegrityd_wq highpri and CPU intensive
  block: make kblockd_workqueue smarter
  Revert "sd: implement sd_check_events()"
  block: Clean up exit_io_context() source code.
  Fix compile warnings due to missing removal of a 'ret' variable
  fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
  block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
  cfq-iosched: don't check cfqg in choose_service_tree()
  fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
  cdrom: export cdrom_check_events()
  sd: implement sd_check_events()
  sr: implement sr_check_events()
  ...
2011-01-13 10:45:01 -08:00
Jan Kara
f00c9e44ad quota: Fix deadlock during path resolution
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.

Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-12 19:14:55 +01:00
Linus Torvalds
e9688f6aca Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: fix trimming starting with block 0 with small blocksize
  ext4: revert buggy trim overflow patch
  ext4: don't pass entire map to check_eofblocks_fl
  ext4: fix memory leak in ext4_free_branches
  ext4: remove ext4_mb_return_to_preallocation()
  ext4: flush the i_completed_io_list during ext4_truncate
  ext4: add error checking to calls to ext4_handle_dirty_metadata()
  ext4: fix trimming of a single group
  ext4: fix uninitialized variable in ext4_register_li_request
  ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
  ext4: drop i_state_flags on architectures with 64-bit longs
  ext4: reorder ext4_inode_info structure elements to remove unneeded padding
  ext4: drop ec_type from the ext4_ext_cache structure
  ext4: use ext4_lblk_t instead of sector_t for logical blocks
  ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
  ext4: fix 32bit overflow in ext4_ext_find_goal()
  ext4: add more error checks to ext4_mkdir()
  ext4: ext4_ext_migrate should use NULL not 0
  ext4: Use ext4_error_file() to print the pathname to the corrupted inode
  ext4: use IS_ERR() to check for errors in ext4_error_file
  ...
2011-01-11 14:37:31 -08:00
Andrew Morton
6c5a6cb998 ext4: fix uninitialized variable in ext4_register_li_request
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function

It looks buggy to me, too.

Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:17 -05:00
Theodore Ts'o
8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Theodore Ts'o
f232109773 ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:36 -05:00
Theodore Ts'o
f7c21177af ext4: Use ext4_error_file() to print the pathname to the corrupted inode
Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:55 -05:00
Dan Carpenter
f9a62d090c ext4: use IS_ERR() to check for errors in ext4_error_file
d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
ext4_error_file() and no one actually calls ext4_error_file().

Signed-off-by: Dan Carpenter <error27@gmail.com>
2011-01-10 12:10:50 -05:00
Nick Piggin
fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Joe Perches
0ff2ea7d84 ext4: Use printf extension %pV
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.

In function __ext4_grp_locked_error also added KERN_CONT to some
printks.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:43:19 -05:00
Joe Perches
94de56ab20 ext4: Use vzalloc in ext4_fill_flex_info()
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:21:02 -05:00
Theodore Ts'o
a2595b8aa6 ext4: Add second mount options field since the s_mount_opt is full up
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:30:48 -05:00
Theodore Ts'o
673c610033 ext4: Move struct ext4_mount_options from ext4.h to super.c
Move the ext4_mount_options structure definition from ext4.h, since it
is only used in super.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:28:48 -05:00
Theodore Ts'o
fd8c37eccd ext4: Simplify the usage of clear_opt() and set_opt() macros
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt.  This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:26:48 -05:00
Theodore Ts'o
1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Lukas Czerner
93bb41f4f8 fs: Do not dispatch FITRIM through separate super_operation
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.

So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:18:35 -05:00
Darrick J. Wong
5a9ae68a34 ext4: ext4_fill_super shouldn't return 0 on corruption
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret.  However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail.  This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.

A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 09:56:44 -05:00
Dan Carpenter
f4c8cc652d ext4: missing unlock in ext4_clear_request_list()
If the the li_request_list was empty then it returned with the lock
held.  Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:25 -05:00
Tejun Heo
d4d7762995 block: clean up blkdev_get() wrappers and their users
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:18 +01:00
Tejun Heo
e525fd89d3 block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
Theodore Ts'o
7ff9c073dd ext4: Add new ext4 inode tracepoints
Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
ext4_begin_ordered_truncate()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:51:33 -05:00
Dmitry Monakhov
87009d86dc ext4: do not try to grab the s_umount semaphore in ext4_quota_off
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.

Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2010-11-08 13:47:33 -05:00
Theodore Ts'o
f7ad6d2e92 ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename).  In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed.  If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.

To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed.  That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).

Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.

  kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
  Call Trace:
   [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
   [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
   [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
   [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
   [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
   [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
   [<ffffffff81087910>] do_writepages+0x1c/0x25
   [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
   [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
   [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
   [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
   [<ffffffff810c14a3>] evict+0x22/0x92
   [<ffffffff810c1a3d>] iput+0x212/0x249
   [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
   [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
   [<ffffffff810be613>] dput+0x13a/0x147
   [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
   [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
   [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
   [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
   [<ffffffff810b99c6>] sys_rename+0x16/0x18
   [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
2010-11-08 13:43:33 -05:00
Theodore Ts'o
ce7e010aef ext4: initialize the percpu counters before replaying the journal
We now initialize the percpu counters before replaying the journal,
but after the journal, we recalculate the global counters, to deal
with the possibility of the per-blockgroup counts getting updated by
the journal replay.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-03 12:03:21 -04:00
Theodore Ts'o
b2c78cd09b ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
Newer GCC's reported the following build warning:

   fs/ext4/super.c: In function 'ext4_lazyinit_thread':
   fs/ext4/super.c:2702: warning: 'ret' may be used uninitialized in this function

Fix it by removing the need for the ret variable in the first place.

Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
Reported-by: "Stefan Richter" <stefanr@s5r6.in-berlin.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:19:30 -04:00
Lukas Czerner
f4245bd4eb ext4: fix lazyinit hang after removing request
When the request has been removed from the list and no other request
has been issued, we will end up with next wakeup scheduled to
MAX_JIFFY_OFFSET which is bad. So check for that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:07:17 -04:00
Al Viro
152a083666 new helper: mount_bdev()
... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:13 -04:00
Theodore Ts'o
a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Nicolas Kaiser
beed5ecbaa ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:42 -04:00
Theodore Ts'o
1f109d5a17 ext4: make various ext4 functions be static
These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
5dabfc78dc ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
7f93cff90f ext4: fix kernel oops if the journal superblock has a non-zero j_errno
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Lukas Czerner
27ee40df2b ext4: add batched_discard into ext4 feature list
Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner
7360d1731e ext4: Add batched discard support for ext4
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Theodore Ts'o
bd2d0210cf ext4: use bio layer instead of buffer layer in mpage_da_submit_io
Call the block I/O layer directly instad of going through the buffer
layer.  This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Maciej Żenczykowski
c41303ced6 ext4: don't update sb journal_devnum when RO dev
An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext4 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:06 -04:00
Lukas Czerner
857ac889cc ext4: add interface to advertise ext4 features in sysfs
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner
bfff68738f ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Sergey Senozhatsky
a1c6c5698d ext4: fix NULL pointer dereference in print_daily_error_info()
Fix NULL pointer dereference in print_daily_error_info, when   
called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
reporting timer in ext4_put_super.

Google-Bug-Id: 3017663

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Linus Torvalds
79f14b7c56 Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
  BKL: remove BKL from freevxfs
  BKL: remove BKL from qnx4
  autofs4: Only declare function when CONFIG_COMPAT is defined
  autofs: Only declare function when CONFIG_COMPAT is defined
  ncpfs: Lock socket in ncpfs while setting its callbacks
  fs/locks.c: prepare for BKL removal
  BKL: Remove BKL from ncpfs
  BKL: Remove BKL from OCFS2
  BKL: Remove BKL from squashfs
  BKL: Remove BKL from jffs2
  BKL: Remove BKL from ecryptfs
  BKL: Remove BKL from afs
  BKL: Remove BKL from USB gadgetfs
  BKL: Remove BKL from autofs4
  BKL: Remove BKL from isofs
  BKL: Remove BKL from fat
  BKL: Remove BKL from ext2 filesystem
  BKL: Remove BKL from do_new_mount()
  BKL: Remove BKL from cgroup
  BKL: Remove BKL from NTFS
  ...
2010-10-22 10:52:01 -07:00
Jan Blunck
f2143c4e2e BKL: Remove BKL from ext4 filesystem
The BKL is still used in ext4_put_super(), ext4_fill_super() and
ext4_remount(). All three calles are protected against concurrent calls by
the s_umount rw semaphore of struct super_block.

Therefore the BKL is protecting nothing in this case.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:38 +02:00
Jan Blunck
db71922217 BKL: Explicitly add BKL around get_sb/fill_super
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.

Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.

[arnd: do not add the BKL to those file systems that already
       don't use it elsewhere]

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-04 21:10:10 +02:00
Patrick J. LoPresti
30ca22c70e ext3/ext4: Factor out disk addressability check
As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.

An identical check already appears in ext3 and ext4.  This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.

[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 08:41:42 -07:00
Linus Torvalds
5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Al Viro
0930fcc1ee convert ext4 to ->evict_inode()
pretty much brute-force...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:30 -04:00
Linus Torvalds
09dc942c2a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Adding error check after calling ext4_mb_regular_allocator()
  ext4: Fix dirtying of journalled buffers in data=journal mode
  ext4: re-inline ext4_rec_len_(to|from)_disk functions
  jbd2: Remove t_handle_lock from start_this_handle()
  jbd2: Change j_state_lock to be a rwlock_t
  jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
  ext4: Add mount options in superblock
  ext4: force block allocation on quota_off
  ext4: fix freeze deadlock under IO
  ext4: drop inode from orphan list if ext4_delete_inode() fails
  ext4: check to make make sure bd_dev is set before dereferencing it
  jbd2: Make barrier messages less scary
  ext4: don't print scary messages for allocation failures post-abort
  ext4: fix EFBIG edge case when writing to large non-extent file
  ext4: fix ext4_get_blocks references
  ext4: Always journal quota file modifications
  ext4: Fix potential memory leak in ext4_fill_super
  ext4: Don't error out the fs if the user tries to make a file too big
  ext4: allocate stripe-multiple IOs on stripe boundaries
  ext4: move aio completion after unwritten extent conversion
  ...

Fix up conflicts in fs/ext4/inode.c as per Ted.

Fix up xfs conflicts as per earlier xfs merge.
2010-08-07 13:03:53 -07:00
Theodore Ts'o
a931da6ac9 jbd2: Change j_state_lock to be a rwlock_t
Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores.  So
change it to be a read/write spinlock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-03 21:35:12 -04:00
Theodore Ts'o
8b67f04ab9 ext4: Add mount options in superblock
Allow mount options to be stored in the superblock.  Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 23:14:20 -04:00
Dmitry Monakhov
ca0e05e4b1 ext4: force block allocation on quota_off
Perform full sync procedure so that any delayed allocation blocks are
allocated so quota will be consistent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:48:36 -04:00
Eric Sandeen
437f88cc03 ext4: fix freeze deadlock under IO
Commit 6b0310fbf0 caused a regression resulting in deadlocks
when freezing a filesystem which had active IO; the vfs_check_frozen
level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
through.  Duh.

Changing the test to FREEZE_TRANS should let the normal freeze
syncing get through the fs, but still block any transactions from
starting once the fs is completely frozen.

I tested this by running fsstress in the background while periodically
snapshotting the fs and running fsck on the result.  I ran into
occasional deadlocks, but different ones.  I think this is a
fine fix for the problem at hand, and the other deadlocky things
will need more investigation.

Reported-by: Phillip Susi <psusi@cfl.rr.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:33:29 -04:00
Theodore Ts'o
f613dfcb33 ext4: check to make make sure bd_dev is set before dereferencing it
There are some drivers which may not set bdev->bd_dev.  So make sure
it is non-NULL before dereferencing it.

Google-Bug-Id: 1773557

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Jan Kara
62d2b5f2dc ext4: Always journal quota file modifications
When journaled quota options are not specified, we do writes
to quota files just in data=ordered mode. This actually causes
warnings from JBD2 about dirty journaled buffer because ext4_getblk
unconditionally treats a block allocated by it as metadata. Since
quota actually is filesystem metadata, the easiest way to get rid
of the warning is to always treat quota writes as metadata...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Cyrill Gorcunov
dcc7dae3cb ext4: Fix potential memory leak in ext4_fill_super
Under heavy memory pressure we may hit out of memory
situation and as result kstrdup'ed options will not be
freed. Fix it.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Theodore Ts'o
66e61a9e95 ext4: Once a day, printk file system error information to dmesg
This allows us to grab any file system error messages by scraping
/var/log/messages.  This will make it easy for us to do error analysis
across the very large number of machines as we deploy ext4 across the
fleet.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o
1c13d5c087 ext4: Save error information to the superblock for analysis
Save number of file system errors, and the time function name, line
number, block number, and inode number of the first and most recent
errors reported on the file system in the superblock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:03 -04:00
Theodore Ts'o
c398eda0e4 ext4: Pass line numbers to ext4_error() and friends
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:40 -04:00
Theodore Ts'o
90c7201b97 ext4: Pass line number to ext4_journal_abort_handle()
This allows the error messages to include the line number

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 14:53:24 -04:00
Theodore Ts'o
e29136f80e ext4: Enhance ext4_grp_locked_error() to take block and function numbers
Also use a macro definition so that __func__ and __LINE__ is implicit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 12:54:28 -04:00
Theodore Ts'o
c67d859e39 ext4: clean up ext4_abort() so __func__ is now implicit
Use a macro definition for ext4_abort() to clean up the .c files a wee
bit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:07:07 -04:00
Jiri Kosina
f1bbbb6912 Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
Uwe Kleine-König
421f91d21a fix typos concerning "initiali[zs]e"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-06-16 18:05:05 +02:00
Christoph Hellwig
206f7ab4f4 ext4: remove vestiges of nobh support
The nobh option was only supported for writeback mode, but given that all
write paths actually create buffer heads it effectively was a no-op already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 14:42:49 -04:00
Linus Torvalds
d28619f156 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Convert quota statistics to generic percpu_counter
  ext3 uses rb_node = NULL; to zero rb_root.
  quota: Fixup dquot_transfer
  reiserfs: Fix resuming of quotas on remount read-write
  pohmelfs: Remove dead quota code
  ufs: Remove dead quota code
  udf: Remove dead quota code
  quota: rename default quotactl methods to dquot_
  quota: explicitly set ->dq_op and ->s_qcop
  quota: drop remount argument to ->quota_on and ->quota_off
  quota: move unmount handling into the filesystem
  quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
  quota: move remount handling into the filesystem
  ocfs2: Fix use after free on remount read-only

Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c
2010-05-30 09:11:11 -07:00
Christoph Hellwig
287a80958c quota: rename default quotactl methods to dquot_
Follow the dquot_* style used elsewhere in dquot.c.

[Jan Kara: Fixed up missing conversion of ext2]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:17 +02:00
Christoph Hellwig
307ae18a56 quota: drop remount argument to ->quota_on and ->quota_off
Remount handling has fully moved into the filesystem, so all this is
superflous now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig
e0ccfd959c quota: move unmount handling into the filesystem
Currently the VFS calls into the quotactl interface for unmounting
filesystems.  This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount.  Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.

Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem.  But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig
0f0dd62fdd quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly.  Also rename vfs_quota_disable to
dquot_disable while we're at it.

[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:40 +02:00
Christoph Hellwig
c79d967de3 quota: move remount handling into the filesystem
Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw.  Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself.  This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:39 +02:00
Theodore Ts'o
60e6679e28 ext4: Drop whitespace at end of lines
This patch was generated using:

#!/usr/bin/perl -i
while (<>) {
    s/[ 	]+$//;
    print;
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 07:00:00 -04:00
Dmitry Monakhov
12e9b89200 ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.

https://bugzilla.kernel.org/show_bug.cgi?id=15792

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 22:00:00 -04:00
Jan Kara
39a4bade8c ext4: Show journal_checksum option
We failed to show journal_checksum option in /proc/mounts. Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 17:00:00 -04:00
Curt Wohlgemuth
fbe845ddf3 ext4: Remove extraneous newlines in ext4_msg() calls
Addresses-Google-Bug: #2562325

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 13:00:00 -04:00
Curt Wohlgemuth
d4c402d9fd ext4: Print mount options in when mounting and add a remount message
This adds a "re-mounted" message to ext4_remount(), and both it and
the mount message in ext4_fill_super() now have the original mount
options data string.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 12:00:00 -04:00
Dmitry Monakhov
84061e07c5 ext4: init statistics after journal recovery
Currently block/inode/dir counters initialized before journal was
recovered. In fact after journal recovery this info will probably
change. And freeblocks it critical for correct delalloc mode
accounting.

https://bugzilla.kernel.org/show_bug.cgi?id=15768

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 08:00:00 -04:00
Eric Sandeen
6b0310fbf0 ext4: don't return to userspace after freezing the fs with a mutex held
ext4_freeze() used jbd2_journal_lock_updates() which takes
the j_barrier mutex, and then returns to userspace.  The
kernel does not like this:

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
lvcreate/1075 is leaving the kernel with locks still held!
1 lock held by lvcreate/1075:
 #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
jbd2_journal_lock_updates+0xe1/0xf0

Use vfs_check_frozen() added to ext4_journal_start_sb() and
ext4_force_commit() instead.

Addresses-Red-Hat-Bugzilla: #568503

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 02:00:00 -04:00
Linus Torvalds
39f1cd635c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
  ext4: Don't use delayed allocation by default when used instead of ext3
  ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
  ext4: Fix estimate of # of blocks needed to write indirect-mapped files
2010-03-25 14:10:53 -07:00
Jan Kara
ba69f9ab7d ext4: Don't use delayed allocation by default when used instead of ext3
When ext4 driver is used to mount a filesystem instead of the ext3 file
system driver (through CONFIG_EXT4_USE_FOR_EXT23), do not enable delayed
allocation by default since some ext3 users and application writers have
developed unfortunate expectations about the safety of writing files on
systems subject to sudden and violent death without using fsync().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:18:37 -04:00
Theodore Ts'o
37f328eb60 ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
Oops.  (Blush.)

Thanks to Sedat Dilek for pointing this out.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:06:41 -04:00
Linus Torvalds
c32da02342 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
  doc: fix typo in comment explaining rb_tree usage
  Remove fs/ntfs/ChangeLog
  doc: fix console doc typo
  doc: cpuset: Update the cpuset flag file
  Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
  Remove drivers/parport/ChangeLog
  Remove drivers/char/ChangeLog
  doc: typo - Table 1-2 should refer to "status", not "statm"
  tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
  No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
  devres/irq: Fix devm_irq_match comment
  Remove reference to kthread_create_on_cpu
  tree-wide: Assorted spelling fixes
  tree-wide: fix 'lenght' typo in comments and code
  drm/kms: fix spelling in error message
  doc: capitalization and other minor fixes in pnp doc
  devres: typo fix s/dev/devm/
  Remove redundant trailing semicolons from macros
  fix typo "definetly" -> "definitely" in comment
  tree-wide: s/widht/width/g typo in comments
  ...

Fix trivial conflict in Documentation/laptops/00-INDEX
2010-03-12 16:04:50 -08:00
Jiri Kosina
318ae2edc3 Merge branch 'for-next' into for-linus
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
2010-03-08 16:55:37 +01:00
Emese Revfy
52cf25d0ab Driver core: Constify struct sysfs_ops in struct kobj_type
Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

 * prevents modification of data that is shared
   (referenced) by many other structure instances
   at runtime

 * detects/prevents accidental (but not intentional)
   modification attempts on archs that enforce
   read-only kernel data at runtime

 * potentially better optimized code as the compiler
   can assume that the const data cannot be changed

 * the compiler/linker move const data into .rodata
   and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:49 -08:00
Linus Torvalds
e213e26ab3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
  quota: stop using QUOTA_OK / NO_QUOTA
  dquot: cleanup dquot initialize routine
  dquot: move dquot initialization responsibility into the filesystem
  dquot: cleanup dquot drop routine
  dquot: move dquot drop responsibility into the filesystem
  dquot: cleanup dquot transfer routine
  dquot: move dquot transfer responsibility into the filesystem
  dquot: cleanup inode allocation / freeing routines
  dquot: cleanup space allocation / freeing routines
  ext3: add writepage sanity checks
  ext3: Truncate allocated blocks if direct IO write fails to update i_size
  quota: Properly invalidate caches even for filesystems with blocksize < pagesize
  quota: generalize quota transfer interface
  quota: sb_quota state flags cleanup
  jbd: Delay discarding buffers in journal_unmap_buffer
  ext3: quota_write cross block boundary behaviour
  quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
  quota: split out compat_sys_quotactl support from quota.c
  quota: split out netlink notification support from quota.c
  quota: remove invalid optimization from quota_sync_all
  ...

Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
2010-03-05 13:20:53 -08:00
Christoph Hellwig
871a293155 dquot: cleanup dquot initialize routine
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
9f75475802 dquot: cleanup dquot drop routine
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
257ba15ced dquot: move dquot drop responsibility into the filesystem
Currently clear_inode calls vfs_dq_drop directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the drop inside the ->clear_inode
superblock operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig
b43fa8284d dquot: cleanup dquot transfer routine
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig
63936ddaa1 dquot: cleanup inode allocation / freeing routines
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.

Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Christoph Hellwig
5dd4056db8 dquot: cleanup space allocation / freeing routines
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.

Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not.  Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Dmitry Monakhov
67eeb5685d ext4: Fix ext4_quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext4_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal
credits reservation, and later a BUG_ON.  As soon this never happen
the boundary cross loop is NOOP.  In order to make things straight
let's remove this loop and assert cross boundary condition.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 08:08:51 -05:00
Frank Mayhar
273df556b6 ext4: Convert BUG_ON checks to use ext4_error() instead
Convert a bunch of BUG_ONs to emit a ext4_error() message and return
EIO.  This is a first pass and most notably does _not_ cover
mballoc.c, which is a morass of void functions.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 11:46:09 -05:00
Jiaying Zhang
744692dc05 ext4: use ext4_get_block_write in buffer write
Allocate uninitialized extent before ext4 buffer write and
convert the extent to initialized after io completes.
The purpose is to make sure an extent can only be marked
initialized after it has been written with new data so
we can safely drop the i_mutex lock in ext4 DIO read without
exposing stale data. This helps to improve multi-thread DIO
read performance on high-speed disks.

Skip the nobh and data=journal mount cases to make things simple for now.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 16:14:02 -05:00
Jiaying Zhang
c7064ef13b ext4: mechanical rename some of the direct I/O get_block's identifiers
This commit renames some of the direct I/O's block allocation flags,
variables, and functions introduced in Mingming's "Direct IO for holes
and fallocate" patches so that they can be used by ext4's buffered
write path as well.  Also changed the related function comments
accordingly to cover both direct write and buffered write cases.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:28:44 -05:00
Dmitry Monakhov
437ca0fda3 ext4: deprecate obsoleted mount options
Declare following list of mount options as deprecated:
 - bsddf, miniddf
 - grpid, bsdgroups, nogrpid, sysvgroups

Declare following list of default mount options as deprecated:
 - bsdgroups

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 22:29:21 -05:00
Dmitry Monakhov
56c50f11f4 ext4: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enough, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:28:41 -05:00
Dmitry Monakhov
482a74258f ext4: mount flags manipulation cleanup
Replace intermediate EXT4_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 11:35:32 -05:00
Eric Sandeen
12062dddda ext4: move __func__ into a macro for ext4_warning, ext4_error
Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 14:19:27 -05:00
Thadeu Lima de Souza Cascardo
d6b198bc8a fix ext3/ext4 comment typo compain -> complain
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:36 +01:00
Theodore Ts'o
9d0be50230 ext4: Calculate metadata requirements more accurately
In the past, ext4_calc_metadata_amount(), and its sub-functions
ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount()
badly over-estimated the number of metadata blocks that might be
required for delayed allocation blocks.  This didn't matter as much
when functions which managed the reserved metadata blocks were more
aggressive about dropping reserved metadata blocks as delayed
allocation blocks were written, but unfortunately they were too
aggressive.  This was fixed in commit 0637c6f, but as a result the
over-estimation by ext4_calc_metadata_amount() would lead to reserving
2-3 times the number of pending delayed allocation blocks as
potentially required metadata blocks.  So if there are 1 megabytes of
blocks which have been not yet been allocation, up to 3 megabytes of
space would get reserved out of the user's quota and from the file
system free space pool until all of the inode's data blocks have been
allocated.

This commit addresses this problem by much more accurately estimating
the number of metadata blocks that will be required.  It will still
somewhat over-estimate the number of blocks needed, since it must make
a worst case estimate not knowing which physical blocks will be
needed, but it is much more accurate than before.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01 02:41:30 -05:00
Andrew Morton
a6b43e3825 ext4: fix unsigned long long printk warning in super.c
sparc64 allmodconfig:

fs/ext4/super.c: In function `lifetime_write_kbytes_show':
fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)
fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 07:48:08 -05:00
Theodore Ts'o
51b7e3c9fb ext4: add module aliases for ext2 and ext3
Add module aliases for ext2 and ext3 when CONFIG_EXT4_USE_FOR_EXT23 is
set.  This makes the existing user-space stuff like mkinitrd working
as is.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-21 10:56:09 -05:00
Dmitry Monakhov
a9e7f44720 ext4: Convert to generic reserved quota's space management.
This patch also fixes write vs chown race condition.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23 13:33:55 +01:00
André Goddard Rosa
e7d2860b69 tree-wide: convert open calls to remove spaces to skip_spaces() lib function
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.

It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
   text    data     bss     dec     hex filename
  64688     584     592   65864   10148 (TOTALS-BEFORE)
  64641     584     592   65817   10119 (TOTALS-AFTER)

Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".

Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
    drivers/leds/led-class.c
    drivers/leds/ledtrig-timer.c
    drivers/video/output.c

@@
expression str;
@@

( // ignore skip_spaces cases
while (*str &&  isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:32 -08:00
Linus Torvalds
3126c136bc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
  ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
  ext3: Fix data / filesystem corruption when write fails to copy data
  ext4: Support for 64-bit quota format
  ext3: Support for vfsv1 quota format
  quota: Implement quota format with 64-bit space and inode limits
  quota: Move definition of QFMT_OCFS2 to linux/quota.h
  ext2: fix comment in ext2_find_entry about return values
  ext3: Unify log messages in ext3
  ext2: clear uptodate flag on super block I/O error
  ext2: Unify log messages in ext2
  ext3: make "norecovery" an alias for "noload"
  ext3: Don't update the superblock in ext3_statfs()
  ext3: journal all modifications in ext3_xattr_set_handle
  ext2: Explicitly assign values to on-disk enum of filetypes
  quota: Fix WARN_ON in lookup_one_len
  const: struct quota_format_ops
  ubifs: remove manual O_SYNC handling
  afs: remove manual O_SYNC handling
  kill wait_on_page_writeback_range
  vfs: Implement proper O_SYNC semantics
  ...
2009-12-11 15:31:13 -08:00
Jan Kara
5a20bdfcdc ext4: Support for 64-bit quota format
Add support for new 64-bit quota format. It is enough to add proper
mount options handling. The rest is done by the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Theodore Ts'o
a214238d3b ext4: Do not override ext2 or ext3 if built they are built as modules
The CONFIG_EXT4_USE_FOR_EXT23 option must not try to take over the
ext2 or ext3 file systems if the those file system drivers are
configured to be built as mdoules.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:09:58 -05:00
Eric Sandeen
15121c18a2 ext4: Fix optional-arg mount options
We have 2 mount options, "barrier" and "auto_da_alloc" which may or
may not take a 1/0 argument.  This causes the ext4 superblock mount
code to subtract uninitialized pointers and pass the result to
kmalloc, which results in very noisy failures.

Per Ted's suggestion, initialize the args struct so that
we know whether match_token() found an argument for the
option, and skip match_int() if not.

Also, return error (0) from parse_options if we thought
we found an argument, but match_int() Fails.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 20:17:55 -05:00
Jan Kara
b436b9bef8 ext4: Wait for proper transaction commit on fsync
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 23:51:10 -05:00
Josef Bacik
d4edac314e ext4: wait for log to commit when umounting
There is a potential race when a transaction is committing right when
the file system is being umounting.  This could reduce in a race
because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super
before the commit code calls a callback so the mballoc code can
release freed blocks in the transaction, resulting in a panic trying
to access the freed s_group_info.

The fix is to wait for the transaction to finish committing before we
shutdown the multiblock allocator.  

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 21:48:58 -05:00
Theodore Ts'o
24b584240a ext4: Use ext4 file system driver for ext2/ext3 file system mounts
Add a new config option, CONFIG_EXT4_USE_FOR_EXT23 which if enabled,
will cause ext4 to be used for either ext2 or ext3 file system mounts
when ext2 or ext3 is not enabled in the configuration.

This allows minimalist kernel fanatics to drop to file system drivers
from their compiled kernel with out losing functionality.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-07 14:08:51 -05:00
Eric Sandeen
e3bb52ae2b ext4: make "norecovery" an alias for "noload"
Users on the linux-ext4 list recently complained about differences
across filesystems w.r.t. how to mount without a journal replay.

In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext4.

Also show this status in /proc/mounts

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:28:50 -05:00
Eric Sandeen
5328e63531 ext4: make trim/discard optional (and off by default)
It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues.  Make
this mount-time optional, and default it to off until we know
that things are working out OK.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:25:42 -05:00
Theodore Ts'o
3f8fb9490e ext4: don't update the superblock in ext4_statfs()
commit a71ce8c6c9 updated ext4_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change.  This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:52 -05:00
Theodore Ts'o
cf40db137c ext4: remove failed journal checksum check
Now that we are checking for failed journal checksums in the jbd2
layer, we don't need to check in the ext4 mount path --- since a
checksum fail will result in ext4_load_journal() returning an error,
causing the file system to refuse to be mounted until e2fsck can deal
with the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 21:00:01 -05:00
Theodore Ts'o
503358ae01 ext4: avoid divide by zero when trying to mount a corrupted file system
If s_log_groups_per_flex is greater than 31, then groups_per_flex will
will overflow and cause a divide by zero error.  This can cause kernel
BUG if such a file system is mounted.

Thanks to Nageswara R Sastry for analyzing the failure and providing
an initial patch.

http://bugzilla.kernel.org/show_bug.cgi?id=14287

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:46 -05:00
Linus Torvalds
d4da6c9ccf Revert "ext4: Remove journal_checksum mount option and enable it by default"
This reverts commit d0646f7b63, as
requested by Eric Sandeen.

It can basically cause an ext4 filesystem to miss recovery (and thus get
mounted with errors) if the journal checksum does not match.

Quoth Eric:

   "My hand-wavy hunch about what is happening is that we're finding a
    bad checksum on the last partially-written transaction, which is
    not surprising, but if we have a wrapped log and we're doing the
    initial scan for head/tail, and we abort scanning on that bad
    checksum, then we are essentially running an unrecovered filesystem.

    But that's hand-wavy and I need to go look at the code.

    We lived without journal checksums on by default until now, and at
    this point they're doing more harm than good, so we should revert
    the default-changing commit until we can fix it and do some good
    power-fail testing with the fixes in place."

See

	http://bugzilla.kernel.org/show_bug.cgi?id=14354

for all the gory details.

Requested-by: Eric Sandeen <sandeen@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Alexey Fisher <bug-track@fisher-privat.net>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-02 10:15:27 -08:00
Eric Sandeen
f0e2dfa7f3 ext4: drop ext4dev compat
Kconfig & super.c promised it'd be gone by 2.6.31, so it's
about time to drop it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-10-01 02:21:07 -04:00
Theodore Ts'o
296c355cd6 ext4: Use tracepoints for mb_history trace file
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.  

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 00:32:42 -04:00
Theodore Ts'o
90576c0b9a ext4, jbd2: Drop unneeded printks at mount and unmount time
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted.  Disable them to economize space
in the system logs.  In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 15:51:30 -04:00
Curt Wohlgemuth
d3d1faf6a7 ext4: Handle nested ext4_journal_start/stop calls without a journal
This patch fixes a problem with handling nested calls to
ext4_journal_start/ext4_journal_stop, when there is no journal present.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 11:01:03 -04:00
Mingming Cao
8d5d02e6b1 ext4: async direct IO for holes and fallocate support
For async direct IO that covers holes or fallocate, the end_io
callback function now queued the convertion work on workqueue but
don't flush the work rightaway as it might take too long to afford.

But when fsync is called after all the data is completed, user expects
the metadata also being updated before fsync returns.

Thus we need to flush the conversion work when fsync() is called.
This patch keep track of a listed of completed async direct io that
has a work queued on workqueue.  When fsync() is called, it will go
through the list and do the conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-28 15:48:29 -04:00
Mingming Cao
4c0425ff68 ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
Currently the DIO VFS code passes create = 0 when writing to the
middle of file.  It does this to avoid block allocation for holes, so
as not to expose stale data out when there is a parallel buffered read
(which does not hold the i_mutex lock).  Direct I/O writes into holes
falls back to buffered IO for this reason.

Since preallocated extents are treated as holes when doing a
get_block() look up (buffer is not mapped), direct IO over fallocate
also falls back to buffered IO.  Thus ext4 actually silently falls
back to buffered IO in above two cases, which is undesirable.

To fix this, this patch creates unitialized extents when a direct I/O
write into holes in sparse files, and registering an end_io callback which
converts the uninitialized extent to an initialized extent after the
I/O is completed.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:48:41 -04:00
Theodore Ts'o
55138e0bc2 ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
Work around problems in the writeback code to force out writebacks in
larger chunks than just 4mb, which is just too small.  This also works
around limitations in the ext4 block allocator, which can't allocate
more than 2048 blocks at a time.  So we need to defeat the round-robin
characteristics of the writeback code and try to write out as many
blocks in one inode before allowing the writeback code to move on to
another inode.  We add a a new per-filesystem tunable,
max_writeback_mb_bump, which caps this to a default of 128mb per
inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 13:31:31 -04:00
Alexey Dobriyan
0d54b217a2 const: make struct super_block::s_qcop const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan
61e225dc34 const: make struct super_block::dq_op const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Eric Sandeen
fb0a387dcd ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.

This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.

This should address RH Bug 519471, ext4 bitmap allocator 
must limit blocks to < 2^32

* ext4_find_goal() is modified to choose a goal < UINT_MAX,
  so that our starting point is in an acceptable range.

* ext4_xattr_block_set() is modified such that the goal block
  is < UINT_MAX, as above.

* ext4_mb_regular_allocator() is modified so that the group
  search does not continue into groups which are too high

* ext4_mb_use_preallocated() has a check that we don't use
  preallocated space which is too far out

* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs

No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes.  Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.

For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:45:10 -04:00
Theodore Ts'o
3661d28615 ext4: Fix include/trace/events/ext4.h to work with Systemtap
Using relative pathnames in #include statements interacts badly with
SystemTap, since the fs/ext4/*.h header files are not packaged up as
part of a distribution kernel's header files.  Since systemtap doesn't
use TP_fast_assign(), we can use a blind structure definition and then
make sure the needed header files are defined before the ext4 source
files #include the trace/events/ext4.h header file.

https://bugzilla.redhat.com/show_bug.cgi?id=512478

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-14 22:59:50 -04:00
Theodore Ts'o
7ad9bb651f ext4: Fix initalization of s_flex_groups
The s_flex_groups array should have been initialized using atomic_add
to sum up the free counts from the block groups that make up a
flex_bg.  By using atomic_set, the value of the s_flex_groups array
was set to the values of the last block group in the flex_bg.  

The impact of this bug is that the block and inode allocation
algorithms might not pick the best flex_bg for new allocation.

Thanks to Damien Guibouret for pointing out this problem!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 16:51:28 -04:00
Theodore Ts'o
71290b368a ext4: Don't update superblock write time when filesystem is read-only
This avoids updating the superblock write time when we are mounting
the root file system read/only but we need to replay the journal; at
that point, for people who are east of GMT and who make their clock
tick in localtime for Windows bug-for-bug compatibility, and this will
cause e2fsck to complain and force a full file system check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-10 17:31:04 -04:00
Theodore Ts'o
d0646f7b63 ext4: Remove journal_checksum mount option and enable it by default
There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 12:50:43 -04:00
Eric Sandeen
a13fb1a453 ext4: Add feature set check helper for mount & remount paths
A user reported that although his root ext4 filesystem was mounting
fine, other filesystems would not mount, with the:

"Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF"

error on his 32-bit box built without CONFIG_LBDAF.  This is because
the test at mount time for this situation was not being re-checked
on remount, and the normal boot process makes an ro->rw transition,
so this was being missed.

Refactor to make a common helper function to test the filesystem
features against the type of mount request (RO vs. RW) so that we 
stay consistent.

Addresses Red-Hat-Bugzilla: #517650

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-18 00:20:23 -04:00
Eric Sandeen
bf43d84b18 ext4: reject too-large filesystems on 32-bit kernels
ext4 will happily mount a > 16T filesystem on a 32-bit box, but
this is not safe; writes to the block device will wrap past 16T
and the page cache can't index past 16T (232 index * 4k pages).

Adding another test to the existing "too many sectors" test
should do the trick.

Add a comment, a relevant return value, and fix the reference
to the CONFIG_LBD(AF) option as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 23:48:51 -04:00
Theodore Ts'o
78f1ddbb49 ext4: Avoid null pointer dereference when decoding EROFS w/o a journal
We need to check to make sure a journal is present before checking the
journal flags in ext4_decode_error().

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 23:09:47 -04:00
Al Viro
d4bfe2f76d switch ext4 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:04 -04:00
Linus Torvalds
31583d6acf Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Fix kernel-doc parameter name typo in blk-settings.c:
  block: rename CONFIG_LBD to CONFIG_LBDAF
  block: Fix bounce_pfn setting
  hd: stop defining MAJOR_NR
2009-06-19 17:43:04 -07:00
Bartlomiej Zolnierkiewicz
90c699a9ee block: rename CONFIG_LBD to CONFIG_LBDAF
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 08:08:50 +02:00
Andreas Dilger
11013911da ext4: teach the inode allocator to use a goal inode number
Enhance the inode allocator to take a goal inode number as a
paremeter; if it is specified, it takes precedence over Orlov or
parent directory inode allocation algorithms.

The extents migration function uses the goal inode number so that the
extent trees allocated the migration function use the correct flex_bg.
In the future, the goal inode functionality will also be used to
allocate an adjacent inode for the extended attributes.

Also, for testing purposes the goal inode number can be specified via
/sys/fs/{dev}/inode_goal.  This can be useful for testing inode
allocation beyond 2^32 blocks on very large filesystems.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 11:45:35 -04:00
Theodore Ts'o
4ab2f15b7f ext4: move the abort flag from s_mount_opts to s_mount_flags
We're running out of space in the mount options word, and
EXT4_MOUNT_ABORT isn't really a mount option, but a run-time flag.  So
move it to become EXT4_MF_FS_ABORTED in s_mount_flags.

Also remove bogus ext2_fs.h / ext4.h simultaneous #include protection,
which can never happen.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:36 -04:00
Theodore Ts'o
7f4520cc62 ext4: change s_mount_opt to be an unsigned int
We can only fit 32 options in s_mount_opt because an unsigned long is
32-bits on a x86 machine.  So use an unsigned int to save space on
64-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:41 -04:00
Alessio Igor Bogani
337eb00a2c Push BKL down into ->remount_fs()
[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Christoph Hellwig
ebc1ac1645 ->write_super lock_super pushdown
Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

 * bfs, nilfs2 - no other uses of s_lock and have internal locks in
	->write_super
 * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
 * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
 	->write_super
 * xfs - no other uses of s_lock and uses internal lock (buffer lock on
	superblock buffer) to serialize ->write_super.  Also xfs_fs_write_super
	is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Al Viro
bbd6851a32 Push lock_super() into the ->remount_fs() of filesystems that care about it
Note that since we can't run into contention between remount_fs and write_super
(due to exclusion on s_umount), we have to care only about filesystems that
touch lock_super() on their own.  Out of those ext3, ext4, hpfs, sysv and ufs
do need it; fat doesn't since its ->remount_fs() only accesses assign-once
data (basically, it's "we have no atime on directories and only have atime on
files for vfat; force nodiratime and possibly noatime into *flags").

[folded a build fix from hch]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Christoph Hellwig
6cfd014842 push BKL down into ->put_super
Move BKL into ->put_super from the only caller.  A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment.  Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
a9e220f832 No need to do lock_super() for exclusion in generic_shutdown_super()
We can't run into contention on it.  All other callers of lock_super()
either hold s_umount (and we have it exclusive) or hold an active
reference to superblock in question, which prevents the call of
generic_shutdown_super() while the reference is held.  So we can
replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
(and corresponding change for unlock_super(), of course).

Since ext4 expects s_lock held for its put_super, take lock_super()
into it.  The rest of filesystems do not care at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Christoph Hellwig
8c85e12512 remove ->write_super call in generic_shutdown_super
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.

Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.

Exceptions:

 - affs already has identical copy & pasted code at the beginning of
   affs_put_super so no need to do it twice.
 - xfs does the right thing without it and I have changes pending for
   the xfs tree touching this are so I don't really need conflicts
   here..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Linus Torvalds
c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Eric Sandeen
b31e15527a ext4: Change all super.c messages to print the device
This patch changes ext4 super.c to include the device name with all 
warning/error messages, by using a new utility function ext4_msg. 
It's a rather large patch, but very mechanic. I left debug printks
alone.

This is a straightforward port of a patch which Andi Kleen did for
ext3.

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04 17:36:36 -04:00
Andreas Dilger
0b8e58a140 ext4: super.c whitespace cleanup
Cleanup of whitespace and formatting.  Initially driven by confusing indents
for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd
cleanup some other 80-column wrapping and other indenting problems at the
same time.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-03 17:59:28 -04:00
Theodore Ts'o
88b6edd17c ext4: Clean up calls to ext4_get_group_desc()
If the caller isn't planning on modifying the block group descriptors,
there's no need to pass in a pointer to a struct buffer_head.  Nuking
this saves a tiny amount of CPU time and stack space usage.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:50:39 -04:00
Martin K. Petersen
e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Manish Katiyar
f68301656b ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:44 -04:00
Theodore Ts'o
6fd058f779 ext4: Add a comprehensive block validity check to ext4_get_blocks()
To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree.  This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 15:38:01 -04:00
Aneesh Kumar K.V
955ce5f5be ext4: Convert ext4_lock_group to use sb_bgl_lock
We have sb_bgl_lock() and ext4_group_info.bb_state
bit spinlock to protech group information. The later is only
used within mballoc code. Consolidate them to use sb_bgl_lock().
This makes the mballoc.c code much simpler and also avoid
confusion with two locks protecting same info.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 20:35:09 -04:00
Curt Wohlgemuth
f40339031b ext4: Make the length of the mb_history file tunable
In memory-constrained systems with many partitions, the ~68K for each
partition for the mb_history buffer can be excessive.

This patch adds a new mount option, mb_history_length, as well as a
way of setting the default via a module parameter (or via a sysfs
parameter in /sys/module/ext4/parameter/default_mb_history_length).
If the mb_history_length is set to zero, the mb_history facility is
disabled entirely.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 20:27:20 -04:00
Theodore Ts'o
bb23c20a85 ext4: Move fs/ext4/group.h into ext4.h
Move the function prototypes in group.h into ext4.h so they are all
defined in one place.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 19:44:44 -04:00
Theodore Ts'o
596397b77c ext4: Move fs/ext4/namei.h into ext4.h
The fs/ext4/namei.h header file had only a single function
declaration, and should have never been a standalone file.  Move it
into ext4.h, where should have been from the beginning.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 13:49:15 -04:00
Theodore Ts'o
9bffad1ed2 ext4: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:48:11 -04:00
Theodore Ts'o
32ed5058ce ext4: Replace lock/unlock_super() with an explicit lock for resizing
Use a separate lock to protect s_groups_count and the other block
group descriptors which get changed via an on-line resize operation,
so we can stop overloading the use of lock_super().
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 22:53:39 -04:00
Theodore Ts'o
3b9d4ed266 ext4: Replace lock/unlock_super() with an explicit lock for the orphan list
Use a separate lock to protect the orphan list, so we can stop
overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 22:54:04 -04:00
Theodore Ts'o
a63c9eb2ce ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
The function ext4_mark_recovery_complete() is called from two call
paths: either (a) while mounting the filesystem, in which case there's
no danger of any other CPU calling write_super() until the mount is
completed, and (b) while remounting the filesystem read-write, in
which case the fs core has already locked the superblock.  This also
allows us to take out a very vile unlock_super()/lock_super() pair in
ext4_remount().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 01:59:42 -04:00
Theodore Ts'o
114e9fc907 ext4: Remove outdated comment about lock_super()
ext4_fill_super() is no longer called by read_super(), and it is no
longer called with the superblock locked.  The
unlock_super()/lock_super() is no longer present, so this comment is
entirely superfluous.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 15:48:07 -04:00
Theodore Ts'o
8df9675f8b ext4: Avoid races caused by on-line resizing and SMP memory reordering
Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count.  However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group.  For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any (for example) the
new block group descriptors allowed by the increased value of
s_groups_count.

Unfortunately, we rather blatently violate this locking protocol
documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves.  So apparently problems
here are relatively hard to hit, since ext3 has been vulnerable to the
same issue for years with no one apparently complaining.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 08:50:38 -04:00
Theodore Ts'o
9ca92389c5 ext4: Use separate super_operations structure for no_journal filesystems
By using a separate super_operations structure for filesystems that
have and don't have journals, we can simply ext4_write_super() ---
which is only needed when no journal is present --- and ext4_freeze(),
ext4_unfreeze(), and ext4_sync_fs(), which are only needed when the
journal is present.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 12:52:25 -04:00
Theodore Ts'o
7234ab2a55 ext4: Fix and simplify s_dirt handling
The s_dirt flag wasn't completely handled correctly, but it didn't
really matter when journalling was enabled.  It turns out that when
ext4 runs without a journal, we don't clear s_dirt in places where we
should have, with the result that the high-level write_super()
function was writing the superblock when it wasn't necessary.

So we fix this by making ext4_commit_super() clear the s_dirt flag,
and removing many of the other places where s_dirt is manipulated.
When journalling is enabled, the s_dirt flag might be left set more
often, but s_dirt really doesn't matter when journalling is enabled.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-30 21:24:04 -04:00
Theodore Ts'o
e2d670523c ext4: Simplify ext4_commit_super()'s function signature
The ext4_commit_super() function took both a struct super_block * and
a struct ext4_super_block *, but the struct ext4_super_block can be
derived from the struct super_block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 00:33:44 -04:00
Theodore Ts'o
f7c439504c ext4: Use is_power_of_2() for clarity
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-24 23:31:59 -04:00
Theodore Ts'o
c5ca7c7636 ext4: Fallback to vmalloc if kmalloc can't allocate s_flex_groups array
For very large filesystems, the s_flex_groups array can get quite big.
For example, a filesystem that can be resized up to 16TB will have
8192 flex groups (assuming the default flex_bg size of 16), so the
array is 96k, which is *very* marginal for kmalloc().  On the other
hand, a 160GB filesystem without the resize_inode feature will only
require 960 bytes.  So we try to allocate the array first using
kmalloc(), and if that fails, we'll try to use vmalloc() instead.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27 22:48:48 -04:00
From: Thiemo Nagel
0f2ddca66d ext4: check block device size on mount
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-07 14:07:47 -04:00
Theodore Ts'o
06705bff91 ext4: Regularize mount options
Add support for using the mount options "barrier" and "nobarrier", and
"auto_da_alloc" and "noauto_da_alloc", which is more consistent than
"barrier=<0|1>" or "auto_da_alloc=<0|1>".  Most other ext3/ext4 mount
options use the foo/nofoo naming convention.  We allow the old forms
of these mount options for backwards compatibility.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-28 10:59:57 -04:00
Theodore Ts'o
afd4672dc7 ext4: Add auto_da_alloc mount option
Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file.  This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code.  This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-16 23:12:23 -04:00
Theodore Ts'o
7d39db14a4 ext4: Use struct flex_groups to calculate get_orlov_stats()
Instead of looping over all of the block groups in a flex group
summing their summary statistics, start tracking used_dirs in struct
flex_groups, and use struct flex_groups instead.  This should save a
bit of CPU for mkdir-heavy workloads.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 19:31:53 -05:00
Theodore Ts'o
9f24e4208f ext4: Use atomic_t's in struct flex_groups
Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's
to track the number of free blocks and inodes in each flex_group.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 19:09:10 -05:00
Theodore Ts'o
b713a5ec55 ext4: remove /proc tuning knobs
Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been
replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-31 09:11:14 -04:00
Theodore Ts'o
3197ebdb13 ext4: Add sysfs support
Add basic sysfs support so that information about the mounted
filesystem and various tuning parameters can be accessed via
/sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-31 09:10:09 -04:00
Theodore Ts'o
afc32f7ee9 ext4: Track lifetime disk writes
Add a new superblock value which tracks the lifetime amount of writes
to the filesystem.  This is useful in estimating the amount of wear on
solid state drives (SSD's) caused by writes to the filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 19:39:58 -05:00
Pekka Enberg
705895b611 ext4: allocate ->s_blockgroup_lock separately
As spotted by kmemtrace, struct ext4_sb_info is 17664 bytes on 64-bit
which makes it a very bad fit for SLAB allocators.  The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.

To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately.  This shinks down struct ext4_sb_info
enough to fit a 2 KB slab cache so now we allocate 16 KB + 2 KB instead of
32 KB saving 14 KB of memory.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 18:07:52 -05:00
Jan Kara
a269eb1829 ext4: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
2009-03-26 02:18:36 +01:00
Mingming Cao
60e58e0f30 ext4: quota reservation for delayed allocation
Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Jan Kara
edf7245362 ext4: Remove unnecessary quota functions
ext4_dquot_initialize() and ext4_dquot_drop() is no longer
needed because of modified quota locking.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Theodore Ts'o
8b1a8ff8b3 ext4: Remove duplicate call to ext4_commit_super() in ext4_freeze()
Commit c4be0c1d added error checking to ext4_freeze() when calling
ext4_commit_super().  Unfortunately the patch failed to remove the
original call to ext4_commit_super(), with the net result that when
freezing the filesystem, the superblock gets written twice, the first
time without error checking.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 00:08:53 -05:00
Jan Kara
9eddacf9e9 Revert "ext4: wait on all pending commits in ext4_sync_fs()"
This undoes commit 14ce0cb411.

Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.

http://bugzilla.kernel.org/show_bug.cgi?id=12224

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
2009-02-10 06:46:05 -05:00
Takashi Sato
c4be0c1dc4 filesystem freeze: add error handling of write_super_lockfs/unlockfs
Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests.  So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.

In many case, a commercial filesystem (e.g.  VxFS) has the freeze feature
and it would be used to get the consistent backup.

If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.

So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
   with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
   or the snapshot.

This patch:

VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.

ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.

reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.

Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-09 16:54:42 -08:00
Linus Torvalds
2150edc6c5 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
  jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
  ext4: Remove "extents" mount option
  block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
  ext4: Make printk's consistently prefixed with "EXT4-fs: "
  ext4: Add sanity checks for the superblock before mounting the filesystem
  ext4: Add mount option to set kjournald's I/O priority
  jbd2: Submit writes to the journal using WRITE_SYNC
  jbd2: Add pid and journal device name to the "kjournald2 starting" message
  ext4: Add markers for better debuggability
  ext4: Remove code to create the journal inode
  ext4: provide function to release metadata pages under memory pressure
  ext3: provide function to release metadata pages under memory pressure
  add releasepage hooks to block devices which can be used by file systems
  ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
  ext4: Init the complete page while building buddy cache
  ext4: Don't allow new groups to be added during block allocation
  ext4: mark the blocks/inode bitmap beyond end of group as used
  ext4: Use new buffer_head flag to check uninit group bitmaps initialization
  ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
  ext4: code cleanup
  ...
2009-01-08 17:14:59 -08:00
Theodore Ts'o
83982b6f47 ext4: Remove "extents" mount option
This mount option is largely superfluous, and in fact the way it was
implemented was buggy; if a filesystem which did not have the extents
feature flag was mounted -o extents, the filesystem would attempt to
create and use extents-based file even though the extents feature flag
was not eabled.  The simplest thing to do is to nuke the mount option
entirely.  It's not all that useful to force the non-creation of new
extent-based files if the filesystem can support it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 14:53:16 -05:00
Theodore Ts'o
abda141892 ext4: Make printk's consistently prefixed with "EXT4-fs: "
Previously, some were "ext4: ", and some were "EXT4: "; change them to
be consistent with most ext4 printk's, which is to use "EXT4-fs: ".

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 00:20:32 -05:00
Theodore Ts'o
4ec1102813 ext4: Add sanity checks for the superblock before mounting the filesystem
This avoids insane superblock configurations that could lead to kernel
oops due to null pointer derefences.

http://bugzilla.kernel.org/show_bug.cgi?id=12371

Thanks to David Maciejak at Fortinet's FortiGuard Global Security
Research Team who discovered this bug independently (but at
approximately the same time) as Thiemo Nagel, who submitted the patch.

Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-06 14:53:26 -05:00
Theodore Ts'o
b3881f74b3 ext4: Add mount option to set kjournald's I/O priority
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jens Axboe <jens.axboe@oracle.com>
2009-01-05 22:46:26 -05:00
Jan Kara
a5b5ee3201 ext4: Add default allocation routines for quota structures
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:26 -08:00
Jan Kara
17bd13b31c ext4: Use sb_any_quota_loaded() instead of sb_any_quota_enabled()
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:36:56 -08:00
Theodore Ts'o
c319106723 ext4: Remove code to create the journal inode
This code has been obsolete in quite some time, since the supported
method for adding a journal inode is to use tune2fs (or to creating
new filesystem with a journal via mke2fs or mkfs.ext4).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 11:14:25 -05:00
Toshiyuki Okajima
c39a7f84d7 ext4: provide function to release metadata pages under memory pressure
Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which calls
jbd2_journal_try_to_free_buffers() function to free the metadata, and
which is called by the block device's blkdev_releasepage() function.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
2009-01-05 22:38:48 -05:00
Aneesh Kumar K.V
560671a0d3 ext4: Use high 16 bits of the block group descriptor's free counts fields
Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:20:24 -05:00
Aneesh Kumar K.V
5d1b1b3f49 ext4: fix BUG when calling ext4_error with locked block group
The mballoc code likes to call ext4_error while it is holding locked
block groups.  This can causes a scheduling in atomic context BUG.  We
can't just unlock the block group and relock it after/if ext4_error
returns since that might result in race conditions in the case where
the filesystem is set to continue after finding errors.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:19:52 -05:00
Jens Axboe
b3a6ffe16b Get rid of CONFIG_LSF
We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.

Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-29 08:29:51 +01:00
Aneesh Kumar K.V
3a06d778df ext4: sparse fixes
* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-22 15:04:59 -05:00
Theodore Ts'o
a9df9a4910 ext4: Make ext4_group_t be an unsigned int
Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:18:16 -05:00
Theodore Ts'o
30773840c1 ext4: add fsync batch tuning knobs
Add new mount options, min_batch_time and max_batch_time, which
controls how long the jbd2 layer should wait for additional filesystem
operations to get batched with a synchronous write transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03 20:27:38 -05:00
Theodore Ts'o
fde4d95ad8 ext4: remove extraneous newlines from calls to ext4_error() and ext4_warning()
This removes annoying blank syslog entries emitted by ext4_error() or
ext4_warning(), since these functions add their own newline.

Signed-off-by: Nick Warne <nick@ukfsn.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:17:35 -05:00
Frank Mayhar
0390131ba8 ext4: Allow ext4 to run without a journal
A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-07 00:06:22 -05:00
Roel Kluin
23475e264c ext4: Use simple_strtol() instead of simple_strtoul() in ext4_ui_proc_open
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-26 02:23:19 -05:00
Aneesh Kumar K.V
565a9617b2 ext4: avoid ext4_error when mounting a fs with a single bg
Remove some completely unneeded code which which caused an ext4_error
to be generated when mounting a file system with only a single block
group.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:51:07 -05:00
Theodore Ts'o
14ce0cb411 ext4: wait on all pending commits in ext4_sync_fs()
In ext4_sync_fs, we only wait for a commit to finish if we started it,
but there may be one already in progress which will not be synced.

In the case of a data=ordered umount with pending long symlinks which
are delayed due to a long list of other I/O on the backing block
device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of
fsync_super.  Then, before they can be dirtied again, kjournald exits,
seeing the UMOUNT flag and the dirty pages are never written to the
backing block device, causing long symlink corruption and exposing new
or previously freed block data to userspace.

To ensure all commits are synced, we flush all journal commits now
when sync_fs'ing ext4.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
2008-11-03 18:10:55 -05:00
Aneesh Kumar K.V
d94e99a64c ext4: Convert to host order before using the values.
Use le16_to_cpu to read the s_reserved_gdt_blocks values
from super block.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-04 09:11:26 -05:00
Theodore Ts'o
f99b25897a ext4: Add support for non-native signed/unsigned htree hash algorithms
The original ext3 hash algorithms assumed that variables of type char
were signed, as God and K&R intended.  Unfortunately, this assumption
is not true on some architectures.  Userspace support for marking
filesystems with non-native signed/unsigned chars was added two years
ago, but the kernel-side support was never added (until now).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-28 13:21:44 -04:00
Hidehiro Kawai
ef2cabf7c6 ext4: fix a bug accessing freed memory in ext4_abort
Vegard Nossum reported a bug which accesses freed memory (found via
kmemcheck).  When journal has been aborted, ext4_put_super() calls
ext4_abort() after freeing the journal_t object, and then ext4_abort()
accesses it.  This patch fix it.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-27 22:53:05 -04:00
Linus Torvalds
2248485640 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
  [PATCH] kill the rest of struct file propagation in block ioctls
  [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
  [PATCH] get rid of blkdev_locked_ioctl()
  [PATCH] get rid of blkdev_driver_ioctl()
  [PATCH] sanitize blkdev_get() and friends
  [PATCH] remember mode of reiserfs journal
  [PATCH] propagate mode through swsusp_close()
  [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
  [PATCH] pass fmode_t to blkdev_put()
  [PATCH] kill the unused bsize on the send side of /dev/loop
  [PATCH] trim file propagation in block/compat_ioctl.c
  [PATCH] end of methods switch: remove the old ones
  [PATCH] switch sr
  [PATCH] switch sd
  [PATCH] switch ide-scsi
  [PATCH] switch tape_block
  [PATCH] switch dcssblk
  [PATCH] switch dasd
  [PATCH] switch mtd_blkdevs
  [PATCH] switch mmc
  ...
2008-10-23 10:23:07 -07:00
Al Viro
8264613def [PATCH] switch quota_on-related stuff to kern_path()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-23 05:12:44 -04:00
Al Viro
9a1c354276 [PATCH] pass fmode_t to blkdev_put()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:58 -04:00
Theodore Ts'o
f287a1a561 ext4: Remove automatic enabling of the HUGE_FILE feature flag
If the HUGE_FILE feature flag is not set, don't allow the creation of
large files, instead of automatically enabling the feature flag.
Recent versions of mke2fs will set the HUGE_FILE flag automatically
anyway for ext4 filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-16 22:50:48 -04:00
Theodore Ts'o
01436ef2e4 ext4: Remove unused mount options: nomballoc, mballoc, nocheck
These mount options don't actually do anything any more, so remove
them.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 07:22:35 -04:00
Eric Sesterhenn
5128273a32 ext4: Add missing newlines to printk messages
There are some newlines missing in ext4_check_descriptors, which
cause the printk level to be printed out when the next printk call
is made:

[  778.847265] EXT4-fs: ext4_check_descriptors: Block bitmap for group 0
not in group (block 1509949442)!<3>EXT4-fs: group descriptors corrupted!
[  802.646630] EXT4-fs: ext4_check_descriptors: Inode bitmap for group 0
not in group (block 9043971)!<3>EXT4-fs: group descriptors corrupted!

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 09:16:19 -04:00
Aneesh Kumar K.V
c2774d84fd ext4: Do mballoc init before doing filesystem recovery
During filesystem recovery we may be doing a truncate
which expects some of the mballoc data structures to
be initialized. So do ext4_mb_init before recovery.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:07:20 -04:00
Steven Whitehouse
a447c09324 vfs: Use const for kernel parser table
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.

This was posted for review some time ago and I believe its been in -mm
since then.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 10:10:37 -07:00
Alexander Beregalov
3244fcb1ae ext4: fix build failure without procfs
fs/ext4/super.c: In function 'ext4_fill_super':
fs/ext4/super.c:2226: error: 'ext4_ui_proc_fops' undeclared (first use
in this function)
fs/ext4/super.c:2226: error: (Each undeclared identifier is reported
only once
fs/ext4/super.c:2226: error: for each function it appears in.)

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-12 17:27:49 -04:00
Hidehiro Kawai
5bf5683a33 ext4: add an option to control error handling on file data
If the journal doesn't abort when it gets an IO error in file data
blocks, the file data corruption will spread silently.  Because
most of applications and commands do buffered writes without fsync(),
they don't notice the IO error.  It's scary for mission critical
systems.  On the other hand, if the journal aborts whenever it gets
an IO error in file data blocks, the system will easily become
inoperable.  So this patch introduces a filesystem option to
determine whether it aborts the journal or just call printk() when
it gets an IO error in file data.

If you mount an ext4 fs with data_err=abort option, it aborts on file
data write error.  If you mount it with data_err=ignore, it doesn't
abort, just call printk().  data_err=ignore is the default.

Here is the corresponding patch of the ext3 version:
http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 22:12:43 -04:00
Hidehiro Kawai
7ffe1ea894 ext4: add checks for errors from jbd2
If the journal has aborted due to a checkpointing failure, we
have to keep the contents of the journal space.  Otherwise, the
filesystem will lose uncheckpointed metadata completely and
become inconsistent.  To avoid this, we need to keep needs_recovery
flag if checkpoint has failed.

With this patch, ext4_put_super() detects a checkpointing failure
from the return value of journal_destroy(), then it invokes
ext4_abort() to make the filesystem read only and keep
needs_recovery flag.  Errors from jbd2_journal_flush() are also
handled by this patch in some places.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:29:21 -04:00
Theodore Ts'o
03010a3350 ext4: Rename ext4dev to ext4
The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix.  Also remove the requirement for the TEST_FILESYS
flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 20:02:48 -04:00
Andi Kleen
39d80c33a0 ext4: Avoid double dirtying of super block in ext4_put_super()
While reading code I noticed that ext4_put_super() dirties the 
superblock bh twice. It is always done in ext4_commit_super()
too. Remove the redundant dirty operation.
Should be a nop semantically.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-10-06 21:37:44 -04:00
Theodore Ts'o
ede86cc473 ext4: Add debugging markers that can be used by systemtap
This debugging markers are designed to debug problems such as the
random filesystem latency problems reported by Arjan.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-05 20:50:06 -04:00
Theodore Ts'o
c2ea3fde61 ext4: Remove old legacy block allocator
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 09:40:52 -04:00
Theodore Ts'o
240799cdf2 ext4: Use readahead when reading an inode from the inode table
With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block.  So request readahead for adjacent inode table
blocks to reduce the time it takes when iterating over directories
(especially when doing this in htree sort order) in a cold cache case.
With this patch, the time it takes to run "git status" on a kernel
tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches"
is reduced by 21%.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-09 23:53:47 -04:00
Theodore Ts'o
5e8814f2f7 ext4: Combine proc file handling into a single set of functions
Previously mballoc created a separate set of functions for each proc
file.  This combines the tunables into a single set of functions which
gets used for all of the per-superblock proc files, saving
approximately 2k of compiled object code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-23 18:07:35 -04:00
Theodore Ts'o
9f6200bbfc ext4: move /proc setup and teardown out of mballoc.c
...and into the core setup/teardown code in fs/ext4/super.c so that
other parts of ext4 can define tuning parameters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-23 09:18:24 -04:00
Theodore Ts'o
914258bf2c ext4/jbd2: Avoid WARN() messages when failing to write to the superblock
This fixes some very common warnings reported by kerneloops.org

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-06 21:35:40 -04:00
Li Zefan
7ee1ec4ca3 ext4: add missing unlock in ext4_check_descriptors() on error path
If there group descriptors are corrupted we need unlock the block
group lock before returning from the function; else we will oops when
freeing a spinlock which is still being held.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 10:47:19 -04:00
Theodore Ts'o
05496769e5 jbd2: clean up how the journal device name is printed
Calculate the journal device name once and stash it away in the
journal_s structure.  This avoids needing to call bdevname()
everywhere and reduces stack usage by not needing to allocate an
on-stack buffer.  In addition, we eliminate the '/' that can appear in
device names (e.g. "cciss/c0d0p9" --- see kernel bugzilla #11321) that
can cause problems when creating proc directory names, and include the
inode number to support ocfs2 which creates multiple journals with
different inode numbers.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-16 14:36:17 -04:00
Frederic Bohe
c62a11fd95 Update flex_bg free blocks and free inodes counters when resizing.
This fixes a bug which prevented the newly created inodes after a
resize from being used on filesystems with flex_bg.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 10:20:24 -04:00
Aneesh Kumar K.V
6bc6e63fcd ext4: Add percpu dirty block accounting.
This patch adds dirty block accounting using percpu_counters.  Delayed
allocation block reservation is now done by updating dirty block
counter.  In a later patch we switch to non delalloc mode if the
filesystem free blocks is greater than 150% of total filesystem dirty
blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 09:39:00 -04:00
Theodore Ts'o
af5bc92dde ext4: Fix whitespace checkpatch warnings/errors
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 22:25:24 -04:00
Theodore Ts'o
e5f8eab885 ext4: Fix long long checkpatch warnings
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 22:25:04 -04:00
Theodore Ts'o
4776004f54 ext4: Add printk priority levels to clean up checkpatch warnings
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 23:00:52 -04:00
Aneesh Kumar K.V
91246c0090 ext4: Initialize writeback_index to 0 when allocating a new inode
The write_cache_pages() function uses the mapping->writeback_index as
the starting index to write out when range_cyclic is set.  Properly
initialize writeback_index so that we start the writeout at index 0.

This was found when debugging the small file fragmentation on ext4.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19 21:14:52 -04:00
Linus Torvalds
8f616cd524 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove write-only variables from ext4_ordered_write_end
  ext4: unexport jbd2_journal_update_superblock
  ext4: Cleanup whitespace and other miscellaneous style issues
  ext4: improve ext4_fill_flex_info() a bit
  ext4: Cleanup the block reservation code path
  ext4: don't assume extents can't cross block groups when truncating
  ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
  ext4: Fix ext4_ext_journal_restart()
  ext4: fix ext4_da_write_begin error path
  jbd2: don't abort if flushing file data failed
  ext4: don't read inode block if the buffer has a write error
  ext4: Don't allow lg prealloc list to be grow large.
  ext4: Convert the usage of NR_CPUS to nr_cpu_ids.
  ext4: Improve error handling in mballoc
  ext4: lock block groups when initializing
  ext4: sync up block and inode bitmap reading functions
  ext4: Allow read/only mounts with corrupted block group checksums
  ext4: Fix data corruption when writing to prealloc area
2008-08-03 10:50:44 -07:00
Al Viro
77e69dac3c [PATCH] fix races and leaks in vfs_quota_on() users
* new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the
  pathname resolution.
* callers of vfs_quota_on() that do their own pathname resolution and
  checks based on it are switched to vfs_quota_on_path(); that way we
  avoid the races.
* reiserfs leaked dentry/vfsmount references on several failure exits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:25 -04:00
Theodore Ts'o
2b2d6d0197 ext4: Cleanup whitespace and other miscellaneous style issues
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26 16:15:44 -04:00
Alexey Dobriyan
51cc50685a SL*B: drop kmem cache argument from constructor
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres.  Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.

Non-trivial places are:
	arch/powerpc/mm/init_64.c
	arch/powerpc/mm/hugetlbpage.c

This is flag day, yes.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:07 -07:00
Li Zefan
ec05e868ac ext4: improve ext4_fill_flex_info() a bit
- use kzalloc() instead of kmalloc() + memset()
- improve a printk info

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-24 12:49:59 -04:00
Eric Sandeen
b5f10eed81 ext4: lock block groups when initializing
I noticed when filling a 1T filesystem with 4 threads using the
fs_mark benchmark:

fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0

that I occasionally got checksum mismatch errors:

EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935

etc.  I'd reliably get 4-5 of them during the run.

It appears that the problem is likely a race to init the bg's
when the uninit_bg feature is enabled.

With the patch below, which adds sb_bgl_locking around initialization,
I was able to complete several runs with no errors or warnings.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-02 21:21:08 -04:00
Theodore Ts'o
8a266467b8 ext4: Allow read/only mounts with corrupted block group checksums
If the block group checksums are corrupted, still allow the mount to
succeed, so e2fsck can have a chance to try to fix things up.  Add
code in the remount r/w path to make sure the block group checksums
are valid before allowing the filesystem to be remounted read/write.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26 14:34:21 -04:00
Eric Sandeen
e4079a11f5 ext4: do not set extents feature from the kernel
We've talked for a while about getting rid of any feature-
setting from the kernel; this gets rid of the code which would
set the INCOMPAT_EXTENTS flag on the first file write when mounted
as ext4[dev].

With this patch, if the extents feature is not already set on disk,
then mounting as ext4 will fall back to noextents with a warning,
and if -o extents is explicitly requested, the mount will fail,
also with warning.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
c07651b556 ext4: Don't allow nonextenst mount option for large filesystem
The block mapped inode format can address only blocks within 2**32. This
causes a number of issues, the biggest of which is that the block
allocator needs to be taught that certain inodes can not utilize block
numbers > 2**32.  So until this is fixed, it is simplest to fail
mounting of file systems with more than 2**32 blocks if the -o noextents
option is given.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
dd919b9822 ext4: Enable delalloc by default.
Enable delalloc by default to ensure it gets sufficient testing and
because it makes the filesystem much more efficient.  Add a nodealalloc
option to disable delayed allocation, and update ext4_show_options to
show delayed allocation off if it is disabled.

If the data=journal mount option is used, disable delayed allocation
since the delalloc code doesn't support data=journal yet.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11 19:27:31 -04:00
Mingming Cao
d2a1763791 ext4: delayed allocation ENOSPC handling
This patch does block reservation for delayed
allocation, to avoid ENOSPC later at page flush time.

Blocks(data and metadata) are reserved at da_write_begin()
time, the freeblocks counter is updated by then, and the number of
reserved blocks is store in per inode counter.
        
At the writepage time, the unused reserved meta blocks are returned
back. At unlink/truncate time, reserved blocks are properly released.

Updated fix from  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix the oldallocator block reservation accounting with delalloc, added
lock to guard the counters and also fix the reservation for meta blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-14 17:52:37 -04:00
Alex Tomas
64769240bd ext4: Add delayed allocation support in data=writeback mode
Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and
release the page from page cache if the delalloc write_begin failed, and
properly handle preallocated blocks.  Also added a fix to clear
buffer_delay in block_write_full_page() after allocating a delayed
buffer.

Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to update i_disksize properly and to add bmap support for delayed
allocation.

Updated with a fix from Valerie Clement <valerie.clement@bull.net> to
avoid filesystem corruption when the filesystem is mounted with the
delalloc option and blocksize < pagesize.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by:  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-07-11 19:27:31 -04:00
Jan Kara
678aaf4814 ext4: Use new framework for data=ordered mode in JBD2
This patch makes ext4 use inode-based implementation of data=ordered mode
in JBD2. It allows us to unify some data=ordered and data=writeback paths
(especially writepage since we don't have to start a transaction anymore)
and remove some buffer walking.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix file system hang due to corrupt jinode values.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
47b4a50beb ext4: Set journal pointer to NULL when journal is released
Set sbi->s_journal to NULL after we call journal_destroy(). This
will be later needed because after journal_destroy() is called,
ext4_clear_inode() can still be called for some inodes (e.g. root
inode) and we'll need to detect there that journal doesn't exists
anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
7477827f66 ext4: Fix sparse warning
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Li Zefan
f9a8ac99fd ext4: remove redundant code in ext4_fill_super()
The previous sb_min_blocksize() has already set the block size.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jose R. Santos
772cb7c83b ext4: New inode allocation for FLEX_BG meta-data groups.
This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping.  It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs.  Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward.  This helps improve
performance by reducing seek time.  Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o
7ad72ca60b ext4: Remove unused variable from ext4_show_options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
4d04e4fbf8 ext4: add missing unlock to an error path in ext4_quota_write()
When write in ext4_quota_write() fails, we have to properly release
i_mutex.  One error path has been missing the unlock...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:05 -07:00
Eric Sandeen
571640cad3 ext4: enable barriers by default
I can't think of any valid reason for ext4 to not use barriers when
they are available;  I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.

An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.

SuSE has carried such a patch for ext3 for quite some time now.

Also document the mount option while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 12:29:46 -04:00
Theodore Ts'o
cd0b6a39a1 ext4: Display the journal_async_commit mount option in /proc/mounts
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 10:28:28 -04:00
Theodore Ts'o
624080eded jbd2: If a journal checksum error is detected, propagate the error to ext4
If a journal checksum error is detected, the ext4 filesystem will call
ext4_error(), and the mount will either continue, become a read-only
mount, or cause a kernel panic based on the superblock flags
indicating the user's preference of what to do in case of filesystem
corruption being detected.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-06 17:50:40 -04:00
Jan Kara
2c8be6b222 ext4: fix typos in messages and comments (journalled -> journaled)
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 21:27:55 -04:00
Jan Kara
0623543b33 ext4: fix synchronization of quota files in journal=data mode
In journal=data mode, it is not enough to do write_inode_now as done in
vfs_quota_on() to write all data to their final location (which is
needed for quota_read to work correctly).  Calling journal_flush() does
its job.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Jan Kara
cd59e7b978 ext4: Fix mount messages when quota disabled
When quota is disabled, we should not print 'journaled quota not
supported' when user tried to mount non-journaled quota. Also fix typo
in the message.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Jan Kara
dfc5d03f12 ext4: correct mount option parsing to detect when quota options can be changed
We should not allow user to change quota mount options when quota is
just suspended.  It would make mount options and internal quota state
inconsistent.  Also we should not allow user to change quota format when
quota is turned on.  On the other hand we can just silently ignore when
some option is set to the value it already has (mount does this on
remount).

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Josef Bacik
c19204b0ae ext4: don't use ext4_error in ext4_check_descriptors
Because ext4_check_descriptors is called at mount time you can't use ext4_error
as it calls ext4_commit_sb, which since the sb isn't all the way initialized
causes bad things to happen (ie a panic).  This patch changes the ext4_error's
to printk's to keep this problem from happening.  Thanks much,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:00:28 -04:00
Christoph Hellwig
3dcf54515a ext4: move headers out of include/linux
Move ext4 headers out of include/linux.  This is just the trivial move,
there's some more thing that could be done later. 

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 18:13:32 -04:00
Jan Kara
2887df139c ext4: Fix hang on umount with quotas when journal is aborted
Call dquot_drop() from ext4_dquot_drop() even if we fail to start a
transaction. Otherwise we never get to dropping references to quota structures
from the inode and umount will hang indefinitely.  Thanks to Payphone LIOU for
spotting the problem.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
CC: Payphone LIOU <lioupayphone@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:02:07 -04:00
Harvey Harrison
46e665e9d2 ext4: replace remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-17 10:38:59 -04:00
Marcin Slusarz
216c34b2b8 ext4: convert byte order of constant instead of variable
Convert byte order of constant instead of variable which can be done at
compile time (vs run time).

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-17 10:38:59 -04:00
Marcin Slusarz
e8546d0615 ext4: le*_add_cpu conversion
replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
					expression_in_cpu_byteorder);
with:
	leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: sct@redhat.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: adilger@clusterfs.com
Cc: Mingming Cao <cmm@us.ibm.com>
2008-04-17 10:38:59 -04:00
Josef Bacik
f3f12faa74 ext4: fix mount option parsing
The "resize" option won't be noticed as it comes after the NULL option,
so if you try to mount (or in this case remount) with that option it
won't be recognized.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:05:28 -04:00
Josef Bacik
97bd42b9c8 ext4: check return of ext4_orphan_get properly
This patch fix a panic while running fsfuzzer. 
We are improperly checking the return of ext4_orphan_get.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:04:56 -04:00
Jan Kara
6f28e08794 quota: ext4: make ext4 handle quotaon on remount
Update ext4 to handle quotaon on remount RW.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:33 -07:00
Jan Blunck
1d957f9bf8 Introduce path_put()
* Add path_put() functions for releasing a reference to the dentry and
  vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(&nd->path)

* Rename dput_path() to path_put_conditional()

[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Jan Blunck
4ac9137858 Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.

Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
  <dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
  struct path in every place where the stack can be traversed
- it reduces the overall code size:

without patch series:
   text    data     bss     dec     hex filename
5321639  858418  715768 6895825  6938d1 vmlinux

with patch series:
   text    data     bss     dec     hex filename
5320026  858418  715768 6894212  693284 vmlinux

This patch:

Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Theodore Tso
469108ff3d ext4: Add new "development flag" to the ext4 filesystem
This flag is simply a generic "this is a crash/burn test filesystem"
marker.  If it is set, then filesystem code which is "in development"
will be allowed to mount the filesystem.  Filesystem code which is not
considered ready for prime-time will check for this flag, and if it is
not set, it will refuse to touch the filesystem.

As we start rolling ext4 out to distro's like Fedora, et. al, this makes
it less likely that a user might accidentally start using ext4 on a
production filesystem; a bad thing, since that will essentially make it
be unfsckable until e2fsprogs catches up.

Signed-off-by: Theodore Tso <tytso@MIT.EDU>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-02-10 01:11:44 -05:00
David Howells
1d1fe1ee02 iget: stop EXT4 from using iget() and read_inode()
Stop the EXT4 filesystem from using iget() and read_inode().  Replace
ext4_read_inode() with ext4_iget(), and call that instead of iget().
ext4_iget() then uses iget_locked() directly and returns a proper error code
instead of an inode in the event of an error.

ext4_fill_super() returns any error incurred when getting the root inode
instead of EINVAL.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:27 -08:00
Akinobu Mita
197cd65acc ext[234]: use ext[234]_get_group_desc()
Use ext[234]_get_group_desc() to get group descriptor from group number.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:21 -08:00
Aneesh Kumar K.V
ce40733ce9 ext4: Check for return value from sb_set_blocksize
sb_set_blocksize validates whether the specfied block size can be used by
the file system. Make sure we fail mounting the file system if the
blocksize specfied cannot be used.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Miklos Szeredi
cb45bbe44b ext4: Add stripe= option to /proc/mounts
Add stripe= option to /proc/mounts for ext4 filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
3dbd0ede4d ext4: Enable the multiblock allocator by default
Enable the multiblock allocator by default.

Fix ext4_show_options() so if it is not enabled, the nomballoc option
included in /proc/mounts.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-01-28 23:58:26 -05:00
Alex Tomas
c9de560ded ext4: Add multi block allocator for ext4
Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-01-29 00:19:52 -05:00
Aneesh Kumar K.V
aa22df2cc8 ext4: Fix ext4_show_options to show the correct mount options.
We need to look at the default value and make sure
the mount options are not set via default value
before showing them via ext4_show_options

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Jean Noel Cordenner
25ec56b518 ext4: Add inode version support in ext4
This patch adds 64-bit inode version support to ext4. The lower 32 bits
are stored in the osd1.linux1.l_i_version field while the high 32 bits
are stored in the i_version_hi field newly created in the ext4_inode.
This field is incremented in case the ext4_inode is large enough. A
i_version mount option has been added to enable the feature.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
2008-01-28 23:58:27 -05:00
Girish Shilamkar
818d276ceb ext4: Add the journal checksum feature
The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
0e855ac8b1 ext4: Convert truncate_mutex to read write semaphore.
We are currently taking the truncate_mutex for every read. This would have
performance impact on large CPU configuration. Convert the lock to read write
semaphore and take read lock when we are trying to read the file.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:26 -05:00
Aneesh Kumar K.V
bb4f397a1a ext4: Change the default behaviour on error
ext4 file system was by default ignoring errors and continuing. This
is not a good default as continuing on error could lead to file system
corruption. Change the default to mark the file system
readonly. Debian and ubuntu already does this as the default in their
fstab.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:26 -05:00
Eric Sandeen
e7c9559300 ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.

Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4.  If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:

        blocks_count = (ext4_blocks_count(es) -
                        le32_to_cpu(es->s_first_data_block) +
                        EXT4_BLOCKS_PER_GROUP(sb) - 1);

This is then assigned to s_groups_count which is an unsigned long:

        sbi->s_groups_count = blocks_count;

This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:

        db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
                   EXT4_DESC_PER_BLOCK(sb);

and in this case db_count will wind up as 0 because the addition overflows
32 bits.  This in turn causes the kmalloc for group_desc to be of 0 size:

        sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
                                    GFP_KERNEL);

and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.

The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count.  We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Adrian Bunk
07620f69ef ext4/super.c: fix #ifdef's (CONFIG_EXT4_* -> CONFIG_EXT4DEV_*)
Based on a report by Robert P. J. Day.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-01-28 23:58:27 -05:00
Eric Sandeen
e2b4657453 ext4: store maxbytes for bitmapped files and return EFBIG as appropriate
Calculate & store the max offset for bitmapped files, and
catch too-large seeks, truncates, and writes in ext4, shortening
or rejecting as appropriate.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2008-01-28 23:58:27 -05:00
Eric Sandeen
cd2291a463 ext4: different maxbytes functions for bitmap & extent files
use 2 different maxbytes functions for bitmapped & extent-based
files.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
8180a5627d ext4: Support large files
This patch converts ext4_inode i_blocks to represent total
blocks occupied by the inode in file system block size.
Earlier the variable used to represent this in 512 byte
block size. This actually limited the total size of the file.

The feature is enabled transparently when we write an inode
whose i_blocks cannot be represnted as 512 byte units in a
48 bit variable.

inode flag  EXT4_HUGE_FILE_FL

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
0fc1b45147 ext4: Add support for 48 bit inode i_blocks.
Use the __le16 l_i_reserved1 field of the linux2 struct of ext4_inode
to represet the higher 16 bits for i_blocks. With this change max_file
size becomes (2**48 -1 )* 512 bytes.

We add a RO_COMPAT feature to the super block to indicate that inode
have i_blocks represented as a split 48 bits. Super block with this
feature set cannot be mounted read write on a kernel with CONFIG_LSF
disabled.

Super block flag EXT4_FEATURE_RO_COMPAT_HUGE_FILE

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:26 -05:00
Aneesh Kumar K.V
1d03ec984c ext4: Fix sparse warnings.
Fix sparse warnings related to static functions
and local variables.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
99e6f829a8 ext4: Introduce ext4_update_*_feature
Introduce ext4_update_*_feature and use them instead
of opencoding.


Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Avantika Mathur
fd2d42912f ext4: add ext4_group_t, and change all group variables to this type.
In many places variables for block group are of type int, which limits the
maximum number of block groups to 2^31.  Each block group can have up to
2^15 blocks, with a 4K block size,  and the max filesystem size is limited to
2^31 * (2^15 * 2^12) = 2^58  -- or 256 PB

This patch introduces a new type ext4_group_t, of type unsigned long, to
represent block group numbers in ext4.
All occurrences of block group variables are converted to type ext4_group_t.

Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
725d26d3f0 ext4: Introduce ext4_lblk_t
This patch adds a new data type ext4_lblk_t to represent
the logical file blocks.

This is the preparatory patch to support large files in ext4
The follow up patch with convert the ext4_inode i_blocks to
represent the number of blocks in file system block size. This
changes makes it possible to have a block number 2**32 -1 which
will result in overflow if the block number is represented by
signed long. This patch convert all the block number to type
ext4_lblk_t which is typedef to __u32

Also remove dead code ext4_ext_walk_space

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2008-01-28 23:58:27 -05:00
Takashi Sato
afc7cbca5b ext4: Support large blocksize up to PAGESIZE
This patch set supports large block size(>4k, <=64k) in ext4,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext4 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
  [1/2]  ext4: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/2]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext4dev, and able to handle empty directory block.

Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Andries E. Brouwer
b47b6f38e5 ext3, ext4: avoid divide by zero
As it turns out, the kernel divides by EXT3_INODES_PER_GROUP(s) when
mounting an ext3 filesystem.  If that number is zero, a crash follows.
Below a patch.

This crash was reported by Joeri de Ruiter, Carst Tankink and Pim Vullers.

Cc: <linux-ext4@vger.kernel.org>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
Christoph Hellwig
3965516440 exportfs: make struct export_operations const
Now that nfsd has stopped writing to the find_exported_dentry member we an
mark the export_operations const

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: David Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Chris Mason <mason@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22 08:13:21 -07:00
Christoph Hellwig
1b961ac05a ext4: new export ops
Trivial switch over to the new generic helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22 08:13:20 -07:00
Aneesh Kumar K.V
308ba3ece7 ext4: Convert s_r_blocks_count and s_free_blocks_count
Convert s_r_blocks_count and s_free_blocks_count to
s_r_blocks_count_lo and s_free_blocks_count_lo

This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
6bc9feff14 ext4: Convert s_blocks_count to s_blocks_count_lo
Convert s_blocks_count to s_blocks_count_lo
This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
5272f83727 ext4: Convert bg_inode_bitmap and bg_inode_table
Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
and bg_inode_table_lo.  This helps in finding BUGs due to
direct partial access of these split 64 bit values

Also fix one direct partial access

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
3a14589cce ext4: Convert bg_block_bitmap to bg_block_bitmap_lo
Convert bg_block_bitmap to bg_block_bitmap_lo
This helps in catching some BUGS due to direct
partial access of these split fields.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:01 -04:00
Jose R. Santos
ce42158179 ext4: FLEX_BG Kernel support v2.
This feature relaxes check restrictions on where each block groups meta
data is located within the storage media.  This allows for the allocation
of bitmaps or inode tables outside the block group boundaries in cases
where bad blocks forces us to look for new blocks which the owning block
group can not satisfy.  This will also allow for new meta-data allocation
schemes to improve performance and scalability.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:01 -04:00
Andreas Dilger
717d50e497 Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use.  This is this the most time consuming part
of the filesystem check.  The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.

With this feature, there is a a high water mark of used inodes for each block
group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.

The feature is enabled through a mkfs option

	mke2fs /dev/ -O uninit_groups

A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.

The patches have been stress tested with fsstress and fsx.  In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem.  In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users.  With performance improvement of 2-20
times, depending on how full the filesystem is.

The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.

In each group descriptor if we have

EXT4_BG_INODE_UNINIT set in bg_flags:
        Inode table is not initialized/used in this group. So we can skip
        the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
        No block in the group is used. So we can skip the block bitmap
        verification for this group.

We also add two new fields to group descriptor as a part of
uninitialized group patch.

        __le16  bg_itable_unused;       /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */

bg_itable_unused:

If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.

bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.

Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:00 -04:00
Coly Li
f077d0d7ea ext4: Remove (partial, never completed) fragment support
Fragment support in ext2/3/4 was never implemented, and it probably will
never be implemented.   So remove it from ext4.

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:59 -04:00
Mingming Cao
cd02ff0b14 jbd2: JBD_XXX to JBD2_XXX naming cleanup
change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:58 -04:00
vignesh babu
d8ea6cf899 ext2/4: use is_power_of_2()
Replace n & (n - 1) with is_power_of_2(n)

Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Miklos Szeredi
d9c9bef134 ext4: show all mount options
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Fengguang Wu
e57aa839ce convert ill defined log2() to ilog2()
It's *wrong* to have
			#define log2(n) ffz(~(n))
It should be *reversed*:
			#define log2(n) flz(~(n))
or
			#define log2(n) fls(n)
or just use
			ilog2(n) defined in linux/log2.h.

This patch follows the last solution, recommended by Andrew Morton.

Cc: <linux-ext4@vger.kernel.org>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Chris Ahna <christopher.j.ahna@intel.com>
Cc: David Mosberger-Tang <davidm@hpl.hp.com>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Christoph Lameter
4ba9b9d0ba Slab API: remove useless ctor parameter and reorder parameters
Slab constructors currently have a flags parameter that is never used.  And
the order of the arguments is opposite to other slab functions.  The object
pointer is placed before the kmem_cache pointer.

Convert

        ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

        ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
833f4077bf lib: percpu_counter_init error handling
alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
52d9f3b409 lib: percpu_counter_sum_positive
s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Jan Kara
9c3013e9b9 quota: fix infinite loop
If we fail to start a transaction when releasing dquot, we have to call
dquot_release() anyway to mark dquot structure as inactive.  Otherwise we
end in an infinite loop inside dqput().

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: xb <xavier.bru@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:19 -07:00
Eric Sandeen
780dcdb211 fix inode_table test in ext234_check_descriptors
ext[234]_check_descriptors sanity checks block group descriptor geometry at
mount time, testing whether the block bitmap, inode bitmap, and inode table
reside wholly within the blockgroup.  However, the inode table test is off
by one so that if the last block in the inode table resides on the last
block of the block group, the test incorrectly fails.  This is because it
tests the last block as (start + length) rather than (start + length - 1).

This can be seen by trying to mount a filesystem made such as:

 mkfs.ext2 -F -b 1024 -m 0 -g 256 -N 3744 fsfile 1024

which yields:

 EXT2-fs error (device loop0): ext2_check_descriptors: Inode table for group 0 not in group (block 101)!
 EXT2-fs: group descriptors corrupted!

There is a similar bug in e2fsprogs, patch already sent for that.

(I wonder if inside(), outside(), and/or in_range() should someday be
used in this and other tests throughout the ext filesystems...)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26 11:35:17 -07:00
Paul Mundt
20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Vignesh Babu
1330593eb2 ext4: Use is_power_of_2()
Replace (n & (n-1)) in the context of power of 2 checks with
is_power_of_2()

Signed-off-by: Vignesh Babu <vignesh.babu@wipro.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 09:11:02 -04:00
Kalpak Shah
ef7f38359e ext4: Add nanosecond timestamps
This patch adds nanosecond timestamps for ext4. This involves adding
*time_extra fields to the ext4_inode to extend the timestamps to
64-bits.  Creation time is also added by this patch.

These extended fields will fit into an inode if the filesystem was
formatted with large inodes (-I 256 or larger) and there are currently
no EAs consuming all of the available space. For new inodes we always
reserve enough space for the kernel's known extended fields, but for
inodes created with an old kernel this might not have been the case. So
this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature
flag(ro-compat so that older kernels can't create inodes with a smaller
extra_isize). which indicates if the fields fitting inside
s_min_extra_isize are available or not.  If the expansion of inodes if
unsuccessful then this feature will be disabled.  This feature is only
enabled if requested by the sysadmin.

None of the extended inode fields is critical for correct filesystem
operation.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 09:15:20 -04:00
Jose R. Santos
eb40a09c67 ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devices
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
than 32bit block sizes during mount time.  This ensure proper record
lenth when writing to the journal.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 08:37:25 -04:00
Mingming Cao
1e2462f93e ext4: Enable extents by default
Turn on extents feature by default in ext4 filesystem, to get wider
testing of extents feature in ext4dev.  This can be disabled using 
-o noextents.  

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 09:00:55 -04:00
Christoph Hellwig
a569425512 knfsd: exportfs: add exportfs.h header
currently the export_operation structure and helpers related to it are in
fs.h.  fs.h is already far too large and there are very few places needing the
export bits, so split them off into a separate header.

[akpm@linux-foundation.org: fix cifs build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:06 -07:00
Badari Pulavarty
5e70030d4c ext4: statfs speed up
This is a patch that speeds up statfs.  It is very simple - the "overhead"
calculation, which takes a huge amount of time for large filesystems, never
changes unless the size of the filesystem itself changes.  That means we can
store it in memory and only recalculate if the filesystem has been resized
(almost never).

It also fixes a minor problem that we never update the on-disk superblock free
blocks/inodes counts until the filesystem is unmounted.  While not fatal, we
may as well update that on disk when we have the information, and it makes
things like debugfs and dumpe2fs report a bit more accurate info.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:52 -07:00
Borislav Petkov
6c675bd43c ext4: fix error handling in ext4_create_journal
Fix error handling in ext4_create_journal according to kernel conventions.

Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:51 -07:00
Jan Kara
32c3773011 ext4: fix deadlock in ext4_remount() and orphan list handling
ext4_orphan_add() and ext4_orphan_del() functions lock sb->s_lock with a
transaction started with ext4_mark_recovery_complete() waits for a transaction
holding sb->s_lock, thus leading to a possible deadlock.  At the moment we
call ext4_mark_recovery_complete() from ext4_remount() we have done all the
work needed for remounting and thus we are safe to drop sb->s_lock before we
wait for transactions to commit.  Note that at this moment we are still
guarded by s_umount lock against other remounts/umounts.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Sandeen <sandeen@sandeen.net>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:48 -07:00
Vasily Averin
9f7dd93de0 ext3/ext4: orphan list check on destroy_inode
Customers claims to ext3-related errors, investigation showed that ext3
orphan list has been corrupted and have the reference to non-ext3 inode.
The following debug helps to understand the reasons of this issue.

[akpm@linux-foundation.org: update for print_hex_dump() changes]
Signed-off-by: Vasily Averin <vvs@sw.ru>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:46 -07:00
Dave Kleikamp
8c55e20411 EXT4: Fix whitespace
Replace a lot of spaces with tabs

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-05-31 16:20:14 -04:00
Christoph Lameter
a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter
50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Peter Zijlstra
f98393a64c mm: remove destroy_dirty_buffers from invalidate_bdev()
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
	quilt add $file;
	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Josef 'Jeff' Sipek
ee9b6d61a2 [PATCH] Mark struct super_operations const
This patch is inspired by Arjan's "Patch series to mark struct
file_operations and struct inode_operations const".

Compile tested with gcc & sparse.

Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:47 -08:00
Hugh Dickins
2e7842b887 [PATCH] fix umask when noACL kernel meets extN tuned for ACLs
Fix insecure default behaviour reported by Tigran Aivazian: if an ext2 or
ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by a
kernel built without ACL support, then umask was ignored when creating
inodes - though root or user has umask 022, touch creates files as 0666,
and mkdir creates directories as 0777.

This appears to have worked right until 2.6.11, when a fix to the default
mode on symlinks (always 0777) assumed VFS applies umask: which it does,
unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in
s_flags according to s_mount_opt set according to def_mount_opts.

We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test);
but other filesystems only set MS_POSIXACL when ACLs are configured.  We
could fix this at another level; but it seems most robust to avoid setting
the s_mount_opt flag in the first place (at the expense of more ifdefs).

Likewise don't set the XATTR_USER flag when built without XATTR support.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Eric Sandeen
ead6596b9e [PATCH] ext4: refuse ro to rw remount of fs with orphan inodes
In the rare case where we have skipped orphan inode processing due to a
readonly block device, and the block device subsequently changes back to
read-write, disallow a remount,rw transition of the filesystem when we have an
unprocessed orphan inodes as this would corrupt the list.

Ideally we should process the orphan inode list during the remount, but that's
trickier, and this plugs the hole for now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Eric Sandeen
a8f48a9561 [PATCH] ext3/4: don't do orphan processing on readonly devices
If you do something like:

  # touch foo
  # tail -f foo &
  # rm foo
  # <take snapshot>
  # <mount snapshot>

you'll panic, because ext3/4 tries to do orphan list processing on the
readonly snapshot device, and:

  kernel: journal commit I/O error
  kernel: Assertion failure in journal_flush_Rsmp_e2f189ce() at journal.c:1356: "!journal->j_checkpoint_transactions"
  kernel: Kernel panic: Fatal exception

for a truly readonly underlying device, it's reasonable and necessary
to just skip orphan list processing.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Pekka Enberg
960cc398a7 [PATCH] ext4: fsid for statvfs
Update ext4_statfs to return an FSID that is a 64 bit XOR of the 128 bit
filesystem UUID as suggested by Andreas Dilger.  See the following Bugzilla
entry for details:

  http://bugzilla.kernel.org/show_bug.cgi?id=136

Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:31 -08:00
Christoph Lameter
e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter
e6b4f8da3a [PATCH] slab: remove SLAB_NOFS
SLAB_NOFS is an alias of GFP_NOFS.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Andrew Morton
63f5793351 [PATCH] ext4 whitespace cleanups
Someone's tab key is emitting spaces.  Attempt to repair some of the damage.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Dmitry Mishin
ceea16bf85 [PATCH] ext4: errors behaviour fix
Current error behaviour for ext2 and ext3 filesystems does not fully
correspond to the documentation and should be fixed.

According to man 8 mount, ext2 and ext3 file systems allow to set one of 3
different on-errors behaviours:

  ---- start of quote man 8 mount ----

  errors=continue / errors=remount-ro / errors=panic

    Define the behaviour when an error is encountered.  (Either ignore
    errors and just mark the file system erroneous and continue, or remount
    the file system read-only, or panic and halt the system.) The default is
    set in the filesystem superblock, and can be changed using tune2fs(8).

  ---- end of quote ----

However EXT3_ERRORS_CONTINUE is not read from the superblock, and thus
ERRORS_CONT is not saved on the sbi->s_mount_opt.  It leads to the incorrect
handle of errors on ext3.

Then we've checked corresponding code in ext2 and discovered that it is buggy
as well:

- EXT2_ERRORS_CONTINUE is not read from the superblock (the same);

- parse_option() does not clean the alternative values and thus something
  like (ERRORS_CONT|ERRORS_RO) can be set;

- if options are omitted, parse_option() does not set any of these options.

Therefore it is possible to set any combination of these options on the ext2:

- none of them may be set: EXT2_ERRORS_CONTINUE on superblock / empty mount
  options;

- any of them may be set using mount options;

- 2 any options may be set: by using EXT2_ERRORS_RO/EXT2_ERRORS_PANIC on the
  superblock and other value in mount options;

- and finally all three options may be set by adding third option in remount.

Currently ext2 uses these values only in ext2_error() and it is not leading to
any noticeable troubles.  However somebody may be discouraged when he will try
to workaround EXT2_ERRORS_PANIC on the superblock by using errors=continue in
mount options.

This patch:

EXT4_ERRORS_CONTINUE should be taken from the superblock as default value for
error behaviour.

Signed-off-by:	Dmitry Mishin <dim@openvz.org>
Acked-by: Vasily Averin <vvs@sw.ru>
Acked-by: Kirill Korotaev <dev@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Andrew Morton
70bbb3e0a0 [PATCH] ext4: rename logic_sb_block
I assume this means "logical sb block".  So call it that.

I still don't understand the name though.  A block is a block.  What's
different about this one?

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:18 -07:00
Andrew Morton
f4e5bc244f [PATCH] ext4 64 bit divide fix
With CONFIG_LBD=n, sector_div() expands to a plain old divide.  But ext4 is
_not_ passing in a sector_t as the first argument, so...

fs/built-in.o: In function `ext4_get_group_no_and_offset':
fs/ext4/balloc.c:39: undefined reference to `__umoddi3'
fs/ext4/balloc.c:41: undefined reference to `__udivdi3'
fs/built-in.o: In function `find_group_orlov':
fs/ext4/ialloc.c:278: undefined reference to `__udivdi3'
fs/built-in.o: In function `ext4_fill_super':
fs/ext4/super.c:1488: undefined reference to `__udivdi3'
fs/ext4/super.c:1488: undefined reference to `__umoddi3'
fs/ext4/super.c:1594: undefined reference to `__udivdi3'
fs/ext4/super.c:1601: undefined reference to `__umoddi3'

Fix that up by calling do_div() directly.

Also cast the arg to u64.  do_div() is only defined on u64, and ext4_fsblk_t
is supposed to be opaque.

Note especially the changes to find_group_orlov().  It was attempting to do

	do_div(int, unsigned long long);

which is royally screwed up.  Switched it to plain old divide.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:18 -07:00
Alexandre Ratchov
8fadc14323 [PATCH] ext4: move block number hi bits
move '_hi' bits of block numbers in the larger part of the
block group descriptor structure

Signed-off-by: Alexandre Ratchov <alexandre.ratchov@bull.net>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:18 -07:00
Alexandre Ratchov
0d1ee42f27 [PATCH] ext4: allow larger descriptor size
make block group descriptor larger.

Signed-off-by: Alexandre Ratchov <alexandre.ratchov@bull.net>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:18 -07:00
Mingming Cao
2ae0210760 [PATCH] ext4: blk_type from sector_t to unsigned long long
Change ext4 in-kernel block type (ext4_fsblk_t) from sector_t to unsigned
long long.  Remove ext4 block type string micro E3FSBLK, replaced with "%llu"

[akpm@osdl.org: build fix]
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:18 -07:00
Laurent Vivier
bd81d8eec0 [PATCH] ext4: 64bit metadata
In-kernel super block changes to support >32 bit free blocks numbers.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Alexandre Ratchov <alexandre.ratchov@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:17 -07:00
Mingming Cao
3a5b2ecdd1 [PATCH] ext4: switch fsblk to sector_t
Redefine ext3 in-kernel filesystem block type (ext3_fsblk_t) from unsigned
long to sector_t, to allow kernel to handle  >32 bit ext3 blocks.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:16 -07:00
Alex Tomas
a86c618126 [PATCH] ext3: add extent map support
On disk extents format:
/*
* this is extent on-disk structure
* it's used at the bottom of the tree
*/
struct ext3_extent {
__le32  ee_block;       /* first logical block extent covers */
__le16  ee_len;         /* number of blocks covered by extent */
__le16  ee_start_hi;    /* high 16 bits of physical block */
__le32  ee_start;       /* low 32 bigs of physical block */
};

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:16 -07:00
Mingming Cao
dab291af8d [PATCH] jbd2: enable building of jbd2 and have ext4 use it rather than jbd
Reworked from a patch by Mingming Cao and Randy Dunlap

Signed-off-By: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:16 -07:00
Mingming Cao
617ba13b31 [PATCH] ext4: rename ext4 symbols to avoid duplication of ext3 symbols
Mingming Cao originally did this work, and Shaggy reproduced it using some
scripts from her.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:15 -07:00
Dave Kleikamp
ac27a0ec11 [PATCH] ext4: initial copy of files from ext3
Start of the ext4 patch series.  See Documentation/filesystems/ext4.txt for
details.

This is a simple copy of the files in fs/ext3 to fs/ext4 and
/usr/incude/linux/ext3* to /usr/include/ex4*

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:15 -07:00