2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

56429 Commits

Author SHA1 Message Date
Olga Kornievskaia
6308bc98e8 NFSD OFFLOAD_STATUS xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
9eb190fca8 NFSD CB_OFFLOAD xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Greg Kroah-Hartman
a38523185b libnvdimm/dax for 4.19-rc6
* (2) fixes for the dax error handling updates that were merged for
 v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
 his ack but the uaccess change is of the trivial / obviously correct
 variety. The address_space_operations fixes a regression.
 
 * A filesystem-dax fix to correct the zero page lookup to be compatible
  with non-x86 (mips and s390) architectures.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbqoecAAoJEB7SkWpmfYgCaPYP/1Pf2ADt0pOskSk0ixM06UI9
 1lR2g2/ICMc505+oB+wUv9TkZOh9jcIS9o8GfLhgNvP7AU4woRvudeyr4NUc0mdw
 rtHRA6TIimbXa3+O2qMg4gqUjXRxj6urQp5oeQi8mQ/vefZv1aisEw/Klae8klVC
 HGoMFii19WGXXyPM2vNUb2L+JGZt1p/nl/Z8ydPavn1XkIIGb7c+MivDiaemjjgT
 487TmFULgLVhTCtQXlhkH7UCcCQ3+l3yxKaS1/g2hFpWE4LncBIvq8XBPwf5RQSL
 H/5rH/sd30XR2L0NMERxr0ENvCJf2iIf4bIqckODN4L9ojRE8zmBZMsSeRKmHufm
 3ZfTBLHjPUwwKWy7PlKSsFk2J8KjErsqRlfZQSMSJShpEgL1jCjYtuTEtupaegbU
 v8TwzsNDgJ1B6JuKJ7lh7hOF7vUQ/i65xG8SFACvNoiih8RGSW3llra442k2hmFu
 IEMXa9S4tvqHfXUb0J/6hLLi+xoV+KsYPWRiCuovy7t6EfAWUnNuGCldjfsQtZZv
 npHS7F7lkWlSCneDbE4cMdkkwjBKjAw0sjIWDrPVVCoITVe1j9+bEwE9reX1VOS+
 L+PB/WgcVH72MeiQnmPPTcUyEmNgCbku7NEhwrMHJngxuo9HmXE+BN8jdFDYZFfL
 uWDV25XOxviOC9xBosw4
 =mAaH
 -----END PGP SIGNATURE-----

erge tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "libnvdimm/dax for 4.19-rc6

  * (2) fixes for the dax error handling updates that were merged for
  v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
  his ack but the uaccess change is of the trivial / obviously correct
  variety. The address_space_operations fixes a regression.

  * A filesystem-dax fix to correct the zero page lookup to be compatible
   with non-x86 (mips and s390) architectures."

* tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: Add missing address_space_operations
  uaccess: Fix is_source param for check_copy_size() in copy_to_iter_mcsafe()
  filesystem-dax: Fix use of zero page
2018-09-25 21:37:41 +02:00
Wei Yongjun
69383c5913 ovl: make symbol 'ovl_aops' static
Fixes the following sparse warning:

fs/overlayfs/inode.c:507:39: warning:
 symbol 'ovl_aops' was not declared. Should it be static?

Fixes: 5b910bd615 ("ovl: fix GPF in swapfile_activate of file from overlayfs over xfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-25 20:41:23 +02:00
Chengguang Xu
2aad26fa0a ext2: remove redundant building macro check
If macro CONFIG_QUOTA is not enabled then mount option flag
of usrquota/grpquota will not be set, so we can remove some
building macro check safely in ext2_shwo_options().
Additionally, I think it's better to define EXT2_MOUNT_DAX
regardless macro CONFIG_FS_DAX is enabled just like acl/xattr.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-24 21:34:15 +02:00
Amir Goldstein
a725356b66 vfs: swap names of {do,vfs}_clone_file_range()
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.

The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.

It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.

Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
d9d150ae50 ovl: fix freeze protection bypass in ovl_clone_file_range()
Tested by doing clone on overlayfs while upper xfs+reflink is frozen:

  xfs_io -f /ovl/y
                             fsfreeze -f /xfs
  xfs_io> reflink /ovl/x

Before the fix xfs_io enters xfs_reflink_remap_range() and blocks
in xfs_trans_alloc(). After the fix, xfs_io blocks outside xfs code
in ovl_clone_file_range().

Fixes: 8ede205541 ("ovl: add reflink/copyfile/dedup support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
898cc19d8a ovl: fix freeze protection bypass in ovl_write_iter()
Tested by re-writing to an open overlayfs file while upper ext4 is frozen:

  xfs_io -f /ovl/x
  xfs_io> pwrite 0 4096
                             fsfreeze -f /ext4
  xfs_io> pwrite 0 4096

  WARNING: CPU: 0 PID: 1492 at fs/ext4/ext4_jbd2.c:53 \
           ext4_journal_check_start+0x48/0x82

After the fix, the second write blocks in ovl_write_iter() and avoids
hitting WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE) in
ext4_journal_check_start().

Fixes: 2a92e07edc ("ovl: add ovl_write_iter()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
63e1325280 ovl: fix memory leak on unlink of indexed file
The memory leak was detected by kmemleak when running xfstests
overlay/051,053

Fixes: caf70cb2ba ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Dennis Zhou (Facebook)
bdc2491708 blkcg: associate writeback bios with a blkg
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg. In
this patch, the wbc_init_bio call is changed such that it must be called
after a queue has been associated with the bio.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:11 -06:00
Greg Kroah-Hartman
0eba8697bc This pull request contains fixes for UBIFS:
- A wrong UBIFS assertion in mount code
 - Fix for a NULL pointer deref in mount code
 - Revert of a bad fix for xattrs
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlukuf8WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wRnyD/40jc4PA1pP3dJcXqdqwSeN+vJZ
 YWu307rE+giWWg6vRKa8zu3z614579UC7EE/I9miSrPNPYCi9VTLNzJGNMvn8aSf
 8seMrbhW0kz29T4YZ5fESr6x3hMsSdmIWZkPobruul8oBGmNcUm3Ag3MBM4XbQTm
 5AcWUdXVQUXPROK/Xm04hNa5pJ6wkSu5CxA+mf9YM2ZYYPPdCblxxq3NsDX99vDw
 gZup7JndSzpk+IpZMQIEKUP83vh5YgCU06YPjXi5YIfXl+sQnpH0fyrWm3+1C0Yz
 4T80rd3fm+7+obXdAI29mhcoZ09fGayozfqjU/fV3edW43+pUvRWced13vuLqz9Q
 JannCFeN+v/nB6OPlj4JGkgNunZ5M3r6ZZIQBbY9RJuZgG0/FMbpYb62RY1z/jg9
 YA7e3J/R02i7tHnPbcIz6ngKE0c42VKxTdATaybXVunx5ZPip7txj9OeJbfIYUrr
 CbLMdiqIJUAAheHMUrTiF3jDNwEiMhBZA4dAGDvLmpLuyfMlN8Lwje9BdRd5VNKp
 zwEGUQkDWLxniFYLtmbceFSa4IQT6z70XJn1pqdDvgjxfp8trEqhtp9GMxA2u6TA
 adcug5gUzRWMQ2QuyoKIv6tkjmnII5N2KzhZH9hwBReRJUlKY9WQ9jowcGrPXhbw
 F2prIwBxp8drneOeKg==
 =CGLq
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs

Richard writes:
  "This pull request contains fixes for UBIFS:
   - A wrong UBIFS assertion in mount code
   - Fix for a NULL pointer deref in mount code
   - Revert of a bad fix for xattrs"

* tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs:
  Revert "ubifs: xattr: Don't operate on deleted inodes"
  ubifs: drop false positive assertion
  ubifs: Check for name being NULL while mounting
2018-09-21 15:29:44 +02:00
Chao Yu
dc4cd1257c f2fs: fix to recover inode's uid/gid during POR
Step to reproduce this bug:
1. logon as root
2. mount -t f2fs /dev/sdd /mnt;
3. touch /mnt/file;
4. chown system /mnt/file; chgrp system /mnt/file;
5. xfs_io -f /mnt/file -c "fsync";
6. godown /mnt;
7. umount /mnt;
8. mount -t f2fs /dev/sdd /mnt;

After step 8) we will expect file's uid/gid are all system, but during
recovery, these two fields were not been recovered, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:50:25 -07:00
Jaegeuk Kim
f84262b086 f2fs: avoid infinite loop in f2fs_alloc_nid
If we have an error in f2fs_build_free_nids, we're able to fall into a loop
to find free nids.

Suggested-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:49:19 -07:00
Junxiao Bi
234b69e3e0 ocfs2: fix ocfs2 read block panic
While reading block, it is possible that io error return due to underlying
storage issue, in this case, BH_NeedsValidate was left in the buffer head.
Then when reading the very block next time, if it was already linked into
journal, that will trigger the following panic.

[203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
[203748.702533] invalid opcode: 0000 [#1] SMP
[203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
[203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
[203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
[203748.703088] RIP: 0010:[<ffffffffa05e9f09>]  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.703130] RSP: 0018:ffff88006ff4b818  EFLAGS: 00010206
[203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
[203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
[203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
[203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
[203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
[203748.705871] FS:  00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
[203748.706370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
[203748.707124] Stack:
[203748.707371]  ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
[203748.707885]  0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
[203748.708399]  00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
[203748.708915] Call Trace:
[203748.709175]  [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[203748.709680]  [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
[203748.710185]  [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
[203748.710691]  [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
[203748.711204]  [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
[203748.711716]  [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
[203748.712227]  [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
[203748.712737]  [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
[203748.713003]  [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2]
[203748.713263]  [<ffffffff8121714b>] vfs_create+0xdb/0x150
[203748.713518]  [<ffffffff8121b225>] do_last+0x815/0x1210
[203748.713772]  [<ffffffff812192e9>] ? path_init+0xb9/0x450
[203748.714123]  [<ffffffff8121bca0>] path_openat+0x80/0x600
[203748.714378]  [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620
[203748.714634]  [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0
[203748.714888]  [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130
[203748.715143]  [<ffffffff81209ffc>] do_sys_open+0x12c/0x220
[203748.715403]  [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180
[203748.715668]  [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190
[203748.715928]  [<ffffffff8120a10e>] SyS_open+0x1e/0x20
[203748.716184]  [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7
[203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
[203748.717505] RIP  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.717775]  RSP <ffff88006ff4b818>

Joesph ever reported a similar panic.
Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html

Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:12 +02:00
Dominique Martinet
a1b3d2f217 fs/proc/kcore.c: fix invalid memory access in multi-page read optimization
The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.

There is no choice but to run through the list of entries for every
address if we did not find any entry in the previous iteration

Reset 'm' to NULL in that case at Omar Sandoval's suggestion.

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org
Fixes: bf991c2231 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:11 +02:00
Richard Weinberger
f061c1cc40 Revert "ubifs: xattr: Don't operate on deleted inodes"
This reverts commit 11a6fc3dc7.
UBIFS wants to assert that xattr operations are only issued on files
with positive link count. The said patch made this operations return
-ENOENT for unlinked files such that the asserts will no longer trigger.
This was wrong since xattr operations are perfectly fine on unlinked
files.
Instead the assertions need to be fixed/removed.

Cc: <stable@vger.kernel.org>
Fixes: 11a6fc3dc7 ("ubifs: xattr: Don't operate on deleted inodes")
Reported-by: Koen Vandeputte <koen.vandeputte@ncentric.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:41 +02:00
Sascha Hauer
d3bdc016c5 ubifs: drop false positive assertion
The following sequence triggers

	ubifs_assert(c, c->lst.taken_empty_lebs > 0);

at the end of ubifs_remount_fs():

mount -t ubifs /dev/ubi0_0 /mnt
echo 1 > /sys/kernel/debug/ubifs/ubi0_0/ro_error
umount /mnt
mount -t ubifs -o ro /dev/ubix_y /mnt
mount -o remount,ro /mnt

The resulting

UBIFS assert failed in ubifs_remount_fs at 1878 (pid 161)

is a false positive. In the case above c->lst.taken_empty_lebs has
never been changed from its initial zero value. This will only happen
when the deferred recovery is done.

Fix this by doing the assertion only when recovery has been done
already.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:07 +02:00
Richard Weinberger
37f31b6ca4 ubifs: Check for name being NULL while mounting
The requested device name can be NULL or an empty string.
Check for that and refuse to continue. UBIFS has to do this manually
since we cannot use mount_bdev(), which checks for this condition.

Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: syzbot+38bd0f7865e5c6379280@syzkaller.appspotmail.com
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:07 +02:00
Chao Yu
1390643d1d jfs: remove redundant dquot_initialize() in jfs_evict_inode()
We don't need to call dquot_initialize() twice in jfs_evict_inode(),
remove one of them for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-20 09:28:49 -05:00
Sahitya Tummala
a7d10cf3e4 f2fs: add new idle interval timing for discard and gc paths
This helps to control the frequency of submission of discard and
GC requests independently, based on the need.

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-19 15:55:14 -07:00
Toshi Kani
9e796c9db9 ext2, dax: set ext2_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext2_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext2_dax_aops', since S_DAX flag is set after
ext2_set_aops() in the open path.

Similar to ext4, change ext2_iget() to initialize i_flags before
ext2_set_aops().

Fixes: fb094c9074 ("ext2, dax: introduce ext2_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-19 15:03:04 +02:00
Andreas Gruenbacher
ffc4c92227 sysfs: Do not return POSIX ACL xattrs via listxattr
Commit 786534b92f introduced a regression that caused listxattr to
return the POSIX ACL attribute names even though sysfs doesn't support
POSIX ACLs.  This happens because simple_xattr_list checks for NULL
i_acl / i_default_acl, but inode_init_always initializes those fields
to ACL_NOT_CACHED ((void *)-1).  For example:
    $ getfattr -m- -d /sys
    /sys: system.posix_acl_access: Operation not supported
    /sys: system.posix_acl_default: Operation not supported
Fix this in simple_xattr_list by checking if the filesystem supports POSIX ACLs.

Fixes: 786534b92f ("tmpfs: listxattr should include POSIX ACL xattrs")
Reported-by:  Marc Aurèle La France <tsi@tuyoix.net>
Tested-by: Marc Aurèle La France <tsi@tuyoix.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-18 07:30:48 -04:00
Greg Kroah-Hartman
ad3273d5f1 Various ext4 bug fixes; primarily making ext4 more robust against
maliciously crafted file systems, and some DAX fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlufGncACgkQ8vlZVpUN
 gaPwuQf9FKp9yRvjBkjtnH3+s4Ps8do9r067+90y1k2DJMxKoaBUhGSW2MJJ04j+
 5F6Ndp/TZHw+LfPnzsqlrAAoP3CG5+kacfJ7xeVKR0umvACm6rLMsCUct7/rFoSl
 PgzCALFIJvQ9+9shuO9qrgmjJrfrlTVUgR9Mu3WUNEvMFbMjk3FMI8gi5kjjWemE
 G9TDYH2lMH2sL0cWF51I2gOyNXOXrihxe+vP7j6i/rUkV+YLpKZhE1ss3Sfn6pR2
 p/KjnXdupLJpgYLJne9kMrq2r8xYmDfA0S+Dec7nkox5FUOFUHssl3+q8C7cDwO9
 zl6VyVFwybjFRJ/Y59wox6eqVPlIWw==
 =1P1w
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Ted writes:
	Various ext4 bug fixes; primarily making ext4 more robust against
	maliciously crafted file systems, and some DAX fixes.

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, dax: set ext4_dax_aops for dax files
  ext4, dax: add ext4_bmap to ext4_dax_aops
  ext4: don't mark mmp buffer head dirty
  ext4: show test_dummy_encryption mount option in /proc/mounts
  ext4: close race between direct IO and ext4_break_layouts()
  ext4: fix online resizing for bigalloc file systems with a 1k block size
  ext4: fix online resize's handling of a too-small final block group
  ext4: recalucate superblock checksum after updating free blocks/inodes
  ext4: avoid arithemetic overflow that can trigger a BUG
  ext4: avoid divide by zero fault when deleting corrupted inline directories
  ext4: check to make sure the rename(2)'s destination is not freed
  ext4: add nonstring annotations to ext4.h
2018-09-17 09:13:47 +02:00
Bernd Edlinger
a75e78f21f kernfs: Fix range checks in kernfs_get_target_path
The terminating NUL byte is only there because the buffer is
allocated with kzalloc(PAGE_SIZE, GFP_KERNEL), but since the
range-check is off-by-one, and PAGE_SIZE==PATH_MAX, the
returned string may not be zero-terminated if it is exactly
PATH_MAX characters long.  Furthermore also the initial loop
may theoretically exceed PATH_MAX and cause a fault.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-16 22:38:57 +02:00
Toshi Kani
cce6c9f7e6 ext4, dax: set ext4_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext4_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext4_dax_aops', since S_DAX flag is set after
ext4_set_aops() in the open path.

  New file
  --------
  lookup_open
    ext4_create
      __ext4_new_inode
        ext4_set_inode_flags   // Set S_DAX flag
      ext4_set_aops            // Set aops to ext4_dax_aops

  Existing file
  -------------
  lookup_open
    ext4_lookup
      ext4_iget
        ext4_set_aops          // Set aops to ext4_da_aops
        ext4_set_inode_flags   // Set S_DAX flag

Change ext4_iget() to initialize i_flags before ext4_set_aops().

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:37:59 -04:00
Toshi Kani
94dbb63117 ext4, dax: add ext4_bmap to ext4_dax_aops
Ext4 mount path calls .bmap to the journal inode. This currently
works for the DAX mount case because ext4_iget() always set
'ext4_da_aops' to any regular files.

In preparation to fix ext4_iget() to set 'ext4_dax_aops' for ext4
DAX files, add ext4_bmap() to 'ext4_dax_aops', since bmap works for
DAX inodes.

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:23:41 -04:00
Li Dongyang
fe18d64989 ext4: don't mark mmp buffer head dirty
Marking mmp bh dirty before writing it will make writeback
pick up mmp block later and submit a write, we don't want the
duplicate write as kmmpd thread should have full control of
reading and writing the mmp block.
Another reason is we will also have random I/O error on
the writeback request when blk integrity is enabled, because
kmmpd could modify the content of the mmp block(e.g. setting
new seq and time) while the mmp block is under I/O requested
by writeback.

Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2018-09-15 17:11:25 -04:00
Eric Biggers
338affb548 ext4: show test_dummy_encryption mount option in /proc/mounts
When in effect, add "test_dummy_encryption" to _ext4_show_options() so
that it is shown in /proc/mounts and other relevant procfs files.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-15 14:28:26 -04:00
Linus Torvalds
3a5af36b6d fixes for four CIFS/SMB3 potential pointer overflow issues, one minor build fix, and a build warning cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlucKDwACgkQiiy9cAdy
 T1EhEgwAqgVXTujce2UVPtaFY/MaGmaIwAimh+aYOCAADxLYJHkjtRzHd5PQgf+L
 n55R2hFcMeelWxOMEb/aRmxIKLk8fmJYVWClM2+S7Z79M3GHexDbMS8+oDZnzCTB
 EknvaTbi+vOt4HGABkbJ/jiQCgonmeobon30gLWaYa3XGeYc7ZV5gR+EXL9xSdvh
 +I+x3rSDpm8fQ5njkB7RKgfB+ha4NQqZ6dXlYQzcb0vMO3/lhQ56Ypgn95Jlu5UW
 pcLxUFE1do+JeGvIU1it2SCRJ5499g180Rxucl7X1xFBQ44Qss9QOeWkFTZ8784V
 PIYVZMTUqO0Km4H22qXD8lIY5GjDuAXLYM3AddFkJpbKaw6g++ZsUXSfoA5zRIDn
 10edaPK/hDIQeFaV+ySTN5g/Qh3YFnmY4kDL3t3CRZDe4+DTW/+cmrF3sGkhZDQt
 +nDo0JxJsjNnJW6loB5Lb76lygvsng01owSsYSAChjhYwBvCDgp9/D85pDQ/oYkl
 HKD9tiF3
 =YNSx
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Fixes for four CIFS/SMB3 potential pointer overflow issues, one minor
  build fix, and a build warning cleanup"

* tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: read overflow in is_valid_oplock_break()
  cifs: integer overflow in in SMB2_ioctl()
  CIFS: fix wrapping bugs in num_entries()
  cifs: prevent integer overflow in nxt_dir_entry()
  fs/cifs: require sha512
  fs/cifs: suppress a string overflow warning
2018-09-14 19:33:42 -10:00
Linus Torvalds
589109df31 NFS client bugfixes for Linux 4.19
Stable bugfixes:
 - v4.17+: Fix a tracepoint Oops in initiate_file_draining()
 - v4.17+: Fix a tracepoint Oops in initiate_file_draining()
 - v4.11+: Fix an infinite loop on I/O
 
 Other fixes:
 - Return errors if a waiting layoutget is killed
 - Don't open code clearing of delegation state
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlucGhMACgkQ18tUv7Cl
 QOsMCA/9ET0bbzus25DWcnbT10bpEdtW4p6dR/2ztGqUGe0FVyCyVT70jBKFbnAL
 cO1pqElKLj7TMPhTsxK63dDxXGELXCqDtmsHELaD8jf0h6270KAPariJBQ5+N0ud
 5U5CXswW/zbQekE9GMTDQtAAGBzfht33PavFt2+5oVYTAQ5K6Pwvq2qMoifQxMlk
 wjtVjypz0QjBy5bHBO6XGxX58JIc23EwA0/KDS4cU3vkwDXmEZcVYIUdqJF4gtCz
 JdJdnT4b9ebtbdHENx8rkot3L1VSx6JfW9pPMvxLjxn8IG1rj7zQXtc7kpnoF8RY
 WVGWuf4rn7Zo7YXf11SXNebMYgrljx5/0KcmUtSgSmCCqVUmY1e/a69e0fhkKfxn
 /W/+fYIdC1wG0JXtrCN8eJbGropYj3B8Ln5TBV4LN91hFMI8Lx/4D1lKLoK7RNLJ
 3Q6VmKIhKVMHgK6eAivWyN7X/WzIvAj37a2ix2xSjENUIP+l3ePAekc4f5XGMe8O
 wx6wxgvVSomVrsM565XGcjw+LXUyzlXowS6JhR+Zn6fYmbDkPk0ZMC5HBFidakkB
 YxO7aWjBvyYinOrBMWODeWatt4q50bI6YtQpxxWyDIdRXWbvQDjBILveTh6/sRGQ
 KbA3r6XC/DSMri4Mo/r+92fbBEYV2otnkkYraz04VgBVYYgiJts=
 =C83+
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "These are a handful of fixes for problems that Trond found. Patch #1
  and #3 have the same name, a second issue was found after applying the
  first patch.

  Stable bugfixes:
   - v4.17+: Fix tracepoint Oops in initiate_file_draining()
   - v4.11+: Fix an infinite loop on I/O

  Other fixes:
   - Return errors if a waiting layoutget is killed
   - Don't open code clearing of delegation state"

* tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Don't open code clearing of delegation state
  NFSv4.1 fix infinite loop on I/O.
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
  pNFS: Ensure we return the error if someone kills a waiting layoutget
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
2018-09-14 19:25:28 -10:00
Trond Myklebust
9f0c5124f4 NFS: Don't open code clearing of delegation state
Add a helper for the case when the nfs4 open state has been set to use
a delegation stateid, and we want to revert to using the open stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:27 -04:00
Trond Myklebust
994b15b983 NFSv4.1 fix infinite loop on I/O.
The previous fix broke recovery of delegated stateids because it assumes
that if we did not mark the delegation as suspect, then the delegation has
effectively been revoked, and so it removes that delegation irrespectively
of whether or not it is valid and still in use. While this is "mostly
harmless" for ordinary I/O, we've seen pNFS fail with LAYOUTGET spinning
in an infinite loop while complaining that we're using an invalid stateid
(in this case the all-zero stateid).

What we rather want to do here is ensure that the delegation is always
correctly marked as needing testing when that is the case. So we want
to close the loophole offered by nfs4_schedule_stateid_recovery(),
which marks the state as needing to be reclaimed, but not the
delegation that may be backing it.

Fixes: 0e3d3e5df0 ("NFSv4.1 fix infinite loop on IO BAD_STATEID error")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:11 -04:00
Trond Myklebust
2edaead69e NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.

Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:08 -04:00
Trond Myklebust
d03360aaf5 pNFS: Ensure we return the error if someone kills a waiting layoutget
If someone interrupts a wait on one or more outstanding layoutgets in
pnfs_update_layout() then return the ERESTARTSYS/EINTR error.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:08 -04:00
Trond Myklebust
2a534a7473 NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.

Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:23:16 -04:00
Al Viro
e21120383f move compat handling of tty ioctls to tty_compat_ioctl()
ioctls that are
	* callable only via tty_ioctl()
	* not driver-specific
	* not demand data structure conversions
	* either always need passing arg as is or always demand compat_ptr()
get intercepted in tty_compat_ioctl() from the very beginning and
redirecter to tty_ioctl().  As the result, their entries in fs/compat_ioctl.c
(some of those had been missing, BTW) got removed, as well as
n_tty_compat_ioctl_helper() (now it's never called with any cmd it would accept).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-14 11:12:17 -04:00
Linus Torvalds
48751b562b overlayfs fixes for 4.19-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW5qpOgAKCRDh3BK/laaZ
 PDCQAQCIKLg0aLeWOkfUO76mBjlp5srKgJfrqpFoyuozD6l2fQEAl/W2x9NOduV+
 PK4sCYMT8SpI0hMrbv9P4zZ683kmaA8=
 =RnZU
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a regression in the recent file stacking update, reported
  and fixed by Amir Goldstein. The fix is fairly trivial, but involves
  adding a fadvise() f_op and the associated churn in the vfs. As
  discussed on -fsdevel, there are other possible uses for this method,
  than allowing proper stacking for overlays.

  And there's one other fix for a syzkaller detected oops"

* tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix oopses in ovl_fill_super() failure paths
  ovl: add ovl_fadvise()
  vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  vfs: add the fadvise() file operation
  Documentation/filesystems: update documentation of file_operations
  ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  ovl: respect FIEMAP_FLAG_SYNC flag
2018-09-13 19:21:40 -10:00
Linus Torvalds
145ea6f10d pstore fixes for v4.19-rc4:
- Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAluakCIWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJnq1D/9a9PFwxBflN3Wi9G2SMJuCF8y1
 vb+auPyMtk2gpnDnE/GbumwLfB/ZW/1JSBNncCmSnFOjeK3M0m4e0gz0XU2AQaT6
 BTNFZlVe649a6zct59nEQRILG1aPP5UdKv4eE2mJ+XOFEi3DSTMgtsmFhftrngGZ
 Gp8XYaNgf6JIL25fUtfgZd5YiwbR+ZdLZdcd+qlIliJF80rup/QftdgRsPJk0kUS
 b8yE6sWNnP1NDzlUWcAWXy231lewbbUsdYAQ11RLThl8Glogz3hlxLKIbCuKFymd
 IGwSyShnZFczO6qN6Rcvx5o8PIgbTH+1ntQc3s3PNuXOpOmOk7F2YspkAEqLY3//
 4Gy+jphSmg12JAzPy7B1m0UT8nCE0S9qfi7tnQ0Dti0rvu+vS1dyqFlqbTgN9Hb3
 Wfi8u/rNvUT5XqB0afq5v7U6On7mjHfW6PwcwqZJZtEblERqv+DOgRUPjWkack6E
 x64IA0UzlEX2AlWHiBxGcwd55kIvk+xI94VVy2K8QcbWlGeM5Ij7WDMABTkmJn5L
 3I8EwoIpg9qtoabKOywOQuj6m9Dovvo0Dv8juUtHARBUJZC0fEpakTdPZEQoiSBs
 7Vox1UTzY+QqJch1ayEvjmqJNkPsAor7zqfVX4Ub85tuECAEs+MtQDbZCZUTXR7I
 WN5N1QBDpuwC8tupCg==
 =8krP
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
 "This fixes a 6 year old pstore bug that everyone just got lucky in
  avoiding, likely due only using page-aligned persistent ram regions:

   - Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)"

* tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Fix incorrect persistent ram buffer mapping
2018-09-13 18:52:41 -10:00
Bin Yang
831b624df1 pstore: Fix incorrect persistent ram buffer mapping
persistent_ram_vmap() returns the page start vaddr.
persistent_ram_iomap() supports non-page-aligned mapping.

persistent_ram_buffer_map() always adds offset-in-page to the vaddr
returned from these two functions, which causes incorrect mapping of
non-page-aligned persistent ram buffer.

By default ftrace_size is 4096 and max_ftrace_cnt is nr_cpu_ids. Without
this patch, the zone_sz in ramoops_init_przs() is 4096/nr_cpu_ids which
might not be page aligned. If the offset-in-page > 2048, the vaddr will be
in next page. If the next page is not mapped, it will cause kernel panic:

[    0.074231] BUG: unable to handle kernel paging request at ffffa19e0081b000
...
[    0.075000] RIP: 0010:persistent_ram_new+0x1f8/0x39f
...
[    0.075000] Call Trace:
[    0.075000]  ramoops_init_przs.part.10.constprop.15+0x105/0x260
[    0.075000]  ramoops_probe+0x232/0x3a0
[    0.075000]  platform_drv_probe+0x3e/0xa0
[    0.075000]  driver_probe_device+0x2cd/0x400
[    0.075000]  __driver_attach+0xe4/0x110
[    0.075000]  ? driver_probe_device+0x400/0x400
[    0.075000]  bus_for_each_dev+0x70/0xa0
[    0.075000]  driver_attach+0x1e/0x20
[    0.075000]  bus_add_driver+0x159/0x230
[    0.075000]  ? do_early_param+0x95/0x95
[    0.075000]  driver_register+0x70/0xc0
[    0.075000]  ? init_pstore_fs+0x4d/0x4d
[    0.075000]  __platform_driver_register+0x36/0x40
[    0.075000]  ramoops_init+0x12f/0x131
[    0.075000]  do_one_initcall+0x4d/0x12c
[    0.075000]  ? do_early_param+0x95/0x95
[    0.075000]  kernel_init_freeable+0x19b/0x222
[    0.075000]  ? rest_init+0xbb/0xbb
[    0.075000]  kernel_init+0xe/0xfc
[    0.075000]  ret_from_fork+0x3a/0x50

Signed-off-by: Bin Yang <bin.yang@intel.com>
[kees: add comments describing the mapping differences, updated commit log]
Fixes: 24c3d2f342 ("staging: android: persistent_ram: Make it possible to use memory outside of bootmem")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-13 09:14:57 -07:00
Dan Carpenter
097f5863b1 cifs: read overflow in is_valid_oplock_break()
We need to verify that the "data_offset" is within bounds.

Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-09-12 17:13:34 -05:00
Chao Yu
6f5c2ed0a2 f2fs: split IO error injection according to RW
This patch adds to support injecting error for write IO, this can simulate
IO error like fail_make_request or dm_flakey does.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:32 -07:00
Chao Yu
7c1a000d46 f2fs: add SPDX license identifiers
Remove the verbose license text from f2fs files and replace them with
SPDX tags.  This does not change the license of any of the code.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:10 -07:00
Chengguang Xu
4cb037ec3f f2fs: surround fault_injection related option parsing using CONFIG_F2FS_FAULT_INJECTION
It's a little bit strange when fault_injection related
options fail with -EINVAL which were already disabled
from config, so surround all fault_injection related option
parsing code using CONFIG_F2FS_FAULT_INJECTION. Meanwhile,
slightly change warning message to keep consistency with
option POSIX_ACL and FS_XATTR.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:05:33 -07:00
Arnd Bergmann
04b72322e8 media: dvb: move compat handlers into drivers
The VIDEO_STILLPICTURE is only implemented by one driver, while
VIDEO_GET_EVENT has two users in tree. In both cases, it is fairly
easy to handle the compat ioctls in the native handler rather
than relying on translation in fs/compat_ioctls.

In effect, this means that now the drivers implement both structure
layouts in both native and compat mode, but I don't see anything
wrong with that.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
8a24280b11 media: dvb: move most compat_ioctl handling into drivers
Most DVB audio and video ioctl commands are completely compatible,
and are implemented by just two drivers: ttpci and ivtv. In both
cases, we can use the same ioctl handler for both native and
compat ioctl handling, and remove the entries from the global
lookup table.

In case of ttpci, this directly hooks into the file_operations
structure, and for ivtv, we have to set the compat_ioctl32
method in v4l2_file_operations. For all I can tell, setting it
to video_ioctl2 will still do the right thing for all commands.

Note that for the VIDEO_STILLPICTURE and VIDEO_GET_EVENT commands,
a translation handler in fs/compat_ioctl.c is still used. This
works because the command numbers are different on 32-bit
systems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
e6c8320648 media: cec: move compat_ioctl handling to cec-api.c
All the CEC ioctls are compatible, and they are only implemented
in one driver, so we can simply let this driver handle them
natively.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
b5d3206112 media: dvb: dmxdev: move compat_ioctl handling to dmxdev.c
All dmx ioctls are compatible, and they are only implemented
in one file, so we can replace the list of commands in
fs/compat_ioctl.c with a single line in dmxdev.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
1ccbeeb888 media: dvb: fix compat ioctl translation
The VIDEO_GET_EVENT and VIDEO_STILLPICTURE was added back in 2005 but
it never worked because the command number is wrong.

Using the right command number means we have a better chance of them
actually doing the right thing, though clearly nobody has ever tried
it successfully.

I noticed these while auditing the remaining users of compat_time_t
for y2038 bugs. This one is fine in that regard, it just never did
anything.

Fixes: 6e87abd0b8 ("[DVB]: Add compat ioctl handling.")

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Dan Carpenter
2d204ee9d6 cifs: integer overflow in in SMB2_ioctl()
The "le32_to_cpu(rsp->OutputOffset) + *plen" addition can overflow and
wrap around to a smaller value which looks like it would lead to an
information leak.

Fixes: 4a72dafa19 ("SMB2 FSCTL and IOCTL worker function")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:32:07 -05:00
Dan Carpenter
56446f218a CIFS: fix wrapping bugs in num_entries()
The problem is that "entryptr + next_offset" and "entryptr + len + size"
can wrap.  I ended up changing the type of "entryptr" because it makes
the math easier when we don't have to do so much casting.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:32:07 -05:00
Dan Carpenter
8ad8aa3535 cifs: prevent integer overflow in nxt_dir_entry()
The "old_entry + le32_to_cpu(pDirInfo->NextEntryOffset)" can wrap
around so I have added a check for integer overflow.

Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:27:57 -05:00
Matthew Wilcox
b90ca5cc32 filesystem-dax: Fix use of zero page
Use my_zero_pfn instead of ZERO_PAGE(), and pass the vaddr to it instead
of zero so it works on MIPS and s390 who reference the vaddr to select a
zero page.

Cc: <stable@vger.kernel.org>
Fixes: 91d25ba8a6 ("dax: use common 4k zero page for dax mmap reads")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-11 21:27:44 -07:00
Wang Shilong
c8e927579e f2fs: fix setattr project check upon fssetxattr ioctl
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Zhikang Zhang
b430f72636 f2fs: avoid sleeping under spin_lock
In the call trace below, we might sleep in function dput().

So in order to avoid sleeping under spin_lock, we remove f2fs_mark_inode_dirty_sync
from __try_update_largest_extent && __drop_largest_extent.

BUG: sleeping function called from invalid context at fs/dcache.c:796
Call trace:
	dump_backtrace+0x0/0x3f4
	show_stack+0x24/0x30
	dump_stack+0xe0/0x138
	___might_sleep+0x2a8/0x2c8
	__might_sleep+0x78/0x10c
	dput+0x7c/0x750
	block_dump___mark_inode_dirty+0x120/0x17c
	__mark_inode_dirty+0x344/0x11f0
	f2fs_mark_inode_dirty_sync+0x40/0x50
	__insert_extent_tree+0x2e0/0x2f4
	f2fs_update_extent_tree_range+0xcf4/0xde8
	f2fs_update_extent_cache+0x114/0x12c
	f2fs_update_data_blkaddr+0x40/0x50
	write_data_page+0x150/0x314
	do_write_data_page+0x648/0x2318
	__write_data_page+0xdb4/0x1640
	f2fs_write_cache_pages+0x768/0xafc
	__f2fs_write_data_pages+0x590/0x1218
	f2fs_write_data_pages+0x64/0x74
	do_writepages+0x74/0xe4
	__writeback_single_inode+0xdc/0x15f0
	writeback_sb_inodes+0x574/0xc98
	__writeback_inodes_wb+0x190/0x204
	wb_writeback+0x730/0xf14
	wb_check_old_data_flush+0x1bc/0x1c8
	wb_workfn+0x554/0xf74
	process_one_work+0x440/0x118c
	worker_thread+0xac/0x974
	kthread+0x1a0/0x1c8
	ret_from_fork+0x10/0x1c

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
e1293bdfa0 f2fs: plug readahead IO in readdir()
Add a plug to merge readahead IO in readdir(), expecting it can
reduce bio count before submitting to block layer.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
042be0f849 f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219

Reproduction way:
- mount image
- run poc code
- umount image

F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
 f2fs_allocate_data_block+0x124/0x580 [f2fs]
 do_write_page+0x78/0x150 [f2fs]
 f2fs_do_write_node_page+0x25/0xa0 [f2fs]
 __write_node_page+0x2bf/0x550 [f2fs]
 f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
 ? sync_inode_metadata+0x2f/0x40
 ? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
 ? up_write+0x1e/0x80
 f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
 ? mark_held_locks+0x5d/0x80
 ? _raw_spin_unlock_irq+0x27/0x50
 kill_f2fs_super+0x68/0x90 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---

The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.

Main area: 24 segs, 24 secs 24 zones
  - COLD  data: 0, 0, 0
  - WARM  data: 1, 1, 1
  - HOT   data: 20, 20, 20
  - Dir   dnode: 22, 22, 22
  - File   dnode: 22, 22, 22
  - Indir nodes: 21, 21, 21

So this patch adds sanity check to detect such condition to avoid
this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
4a70e25544 f2fs: fix memory leak of percpu counter in fill_super()
In fill_super -> init_percpu_info, we should destroy percpu counter
in error path, otherwise memory allcoated for percpu counter will
leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
0b2103e886 f2fs: fix memory leak of write_io in fill_super()
It needs to release memory allocated for sbi->write_io in error path,
otherwise, it will cause memory leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Al Viro
77021f8bab presence of RS485 ioctls has been unconditional since 2014
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-11 19:46:07 -04:00
Chengguang Xu
313ed62a3d f2fs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Chao Yu
1378752b99 f2fs: fix to flush all dirty inodes recovered in readonly fs
generic/417 reported as blow:

------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:695!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21697 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
Call Trace:
 ? _raw_spin_unlock+0x2c/0x50
 evict+0xa8/0x170
 dispose_list+0x34/0x40
 evict_inodes+0x118/0x120
 generic_shutdown_super+0x41/0x100
 ? rcu_read_lock_sched_held+0x97/0xa0
 kill_block_super+0x22/0x50
 kill_f2fs_super+0x6f/0x80 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]

It can simply reproduced with scripts:

Enable quota feature during mkfs.

Testcase1:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
4. godown /mnt/f2fs
5. umount /mnt/f2fs
6. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
7. umount /mnt/f2fs

Testcase2:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. touch /mnt/f2fs/file
4. create process[pid = x] do:
	a) open /mnt/f2fs/file;
	b) unlink /mnt/f2fs/file
5. godown -f /mnt/f2fs
6. kill process[pid = x]
7. umount /mnt/f2fs
8. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
9. umount /mnt/f2fs

The reason is: during recovery, i_{c,m}time of inode will be updated, then
the inode can be set dirty w/o being tracked in sbi->inode_list[DIRTY_META]
global list, so later write_checkpoint will not flush such dirty inode into
node page.

Once umount is called, sync_filesystem() in generic_shutdown_super() will
skip syncng dirty inodes due to sb_rdonly check, leaving dirty inodes
there.

To solve this issue, during umount, add remove SB_RDONLY flag in
sb->s_flags, to make sure sync_filesystem() will not be skipped.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Yunlei He
cda9cc595f f2fs: report error if quota off error during umount
Now, we depend on fsck to ensure quota file data is ok,
so we scan whole partition if checkpoint without umount
flag. It's same for quota off error case, which may make
quota file data inconsistent.

generic/019 reports below error:

 __quota_error: 1160 callbacks suppressed
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 VFS: Busy inodes after unmount of zram1. Self-destruct in 5 seconds.  Have a nice day...

If we failed in below path due to fail to write dquot block, we will miss
to release quota inode, fix it.

- f2fs_put_super
 - f2fs_quota_off_umount
  - f2fs_quota_off
   - f2fs_quota_sync   <-- failed
   - dquot_quota_off   <-- missed to call

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Eric W. Biederman
961366a019 signal: Remove the siginfo paramater from kernel_dqueue_signal
None of the callers use the it so remove it.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11 21:19:14 +02:00
Ross Zwisler
b1f382178d ext4: close race between direct IO and ext4_break_layouts()
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-11 13:31:16 -04:00
Chengguang Xu
02645bcdfc jfs: remove quota option from ignore list
We treat quota option as usrquota, so remove quota option from ignore
list.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-10 15:36:40 -05:00
Al Viro
702ec3072a hidp: fix compat_ioctl
1) no point putting it into fs/compat_ioctl.c when you handle it in
your ->compat_ioctl() anyway.
2) HIDPCONNADD is *not* COMPATIBLE_IOCTL() stuff at all - it does
layout massage (pointer-chasing there)
3) use compat_ptr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:07 -04:00
Al Viro
89c0c24b4f cmtp: fix compat_ioctl
Use compat_ptr().  And don't mess with fs/compat_ioctl.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
cc04f6e242 bnep: fix compat_ioctl
use compat_ptr() properly and don't bother with fs/compat_ioctl.c -
it's all handled in ->compat_ioctl() anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
0976d4e1dc compat_ioctl: trim the pointless includes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:40:57 -04:00
Miklos Szeredi
8c25741aaa ovl: fix oopses in ovl_fill_super() failure paths
ovl_free_fs() dereferences ofs->workbasedir and ofs->upper_mnt in cases when
those might not have been initialized yet.

Fix the initialization order for these fields.

Reported-by: syzbot+c75f181dc8429d2eb887@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc:  <stable@vger.kernel.org> # v4.15
Fixes: 95e6d4177c ("ovl: grab reference to workbasedir early")
Fixes: a9075cdb46 ("ovl: factor out ovl_free_fs() helper")
2018-09-10 12:55:49 +02:00
Stefan Metzmacher
5890184d2b fs/cifs: require sha512
This got lost in commit 0fdfef9aa7,
which removed CONFIG_CIFS_SMB311.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Fixes: 0fdfef9aa7 ("smb3: simplify code by removing CONFIG_CIFS_SMB311")
CC: Stable <stable@vger.kernel.org>
CC: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09 00:04:27 -05:00
Stephen Rothwell
bcfb84a996 fs/cifs: suppress a string overflow warning
A powerpc build of cifs with gcc v8.2.0 produces this warning:

fs/cifs/cifssmb.c: In function ‘CIFSSMBNegotiate’:
fs/cifs/cifssmb.c:605:3: warning: ‘strncpy’ writing 16 bytes into a region of size 1 overflows the destination [-Wstringop-overflow=]
   strncpy(pSMB->DialectsArray+count, protocols[i].name, 16);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since we are already doing a strlen() on the source, change the strncpy
to a memcpy().

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09 00:02:39 -05:00
David Howells
ecfe951f0c afs: Fix cell specification to permit an empty address list
Fix the cell specification mechanism to allow cells to be pre-created
without having to specify at least one address (the addresses will be
upcalled for).

This allows the cell information preload service to avoid the need to issue
loads of DNS lookups during boot to get the addresses for each cell (500+
lookups for the 'standard' cell list[*]).  The lookups can be done later as
each cell is accessed through the filesystem.

Also remove the print statement that prints a line every time a new cell is
added.

[*] There are 144 cells in the list.  Each cell is first looked up for an
    SRV record, and if that fails, for an AFSDB record.  These get a list
    of server names, each of which then has to be looked up to get the
    addresses for that server.  E.g.:

	dig srv _afs3-vlserver._udp.grand.central.org

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-07 16:39:44 -07:00
Jaegeuk Kim
5ce805869c f2fs: submit bio after shutdown
Sometimes, some merged IOs could get a chance to be submitted, resulting in
system hang in shutdown test. This issues IOs all the time after shutdown.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-07 14:26:43 -07:00
Linus Torvalds
a12ed06ba2 Two rbd patches to complete support for images within namespaces that
went into -rc1 and a use-after-free fix.
 
 The rbd changes have been sitting in a branch for quite a while but
 couldn't be included into the -rc1 pull request because of a pending
 wire protocol backwards compatibility fixup that only got committed
 early this week.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAluSrJYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi/N8B/4sZzRCJMCejvU/yRq91NlaPDrxbVHh
 nfICZ/8Fsy/fmvK8NWNyHcCIWx+nWrbCvCJMj0fxWMhk/1t75yC+TdyCJnyuhsQU
 V/CPTs9BTdwrSUiTB83/n/ukGL6mpESk0CQ1er/l1EO6FnNOXvgzHDnCqUQZLdzU
 1aRcx5JQWWo/QlCmzt2KWENhfQRMvLAtf04F5cUuR+JTrMjwWia6MAuRGuOhVQkW
 XIlFNakBKab89Vod1pmA7BrG/+sHXCpVGX6sjAp9vQUWO3WWKBRnNtVwo9dPSHah
 hBR8IzOkihw7HfTlINWVpiR69nTfM80PQHXJkFSp36E6Sfq8EShRpFIZ
 =pga5
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two rbd patches to complete support for images within namespaces that
  went into -rc1 and a use-after-free fix.

  The rbd changes have been sitting in a branch for quite a while but
  couldn't be included into the -rc1 pull request because of a pending
  wire protocol backwards compatibility fixup that only got committed
  early this week"

* tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client:
  rbd: support cloning across namespaces
  rbd: factor out get_parent_info()
  ceph: avoid a use-after-free in ceph_destroy_options()
2018-09-07 10:57:59 -07:00
Linus Torvalds
d042a240a8 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluSNfQACgkQnJ2qBz9k
 QNloSAf/RpsqUnmQvJKK7hQUVNMCQP/Kf3KND5iN5RfMbhU9r7tzERkNvqhdA6QZ
 uoPi8dEecI+ihY5F8ddyw1Chaou4MToWKdNz4ojwJXVrN6bb+pq+xj0hTvT5FjFh
 iM1JXHtSEk6W+CnXPE5CycrZppIHxJfJxeaWg7av5Zyc4nkTesxtG8PycMBxROW8
 detUcJt15VGBswi19udztf7XY/lwDwUQ9LwC0W5B+o8pKIwuN3ENMVVOeAriAyoy
 hXTpPA8twBhM7i8D/1eppDCkYLTr08bquNsDpn8kUEf2RxcxiFJuDLOeXiH3sQRq
 BZmf/QIIRA8R+SPeFiuxY/795FDC6Q==
 =CWu1
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fix from Jan Kara:
 "A small fsnotify fix from Amir"

* tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix ignore mask logic in fsnotify()
2018-09-07 10:54:46 -07:00
Dominique Martinet
b4dc44b3ca 9p locks: fix glock.client_id leak in do_lock
the 9p client code overwrites our glock.client_id pointing to a static
buffer by an allocated string holding the network provided value which
we do not care about; free and reset the value as appropriate.

This is almost identical to the leak in v9fs_file_getlock() fixed by
Al Viro in commit ce85dd58ad ("9p: we are leaking glock.client_id
in v9fs_file_getlock()"), which was returned as an error by a coverity
false positive -- while we are here attempt to make the code slightly
more robust to future change of the net/9p/client code and hopefully
more clear to coverity that there is no problem.

Link: http://lkml.kernel.org/r/1536339057-21974-5-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:52:35 +09:00
Dominique Martinet
e02a53d92e 9p: acl: fix uninitialized iattr access
iattr is passed to v9fs_vfs_setattr_dotl which does send various
values from iattr over the wire, even if it tells the server to
only look at iattr.ia_valid fields this could leak some stack data.

Link: http://lkml.kernel.org/r/1536339057-21974-2-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 1195601 ("Uninitalized scalar variable")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:51:50 +09:00
Dinu-Razvan Chis-Serban
5e172f75e5 9p locks: add mount option for lock retry interval
The default P9_LOCK_TIMEOUT can be too long for some users exporting
a local file system to a guest VM (30s), make this configurable at
mount time.

Link: http://lkml.kernel.org/r/1536295827-3181-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195727
Signed-off-by: Dinu-Razvan Chis-Serban <justcsdr@gmail.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Gertjan Halkes
2803cf4379 9p: do not trust pdu content for stat item size
v9fs_dir_readdir() could deadloop if a struct was sent with a size set
to -2

Link: http://lkml.kernel.org/r/1536134432-11997-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88021
Signed-off-by: Gertjan Halkes <gertjan@google.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Gustavo A. R. Silva
426d5a0f97 9p: fix spelling mistake in fall-through annotation
Replace "fallthough" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough

Link: http://lkml.kernel.org/r/20180903193806.GA11258@embeddedor.com
Addresses-Coverity-ID: 402012 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:39:47 +09:00
Jan Kara
1abefb0274 udf: Drop pack pragma from udf_sb.h
Drop pack pragma. The header file defines only in-memory structures.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:23 +02:00
Jan Kara
694538b5d7 udf: Drop freed bitmap / table support
We don't support Free Space Table and Free Space Bitmap as specified by
UDF standard for writing as we don't support erasing blocks before
overwriting them. Just drop the handling of these structures as
partition descriptor checking code already makes sure such filesystems
can be mounted only read-only.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Jan Kara
b085fbe2ef udf: Fix crash during mount
Fix a crash during an attempt to mount a filesystem that has both
Unallocated Space Table and Unallocated Space Bitmap. Such filesystem
actually violates the UDF standard so we just have to properly detect
such situation and refuse to mount such filesystem read-write. When we
are at it, verify also other constraints on the allocation information
mandated by the standard.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Jan Kara
a9ad01bc75 udf: Prevent write-unsupported filesystem to be remounted read-write
There are certain filesystem features which we support for reading but
not for writing. We properly refuse to mount such filesystems read-write
however for some features (such as read-only partitions), we don't check
for these features when remounting the filesystem from read-only to
read-write. Thus such filesystems could be remounted read-write leading
to strange behavior (most likely crashes).

Fix the problem by marking in superblock whether the filesystem has some
features that are supported in read-only mode and check this flag during
remount.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Linus Torvalds
c6ff25ce35 four small SMB3 fixes, three for stable, and one minor debug clarification
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluRb84ACgkQiiy9cAdy
 T1GBuAv/ZkXC5cxNpE84dRLii4ey+Y0u5ip3VIyzxwburDDxQh99zWTs8FdyYe1f
 5CYD4PKcNajVceNUg1EdfkNC4ss21I2HcxujquCo8gZJqrt5lHvZXELJ31d8ovXG
 Y2MRl5+KZKLx1sBgzsGJf3aZOneObp7EEAnL/bjeziX7caD6uO/F52MXcMgWrpoM
 krvWSMzS5iY+jRltvPLhTzUmfbaPoS86FRNIHHOiA8AgQLvx3CT4lL+kJOHv1bHX
 haZc9zKy1iUU+yK05vnNLOHVlPeZDa8j/Q8lcEfTrDVa0J/je4DkI9HsB1X54vz0
 65JluAf4G4vaSYU/hnLaWt4PZ7owjTr7fJlu0c8TE6aPqAZEYYoVhjbrdXg7OI4j
 9nUsoEY1hyBH2X5qStvpb9GiZbhR19bhcvnvOAgZ7p4VnyKfexe7o2QHTCPPAxaP
 uYf8OeHjmnYF3MQI0s6rA0TPdZxc77acgrZQWUoOMPN/bwkyP1bSfA1aURJAsx/k
 OodjAog1
 =WRXo
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Four small SMB3 fixes, three for stable, and one minor debug
  clarification"

* tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: connect to servername instead of IP for IPC$ share
  smb3: check for and properly advertise directory lease support
  smb3: minor debugging clarifications in rfc1001 len processing
  SMB3: Backup intent flag missing for directory opens with backupuid mounts
  fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
2018-09-06 15:39:11 -07:00
Chengguang Xu
e8d4ceeb34 jfs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-06 11:56:26 -05:00
Linus Torvalds
5404525b98 for-4.19-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAluRLa8ACgkQxWXV+ddt
 WDvc+BAAqxTMVngZ60WfktXzsS56OB6fu/R3DORgYcSZ0BCD4zTwoDlCjLhrCK6E
 cmC+BMj+AspDQYiYESwGyFcN10sK0X7w7fa3wypTc4GNWxpkRm0Z6zT/kCvLUhdI
 NlkMqAfsZ9N6iIXcR0qOxI7G55e3mpXPZGdFTk5rmDTv/9TqU0TMp9s8Zw5scn6R
 ctdE+iE0lpRfNjF8ZDH1BtYIV4g2X81sZF/fkGz621HQfMTCjjPHFdlz+jlirBaf
 BrYR4w4zjVuMKd3ZC5FHffVchbkvt29h6fAr4sEpJTwFJwd8pjI7GuPYWDQ918NB
 TGX6EUP6usQqDK2zD405jCS6MbMshJm3uh5kmEpeNgK/tKJTln8Sbef/Xs93yIn2
 +k9BMKOIcUHHBiv6PgCaZomcWCpii2S2u6vncqCnNuI4wK1RN3gHJc5YPhJArlrB
 NUFJiTCQE6LWYOP2Hw+rggcrtBxli0bX7Mqp5FYFVdh5KBvolJE1o3B/JS8qpqRF
 u0dPwbLHtTpTpXM5EfmM8a45S+DxuxTDBh3vdoAOM9LN/ivpeqqnFbHrIGmrTMjo
 pQJ8aTrCwYMEMNu6oCV1cniFrOYRZ439hYjg524MjVXYCRyxhzAdVmVTEBaLjWCW
 9GlGqEC7YZY2wLi5lPEGqxsIaVVELpettJB9KbBKmYB47VFWEf0=
 =fu93
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix for improper fsync after hardlink

 - fix for a corruption during file deduplication

 - use after free fixes

 - RCU warning fix

 - fix for buffered write to nodatacow file

* tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
  btrfs: use after free in btrfs_quota_enable
  btrfs: btrfs_shrink_device should call commit transaction at the end
  btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
  Btrfs: fix data corruption when deduplicating between different files
  Btrfs: sync log after logging new name
  Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
2018-09-06 09:04:45 -07:00
Ilya Dryomov
8aaff15168 ceph: avoid a use-after-free in ceph_destroy_options()
syzbot reported a use-after-free in ceph_destroy_options(), called from
ceph_mount().  The problem was that create_fs_client() consumed the opt
pointer on some errors, but not on all of them.  Make sure it always
consumes both libceph and ceph options.

Reported-by: syzbot+8ab6f1042021b4eed062@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-09-06 16:18:04 +02:00
Jaegeuk Kim
0ded69f632 f2fs: avoid wrong decrypted data from disk
1. Create a file in an encrypted directory
2. Do GC & drop caches
3. Read stale data before its bio for metapage was not issued yet

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:07 -07:00
Chao Yu
22d7ea1364 Revert "f2fs: use printk_ratelimited for f2fs_msg"
Don't limit printing log, so that we will not miss any key messages.

This reverts commit a36c106dff.

In addition, we use printk_ratelimited to avoid too many log prints.
- error injection
- discard submission failure

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:06 -07:00
Sahitya Tummala
abde73c718 f2fs: fix unnecessary periodic wakeup of discard thread when dev is busy
When dev is busy, discard thread wake up timeout can be aligned with the
exact time that it needs to wait for dev to come out of busy. This helps
to avoid unnecessary periodic wakeups and thus save some power.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:32 -07:00
Chao Yu
7d20c8abb2 f2fs: fix to avoid NULL pointer dereference on se->discard_map
https://bugzilla.kernel.org/show_bug.cgi?id=200951

These is a NULL pointer dereference issue reported in bugzilla:

Hi,
in the setup there is a SATA SSD connected to a SATA-to-USB bridge.

The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
There are four partitions:
 sda1: FAT  /boot
 sda2: F2FS /
 sda3: F2FS /home
 sda4: F2FS

The bridge is ASMT1153e which uses the "uas" driver.
There is no TRIM pass-through, so, when mounting it reports:
 mounting with "discard" option, but the device does not support discard

The USB host is USB3.0 and UASP capable. It is the one on RK3399.

Given this everything works fine, except there is no TRIM support.

In order to enable TRIM a new UDEV rule is added [1]:
 /etc/udev/rules.d/10-sata-bridge-trim.rules:
 ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
After reboot any F2FS write hangs forever and dmesg reports:
 Unable to handle kernel NULL pointer dereference

Also tested on a x86_64 system: works fine even with TRIM enabled.
 same disc
 same bridge
 different usb host controller
 different cpu architecture
 not root filesystem

Regards,
  Vicenç.

[1] Post #5 in https://bbs.archlinux.org/viewtopic.php?id=236280

 Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
 [000000000000003e] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
 CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
 Hardware name: Sapphire-RK3399 Board (DT)
 pstate: 00000005 (nzcv daif -PAN -UAO)
 pc : update_sit_entry+0x304/0x4b0
 lr : update_sit_entry+0x108/0x4b0
 sp : ffff00000ca13bd0
 x29: ffff00000ca13bd0 x28: 000000000000003e
 x27: 0000000000000020 x26: 0000000000080000
 x25: 0000000000000048 x24: ffff8000ebb85cf8
 x23: 0000000000000253 x22: 00000000ffffffff
 x21: 00000000000535f2 x20: 00000000ffffffdf
 x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
 x17: 0000000007ce6926 x16: 000000001c83ffa8
 x15: 0000000000000000 x14: ffff8000f602df90
 x13: 0000000000000006 x12: 0000000000000040
 x11: 0000000000000228 x10: 0000000000000000
 x9 : 0000000000000000 x8 : 0000000000000000
 x7 : 00000000000535f2 x6 : ffff8000ebff3440
 x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
 x3 : 00000000ffffffff x2 : 0000000000000020
 x1 : 0000000000000000 x0 : ffff8000eb9e5800
 Process nvim (pid: 957, stack limit = 0x0000000063a78320)
 Call trace:
  update_sit_entry+0x304/0x4b0
  f2fs_invalidate_blocks+0x98/0x140
  truncate_node+0x90/0x400
  f2fs_remove_inode_page+0xe8/0x340
  f2fs_evict_inode+0x2b0/0x408
  evict+0xe0/0x1e0
  iput+0x160/0x260
  do_unlinkat+0x214/0x298
  __arm64_sys_unlinkat+0x3c/0x68
  el0_svc_handler+0x94/0x118
  el0_svc+0x8/0xc
 Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
 ---[ end trace a0f21a307118c477 ]---

The reason is it is possible to enable discard flag on block queue via
UDEV, but during mount, f2fs will initialize se->discard_map only if
this flag is set, once the flag is set after mount, f2fs may dereference
NULL pointer on se->discard_map.

So this patch does below changes to fix this issue:
- initialize and update se->discard_map all the time.
- don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
during mount.
- don't issue small discard on zoned block device.
- introduce some functions to enhance the readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Chengguang Xu
1618e6e297 f2fs: add additional sanity check in f2fs_acl_from_disk()
Add additinal sanity check for irregular case(e.g. corruption).
If size of extended attribution is smaller than size of acl header,
then return -EINVAL.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Ryusuke Konishi
ae98043f5f nilfs2: convert to SPDX license tags
Remove the verbose license text from NILFS2 files and replace them with
SPDX tags.  This does not change the license of any of the code.

Link: http://lkml.kernel.org/r/1535624528-5982-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Alexander Popov
c8d126275a fs/proc: Show STACKLEAK metrics in the /proc file system
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04 10:35:48 -07:00
Jason A. Donenfeld
578bdaabd0 crypto: speck - remove Speck
These are unused, undesired, and have never actually been used by
anybody. The original authors of this code have changed their mind about
its inclusion. While originally proposed for disk encryption on low-end
devices, the idea was discarded [1] in favor of something else before
that could really get going. Therefore, this patch removes Speck.

[1] https://marc.info/?l=linux-crypto-vger&m=153359499015659

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:35:03 +08:00
Theodore Ts'o
5f8c10936f ext4: fix online resizing for bigalloc file systems with a 1k block size
An online resize of a file system with the bigalloc feature enabled
and a 1k block size would be refused since ext4_resize_begin() did not
understand s_first_data_block is 0 for all bigalloc file systems, even
when the block size is 1k.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-03 22:25:01 -04:00
Theodore Ts'o
f0a459dec5 ext4: fix online resize's handling of a too-small final block group
Avoid growing the file system to an extent so that the last block
group is too small to hold all of the metadata that must be stored in
the block group.

This problem can be triggered with the following reproducer:

umount /mnt
mke2fs -F -m0 -b 4096 -t ext4 -O resize_inode,^has_journal \
	-E resize=1073741824 /tmp/foo.img 128M
mount /tmp/foo.img /mnt
truncate --size 1708M /tmp/foo.img
resize2fs /dev/loop0 295400
umount /mnt
e2fsck -fy /tmp/foo.img

Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-03 22:19:43 -04:00
Amir Goldstein
d54f4fba88 fanotify: add API to attach/detach super block mark
Add another mark type flag FAN_MARK_FILESYSTEM for add/remove/flush
of super block mark type.

A super block watch gets all events on the filesystem, regardless of
the mount from which the mark was added, unless an ignore mask exists
on either the inode or the mount where the event was generated.

Only one of FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM mark type flags
may be provided to fanotify_mark() or no mark type flag for inode mark.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:16:11 +02:00
Amir Goldstein
60f7ed8c7c fsnotify: send path type events to group with super block marks
Send events to group if super block mark mask matches the event
and unless the same group has an ignore mask on the vfsmount or
the inode on which the event occurred.

Soon, fanotify backend is going to support super block marks and
fanotify backend only supports path type events.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:04 +02:00
Amir Goldstein
1e6cb72399 fsnotify: add super block object type
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.

This is going to be used by fanotify backend to setup super block
marks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:01 +02:00
Jann Horn
9da3f2b740 x86/fault: BUG() when uaccess helpers fault on kernel addresses
There have been multiple kernel vulnerabilities that permitted userspace to
pass completely unchecked pointers through to userspace accessors:

 - the waitid() bug - commit 96ca579a1e ("waitid(): Add missing
   access_ok() checks")
 - the sg/bsg read/write APIs
 - the infiniband read/write APIs

These don't happen all that often, but when they do happen, it is hard to
test for them properly; and it is probably also hard to discover them with
fuzzing. Even when an unmapped kernel address is supplied to such buggy
code, it just returns -EFAULT instead of doing a proper BUG() or at least
WARN().

Try to make such misbehaving code a bit more visible by refusing to do a
fixup in the pagefault handler code when a userspace accessor causes a #PF
on a kernel address and the current context isn't whitelisted.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
2018-09-03 15:12:09 +02:00
Amir Goldstein
9bdda4e9cf fsnotify: fix ignore mask logic in fsnotify()
Commit 92183a4289 ("fsnotify: fix ignore mask logic in
send_to_group()") acknoledges the use case of ignoring an event on
an inode mark, because of an ignore mask on a mount mark of the same
group (i.e. I want to get all events on this file, except for the events
that came from that mount).

This change depends on correctly merging the inode marks and mount marks
group lists, so that the mount mark ignore mask would be tested in
send_to_group(). Alas, the merging of the lists did not take into
account the case where event in question is not in the mask of any of
the mount marks.

To fix this, completely remove the tests for inode and mount event masks
from the lists merging code.

Fixes: 92183a4289 ("fsnotify: fix ignore mask logic in send_to_group")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 14:57:41 +02:00
Chengguang Xu
59fed3bf8a ext2: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems.  However,
when default_acl and acl are NULL, then they keep the value of
ACL_NOT_CACHED. This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup in future.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 11:05:03 +02:00
Colin Ian King
849fe89ce6 udf: remove unused variables group_start and nr_groups
Variables group_start and nr_groups are being assigned but are never used
hence they are redundant and can be removed.

Cleans up clang warning:
variable 'group_start' set but not used [-Wunused-but-set-variable]
variable 'nr_groups' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 11:04:49 +02:00
Amir Goldstein
b833a36603 ovl: add ovl_fadvise()
Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
on an overlayfs file.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef857 ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-03 09:43:10 +02:00
Thomas Werschlein
395a2076b4 cifs: connect to servername instead of IP for IPC$ share
This patch is required allows access to a Microsoft fileserver failover
cluster behind a 1:1 NAT firewall.

The change also provides stronger context for authentication and share
connection (see MS-SMB2 3.3.5.7 and MS-SRVS 3.1.6.8) as noted by
Tom Talpey, and addresses comments about the buffer size for the UNC
made by Aurélien Aptel.

Signed-off-by: Thomas Werschlein <thomas.werschlein@geo.uzh.ch>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Tom Talpey <ttalpey@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-09-02 23:21:42 -05:00
Steve French
f801568332 smb3: check for and properly advertise directory lease support
Although servers will typically ignore unsupported features,
we should advertise the support for directory leases (as
Windows e.g. does) in the negotiate protocol capabilities we
pass to the server, and should check for the server capability
(CAP_DIRECTORY_LEASING) before sending a lease request for an
open of a directory.  This will prevent us from accidentally
sending directory leases to SMB2.1 or SMB2 server for example.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-09-02 23:21:42 -05:00
Steve French
25f2573512 smb3: minor debugging clarifications in rfc1001 len processing
I ran into some cases where server was returning the wrong length
on frames but I couldn't easily match them to the command in the
network trace (or server logs) since I need the command and/or
multiplex id to find the offending SMB2/SMB3 command.  Add these
two fields to the log message. In the case of padding too much
it may not be a problem in all cases but might have correlated
to a network disconnect case in some problems we have been
looking at. In the case of frame too short is even more important.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-09-02 23:21:42 -05:00
Steve French
5e19697b56 SMB3: Backup intent flag missing for directory opens with backupuid mounts
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag needs to be set
on opens of directories (and files) but was missing in some
places causing access denied trying to enumerate and backup
servers.

Fixes kernel bugzilla #200953
https://bugzilla.kernel.org/show_bug.cgi?id=200953

Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-09-02 23:21:42 -05:00
Jon Kuhn
c15e3f19a6 fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
When a Mac client saves an item containing a backslash to a file server
the backslash is represented in the CIFS/SMB protocol as as U+F026.
Before this change, listing a directory containing an item with a
backslash in its name will return that item with the backslash
represented with a true backslash character (U+005C) because
convert_sfm_character mapped U+F026 to U+005C when interpretting the
CIFS/SMB protocol response.  However, attempting to open or stat the
path using a true backslash will result in an error because
convert_to_sfm_char does not map U+005C back to U+F026 causing the
CIFS/SMB request to be made with the backslash represented as U+005C.

This change simply prevents the U+F026 to U+005C conversion from
happenning.  This is analogous to how the code does not do any
translation of UNI_SLASH (U+F000).

Signed-off-by: Jon Kuhn <jkuhn@barracuda.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-02 23:21:42 -05:00
Linus Torvalds
501dacbc24 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "A small set of updates for core code:

   - Prevent tracing in functions which are called from trace patching
     via stop_machine() to prevent executing half patched function trace
     entries.

   - Remove old GCC workarounds

   - Remove pointless includes of notifier.h"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Remove workaround for unreachable warnings from old GCC
  notifier: Remove notifier header file wherever not used
  watchdog: Mark watchdog touch functions as notrace
2018-09-02 09:41:45 -07:00
Theodore Ts'o
4274f516d4 ext4: recalucate superblock checksum after updating free blocks/inodes
When mounting the superblock, ext4_fill_super() calculates the free
blocks and free inodes and stores them in the superblock.  It's not
strictly necessary, since we don't use them any more, but it's nice to
keep them roughly aligned to reality.

Since it's not critical for file system correctness, the code doesn't
call ext4_commit_super().  The problem is that it's in
ext4_commit_super() that we recalculate the superblock checksum.  So
if we're not going to call ext4_commit_super(), we need to call
ext4_superblock_csum_set() to make sure the superblock checksum is
consistent.

Most of the time, this doesn't matter, since we end up calling
ext4_commit_super() very soon thereafter, and definitely by the time
the file system is unmounted.  However, it doesn't work in this
sequence:

mke2fs -Fq -t ext4 /dev/vdc 128M
mount /dev/vdc /vdc
cp xfstests/git-versions /vdc
godown /vdc
umount /vdc
mount /dev/vdc
tune2fs -l /dev/vdc

With this commit, the "tune2fs -l" no longer fails.

Reported-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-01 14:42:14 -04:00
Theodore Ts'o
bcd8e91f98 ext4: avoid arithemetic overflow that can trigger a BUG
A maliciously crafted file system can cause an overflow when the
results of a 64-bit calculation is stored into a 32-bit length
parameter.

https://bugzilla.kernel.org/show_bug.cgi?id=200623

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-09-01 12:45:04 -04:00
Amir Goldstein
5b910bd615 ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.

Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:

 CPU: 0 PID: 2462 Comm: swapon Not tainted
 4.18.0-xfstests-12721-g33e17876ea4e #3402
 RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
 Call Trace:
  xfs_iomap_swapfile_activate+0x1f/0x43
  __se_sys_swapon+0xb1a/0xee9

Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.

Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.

Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0 ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Amir Goldstein
80d3481081 ovl: respect FIEMAP_FLAG_SYNC flag
Stacked overlayfs fiemap operation broke xfstests that test delayed
allocation (with "_test_generic_punch -d"), because ovl_fiemap()
failed to write dirty pages when requested.

Fixes: 9e142c4102 ("ovl: add ovl_fiemap()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Linus Torvalds
f3f106dac0 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluGfbAACgkQnJ2qBz9k
 QNmkVQgAjv/tCHStwoQ4Hhr6q5wLU9apAW5WC7Or2MyNfqoJsKgyujpikwFMa0xY
 wkVqlITYavvUKl/nRlQ6hpZtAY7hi1Y7GvIYs2ci6QO4YAUuBoH7qPLKqyZYzXx+
 mBaC68885nMMDqIHvsCcLurwTiDer6XQXXPDmpMoO9g4kyVuhm2e0/M7CPaA4Ue9
 WEfBPBROSNdRH7Wtww/MUvtfd+9mezdlpUQZVVO5cdhAVQ8pzW2g2piYtwIpaY7E
 DiJ1zcgaqCnni+b69x4wz8TwB0q8wZJjrPfJgf4X7mIGBZrw80yNRlevfe5SxwdJ
 mkMDvjaD0TEItvpjgTZReXaAjFOWTA==
 =E61M
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs fixes from Jan Kara:

 - make UDF to properly mount media created by Win7

 - make isofs to properly refuse devices with large physical block size

 - fix a Spectre gadget in quotactl(2)

 - fix a warning in fsnotify code hit by syzkaller

* tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix mounting of Win7 created UDF filesystems
  udf: Remove dead code from udf_find_fileset()
  fs/quota: Fix spectre gadget in do_quotactl
  fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
  isofs: reject hardware sector size > 2048 bytes
  fsnotify: fix false positive warning on inode delete
2018-08-29 14:56:45 -07:00
Arnd Bergmann
4faea239e5 y2038: utimes: Rework #ifdef guards for compat syscalls
After changing over to 64-bit time_t syscalls, many architectures will
want compat_sys_utimensat() but not respective handlers for utime(),
utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
support CONFIG_COMPAT set it, but future 64-bit architectures will not
(tile would not have needed it either, but got removed).

As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
they will have to use __ARCH_WANT_SYS_UTIME32 instead of
__ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
need either of them as they never had a utime syscall.

Since the compat_utimbuf structure is now required outside of
CONFIG_COMPAT, I'm moving it into compat_time.h.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
changed from last version:
- renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
2018-08-29 15:42:23 +02:00
Arnd Bergmann
185cfaf764 y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.

There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:23 +02:00
Arnd Bergmann
a4f7a30046 y2038: Change sys_utimensat() to use __kernel_timespec
When 32-bit architectures get changed to support 64-bit time_t,
utimensat() needs to use the new __kernel_timespec structure as its
argument.

The older utime(), utimes() and futimesat() system calls don't need a
corresponding change as they are no longer used on C libraries that have
64-bit time support.

As we do for the other syscalls that have timespec arguments, we reuse
the 'compat' syscall entry points to implement the traditional four
interfaces, and only leave the new utimensat() as a native handler,
so that the same code gets used on both 32-bit and 64-bit kernels
on each syscall.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:22 +02:00
Arnd Bergmann
caf6f9c8a3 asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macro
The sys_llseek sytem call is needed on all 32-bit architectures and
none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
and simplify the include/asm-generic/unistd.h header further.

Since 32-bit tasks can run either natively or in compat mode on 64-bit
architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.

There are a few 64-bit architectures that also reference sys_llseek
in their 64-bit ABI (e.g. sparc), but I verified that those all
select CONFIG_COMPAT, so the #if check is still correct here. It's
a bit odd to include it in the syscall table though, as it's the
same as sys_lseek() on 64-bit, but with strange calling conventions.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:21 +02:00
Arnd Bergmann
82b355d161 y2038: Remove newstat family from default syscall set
We have four generations of stat() syscalls:
- the oldstat syscalls that are only used on the older architectures
- the newstat family that is used on all 64-bit architectures but
  lacked support for large files on 32-bit architectures.
- the stat64 family that is used mostly on 32-bit architectures to
  replace newstat
- statx() to replace all of the above, adding 64-bit timestamps among
  other things.

We already compile stat64 only on those architectures that need it,
but newstat is always built, including on those that don't reference
it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of
__ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of
newstat. All architectures that need it use an explict define, the
others now get a little bit smaller, and future architecture (including
64-bit targets) won't ever see it.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:20 +02:00
Dominique Martinet
81c99089bc v9fs_dir_readdir: fix double-free on p9stat_read error
p9stat_read will call p9stat_free on error, we should only free the
struct content on success.

There also is no need to "p9stat_init" st as the read function will
zero the whole struct for us anyway, so clean up the code a bit while
we are here.

Link: http://lkml.kernel.org/r/1535410108-20650-1-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Reported-by: syzbot+d4252148d198410b864f@syzkaller.appspotmail.com
2018-08-29 13:39:57 +09:00
Bob Peterson
4f36cb36c9 gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
The GFS2_RDF_UPTODATE flag in the rgrp is used to determine when
a rgrp buffer is valid. It's cleared when the glock is invalidated,
signifying that the buffer data is now invalid. But before this
patch, function update_rgrp_lvb was setting the flag when it
determined it had a valid lvb. But that's an invalid assumption:
just because you have a valid lvb doesn't mean you have valid
buffers. After all, another node may have made the lvb valid,
and this node just fetched it from the glock via dlm.

Consider this scenario:
1. The file system is mounted with RGRPLVB option.
2. In gfs2_inplace_reserve it locks the rgrp glock EX, but thanks
   to GL_SKIP, it skips the gfs2_rgrp_bh_get.
3. Since loops == 0 and the allocation target (ap->target) is
   bigger than the largest known chunk of blocks in the rgrp
   (rs->rs_rbm.rgd->rd_extfail_pt) it skips that rgrp and bypasses
   the call to gfs2_rgrp_bh_get there as well.
4. update_rgrp_lvb sees the lvb MAGIC number is valid, so bypasses
   gfs2_rgrp_bh_get, but it still sets sets GFS2_RDF_UPTODATE due
   to this invalid assumption.
5. The next time update_rgrp_lvb is called, it sees the bit is set
   and just returns 0, assuming both the lvb and rgrp are both
   uptodate. But since this is a smaller allocation, or space has
   been freed by another node, thus adjusting the lvb values,
   it decides to use the rgrp for allocations, with invalid rd_free
   due to the fact it was never updated.

This patch changes update_rgrp_lvb so it doesn't set the UPTODATE
flag anymore. That way, it has no choice but to fetch the latest
values.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-08-28 12:51:08 -05:00
Bob Peterson
72244b6bc7 gfs2: improve debug information when lvb mismatches are found
Before this patch, gfs2_rgrp_bh_get would check for lvb mismatches,
but it wouldn't tell you what was actually wrong. This patch adds
more information to help us debug it. It also makes rgrp consistency
checks dump any bad rgrps, and the rgrp dump code dump any lvbs
as well as the rgrp itself.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2018-08-28 12:51:08 -05:00
Theodore Ts'o
4d982e25d0 ext4: avoid divide by zero fault when deleting corrupted inline directories
A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault.  Fix this by using the size of the inline directory instead of
dir->i_size.

Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero.  (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)

https://bugzilla.kernel.org/show_bug.cgi?id=200933

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-08-27 09:22:45 -04:00
Arnd Bergmann
9afc5eee65 y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old				new
---				---
compat_time_t			old_time32_t
struct compat_timeval		struct old_timeval32
struct compat_timespec		struct old_timespec32
struct compat_itimerspec	struct old_itimerspec32
ns_to_compat_timeval()		ns_to_old_timeval32()
get_compat_itimerspec64()	get_old_itimerspec32()
put_compat_itimerspec64()	put_old_itimerspec32()
compat_get_timespec64()		get_old_timespec32()
compat_put_timespec64()		put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-27 14:48:48 +02:00
Theodore Ts'o
b50282f324 ext4: check to make sure the rename(2)'s destination is not freed
If the destination of the rename(2) system call exists, the inode's
link count (i_nlinks) must be non-zero.  If it is, the inode can end
up on the orphan list prematurely, leading to all sorts of hilarity,
including a use-after-free.

https://bugzilla.kernel.org/show_bug.cgi?id=200931

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-08-27 01:47:09 -04:00
Theodore Ts'o
072ebb3bff ext4: add nonstring annotations to ext4.h
This suppresses some false positives in gcc 8's -Wstringop-truncation

Suggested by Miguel Ojeda (hopefully the __nonstring definition will
eventually get accepted in the compiler-gcc.h header file).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-08-27 01:15:11 -04:00
Linus Torvalds
aba16dc5cf Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax
Pull IDA updates from Matthew Wilcox:
 "A better IDA API:

      id = ida_alloc(ida, GFP_xxx);
      ida_free(ida, id);

  rather than the cumbersome ida_simple_get(), ida_simple_remove().

  The new IDA API is similar to ida_simple_get() but better named.  The
  internal restructuring of the IDA code removes the bitmap
  preallocation nonsense.

  I hope the net -200 lines of code is convincing"

* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
  ida: Change ida_get_new_above to return the id
  ida: Remove old API
  test_ida: check_ida_destroy and check_ida_alloc
  test_ida: Convert check_ida_conv to new API
  test_ida: Move ida_check_max
  test_ida: Move ida_check_leaf
  idr-test: Convert ida_check_nomem to new API
  ida: Start new test_ida module
  target/iscsi: Allocate session IDs from an IDA
  iscsi target: fix session creation failure handling
  drm/vmwgfx: Convert to new IDA API
  dmaengine: Convert to new IDA API
  ppc: Convert vas ID allocation to new IDA API
  media: Convert entity ID allocation to new IDA API
  ppc: Convert mmu context allocation to new IDA API
  Convert net_namespace to new IDA API
  cb710: Convert to new IDA API
  rsxx: Convert to new IDA API
  osd: Convert to new IDA API
  sd: Convert to new IDA API
  ...
2018-08-26 11:48:42 -07:00
Linus Torvalds
d207ea8e74 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Kernel:
   - Improve kallsyms coverage
   - Add x86 entry trampolines to kcore
   - Fix ARM SPE handling
   - Correct PPC event post processing

  Tools:
   - Make the build system more robust
   - Small fixes and enhancements all over the place
   - Update kernel ABI header copies
   - Preparatory work for converting libtraceevnt to a shared library
   - License cleanups"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
  tools arch x86: Update tools's copy of cpufeatures.h
  perf python: Fix pyrf_evlist__read_on_cpu() interface
  perf mmap: Store real cpu number in 'struct perf_mmap'
  perf tools: Remove ext from struct kmod_path
  perf tools: Add gzip_is_compressed function
  perf tools: Add lzma_is_compressed function
  perf tools: Add is_compressed callback to compressions array
  perf tools: Move the temp file processing into decompress_kmodule
  perf tools: Use compression id in decompress_kmodule()
  perf tools: Store compression id into struct dso
  perf tools: Add compression id into 'struct kmod_path'
  perf tools: Make is_supported_compression() static
  perf tools: Make decompress_to_file() function static
  perf tools: Get rid of dso__needs_decompress() call in __open_dso()
  perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
  perf tools: Get rid of dso__needs_decompress() call in read_object_code()
  tools lib traceevent: Change to SPDX License format
  perf llvm: Allow passing options to llc in addition to clang
  perf parser: Improve error message for PMU address filters
  ...
2018-08-26 11:25:21 -07:00
Linus Torvalds
2923b27e54 libnvdimm-for-4.19_dax-memory-failure
* memory_failure() gets confused by dev_pagemap backed mappings. The
   recovery code has specific enabling for several possible page states
   that needs new enabling to handle poison in dax mappings. Teach
   memory_failure() about ZONE_DEVICE pages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
 OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
 75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
 5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
 BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
 3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
 F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
 T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
 MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
 oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
 /kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
 nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
 =Ftop
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
 "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
  backed mappings. The recovery code has specific enabling for several
  possible page states and needs new enabling to handle poison in dax
  mappings.

  In order to support reliable reverse mapping of user space addresses:

   1/ Add new locking in the memory_failure() rmap path to prevent races
      that would typically be handled by the page lock.

   2/ Since dev_pagemap pages are hidden from the page allocator and the
      "compound page" accounting machinery, add a mechanism to determine
      the size of the mapping that encompasses a given poisoned pfn.

   3/ Given pmem errors can be repaired, change the speculatively
      accessed poison protection, mce_unmap_kpfn(), to be reversible and
      otherwise allow ongoing access from the kernel.

  A side effect of this enabling is that MADV_HWPOISON becomes usable
  for dax mappings, however the primary motivation is to allow the
  system to survive userspace consumption of hardware-poison via dax.
  Specifically the current behavior is:

     mce: Uncorrected hardware memory error in user-access at af34214200
     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
     mce: [Hardware Error]: Machine check events logged
     {1}[Hardware Error]: event severity: corrected
     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
     [..]
     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
     mce: Memory error not recovered
     <reboot>

  ...and with these changes:

     Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
     Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
     Memory failure: 0x20cb00: recovery action for dax page: Recovered

  Given all the cross dependencies I propose taking this through
  nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
  folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pmem: Restore page attributes when clearing errors
  x86/memory_failure: Introduce {set, clear}_mce_nospec()
  x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  filesystem-dax: Introduce dax_lock_mapping_entry()
  mm, memory_failure: Collect mapping size in collect_procs()
  mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  mm, dev_pagemap: Do not clear ->mapping on final put
  mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
  filesystem-dax: Set page->index
  device-dax: Set page->index
  device-dax: Enable page_mapping()
  device-dax: Convert to vmf_insert_mixed and vm_fault_t
2018-08-25 18:43:59 -07:00
Linus Torvalds
828bf6e904 libnvdimm-for-4.19_misc
Collection of misc libnvdimm patches for 4.19 submission
 * Adding support to read locked nvdimm capacity.
 
 * Change test code to make DSM failure code injection an override.
 
 * Add support for calculate maximum contiguous area for namespace.
 
 * Add support for queueing a short ARS when there is on going ARS for
   nvdimm.
 
 * Allow NULL to be passed in to ->direct_access() for kaddr and
   pfn params.
 
 * Improve smart injection support for nvdimm emulation testing.
 
 * Fix test code that supports for emulating controller temperature.
 
 * Fix hang on error before devm_memremap_pages()
 
 * Fix a bug that causes user memory corruption when data returned
   to user for ars_status.
 
 * Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
 OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
 KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
 b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
 4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
 6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
 43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
 EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
 1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
 5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
 u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
 7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
 =5p2n
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dave Jiang:
 "Collection of misc libnvdimm patches for 4.19 submission:

   - Adding support to read locked nvdimm capacity.

   - Change test code to make DSM failure code injection an override.

   - Add support for calculate maximum contiguous area for namespace.

   - Add support for queueing a short ARS when there is on going ARS for
     nvdimm.

   - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
     params.

   - Improve smart injection support for nvdimm emulation testing.

   - Fix test code that supports for emulating controller temperature.

   - Fix hang on error before devm_memremap_pages()

   - Fix a bug that causes user memory corruption when data returned to
     user for ars_status.

   - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
     fsdax"

* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: fix ars_status output length calculation
  device-dax: avoid hang on error before devm_memremap_pages()
  tools/testing/nvdimm: improve emulation of smart injection
  filesystem-dax: Do not request kaddr and pfn when not required
  md/dm-writecache: Don't request pointer dummy_addr when not required
  dax/super: Do not request a pointer kaddr when not required
  tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
  s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
  libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
  acpi/nfit: queue issuing of ars when an uc error notification comes in
  libnvdimm: Export max available extent
  libnvdimm: Use max contiguous area for namespace size
  MAINTAINERS: Add Jan Kara for filesystem DAX
  MAINTAINERS: update Ross Zwisler's email address
  tools/testing/nvdimm: Fix support for emulating controller temperature
  tools/testing/nvdimm: Make DSM failure code injection an override
  acpi, nfit: Prefer _DSM over _LSR for namespace label reads
  libnvdimm: Introduce locked DIMM capacity support
2018-08-25 18:13:10 -07:00
Linus Torvalds
db84abf5f8 This pull request contains a single fix for UBIFS:
- Remove an empty file from UBIFS source
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAluBE84WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wc73EACHsV8cp3Atkg18yZYQq3t8uOZG
 lZApUWTDr2SpWW88/OJ+RAFKGflKQqSiS1y1HYd2s8is4yYMA/XBto5pWvGqg03P
 GsFGzjMqRztZtNwg7n7JUD0Sq9WLoB1BimuMAG1b8vjTH2uuCxJZVyVi7ZF5W5Oc
 065ebVblFhziI7jddRh/Gpwa2dvx3SlKZa0VImiJd6Mw9AnjO4grU+rOofswXTjX
 AKVjeRvNp0DuxJkxaqO9mH7mug87Owi4co4DBLCJu7pPEq2kvFRHQQtn6QKCdtRk
 gyYBD/ZGkhNYz3wPoQxlHM0bGVAToKtjcT1/T61lrlXKjGJu8S8IHC/raEyCB8yb
 Cz3XxMCH7LXyS3eaov4sZAiogENsoP7i20E2AKwlGil65gx4DVR814MTAi5m0AwN
 Kpv7GHRG9oMEar378Rrv8xptRJhj6nwCxAYuBwmPFlK7d95qMQ9hLjsvYp3TLuzn
 7ihUzT8m/GZG1qrlkkk24Bjm834dCVRKPBJJ4beh3YG8/mj/h6OlQGs0NPRIF6O1
 JknbQqXOz6UZ1UsvgtDK7p8GuHztKA4K+3Qv4BWwIRFisX+gTWKB8IveGPpouUaa
 K8NMOkbw8RsatxzM4DNZBPl2YNjvHuuTGudNqMn3pytsi4Y8y5YKdkpFVgZH1V25
 QmQZbhicHSlfwWR0FQ==
 =QoL7
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs

Pull UBIFS fix from Richard Weinberger:
 "Remove an empty file from UBIFS source"

* tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs:
  ubifs: Remove empty file.h
2018-08-25 13:27:35 -07:00
Linus Torvalds
04faac10fc three small SMB3 fixes, one for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
 T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
 0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
 AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
 dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
 brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
 sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
 wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
 NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
 7fcNZtct
 =jMLB
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small SMB3 fixes, one for stable"

* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko to 2.12
  cifs: check kmalloc before use
  cifs: check if SMB2 PDU size has been padded and suppress the warning
  cifs: create a define for how many iovs we need for an SMB2_open()
2018-08-25 13:17:53 -07:00
Colin Ian King
e0fcfe1f1a hpfs: remove unnecessary checks on the value of r when assigning error code
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2.  Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.

Detected by CoverityScan, CID#1226737 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 12:42:33 -07:00
Linus Torvalds
4def196360 Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This is a set of four fairly obvious bug fixes:

   - a switch from d_find_alias to d_find_any_alias because the xattr
     code perversely takes a dentry

   - two mutex vs copy_to_user fixes from Jann Horn

   - a fix to use a sanitized size not the size userspace passed in from
     Christian Brauner"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  getxattr: use correct xattr length
  sys: don't hold uts_sem while accessing userspace memory
  userns: move user access out of the mutex
  cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
2018-08-24 09:25:39 -07:00
Misono Tomohiro
b6fdfbff07 btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
Commit 672d599041 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.

It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.

Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.

Fixes: 672d599041 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-24 14:09:43 +02:00
Richard Weinberger
6e5461d774 ubifs: Remove empty file.h
This empty file sneaked into the tree by mistake.
Remove it.

Fixes: 6eb61d587f ("ubifs: Pass struct ubifs_info to ubifs_assert()")
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-24 13:50:07 +02:00
Jan Kara
ee4af50ca9 udf: Fix mounting of Win7 created UDF filesystems
Win7 is creating UDF filesystems with single partition with number 8192.
Current partition descriptor scanning code does not handle this well as
it incorrectly assumes that partition numbers will form mostly contiguous
space of small numbers. This results in unmountable media due to errors
like:

UDF-fs: error (device dm-1): udf_read_tagged: tag version 0x0000 != 0x0002 || 0x0003, block 0
UDF-fs: warning (device dm-1): udf_fill_super: No fileset found

Fix the problem by handling partition descriptors in a way that sparse
partition numbering does not matter.

Reported-and-tested-by: jean-luc malet <jeanluc.malet@gmail.com>
CC: stable@vger.kernel.org
Fixes: 7b78fd02fb
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-24 11:13:32 +02:00
Jan Kara
82c82ab658 udf: Remove dead code from udf_find_fileset()
Remove dead code and slightly simplify code in udf_find_fileset().

Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-24 11:13:32 +02:00
Linus Torvalds
33e17876ea Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of MM

 - various misc fixes and tweaks

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  mm: Change return type int to vm_fault_t for fault handlers
  lib/fonts: convert comments to utf-8
  s390: ebcdic: convert comments to UTF-8
  treewide: convert ISO_8859-1 text comments to utf-8
  drivers/gpu/drm/gma500/: change return type to vm_fault_t
  docs/core-api: mm-api: add section about GFP flags
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: split memory management API to a separate file
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: kill trailing whitespace in kernel-api.rst
  mm/util: add kernel-doc for kvfree
  mm/util: make strndup_user description a kernel-doc comment
  fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
  treewide: correct "differenciate" and "instanciate" typos
  fs/afs: use new return type vm_fault_t
  drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
  mm: soft-offline: close the race against page allocation
  mm: fix race on soft-offlining free huge pages
  namei: allow restricted O_CREAT of FIFOs and regular files
  hfs: prevent crash on exit from failed search
  ...
2018-08-23 19:20:12 -07:00
Souptick Joarder
2b74030354 mm: Change return type int to vm_fault_t for fault handlers
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type.  As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:44 -07:00
Arnd Bergmann
a2036a1ef2 fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
Without CONFIG_MMU, we get a build warning:

  fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
   static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,

The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.

Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Souptick Joarder
0722f18620 fs/afs: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.

Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Salvatore Mesoraca
30aba6656f namei: allow restricted O_CREAT of FIFOs and regular files
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag.  The purpose
is to make data spoofing attacks harder.  This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection.  This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.

This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:

CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489

This list is not meant to be complete.  It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector.  In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.

[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
  Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
  Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Ernesto A. Fernández
dc2572791d hfs: prevent crash on exit from failed search
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer.  Fix
this to prevent a crash.

Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Ernesto A. Fernandez
aba93a92f4 hfsplus: prevent crash on exit from failed search
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer.  Fix
this to prevent a crash.

Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Ernesto A. Fernández
a7ec7a4193 hfsplus: fix NULL dereference in hfsplus_lookup()
An HFS+ filesystem can be mounted read-only without having a metadata
directory, which is needed to support hardlinks.  But if the catalog
data is corrupted, a directory lookup may still find dentries claiming
to be hardlinks.

hfsplus_lookup() does check that ->hidden_dir is not NULL in such a
situation, but mistakenly does so after dereferencing it for the first
time.  Reorder this check to prevent a crash.

This happens when looking up corrupted catalog data (dentry) on a
filesystem with no metadata directory (this could only ever happen on a
read-only mount).  Wen Xu sent the replication steps in detail to the
fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297

Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Linus Torvalds
53a01c9a5f NFS client updates for Linux 4.19
Stable bufixes:
 - v3.17+: Fix an off-by-one in bl_map_stripe()
 - v4.9+: NFSv4 client live hangs after live data migration recovery
 - v4.18+: xprtrdma: Fix disconnect regression
 - v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
 - v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
 
 Features:
 - Add support for asynchronous server-side COPY operations
 
 Other bugfixes and cleanups:
 - Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
 - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
 - Immediately reschedule writeback when the server replies with an error
 - Fix excessive attribute revalidation in nfs_execute_ok()
 - Add error checking to nfs_idmap_prepare_message()
 - Use new vm_fault_t return type
 - Return a delegation when reclaiming one that the server has recalled
 - Referrals should inherit proto setting from parents
 - Make rpc_auth_create_args a const
 - Improvements to rpc_iostats tracking
 - Fix a potential reference leak when there is an error processing a callback
 - Fix rmdir / mkdir / rename nlink accounting
 - Fix updating inode change attribute
 - Fix error handling in nfsn4_sp4_select_mode()
 - Use an appropriate work queue for direct-write completion
 - Don't busy wait if NFSv4 session draining is interrupted
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
 QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
 r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
 j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
 dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
 MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
 Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
 o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
 UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
 J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
 3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
 tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
 =8+9t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "These patches include adding async support for the v4.2 COPY
  operation. I think Bruce is planning to send the server patches for
  the next release, but I figured we could get the client side out of
  the way now since it's been in my tree for a while. This shouldn't
  cause any problems, since the server will still respond with
  synchronous copies even if the client requests async.

  Features:
   - Add support for asynchronous server-side COPY operations

  Stable bufixes:
   - Fix an off-by-one in bl_map_stripe() (v3.17+)
   - NFSv4 client live hangs after live data migration recovery (v4.9+)
   - xprtrdma: Fix disconnect regression (v4.18+)
   - Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
   - Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)

  Other bugfixes and cleanups:
   - Optimizations and fixes involving NFS v4.1 / pNFS layout handling
   - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
   - Immediately reschedule writeback when the server replies with an
     error
   - Fix excessive attribute revalidation in nfs_execute_ok()
   - Add error checking to nfs_idmap_prepare_message()
   - Use new vm_fault_t return type
   - Return a delegation when reclaiming one that the server has
     recalled
   - Referrals should inherit proto setting from parents
   - Make rpc_auth_create_args a const
   - Improvements to rpc_iostats tracking
   - Fix a potential reference leak when there is an error processing a
     callback
   - Fix rmdir / mkdir / rename nlink accounting
   - Fix updating inode change attribute
   - Fix error handling in nfsn4_sp4_select_mode()
   - Use an appropriate work queue for direct-write completion
   - Don't busy wait if NFSv4 session draining is interrupted"

* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
  pNFS: Remove unwanted optimisation of layoutget
  pNFS/flexfiles: ff_layout_pg_init_read should exit on error
  pNFS: Treat RECALLCONFLICT like DELAY...
  pNFS: When updating the stateid in layoutreturn, also update the recall range
  NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
  NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
  NFSv4: Fix a typo in nfs4_init_channel_attrs()
  NFSv4: Don't busy wait if NFSv4 session draining is interrupted
  NFS recover from destination server reboot for copies
  NFS add a simple sync nfs4_proc_commit after async COPY
  NFS handle COPY ERR_OFFLOAD_NO_REQS
  NFS send OFFLOAD_CANCEL when COPY killed
  NFS export nfs4_async_handle_error
  NFS handle COPY reply CB_OFFLOAD call race
  NFS add support for asynchronous COPY
  NFS COPY xdr handle async reply
  NFS OFFLOAD_CANCEL xdr
  NFS CB_OFFLOAD xdr
  NFS: Use an appropriate work queue for direct-write completion
  NFSv4: Fix error handling in nfs4_sp4_select_mode()
  ...
2018-08-23 16:03:58 -07:00
Linus Torvalds
9157141c95 A mistake on my part caused me to tag my branch 6 commits too early,
missing Chuck's fixes for the problem with callbacks over GSS from
 multi-homed servers, and a smaller fix from Laura Abbott.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
 KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
 7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
 jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
 V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
 nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
 Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
 4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
 ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
 J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
 2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
 fuEjB2b9pow1Ffynat6q
 =JnLK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
  multi-homed servers.

  The only new feature is a minor bit of protocol (change_attr_type)
  which the client doesn't even use yet.

  Other than that, various bugfixes and cleanup"

* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
  sunrpc: Add comment defining gssd upcall API keywords
  nfsd: Remove callback_cred
  nfsd: Use correct credential for NFSv4.0 callback with GSS
  sunrpc: Extract target name into svc_cred
  sunrpc: Enable the kernel to specify the hostname part of service principals
  sunrpc: Don't use stack buffer with scatterlist
  rpc: remove unneeded variable 'ret' in rdma_listen_handler
  nfsd: use true and false for boolean values
  nfsd: constify write_op[]
  fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
  NFSD: Handle full-length symlinks
  NFSD: Refactor the generic write vector fill helper
  svcrdma: Clean up Read chunk path
  svcrdma: Avoid releasing a page in svc_xprt_release()
  nfsd: Mark expected switch fall-through
  sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
  nfsd: fix leaked file lock with nfs exported overlayfs
  nfsd: don't advertise a SCSI layout for an unsupported request_queue
  nfsd: fix corrupted reply to badly ordered compound
  nfsd: clarify check_op_ordering
  ...
2018-08-23 16:00:10 -07:00
Linus Torvalds
6f7948f566 This pull request contains updates for both UBI and UBIFS:
- Year 2038 preparations
 - New UBI feature to skip CRC checks of static volumes
 - A new Kconfig option to disable xattrs in UBIFS
 - Lots of fixes in UBIFS, found by our new test framework
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlt9zFkWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waiuD/oDYzerOLe0R7n2sRT9zjtY8kCx
 LuizRvDYUlmMynI6EVahfyJy2IixcDmXOklGdxJqUkN5igDC/FORWdQjv2X9y56d
 qZ2dlS8aBvI0ZKBG2ew4VP1H67CXtCw8H9fE32SGotPmxKRUQqt2vKqo+vgQfapH
 eSVPrOaoqoRh+/ieumYXsvFdEUWpa66G3tVMFe4znu+kYRBbGzSszxpuq1ukIls2
 P9wewqbWAZpqn+n9A9+RBIv81g+jH87/acfjK2L7/lT9wsFO7BQGKi373dPbnTa5
 9WsjGEd+Gt0kb4Kjh5QegY97bPqWjmaMj1BLqeQVpSbQqpzkiFMf9GW5+h3XqAfO
 hM1zzgONZMxHdZSKH0bWzIRQbvU6v0d9C4J/elfFuH9ke2XscrxjOtZZQbtbGeYj
 tE7FWoZnB8euXubulGAUBKofzWe+gItBe9+iA29EBETNOemrJyHyKjO0Fe9ze5p2
 bfVFvN62kHz4ZCJoinwO/OpXnCuA91xrVocLOOIreb4dkZ/kqP+YZWFf70FcE1o5
 sPAbAUu+hfb2LbpktEdZHHbhoupfCnJokzfboJMX0NWKRtFXJDONjogJYTFUjrpW
 eXS+55+WFHoLWtx9J2IVmcb3cQrj/W/4J83kSg99cUkVjGpil50zmtzhq9bHzsLc
 wazngueP7kW2l9bSSg==
 =gCyp
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - Year 2038 preparations

 - New UBI feature to skip CRC checks of static volumes

 - A new Kconfig option to disable xattrs in UBIFS

 - Lots of fixes in UBIFS, found by our new test framework

* tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
  ubifs: Set default assert action to read-only
  ubifs: Allow setting assert action as mount parameter
  ubifs: Rework ubifs_assert()
  ubifs: Pass struct ubifs_info to ubifs_assert()
  ubifs: Turn two ubifs_assert() into a WARN_ON()
  ubi: expose the volume CRC check skip flag
  ubi: provide a way to skip CRC checks
  ubifs: Use kmalloc_array()
  ubifs: Check data node size before truncate
  Revert "UBIFS: Fix potential integer overflow in allocation"
  ubifs: Add comment on c->commit_sem
  ubifs: introduce Kconfig symbol for xattr support
  ubifs: use swap macro in swap_dirty_idx
  ubifs: tnc: use monotonic znode timestamp
  ubifs: use timespec64 for inode timestamps
  ubifs: xattr: Don't operate on deleted inodes
  ubifs: gc: Fix typo
  ubifs: Fix memory leak in lprobs self-check
  ubi: Initialize Fastmap checkmapping correctly
  ubifs: Fix synced_i_size calculation for xattr inodes
  ...
2018-08-23 15:58:04 -07:00
Steve French
7753e38286 cifs: update internal module version number for cifs.ko to 2.12
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:11:10 -05:00
Nicholas Mc Guire
126c97f4d0 cifs: check kmalloc before use
The kmalloc was not being checked - if it fails issue a warning
and return -ENOMEM to the caller.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: b8da344b74 ("cifs: dynamic allocation of ntlmssp blob")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cc: Stable <stable@vger.kernel.org>`
2018-08-23 15:10:49 -05:00
Ronnie Sahlberg
e6c47dd0da cifs: check if SMB2 PDU size has been padded and suppress the warning
Some SMB2/3 servers, Win2016 but possibly others too, adds padding
not only between PDUs in a compound but also to the final PDU.
This padding extends the PDU to a multiple of 8 bytes.

Check if the unexpected length looks like this might be the case
and avoid triggering the log messages for :

  "SMB2 server sent bad RFC1001 len %d not %d\n"

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:10:46 -05:00
Ronnie Sahlberg
4d8dfafc5c cifs: create a define for how many iovs we need for an SMB2_open()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:10:40 -05:00
Christian Brauner
82c9a927bc getxattr: use correct xattr length
When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr as returned by
vfs_getxattr() down.

A reproducer for the issue is:

  touch acl_posix

  setfacl -m user:0:rwx acl_posix

and the compile:

  #define _GNU_SOURCE
  #include <errno.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/types.h>
  #include <unistd.h>
  #include <attr/xattr.h>

  /* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
  int main(int argc, void **argv)
  {
          ssize_t ret1, ret2;
          char buf1[128], buf2[132];
          int fret = EXIT_SUCCESS;
          char *file;

          if (argc < 2) {
                  fprintf(stderr,
                          "Please specify a file with "
                          "\"system.posix_acl_access\" permissions set\n");
                  _exit(EXIT_FAILURE);
          }
          file = argv[1];

          ret1 = getxattr(file, "system.posix_acl_access",
                          buf1, sizeof(buf1));
          if (ret1 < 0) {
                  fprintf(stderr, "%s - Failed to retrieve "
                                  "\"system.posix_acl_access\" "
                                  "from \"%s\"\n", strerror(errno), file);
                  _exit(EXIT_FAILURE);
          }

          ret2 = getxattr(file, "system.posix_acl_access",
                          buf2, sizeof(buf2));
          if (ret2 < 0) {
                  fprintf(stderr, "%s - Failed to retrieve "
                                  "\"system.posix_acl_access\" "
                                  "from \"%s\"\n", strerror(errno), file);
                  _exit(EXIT_FAILURE);
          }

          if (ret1 != ret2) {
                  fprintf(stderr, "The value of \"system.posix_acl_"
                                  "access\" for file \"%s\" changed "
                                  "between two successive calls\n", file);
                  _exit(EXIT_FAILURE);
          }

          for (ssize_t i = 0; i < ret2; i++) {
                  if (buf1[i] == buf2[i])
                          continue;

                  fprintf(stderr,
                          "Unexpected different in byte %zd: "
                          "%02x != %02x\n", i, buf1[i], buf2[i]);
                  fret = EXIT_FAILURE;
          }

          if (fret == EXIT_SUCCESS)
                  fprintf(stderr, "Test passed\n");
          else
                  fprintf(stderr, "Test failed\n");

          _exit(fret);
  }
and run:

  ./tester acl_posix

On a non-fixed up kernel this should return something like:

  root@c1:/# ./t
  Unexpected different in byte 16: ffffffa0 != 00
  Unexpected different in byte 17: ffffff86 != 00
  Unexpected different in byte 18: 01 != 00

and on a fixed kernel:

  root@c1:~# ./t
  Test passed

Cc: stable@vger.kernel.org
Fixes: 2f6f0654ab ("userns: Convert vfs posix_acl support to use kuids and kgids")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-08-23 20:42:57 +02:00
Dan Carpenter
b9b8a41ade btrfs: use after free in btrfs_quota_enable
The issue here is that btrfs_commit_transaction() frees "trans" on both
the error and the success path.  So the problem would be if
btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
fails.  That means that "ret" is non-zero and "trans" is non-NULL and it
leads to a use after free inside the btrfs_end_transaction() macro.

Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:27 +02:00
Anand Jain
801660b040 btrfs: btrfs_shrink_device should call commit transaction at the end
Test case btrfs/164 reports use-after-free:

[ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
..
[ 6712.195423]  btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
[ 6712.201424]  btrfs_commit_transaction+0x57d/0xa90 [btrfs]
[ 6712.206999]  btrfs_rm_device+0x627/0x850 [btrfs]
[ 6712.211800]  btrfs_ioctl+0x2b03/0x3120 [btrfs]

Reason for this is that btrfs_shrink_device adds the resized device to
the fs_devices::resized_devices after it has called the last commit
transaction.

So the list fs_devices::resized_devices is not empty when
btrfs_shrink_device returns.  Now the parent function
btrfs_rm_device calls:

        btrfs_close_bdev(device);
        call_rcu(&device->rcu, free_device_rcu);

and then does the transactio ncommit. It goes through the
fs_devices::resized_devices in btrfs_update_commit_device_size and
leads to use-after-free.

Fix this by making sure btrfs_shrink_device calls the last needed
btrfs_commit_transaction before the return. This is consistent with what
the grow counterpart does and this makes sure the on-disk state is
persistent when the function returns.

Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:27 +02:00
Lu Fengqi
a5b7f4295e btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.

Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.

Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Filipe Manana
de02b9f6bb Btrfs: fix data corruption when deduplicating between different files
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
  # The first byte with a value of 0xae starts at an offset (2518890)
  # which is not a multiple of the sector size.
  $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo

  # Confirm the file content is full of bytes with values 0x6b and 0xae.
  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Create a second file with a length not aligned to the sector size,
  # whose bytes all have the value 0x6b, so that its extent(s) can be
  # deduplicated with the first file.
  $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar

  # Now deduplicate the entire second file into a range of the first file
  # that also has all bytes with the value 0x6b. The destination range's
  # end offset must not be aligned to the sector size and must be less
  # then the offset of the first byte with the value 0xae (byte at offset
  # 2518890).
  $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo

  # The bytes in the range starting at offset 2515659 (end of the
  # deduplication range) and ending at offset 2519040 (start offset
  # rounded up to the block size) must all have the value 0xae (and not
  # replaced with 0x00 values). In other words, we should have exactly
  # the same data we had before we asked for deduplication.
  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Unmount the filesystem and mount it again. This guarantees any file
  # data in the page cache is dropped.
  $ umount /dev/sdb
  $ mount /dev/sdb /mnt

  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
  11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
  # value of 0xae, data corruption happened due to the deduplication
  # operation.

So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:

  1) Source file's range ends at its i_size;
  2) Source file's i_size is not aligned to the sector size;
  3) Destination range does not cross the i_size of the destination file.

Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Filipe Manana
d4682ba03e Btrfs: sync log after logging new name
When we add a new name for an inode which was logged in the current
transaction, we update the inode in the log so that its new name and
ancestors are added to the log. However when we do this we do not persist
the log, so the changes remain in memory only, and as a consequence, any
ancestors that were created in the current transaction are updated such
that future calls to btrfs_inode_in_log() return true. This leads to a
subsequent fsync against such new ancestor directories returning
immediately, without persisting the log, therefore after a power failure
the new ancestor directories do not exist, despite fsync being called
against them explicitly.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/A
  $ mkdir /mnt/B
  $ mkdir /mnt/A/C
  $ touch /mnt/B/foo
  $ xfs_io -c "fsync" /mnt/B/foo
  $ ln /mnt/B/foo /mnt/A/C/foo
  $ xfs_io -c "fsync" /mnt/A
  <power failure>

After the power failure, directory "A" does not exist, despite the explicit
fsync on it.

Instead of fixing this by changing the behaviour of the explicit fsync on
directory "A" to persist the log instead of doing nothing, make the logging
of the new file name (which happens when creating a hard link or renaming)
persist the log. This approach not only is simpler, not requiring addition
of new fields to the inode in memory structure, but also gives us the same
behaviour as ext4, xfs and f2fs (possibly other filesystems too).

A test case for fstests follows soon.

Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Chuck Lever
a26dd64f54 nfsd: Remove callback_cred
Clean up: The global callback_cred is no longer used, so it can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
cb25e7b293 nfsd: Use correct credential for NFSv4.0 callback with GSS
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):

May  9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred

The client-side commit f11b2a1cfb ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.

RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database.  This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...

In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "nfs@server.domain" to establish its lease, the server
is required to use "nfs@server.domain" when performing NFSv4.0
callback operations.

The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.

This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
9abdda5dda sunrpc: Extract target name into svc_cred
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.

Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Linus Torvalds
fe6f0ed0da f2fs-for-4.19-rc1
In this round, we've tuned f2fs to improve general performance by serializing
 block allocation and enhancing discard flows like fstrim which avoids user IO
 contention. And we've added fsync_mode=nobarrier which gives an option to user
 where it skips issuing cache_flush commands to underlying flash storage. And
 there are many bug fixes related to fuzzed images, revoked atomic writes, quota
 ops, and minor direct IO.
 
 Enhancement:
  - add fsync_mode=nobarrier which bypasses cache_flush command
  - enhance the discarding flow which avoids user IOs and issues in LBA order
  - readahead some encrypted blocks during GC
  - enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
  - enhance nat_bits behavior
  - set -o discard by default
  - set REQ_RAHEAD to bio in ->readpages
 
 Bug fixes:
  - fix a corner case to corrupt atomic_writes revoking flow
  - revisit i_gc_rwsem to fix race conditions
  - fix some dio behaviors captured by xfstests
  - correct handling errors given by quota-related failures
  - add many sanity check flows to avoid fuzz test failures
  - add more error number propagation to their callers
  - fix several corner cases to continue fault injection w/ shutdown loop
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
 UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
 sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
 Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
 UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
 miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
 ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
 9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
 WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
 BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
 L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
 +VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
 =zgbL
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've tuned f2fs to improve general performance by
  serializing block allocation and enhancing discard flows like fstrim
  which avoids user IO contention. And we've added fsync_mode=nobarrier
  which gives an option to user where it skips issuing cache_flush
  commands to underlying flash storage. And there are many bug fixes
  related to fuzzed images, revoked atomic writes, quota ops, and minor
  direct IO.

  Enhancements:
   - add fsync_mode=nobarrier which bypasses cache_flush command
   - enhance the discarding flow which avoids user IOs and issues in
     LBA order
   - readahead some encrypted blocks during GC
   - enable in-memory inode checksum to verify the blocks if
     F2FS_CHECK_FS is set
   - enhance nat_bits behavior
   - set -o discard by default
   - set REQ_RAHEAD to bio in ->readpages

  Bug fixes:
   - fix a corner case to corrupt atomic_writes revoking flow
   - revisit i_gc_rwsem to fix race conditions
   - fix some dio behaviors captured by xfstests
   - correct handling errors given by quota-related failures
   - add many sanity check flows to avoid fuzz test failures
   - add more error number propagation to their callers
   - fix several corner cases to continue fault injection w/ shutdown
     loop"

* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
  f2fs: readahead encrypted block during GC
  f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
  f2fs: fix performance issue observed with multi-thread sequential read
  f2fs: fix to skip verifying block address for non-regular inode
  f2fs: rework fault injection handling to avoid a warning
  f2fs: support fault_type mount option
  f2fs: fix to return success when trimming meta area
  f2fs: fix use-after-free of dicard command entry
  f2fs: support discard submission error injection
  f2fs: split discard command in prior to block layer
  f2fs: wake up gc thread immediately when gc_urgent is set
  f2fs: fix incorrect range->len in f2fs_trim_fs()
  f2fs: refresh recent accessed nat entry in lru list
  f2fs: fix avoid race between truncate and background GC
  f2fs: avoid race between zero_range and background GC
  f2fs: fix to do sanity check with block address in main area v2
  f2fs: fix to do sanity check with inline flags
  f2fs: fix to reset i_gc_failures correctly
  f2fs: fix invalid memory access
  f2fs: fix to avoid broken of dnode block list
  ...
2018-08-22 13:29:39 -07:00
Miklos Szeredi
6faf05c2b2 ovl: set I_CREATING on inode being created
...otherwise there will be list corruption due to inode_sb_list_add() being
called for inode already on the sb list.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e950564b97 ("vfs: don't evict uninitialized inode")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 13:15:25 -07:00
Linus Torvalds
cd9b44f907 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - procfs updates

 - various misc things

 - more y2038 fixes

 - get_maintainer updates

 - lib/ updates

 - checkpatch updates

 - various epoll updates

 - autofs updates

 - hfsplus

 - some reiserfs work

 - fatfs updates

 - signal.c cleanups

 - ipc/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
  ipc/util.c: update return value of ipc_getref from int to bool
  ipc/util.c: further variable name cleanups
  ipc: simplify ipc initialization
  ipc: get rid of ids->tables_initialized hack
  lib/rhashtable: guarantee initial hashtable allocation
  lib/rhashtable: simplify bucket_table_alloc()
  ipc: drop ipc_lock()
  ipc/util.c: correct comment in ipc_obtain_object_check
  ipc: rename ipcctl_pre_down_nolock()
  ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
  ipc: reorganize initialization of kern_ipc_perm.seq
  ipc: compute kern_ipc_perm.id under the ipc lock
  init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
  fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
  adfs: use timespec64 for time conversion
  kernel/sysctl.c: fix typos in comments
  drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
  fork: don't copy inconsistent signal handler state to child
  signal: make get_signal() return bool
  signal: make sigkill_pending() return bool
  ...
2018-08-22 12:34:08 -07:00
Arnd Bergmann
3e811f053a fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.

In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.

This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().

Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Arnd Bergmann
d9edcbc42c adfs: use timespec64 for time conversion
We just truncate the seconds to 32-bit in one place now, so this can
trivially be converted over to using timespec64 consistently.

Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Arnd Bergmann
f423420c23 fat: propagate 64-bit inode timestamps
Now that we pass down 64-bit timestamps from VFS, we just need to convert
that correctly into on-disk timestamps.  To make that work correctly, this
changes the last use of time_to_tm() in the kernel to time64_to_tm(),
which also lets use remove that deprecated interfaces.

Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp
on the way in, which can be avoided by using types that are wide enough to
hold the intermediate values during the conversion.

[hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long]
Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
OGAWA Hirofumi
0afa962666 fat: validate ->i_start before using
On corrupted FATfs may have invalid ->i_start.  To handle it, this checks
->i_start before using, and return proper error code.

Link: http://lkml.kernel.org/r/87o9f8y1t5.fsf_-_@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Wentao Wang
f663b5b38f fat: add FITRIM ioctl for FAT file system
Add FITRIM ioctl for FAT file system

[witallwang@gmail.com: use u64s]
  Link: http://lkml.kernel.org/r/87h8l37hub.fsf@mail.parknet.co.jp
[hirofumi@mail.parknet.co.jp: bug fixes, coding style fixes, add signal check]
Link: http://lkml.kernel.org/r/87fu10anhj.fsf@mail.parknet.co.jp
Signed-off-by: Wentao Wang <witallwang@gmail.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Jann Horn
a13f085d11 reiserfs: fix broken xattr handling (heap corruption, bad retval)
This fixes the following issues:

- When a buffer size is supplied to reiserfs_listxattr() such that each
  individual name fits, but the concatenation of all names doesn't fit,
  reiserfs_listxattr() overflows the supplied buffer.  This leads to a
  kernel heap overflow (verified using KASAN) followed by an out-of-bounds
  usercopy and is therefore a security bug.

- When a buffer size is supplied to reiserfs_listxattr() such that a
  name doesn't fit, -ERANGE should be returned.  But reiserfs instead just
  truncates the list of names; I have verified that if the only xattr on a
  file has a longer name than the supplied buffer length, listxattr()
  incorrectly returns zero.

With my patch applied, -ERANGE is returned in both cases and the memory
corruption doesn't happen anymore.

Credit for making me clean this code up a bit goes to Al Viro, who pointed
out that the ->actor calling convention is suboptimal and should be
changed.

Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com
Fixes: 48b32a3553 ("reiserfs: use generic xattr handlers")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Arnd Bergmann
8b73ce6a4b reiserfs: change j_timestamp type to time64_t
This uses the deprecated time_t type but is write-only, and could be
removed, but as Jeff explains, having a timestamp can be usefule for
post-mortem analysis in crash dumps.

In order to remove one of the last instances of time_t, this changes the
type to time64_t, same as j_trans_start_time.

Link: http://lkml.kernel.org/r/20180622133315.221210-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Arnd Bergmann
5b1d149c89 reiserfs: remove obsolete print_time function
Before linux-2.4.6, print_time() was used to pretty-print an inode time
when running reiserfs in user space, after that it has become obsolete and
is still a bit incorrect: It behaves differently on 32-bit and 64-bit
machines, and uses a static buffer to hold a string, which could lead to
undefined behavior if we ever called this from multiple places
simultaneously.

Since we always want to treat the timestamps as 'unsigned' anyway, simply
printing them as an integer is both simpler and safer while avoiding the
deprecated time_t type.

Link: http://lkml.kernel.org/r/20180620142522.27639-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Arnd Bergmann
34d082604a reiserfs: use monotonic time for j_trans_start_time
Using CLOCK_REALTIME time_t timestamps breaks on 32-bit systems in 2038,
and gives surprising results with a concurrent settimeofday().

This changes the reiserfs journal timestamps to use ktime_get_seconds()
instead, which makes it use a 64-bit CLOCK_MONOTONIC stamp.

In the procfs output, the monotonic timestamp needs to be converted back
to CLOCK_REALTIME to keep the existing ABI.

Link: http://lkml.kernel.org/r/20180620142522.27639-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Ernesto A. Fernández
f168d9fd63 hfsplus: drop ACL support
The HFS+ Access Control Lists have not worked at all for the past five
years, and nobody seems to have noticed.  Besides, POSIX draft ACLs are
not compatible with MacOS.  Drop the feature entirely.

Link: http://lkml.kernel.org/r/20180714190608.wtnmmtjqeyladkut@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Ernesto A. Fernández
afd6c9e1f5 hfsplus: fix decomposition of Hangul characters
Files created under macOS cannot be opened under linux if their names
contain Korean characters, and vice versa.

The Korean alphabet is special because its normalization is done without a
table.  The module deals with it correctly when composing, but forgets
about it for the decomposition.

Fix this using the Hangul decomposition function provided in the Unicode
Standard.  The code fits a bit awkwardly because it requires a buffer,
while all the other normalizations are returned as pointers to the
decomposition table.  This is actually also a bug because reordering may
still be needed, but for now leave it as it is.

The patch will cause trouble for Hangul filenames already created by the
module in the past.  This shouldn't really be concern because its main
purpose was always sharing with macOS.  If a user actually needs to access
such a file the nodecompose mount option should be enough.

Link: http://lkml.kernel.org/r/20180717220951.p6qqrgautc4pxvzu@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Ting-Chang Hou <tchou@synology.com>
Tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Ernesto A. Fernández
31651c6071 hfsplus: avoid deadlock on file truncation
After an extent is removed from the extent tree, the corresponding bits
are also cleared from the block allocation file.  This is currently done
without releasing the tree lock.

The problem is that the allocation file has extents of its own; if it is
fragmented enough, some of them may be in the extent tree as well, and
hfsplus_get_block() will try to take the lock again.

To avoid deadlock, only hold the extent tree lock during the actual tree
operations.

Link: http://lkml.kernel.org/r/20180709202549.auxwkb6memlegb4a@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Tetsuo Handa
7464726cb5 hfsplus: don't return 0 when fill_super() failed
syzbot is reporting NULL pointer dereference at mount_fs() [1].  This is
because hfsplus_fill_super() is by error returning 0 when
hfsplus_fill_super() detected invalid filesystem image, and mount_bdev()
is returning NULL because dget(s->s_root) == NULL if s->s_root == NULL,
and mount_fs() is accessing root->d_sb because IS_ERR(root) == false if
root == NULL.  Fix this by returning -EINVAL when hfsplus_fill_super()
detected invalid filesystem image.

[1] https://syzkaller.appspot.com/bug?id=21acb6850cecbc960c927229e597158cf35f33d0

Link: http://lkml.kernel.org/r/d83ce31a-874c-dd5b-f790-41405983a5be@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+01ffaf5d9568dd1609f7@syzkaller.appspotmail.com>
Reviewed-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Souptick Joarder
c8ed98cd88 fs/nilfs2/file.c: use new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite handler.

Link: http://lkml.kernel.org/r/1529555928-2411-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Arnd Bergmann
21a1a52dbd nilfs2: use 64-bit superblock timstamps
The mount time field in the superblock uses a 64-bit timestamp, but
calling get_seconds() may truncate the current time to 32 bits.

This changes it to ktime_get_real_seconds() to avoid the potential
overflow.

Link: http://lkml.kernel.org/r/20180620075041.4154396-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
cbf6898fd6 autofs: add AUTOFS_EXP_FORCED flag
The userspace automount(8) daemon is meant to perform a forced expire when
sent a SIGUSR2.

But since the expiration is routed through the kernel and the kernel
doesn't send an expire request if the mount is busy this hasn't worked at
least since autofs version 5.

Add an AUTOFS_EXP_FORCED flag to allow implemention of the feature and
bump the protocol version so user space can check if it's implemented if
needed.

Link: http://lkml.kernel.org/r/152937734715.21213.6594007182776598970.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
e5c85e1fe1 autofs: make expire flags usage consistent with v5 params
Make the usage of the expire flags consistent by naming the expire flags
the same as it is named in the version 5 miscelaneous ioctl parameters and
only check the bit flags when needed.

Link: http://lkml.kernel.org/r/152937734046.21213.9454131988766280028.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
571bc35c42 autofs: make autofs_expire_indirect() static
autofs_expire_indirect() isn't used outside of fs/autofs/expire.c so make
it static.

Link: http://lkml.kernel.org/r/152937733512.21213.10509996499623738446.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
5d30517d67 autofs: make autofs_expire_direct() static
autofs_expire_direct() isn't used outside of fs/autofs/expire.c so make it
static.

Link: http://lkml.kernel.org/r/152937732944.21213.11821977712410930973.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
d1055565bd autofs: fix clearing AUTOFS_EXP_LEAVES in autofs_expire_indirect()
The expire flag AUTOFS_EXP_LEAVES is cleared before the second call to
should_expire() in autofs_expire_indirect() but the parameter passed in
the second call is incorrect.

Fortunately AUTOFS_EXP_LEAVES expire flag has not been used for a long
time but might be needed in the future so fix it rather than remove the
expire leaves functionality.

Link: http://lkml.kernel.org/r/152937732410.21213.7447294898147765076.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
2fd9944f0f autofs: fix inconsistent use of now variable
The global variable "now" in fs/autofs/expire.c is used in an inconsistent
way, sometimes using jiffies directly, and sometimes using the "now"
variable, and setting it isn't done consistently either.

But the autofs dentry info last_used field is only updated during path
walks or during expire so jiffies can be used directly and the global
variable "now" removed.

Link: http://lkml.kernel.org/r/152937731702.21213.7371321165189170865.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
d4d79b8195 autofs: fix directory and symlink access
Depending on how it is configured the autofs user space daemon can leave
in use mounts mounted at exit and re-connect to them at start up.  But for
this to work best the state of the autofs file system needs to be left
intact over the restart.

Also, at system shutdown, mounts in an autofs file system might be
umounted exposing a mount point trigger for which subsequent access can
lead to a hang.  So recent versions of automount(8) now does its best to
set autofs file system mounts catatonic at shutdown.

When autofs file system mounts are catatonic it's currently possible to
create and remove directories and symlinks which can be a problem at
restart, as described above.

So return EACCES in the directory, symlink and unlink methods if the
autofs file system is catatonic.

Link: http://lkml.kernel.org/r/152902119090.4144.9561910674530214291.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
992991c03c fs/eventpoll.c: simplify ep_is_linked() callers
Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer.  This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.

Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
679abf381a fs/eventpoll.c: loosen irq safety in ep_poll()
Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs.  In fact, the call should never be called in irq context at
all, considering that the only path is

epoll_wait(2) -> do_epoll_wait() -> ep_poll().

When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:

    # threads       vanilla         dirty
	 1          1805587	    2106412
	 2          1854064	    2090762
	 4          1805484	    2017436
	 8          1751222	    1974475
	 16         1725299	    1962104
	 32         1378463	    1571233
	 64          787368	     900784

Which is a pretty constantly near 15%.

Also add a lockdep check such that we detect any mischief before
deadlocking.

Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
514056d506 fs/eventpoll.c: simply CONFIG_NET_RX_BUSY_POLL ifdefery
... 'tis easier on the eye.

[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
92e6417840 s/epoll: robustify irq safety with lockdep_assert_irqs_enabled()
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock.  These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.

[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Davidlohr Bueso
304b18b8d6 fs/epoll: loosen irq safety in epoll_insert() and epoll_remove()
Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2).  Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.

Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Davidlohr Bueso
002b343669 fs/epoll: loosen irq safety in ep_scan_ready_list()
Patch series "fs/epoll: loosen irq safety when possible".

Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs.  This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen.  The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.

Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.

This patch (of 2):

Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.

1 Job	 7500	-->    8800 enq/s  (+17%)
2 Jobs	14000   -->   15200 enq/s  (+8%)
3 Jobs	20500	-->   22300 enq/s  (+8%)
4 Jobs	25000   -->   28000 enq/s  (+8-12)%

On bare metal:

For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.

1) Different configurations were used for a epoll_wait (pipes io)
   microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
   around a 7-10% improvement in overall total number of times the
   epoll_wait() loops when using both regular and nested epolls, so very
   raw numbers, but measurable nonetheless.

# threads	vanilla		dirty
     1		1677717		1805587
     2		1660510		1854064
     4		1610184		1805484
     8		1577696		1751222
     16		1568837		1725299
     32		1291532		1378463
     64		 752584		 787368

   Note that stddev is pretty small.

2) Another pipe test, which shows no real measurable improvement.
   (http://www.xmailserver.org/linux-patches/pipetest.c)

Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Matthew Wilcox
c430d1e848 userfaultfd: use fault_wqh lock
The userfaultfd code currently uses the unlocked waitqueue helpers for
managing fault_wqh, but instead of holding the waitqueue lock for this
waitqueue around these calls, it the waitqueue lock of
fault_pending_wq, which is a different waitqueue instance.  Given that
the waitqueue is not exposed to the rest of the kernel this actually
works ok at the moment, but prevents the userfaultfd locking rules from
being enforced using lockdep.

Switch to the internally locked waitqueue helpers instead.  This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entirely unused before.

[hch@lst.de: slight changelog updates]
[rppt@linux.vnet.ibm.com: spotted changelog spellos]
Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Christoph Hellwig
ee8ef0a4b1 epoll: use the waitqueue lock to protect ep->wq
Patch series "waitqueue lockdep annotation", v3.

This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants.  It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.

This patch (of 3):

The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock.  Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.

Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Omar Sandoval
23c85094fe proc/kcore: add vmcoreinfo note to /proc/kcore
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps.  A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.

Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
bf991c2231 proc/kcore: optimize multiple page reads
The current code does a full search of the segment list every time for
every page.  This is wasteful, since it's almost certain that the next
page will be in the same segment.  Instead, check if the previous segment
covers the current page before doing the list search.

Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
37e949bd52 proc/kcore: clean up ELF header generation
Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3.  Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort.  It's cleaner and
more efficient to handle the three separately.

Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
3673fb08db proc/kcore: hold lock during read
Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path.  This is preparation for the
next change.

[akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
b66fb005c9 proc/kcore: fix memory hotplug vs multiple opens race
There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:

CPU0                              CPU1
// hotplug event 1
kcore_need_update = 1

open_kcore()                      open_kcore()
    kcore_update_ram()                kcore_update_ram()
        // Walk RAM                       // Walk RAM
        __kcore_update_ram()              __kcore_update_ram()
            kcore_need_update = 0

// hotplug event 2
kcore_need_update = 1
                                              kcore_need_update = 0

Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2.  The RAM entries will
therefore be stale until there is another hotplug event.

This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in replacing
the kcore list.

Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
0b172f845f proc/kcore: replace kclist_lock rwlock with rwsem
Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.

Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
bf53183164 proc/kcore: don't grab lock for memory hotplug notifier
The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate that
race by using an atomic xchg() in __kcore_update_ram().  This is
preparation for converting kclist_lock to an rwsem.

Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
a8dd9c4df1 proc/kcore: don't grab lock for kclist_add()
Patch series "/proc/kcore improvements", v4.

This series makes a few improvements to /proc/kcore.  It fixes a couple of
small issues in v3 but is otherwise the same.  Patches 1, 2, and 3 are
prep patches.  Patch 4 is a fix/cleanup.  Patch 5 is another prep patch.
Patches 6 and 7 are optimizations to ->read().  Patch 8 makes it possible
to enable CRASH_CORE on any architecture, which is needed for patch 9.
Patch 9 adds vmcoreinfo to /proc/kcore.

This patch (of 9):

kclist_add() is only called at init time, so there's no point in grabbing
any locks.  We're also going to replace the rwlock with a rwsem, which we
don't want to try grabbing during early boot.

While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.

Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
James Morse
df865e8337 fs/proc/kcore.c: use __pa_symbol() for KCORE_TEXT list entries
elf_kcore_store_hdr() uses __pa() to find the physical address of
KCORE_RAM or KCORE_TEXT entries exported as program headers.

This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
not in the linear map.

Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
entries.

Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Souptick Joarder
36f062042b fs/proc/vmcore.c: use new typedef vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno.  Once all instances are
converted, vm_fault_t will become a distinct type.

See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.

Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
9a27e97aaa proc: use "unsigned int" in /proc/stat hook
Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.

Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
891ae71dc4 proc: spread "const" a bit
Link: http://lkml.kernel.org/r/20180627200614.GB18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
f6d2f584d8 proc: use macro in /proc/latency hook
->latency_record is defined as

	struct latency_record[LT_SAVECOUNT];

so use the same macro whie iterating.

Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
41089b6d3e proc: save 2 atomic ops on write to "/proc/*/attr/*"
Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.

Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:

	rcu_read_lock();
	task = pid_task(...);
	rcu_read_unlock();
	if (!task)
		return -ESRCH;
	if (task != current)
		return -EACCESS:

P.S.: rename "length" variable.	Code like this

	length = -EINVAL;

should not exist.

Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Alexey Dobriyan
a44937fe4e proc: put task earlier in /proc/*/fail-nth
Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Alexey Dobriyan
8d48b2e044 proc: smaller readlock section in readdir("/proc")
Readdir context is thread local, so ->pos is thread local,
move it out of readlock.

Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Arnd Bergmann
bdf228a272 fs/proc/uptime.c: use ktime_get_boottime_ts64
get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().

Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Alexey Dobriyan
2d6e4e822a proc: fixup PDE allocation bloat
24074a35c5 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.

Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.

Put BUILD_BUG_ON to know when PDE size has gotten out of control.

[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
  Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Dennis Zhou (Facebook)
7e8a6304d5 /proc/meminfo: add percpu populated pages count
Currently, percpu memory only exposes allocation and utilization
information via debugfs.  This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters.  This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory.  Let's make it easier for someone to
identify how much memory is being used.

This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use.  This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level.  Metadata is excluded.  I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata.  It also makes this calculation light.

Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Andrew Morton
a670468f5e mm: zero out the vma in vma_init()
Rather than in vm_area_alloc().  To ensure that the various oddball
stack-based vmas are in a good state.  Some of the callers were zeroing
them out, others were not.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
258f669e7e mm: /proc/pid/smaps_rollup: convert to single value seq_file
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's.  However, the rollup file doesn't print anything
for each vma, only accumulate the stats.

There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.

Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.

This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally.  This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

[vbabka@suse.c: use seq_file infrastructure]
  Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
f1547959d9 mm: /proc/pid/smaps: factor out common stats printing
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.

[vbabka@suse.cz: add const, per Alexey]
  Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
8e68d689af mm: /proc/pid/smaps: factor out mem stats gathering
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.

Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
871305bb20 mm: /proc/pid/*maps remove is_pid and related wrappers
Patch series "cleanups and refactor of /proc/pid/smaps*".

The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code.  Patches 2 and 3 are
preparations for that.  Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.

Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures.  Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h).  However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.

So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

This patch (of 4):

Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps.  However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").

Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.

Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Ian Kent
0633da48f0 autofs: fix autofs_sbi() does not check super block type
autofs_sbi() does not check the superblock magic number to verify it has
been given an autofs super block.

Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net
Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Jeremy Cline
7b6924d94a fs/quota: Fix spectre gadget in do_quotactl
'type' is user-controlled, so sanitize it after the bounds check to
avoid using it in speculative execution. This covers the following
potential gadgets detected with the help of smatch:

* fs/ext4/super.c:5741 ext4_quota_read() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/ext4/super.c:5778 ext4_quota_write() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/f2fs/super.c:1552 f2fs_quota_read() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/f2fs/super.c:1608 f2fs_quota_write() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/quota/dquot.c:412 mark_info_dirty() warn: potential spectre issue
  'sb_dqopt(sb)->info' [w]
* fs/quota/dquot.c:933 dqinit_needed() warn: potential spectre issue
  'dquots' [r]
* fs/quota/dquot.c:2112 dquot_commit_info() warn: potential spectre
  issue 'dqopt->ops' [r]
* fs/quota/dquot.c:2362 vfs_load_quota_inode() warn: potential spectre
  issue 'dqopt->files' [w] (local cap)
* fs/quota/dquot.c:2369 vfs_load_quota_inode() warn: potential spectre
  issue 'dqopt->ops' [w] (local cap)
* fs/quota/dquot.c:2370 vfs_load_quota_inode() warn: potential spectre
  issue 'dqopt->info' [w] (local cap)
* fs/quota/quota.c:110 quota_getfmt() warn: potential spectre issue
  'sb_dqopt(sb)->info' [r]
* fs/quota/quota_v2.c:84 v2_check_quota_file() warn: potential spectre
  issue 'quota_magics' [w]
* fs/quota/quota_v2.c:85 v2_check_quota_file() warn: potential spectre
  issue 'quota_versions' [w]
* fs/quota/quota_v2.c:96 v2_read_file_info() warn: potential spectre
  issue 'dqopt->info' [r]
* fs/quota/quota_v2.c:172 v2_write_file_info() warn: potential spectre
  issue 'dqopt->info' [r]

Additionally, a quick inspection indicates there are array accesses with
'type' in quota_on() and quota_off() functions which are also addressed
by this.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-22 18:17:48 +02:00
Jeremy Cline
64d9d13828 fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
XQM_MAXQUOTAS and MAXQUOTAS are, it appears, equivalent. Replace all
usage of XQM_MAXQUOTAS and remove it along with the unused XQM_*QUOTA
definitions.

Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-22 18:17:29 +02:00
Matthew Wilcox
0f0a0e54a2 devpts: Convert to new IDA API
ida_alloc_max() matches what this driver wants to do.  Also removes a
call to ida_pre_get().  We no longer need the protection of the mutex,
so convert pty_count to an atomic_t and remove the mutex entirely.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:17 -04:00
Matthew Wilcox
169b480e4c fs: Convert namespace IDAs to new API
We don't need to keep track of the starting value; the IDA is efficient.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:17 -04:00
Matthew Wilcox
5a66847e44 fs: Convert unnamed_dev_ida to new API
The new API is much easier for this user.  Also add kerneldoc for
get_anon_bdev().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:16 -04:00
Linus Torvalds
ad1d697358 fuse update for 4.19
This contains various bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3xvGwAKCRDh3BK/laaZ
 PKECAP9qUpdtQ5RaIL/y9OGZzJLSZbBZuK3LGNY2u2B3EfrSjgEAvhkhXyOQgvVi
 kgYLNszbg/C+w8U4Xc5GWB6cjNm6rwE=
 =GJI7
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse update from Miklos Szeredi:
 "Various bug fixes and cleanups"

* tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: reduce allocation size for splice_write
  fuse: use kvmalloc to allocate array of pipe_buffer structs.
  fuse: convert last timespec use to timespec64
  fs: fuse: Adding new return type vm_fault_t
  fuse: simplify fuse_abort_conn()
  fuse: Add missed unlock_page() to fuse_readpages_fill()
  fuse: Don't access pipe->buffers without pipe_lock()
  fuse: fix initial parallel dirops
  fuse: Fix oops at process_init_reply()
  fuse: umount should wait for all requests
  fuse: fix unlocked access to processing queue
  fuse: fix double request_end()
2018-08-21 18:47:36 -07:00
Linus Torvalds
d9a185f8b4 overlayfs update for 4.19
This contains two new features:
 
  1) Stack file operations: this allows removal of several hacks from the
     VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.
 
  2) Metadata only copy-up: when file is on lower layer and only metadata is
     modified (except size) then only copy up the metadata and continue to
     use the data from the lower file.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
 PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
 fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
 =QDXH
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "This contains two new features:

   - Stack file operations: this allows removal of several hacks from
     the VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.

   - Metadata only copy-up: when file is on lower layer and only
     metadata is modified (except size) then only copy up the metadata
     and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
  ovl: Enable metadata only feature
  ovl: Do not do metacopy only for ioctl modifying file attr
  ovl: Do not do metadata only copy-up for truncate operation
  ovl: add helper to force data copy-up
  ovl: Check redirect on index as well
  ovl: Set redirect on upper inode when it is linked
  ovl: Set redirect on metacopy files upon rename
  ovl: Do not set dentry type ORIGIN for broken hardlinks
  ovl: Add an inode flag OVL_CONST_INO
  ovl: Treat metacopy dentries as type OVL_PATH_MERGE
  ovl: Check redirects for metacopy files
  ovl: Move some dir related ovl_lookup_single() code in else block
  ovl: Do not expose metacopy only dentry from d_real()
  ovl: Open file with data except for the case of fsync
  ovl: Add helper ovl_inode_realdata()
  ovl: Store lower data inode in ovl_inode
  ovl: Fix ovl_getattr() to get number of blocks from lower
  ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
  ovl: Copy up meta inode data from lowest data inode
  ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
  ...
2018-08-21 18:19:09 -07:00
Linus Torvalds
c22fc16d17 Changes since last update:
- Fix an uninitialized variable
 - Don't use obviously garbage AG header counters to calculate
   transaction reservations
 - Trigger icount recalculation on bad icount when monting.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlty8tsACgkQ+H93GTRK
 tOtoKw/+OeCaY6jZc2JoztBwLSUJsMYQ0R8Wsj5GRb4bVp9b0zes7RJMFU03nCtj
 XuE4Rhdsx+6+QZQKxTq/Z6lrKHEjF0kL1EVGHtL46Inr+Z+Rr4bLBG6NV1o0dg7B
 CR1IqW5vYcZ7Vrk9ko/RXVXtuCIxBS5jSW/S/uFT95Y4lVMAf/2asR/OoYt5ZVE3
 17CUfWRifiSGoBQpjtfZd63F23XlEEusiErC5iS9rUbE2qC9FxP9EuvoUP5M/n01
 nLS34Fjw7X739AiwHbf10fQPOvBr7atTazCXskjy4gbwqIWTmuhbF4ieTU1OfTI8
 ozhvYomBYLiZbsEYBhVCs09VEnIfHmf2HoLh//efGE8VEvoQllxdn/g2PQekoPAn
 M7VnRUXCTvaLI8IE2d3Ed1VWm0OTea09xqEiNpB0XGjegim9pXuf6t/zbe4R0vJy
 YLBgQT8XRPw5ZgCnBbxvZOXXxQtAqKnqZzYSWGxlHJhhduKVeKMqerhP0nn0ui8g
 wAOmOe3XEoyLfSY8WY0ACEEEA00pAwErerwVEFLCpaKTh5GOY4i3OBdqcZOtXacn
 f5oIeG9HZKAXKkOTGwpq1zGHTOYhz4mxAYhodRFiEE8rXHDa9odUWQ/iG0zgZaO6
 19xznXjXkVWVg0QJqQJi6SbEkkrAEFtFRYH+VPTgWM/1tg47a14=
 =+0Eq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix an uninitialized variable

 - Don't use obviously garbage AG header counters to calculate
   transaction reservations

 - Trigger icount recalculation on bad icount when mounting

* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: fix WARN_ON_ONCE on uninitialized variable
  xfs: sanity check ag header values in xrep_calc_ag_resblks
  xfs: recalculate summary counters at mount time if icount is bad
2018-08-21 18:15:47 -07:00
Linus Torvalds
0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Trond Myklebust
0af4c8be97 pNFS: Remove unwanted optimisation of layoutget
If we knew that the file was empty, we wouldn't be asking for a layout.
Any optimisation here is already done before calling pnfs_update_layout().
As it stands, we sometimes end up doing an unnecessary inband read to
the MDS even when holding a layout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-21 13:39:08 -04:00
Trond Myklebust
1c1aeaf143 pNFS/flexfiles: ff_layout_pg_init_read should exit on error
If we get an error while retrieving the layout, then we should
report it rather than falling back to I/O through the MDS.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-21 13:39:05 -04:00
Eric Sandeen
09a4e0be58 isofs: reject hardware sector size > 2048 bytes
The largest block size supported by isofs is ISOFS_BLOCK_SIZE (2048), but
isofs_fill_super calls sb_min_blocksize and sets the blocksize to the
device's logical block size if it's larger than what we ended up with after
option parsing.

If for some reason we try to mount a hard 4k device as an isofs filesystem,
we'll set opt.blocksize to 4096, and when we try to read the superblock
we found via:

        block = iso_blknum << (ISOFS_BLOCK_BITS - s->s_blocksize_bits)

with s_blocksize_bits greater than ISOFS_BLOCK_BITS, we'll have a negative
shift and the bread will fail somewhat cryptically:

  isofs_fill_super: bread failed, dev=sda, iso_blknum=17, block=-2147483648

It seems best to just catch and clearly reject mounts of such a device.

Reported-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-21 11:37:41 +02:00
Chao Yu
6aa58d8ad2 f2fs: readahead encrypted block during GC
During GC, for each encrypted block, we will read block synchronously
into meta page, and then submit it into current cold data log area.

So this block read model with 4k granularity can make poor performance,
like migrating non-encrypted block, let's readahead encrypted block
as well to improve migration performance.

To implement this, we choose meta page that its index is old block
address of the encrypted block, and readahead ciphertext into this
page, later, if readaheaded page is still updated, we will load its
data into target meta page, and submit the write IO.

Note that for OPU, truncation, deletion, we need to invalid meta
page after we invalid old block address, to make sure we won't load
invalid data from target meta page during encrypted block migration.

for ((i = 0; i < 1000; i++))
do {
        xfs_io -f /mnt/f2fs/dir/$i -c "pwrite 0 128k" -c "fsync";
} done

for ((i = 0; i < 1000; i+=2))
do {
        rm /mnt/f2fs/dir/$i;
} done

ret = ioctl(fd, F2FS_IOC_GARBAGE_COLLECT, 0);

Before:
              gc-6549  [001] d..1 214682.212797: block_rq_insert: 8,32 RA 32768 () 786400 + 64 [gc]
              gc-6549  [001] d..1 214682.212802: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.213892: block_bio_queue: 8,32 R 67494144 + 8 [gc]
              gc-6549  [001] .... 214682.213899: block_getrq: 8,32 R 67494144 + 8 [gc]
              gc-6549  [001] .... 214682.213902: block_plug: [gc]
              gc-6549  [001] d..1 214682.213905: block_rq_insert: 8,32 R 4096 () 67494144 + 8 [gc]
              gc-6549  [001] d..1 214682.213908: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.226405: block_bio_queue: 8,32 R 67494152 + 8 [gc]
              gc-6549  [001] .... 214682.226412: block_getrq: 8,32 R 67494152 + 8 [gc]
              gc-6549  [001] .... 214682.226414: block_plug: [gc]
              gc-6549  [001] d..1 214682.226417: block_rq_insert: 8,32 R 4096 () 67494152 + 8 [gc]
              gc-6549  [001] d..1 214682.226420: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.226904: block_bio_queue: 8,32 R 67494160 + 8 [gc]
              gc-6549  [001] .... 214682.226910: block_getrq: 8,32 R 67494160 + 8 [gc]
              gc-6549  [001] .... 214682.226911: block_plug: [gc]
              gc-6549  [001] d..1 214682.226914: block_rq_insert: 8,32 R 4096 () 67494160 + 8 [gc]
              gc-6549  [001] d..1 214682.226916: block_unplug: [gc] 1

After:
              gc-5678  [003] .... 214327.025906: block_bio_queue: 8,32 R 67493824 + 8 [gc]
              gc-5678  [003] .... 214327.025908: block_bio_backmerge: 8,32 R 67493824 + 8 [gc]
              gc-5678  [003] .... 214327.025915: block_bio_queue: 8,32 R 67493832 + 8 [gc]
              gc-5678  [003] .... 214327.025917: block_bio_backmerge: 8,32 R 67493832 + 8 [gc]
              gc-5678  [003] .... 214327.025923: block_bio_queue: 8,32 R 67493840 + 8 [gc]
              gc-5678  [003] .... 214327.025925: block_bio_backmerge: 8,32 R 67493840 + 8 [gc]
              gc-5678  [003] .... 214327.025932: block_bio_queue: 8,32 R 67493848 + 8 [gc]
              gc-5678  [003] .... 214327.025934: block_bio_backmerge: 8,32 R 67493848 + 8 [gc]
              gc-5678  [003] .... 214327.025941: block_bio_queue: 8,32 R 67493856 + 8 [gc]
              gc-5678  [003] .... 214327.025943: block_bio_backmerge: 8,32 R 67493856 + 8 [gc]
              gc-5678  [003] .... 214327.025953: block_bio_queue: 8,32 R 67493864 + 8 [gc]
              gc-5678  [003] .... 214327.025955: block_bio_backmerge: 8,32 R 67493864 + 8 [gc]
              gc-5678  [003] .... 214327.025962: block_bio_queue: 8,32 R 67493872 + 8 [gc]
              gc-5678  [003] .... 214327.025964: block_bio_backmerge: 8,32 R 67493872 + 8 [gc]
              gc-5678  [003] .... 214327.025970: block_bio_queue: 8,32 R 67493880 + 8 [gc]
              gc-5678  [003] .... 214327.025972: block_bio_backmerge: 8,32 R 67493880 + 8 [gc]
              gc-5678  [003] .... 214327.026000: block_bio_queue: 8,32 WS 34123776 + 2048 [gc]
              gc-5678  [003] .... 214327.026019: block_getrq: 8,32 WS 34123776 + 2048 [gc]
              gc-5678  [003] d..1 214327.026021: block_rq_insert: 8,32 R 131072 () 67493632 + 256 [gc]
              gc-5678  [003] d..1 214327.026023: block_unplug: [gc] 1
              gc-5678  [003] d..1 214327.026026: block_rq_issue: 8,32 R 131072 () 67493632 + 256 [gc]
              gc-5678  [003] .... 214327.026046: block_plug: [gc]

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jaegeuk Kim
6f8d445506 f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
The f2fs_gc() called by f2fs_balance_fs() requires to be called outside of
fi->i_gc_rwsem[WRITE], since f2fs_gc() can try to grab it in a loop.

If it hits the miximum retrials in GC, let's give a chance to release
gc_mutex for a short time in order not to go into live lock in the worst
case.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jaegeuk Kim
853137cef4 f2fs: fix performance issue observed with multi-thread sequential read
This reverts the commit - "b93f771 - f2fs: remove writepages lock"
to fix the drop in sequential read throughput.

Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
device: UFS

Before -
read throughput: 185 MB/s
total read requests: 85177 (of these ~80000 are 4KB size requests).
total write requests: 2546 (of these ~2208 requests are written in 512KB).

After -
read throughput: 758 MB/s
total read requests: 2417 (of these ~2042 are 512KB reads).
total write requests: 2701 (of these ~2034 requests are written in 512KB).

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Linus Torvalds
7140ad3898 Updates for v4.19:
- Restructure of lockdep and latency tracers
 
    This is the biggest change. Joel Fernandes restructured the hooks
    from irqs and preemption disabling and enabling. He got rid of
    a lot of the preprocessor #ifdef mess that they caused.
 
    He turned both lockdep and the latency tracers to use trace events
    inserted in the preempt/irqs disabling paths. But unfortunately,
    these started to cause issues in corner cases. Thus, parts of the
    code was reverted back to where lockde and the latency tracers
    just get called directly (without using the trace events).
    But because the original change cleaned up the code very nicely
    we kept that, as well as the trace events for preempt and irqs
    disabling, but they are limited to not being called in NMIs.
 
  - Have trace events use SRCU for "rcu idle" calls. This was required
    for the preempt/irqs off trace events. But it also had to not
    allow them to be called in NMI context. Waiting till Paul makes
    an NMI safe SRCU API.
 
  - New notrace SRCU API to allow trace events to use SRCU.
 
  - Addition of mcount-nop option support
 
  - SPDX headers replacing GPL templates.
 
  - Various other fixes and clean ups.
 
  - Some fixes are marked for stable, but were not fully tested
    before the merge window opened.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
 c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
 =vgEr
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Restructure of lockdep and latency tracers

   This is the biggest change. Joel Fernandes restructured the hooks
   from irqs and preemption disabling and enabling. He got rid of a lot
   of the preprocessor #ifdef mess that they caused.

   He turned both lockdep and the latency tracers to use trace events
   inserted in the preempt/irqs disabling paths. But unfortunately,
   these started to cause issues in corner cases. Thus, parts of the
   code was reverted back to where lockdep and the latency tracers just
   get called directly (without using the trace events). But because the
   original change cleaned up the code very nicely we kept that, as well
   as the trace events for preempt and irqs disabling, but they are
   limited to not being called in NMIs.

 - Have trace events use SRCU for "rcu idle" calls. This was required
   for the preempt/irqs off trace events. But it also had to not allow
   them to be called in NMI context. Waiting till Paul makes an NMI safe
   SRCU API.

 - New notrace SRCU API to allow trace events to use SRCU.

 - Addition of mcount-nop option support

 - SPDX headers replacing GPL templates.

 - Various other fixes and clean ups.

 - Some fixes are marked for stable, but were not fully tested before
   the merge window opened.

* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
  tracing: Fix SPDX format headers to use C++ style comments
  tracing: Add SPDX License format tags to tracing files
  tracing: Add SPDX License format to bpf_trace.c
  blktrace: Add SPDX License format header
  s390/ftrace: Add -mfentry and -mnop-mcount support
  tracing: Add -mcount-nop option support
  tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
  tracing: Handle CC_FLAGS_FTRACE more accurately
  Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
  Uprobes: Simplify uprobe_register() body
  tracepoints: Free early tracepoints after RCU is initialized
  uprobes: Use synchronize_rcu() not synchronize_sched()
  tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
  ftrace: Remove unused pointer ftrace_swapper_pid
  tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
  tracing/irqsoff: Handle preempt_count for different configs
  tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
  tracing: irqsoff: Account for additional preempt_disable
  trace: Use rcu_dereference_raw for hooks from trace-event subsystem
  tracing/kprobes: Fix within_notrace_func() to check only notrace functions
  ...
2018-08-20 18:32:00 -07:00
Linus Torvalds
0a78ac4b9b The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).  Also included
 y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
 Chengguang and Zheng's feature bit infrastructure for the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
 MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
 kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
 TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
 J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
 nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
 =AtgY
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main things are support for cephx v2 authentication protocol and
  basic support for rbd images within namespaces (myself).

  Also included are y2038 conversion patches from Arnd, a pile of
  miscellaneous fixes from Chengguang and Zheng's feature bit
  infrastructure for the filesystem"

* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
  ceph: don't drop message if it contains more data than expected
  ceph: support cephfs' own feature bits
  crush: fix using plain integer as NULL warning
  libceph: remove unnecessary non NULL check for request_key
  ceph: refactor error handling code in ceph_reserve_caps()
  ceph: refactor ceph_unreserve_caps()
  ceph: change to void return type for __do_request()
  ceph: compare fsc->max_file_size and inode->i_size for max file size limit
  ceph: add additional size check in ceph_setattr()
  ceph: add additional offset check in ceph_write_iter()
  ceph: add additional range check in ceph_fallocate()
  ceph: add new field max_file_size in ceph_fs_client
  libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
  libceph: check authorizer reply/challenge length before reading
  libceph: implement CEPHX_V2 calculation mode
  libceph: add authorizer challenge
  libceph: factor out encrypt_authorizer()
  libceph: factor out __ceph_x_decrypt()
  libceph: factor out __prepare_write_connect()
  libceph: store ceph_auth_handshake pointer in ceph_connection
  ...
2018-08-20 18:26:55 -07:00
Jan Kara
d3bc0fa841 fsnotify: fix false positive warning on inode delete
When inode is getting deleted and someone else holds reference to a mark
attached to the inode, we just detach the connector from the inode. In
that case fsnotify_put_mark() called from fsnotify_destroy_marks() will
decide to recalculate mask for the inode and __fsnotify_recalc_mask()
will WARN about invalid connector type:

WARNING: CPU: 1 PID: 12015 at fs/notify/mark.c:139
__fsnotify_recalc_mask+0x2d7/0x350 fs/notify/mark.c:139

Actually there's no reason to warn about detached connector in
__fsnotify_recalc_mask() so just silently skip updating the mask in such
case.

Reported-by: syzbot+c34692a51b9a6ca93540@syzkaller.appspotmail.com
Fixes: 3ac70bfcde ("fsnotify: add helper to get mask from connector")
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-20 13:55:45 +02:00
Linus Torvalds
a18d783fed Driver core patches for 4.19-rc1
Here are all of the driver core and related patches for 4.19-rc1.
 
 Nothing huge here, just a number of small cleanups and the ability to
 now stop the deferred probing after init happens.
 
 All of these have been in linux-next for a while with only a merge issue
 reported.  That merge issue is in fs/sysfs/group.c and Stephen has
 posted the diff of what it should be to resolve this.  I'll follow up
 with that diff to this pull request.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g86Q8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynyXQCePaZSW8wft4b7nLN8RdZ98ATBru0Ani10lrJa
 HQeQJRNbWU1AZ0ym7695
 =tOaH
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here are all of the driver core and related patches for 4.19-rc1.

  Nothing huge here, just a number of small cleanups and the ability to
  now stop the deferred probing after init happens.

  All of these have been in linux-next for a while with only a merge
  issue reported"

* tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
  base: core: Remove WARN_ON from link dependencies check
  drivers/base: stop new probing during shutdown
  drivers: core: Remove glue dirs from sysfs earlier
  driver core: remove unnecessary function extern declare
  sysfs.h: fix non-kernel-doc comment
  PM / Domains: Stop deferring probe at the end of initcall
  iommu: Remove IOMMU_OF_DECLARE
  iommu: Stop deferring probe at end of initcalls
  pinctrl: Support stopping deferred probe after initcalls
  dt-bindings: pinctrl: add a 'pinctrl-use-default' property
  driver core: allow stopping deferred probe after init
  driver core: add a debugfs entry to show deferred devices
  sysfs: Fix internal_create_group() for named group updates
  base: fix order of OF initialization
  linux/device.h: fix kernel-doc notation warning
  Documentation: update firmware loader fallback reference
  kobject: Replace strncpy with memcpy
  drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
  kernfs: Replace strncpy with memcpy
  device: Add #define dev_fmt similar to #define pr_fmt
  ...
2018-08-18 11:44:53 -07:00
Ingo Molnar
5804b11034 perf/core improvements ad fixes:
kernel:
 
 . kallsyms, x86: Export addresses of PTI entry trampolines (Alexander Shishkin)
 
 . kallsyms: Simplify update_iter_mod() (Adrian Hunter)
 
 . x86: Add entry trampolines to kcore (Adrian Hunter)
 
 Hardware tracing:
 
 . Fix auxtrace queue resize (Adrian Hunter)
 
 Arch specific:
 
 . Fix uninitialized ARM SPE record error variable (Kim Phillips)
 
 . Fix trace event post-processing in powerpc (Sandipan Das)
 
 Build:
 
 . Fix check-headers.sh AND list path of execution (Alexander Kapshuk)
 
 . Remove -mcet and -fcf-protection when building the python binding
   with older clang versions (Arnaldo Carvalho de Melo)
 
 . Make check-headers.sh check based on kernel dir (Jiri Olsa)
 
 . Move syscall_64.tbl check into check-headers.sh (Jiri Olsa)
 
 Infrastructure:
 
 . Check for null when copying nsinfo.  (Benno Evers)
 
 Libraries:
 
 . Rename libtraceevent prefixes, prep work for making it a shared
   library generaly available (Tzvetomir Stoyanov (VMware))
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELb9bqkb7Te0zijNb1lAW81NSqkAFAlt0OP8ACgkQ1lAW81NS
 qkD0lg/8Dcjd6bCgHzrRJYcR4VUNgHq4LnpJEp3VVvtV29AmptVW3hOCF6siuwDI
 H8rtMzE0gflMVHae310qTaUIzo1A6/ugoRwxUKLKU8aWkA1ikl9iJn6uaTttOCIG
 H4a/mExILDicGfxMk6kAPdyDbYr7r+1UoF58asrWVjPQNPxoJSALJCPtMnLK7Cn8
 qMZN68TqIL5zifbRe6UHKCH/SmDuowVmEIz4Nin3QtwKFPH+I02TtSdYkNWTC5WK
 o469/zy9cceA6a8Q+bVEUP0OD1mU8BvRnZogOiZ5SdMiYZlDkFSqG5MzgTtJUXQC
 DxOKkANdWu7zON/KywGDX8kcIBySzd5toTiXLvsHNNhxR8pT3bU57QvrscqLIMX+
 SVbCR03h5EwLeJopvDKZjwtcSwCWp1aCXCrmLP04tcB1zi+mUIRohbzf2vZDPfwo
 IRtoHvPAgV9UAA7M+UAAtNc7G0Gg/K2c6LlfiuxDhgjk9Jqg4NbOz3cn6rvXtGQX
 B4RM9pdhoNi9tqrXNYCnLzinzKtPhWjyEbb0FBlgvXTNQNiDVjhrkt3pC1fH1zK9
 GX1F6L78x24z3bZl5hUyYXmxOkktpAeprjICXAikYvwwige6aNENVLhCI/fVFHcx
 ByxXbFMom/dSIgU0tBfdqktd7ZmSLm94obSuA6BW8/htf2JIztM=
 =W1go
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-4.19-20180815' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

kernel:

- kallsyms, x86: Export addresses of PTI entry trampolines (Alexander Shishkin)

- kallsyms: Simplify update_iter_mod() (Adrian Hunter)

- x86: Add entry trampolines to kcore (Adrian Hunter)

Hardware tracing:

- Fix auxtrace queue resize (Adrian Hunter)

Arch specific:

- Fix uninitialized ARM SPE record error variable (Kim Phillips)

- Fix trace event post-processing in powerpc (Sandipan Das)

Build:

- Fix check-headers.sh AND list path of execution (Alexander Kapshuk)

- Remove -mcet and -fcf-protection when building the python binding
  with older clang versions (Arnaldo Carvalho de Melo)

- Make check-headers.sh check based on kernel dir (Jiri Olsa)

- Move syscall_64.tbl check into check-headers.sh (Jiri Olsa)

Infrastructure:

- Check for null when copying nsinfo.  (Benno Evers)

Libraries:

- Rename libtraceevent prefixes, prep work for making it a shared
  library generaly available (Tzvetomir Stoyanov (VMware))

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-18 13:11:51 +02:00
Linus Torvalds
1f7a4c73a7 Pull request for inclusion in 4.19, take two
This tag is the same as 9p-for-4.19 without the two MAINTAINERS patches
 
 Contains mostly fixes (6 to be backported to stable) and a few changes,
 here is the breakdown:
  * Rework how fids are attributed by replacing some custom tracking in a
 list by an idr (f28cdf0430)
  * For packet-based transports (virtio/rdma) validate that the packet
 length matches what the header says (f984579a01)
  * A few race condition fixes found by syzkaller (9f476d7c54,
 430ac66eb4)
  * Missing argument check when NULL device is passed in sys_mount
 (10aa14527f)
  * A few virtio fixes (23cba9cbde, 31934da810, d28c756cae)
  * Some spelling and style fixes
 
 ----------------------------------------------------------------
 Chirantan Ekbote (1):
       9p/net: Fix zero-copy path in the 9p virtio transport
 
 Colin Ian King (1):
       fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
 
 Jean-Philippe Brucker (1):
       net/9p: fix error path of p9_virtio_probe
 
 Matthew Wilcox (4):
       9p: Fix comment on smp_wmb
       9p: Change p9_fid_create calling convention
       9p: Replace the fidlist with an IDR
       9p: Embed wait_queue_head into p9_req_t
 
 Souptick Joarder (1):
       fs/9p/vfs_file.c: use new return type vm_fault_t
 
 Stephen Hemminger (1):
       9p: fix whitespace issues
 
 Tomas Bortoli (5):
       net/9p/client.c: version pointer uninitialized
       net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
       net/9p/trans_fd.c: fix race by holding the lock
       9p: validate PDU length
       9p: fix multiple NULL-pointer-dereferences
 
 jiangyiwen (2):
       net/9p/virtio: Fix hard lockup in req_done
       9p/virtio: fix off-by-one error in sg list bounds check
 
 piaojun (5):
       net/9p/client.c: add missing '\n' at the end of p9_debug()
       9p/net/protocol.c: return -ENOMEM when kmalloc() failed
       net/9p/trans_virtio.c: fix some spell mistakes in comments
       fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
       net/9p/trans_virtio.c: add null terminal for mount tag
 
  fs/9p/v9fs.c            |   2 +-
  fs/9p/vfs_file.c        |   2 +-
  fs/9p/xattr.c           |   6 ++++--
  include/net/9p/client.h |  11 ++++-------
  net/9p/client.c         | 119 +++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
  net/9p/protocol.c       |   2 +-
  net/9p/trans_fd.c       |  22 +++++++++++++++-------
  net/9p/trans_rdma.c     |   4 ++++
  net/9p/trans_virtio.c   |  66 +++++++++++++++++++++++++++++++++++++---------------------------
  net/9p/trans_xen.c      |   3 +++
  net/9p/util.c           |   1 -
  12 files changed, 122 insertions(+), 116 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 
 iF0EABECAB0WIQQ8idm2ZSicIMLgzKqoqIItDqvwPAUCW3ElNwAKCRCoqIItDqvw
 PMzfAKCkCYFyNC89vcpxcCNsK7rFQ1qKlwCgoaBpZDdegOu0jMB7cyKwAWrB0LM=
 =h3T0
 -----END PGP SIGNATURE-----

Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "This contains mostly fixes (6 to be backported to stable) and a few
  changes, here is the breakdown:

   - rework how fids are attributed by replacing some custom tracking in
     a list by an idr

   - for packet-based transports (virtio/rdma) validate that the packet
     length matches what the header says

   - a few race condition fixes found by syzkaller

   - missing argument check when NULL device is passed in sys_mount

   - a few virtio fixes

   - some spelling and style fixes"

* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
  net/9p/trans_virtio.c: add null terminal for mount tag
  9p/virtio: fix off-by-one error in sg list bounds check
  9p: fix whitespace issues
  9p: fix multiple NULL-pointer-dereferences
  fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
  9p: validate PDU length
  net/9p/trans_fd.c: fix race by holding the lock
  net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
  net/9p/virtio: Fix hard lockup in req_done
  net/9p/trans_virtio.c: fix some spell mistakes in comments
  9p/net: Fix zero-copy path in the 9p virtio transport
  9p: Embed wait_queue_head into p9_req_t
  9p: Replace the fidlist with an IDR
  9p: Change p9_fid_create calling convention
  9p: Fix comment on smp_wmb
  net/9p/client.c: version pointer uninitialized
  fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
  net/9p: fix error path of p9_virtio_probe
  9p/net/protocol.c: return -ENOMEM when kmalloc() failed
  net/9p/client.c: add missing '\n' at the end of p9_debug()
  ...
2018-08-17 17:27:58 -07:00
Linus Torvalds
6ada4e2826 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - a few Y2038 fixes

 - ntfs fixes

 - arch/sh tweaks

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  mm/hmm.c: remove unused variables align_start and align_end
  fs/userfaultfd.c: remove redundant pointer uwq
  mm, vmacache: hash addresses based on pmd
  mm/list_lru: introduce list_lru_shrink_walk_irq()
  mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
  mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
  mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
  mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
  mm/sparse: delete old sparse_init and enable new one
  mm/sparse: add new sparse_init_nid() and sparse_init()
  mm/sparse: move buffer init/fini to the common place
  mm/sparse: use the new sparse buffer functions in non-vmemmap
  mm/sparse: abstract sparse buffer allocations
  mm/hugetlb.c: don't zero 1GiB bootmem pages
  mm, page_alloc: double zone's batchsize
  mm/oom_kill.c: document oom_lock
  mm/hugetlb: remove gigantic page support for HIGHMEM
  mm, oom: remove sleep from under oom_lock
  kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
  mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
  ...
2018-08-17 16:49:31 -07:00
Colin Ian King
5241d47274 fs/userfaultfd.c: remove redundant pointer uwq
Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
  warning: variable 'uwq' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Kirill Tkhai
9b996468cf mm: add SHRINK_EMPTY shrinker methods return value
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all.  Currently, in
the both of these cases, shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case.  It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.

Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
c92e8e10ca fs: propagate shrinker::id to list_lru
Add list_lru::shrinker_id field and populate it by registered shrinker
id.

This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.

Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
2b3648a6ff fs/super.c: refactor alloc_super()
Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this.  Next patch needs
such the order.

Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Shakeel Butt
f745c6f5fe fs, mm: account buffer_head to kmemcg
The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache.  In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.

Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation.  The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated.  One
concrete example is memory reclaim.  The reclaim can trigger I/O of
pages of any memcg on the system.  So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.

[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
  Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Shakeel Butt
d46eb14b73 fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.

The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs.  All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.

The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated.  However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg.  This patch series
contains two such concrete use-cases i.e.  fsnotify and buffer_head.

The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener.  The events
are allocated in the context of the event producer.  However they should
be charged to the event consumer.  Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.

To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg.  In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged.  For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.

This patch (of 2):

A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer.  This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
  Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Jens Axboe
ac22b46a0b ext4: readpages() should submit IO as read-ahead
a_ops->readpages() is only ever used for read-ahead.  Ensure that we
pass this information down to the block layer.

Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Jens Axboe
5e9d398240 btrfs: readpages() should submit IO as read-ahead
a_ops->readpages() is only ever used for read-ahead.  Ensure that we
pass this information down to the block layer.

Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Jens Axboe
74c8164e1c mpage: mpage_readpages() should submit IO as read-ahead
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such.  Fix that up.  Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.

Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead.  Kill off that member.

Add some documentation notes on ->readpages() being purely for
read-ahead.

Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Jens Axboe
357c120652 mpage: add argument structure for do_mpage_readpage()
Patch series "Submit ->readpages() IO as read-ahead", v4.

The only caller of ->readpages() is from read-ahead, yet we don't submit
IO flagged with REQ_RAHEAD.  This means we don't see it in blktrace, for
instance, which is a shame.  Additionally, it's preventing further
functional changes in the block layer for deadling with read-ahead more
intelligently.  We already make assumptions about ->readpages() just
being for read-ahead in the mpage implementation, using
readahead_gfp_mask(mapping) as out GFP mask of choice.

This small series fixes up mpage_readpages() to submit with REQ_RAHEAD,
which takes care of file systems using mpage_readpages().  The first
patch is a prep patch, that makes do_mpage_readpage() take an argument
structure.

This patch (of 4):

We're currently passing 8 arguments to this function, clean it up a bit
by packing the arguments in an args structure we pass to it.

No intentional functional changes in this patch.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
NeilBrown
1f4aace60b fs/seq_file.c: simplify seq_file iteration code and interface
The documentation for seq_file suggests that it is necessary to be able
to move the iterator to a given offset, however that is not the case.
If the iterator is stored in the private data and is stable from one
read() syscall to the next, it is only necessary to support first/next
interactions.  Implementing this in a client is a little clumsy.

 - if ->start() is given a pos of zero, it should go to start of
   sequence.

 - if ->start() is given the name pos that was given to the most recent
   next() or start(), it should restore the iterator to state just
   before that last call

 - if ->start is given another number, it should set the iterator one
   beyond the start just before the last ->start or ->next call.

Also, the documentation says that the implementation can interpret the
pos however it likes (other than zero meaning start), but seq_file
increments the pos sometimes which does impose on the implementation.

This patch simplifies the interface for first/next iteration and
simplifies the code, while maintaining complete backward compatability.
Now:

 - if ->start() is given a pos of zero, it should return an iterator
   placed at the start of the sequence

 - if ->start() is given a non-zero pos, it should return the iterator
   in the same state it was after the last ->start or ->next.

This is particularly useful for interators which walk the multiple
chains in a hash table, e.g.  using rhashtable_walk*.  See
fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c

A large part of achieving this is to *always* call ->next after ->show
has successfully stored all of an entry in the buffer.  Never just
increment the index instead.  Also:

 - always pass &m->index to ->start() and ->next(), never a temp
   variable

 - don't clear ->from when ->count is zero, as ->from is dead when
   ->count is zero.

Some ->next functions do not increment *pos when they return NULL.  To
maintain compatability with this, we still need to increment m->index in
one place, if ->next didn't increment it.  Note that such ->next
functions are buggy and should be fixed.  A simple demonstration is

   dd if=/proc/swaps bs=1000 skip=1

Choose any block size larger than the size of /proc/swaps.  This will
always show the whole last line of /proc/swaps.

This patch doesn't work around buggy next() functions for this case.

[neilb@suse.com: ensure ->from is valid]
  Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Jonathan Corbet <corbet@lwn.net>	[docs]
Tested-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
NeilBrown
4cdfffc872 vfs: discard ATTR_ATTR_FLAG
This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1.  The flag was never set.

Let's discard it properly.

Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Tetsuo Handa
6cd00a01f0 fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot()
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
are initialized at __d_alloc(), we can't copy the whole size
unconditionally.

 WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
 636f6e66696766732e746d70000000000010000000000000020000000188ffff
  i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
                                  ^
 RIP: 0010:take_dentry_name_snapshot+0x28/0x50
 RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
 RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
 RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
 RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
 R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
 R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
 FS:  00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
  take_dentry_name_snapshot+0x28/0x50
  vfs_rename+0x128/0x870
  SyS_rename+0x3b2/0x3d0
  entry_SYSCALL_64_fastpath+0x1a/0xa4
  0xffffffffffffffff

Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Colin Ian King
480bd56485 ocfs2: make several functions and variables static (and some const)
There are a variety of functions and variables that are local to the
source and do not need to be in global scope, so make them static.  Also
make a couple of char arrays static const.

Cleans up sparse warnings:
  symbol 'o2hb_heartbeat_mode_desc' was not declared. Should it be static?
  symbol 'o2hb_heartbeat_mode' was not declared. Should it be static?
  symbol 'o2hb_dependent_users' was not declared. Should it be static?
  symbol 'o2hb_region_dec_user' was not declared. Should it be static?
  symbol 'o2nm_fence_method_desc' was not declared. Should it be static?
  symbol 'lockdep_keys' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180628131659.12133-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
wangyan
229ba1f82a ocfs2: clean up some unnecessary code
Several functions have some unnecessary code, clean up these code.

Link: http://lkml.kernel.org/r/5B14DF72.5020800@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Jun Piao
93f5920d86 ocfs2: return -EROFS when filesystem becomes read-only
We should return -EROFS rather than other errno if filesystem becomes
read-only.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/5B191B26.9010501@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Kees Cook
ab62ef82ea ntfs: mft: remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size stack buffer.  Existing checks already
require that blocksize >= NTFS_BLOCK_SIZE and mft_record_size <=
PAGE_SIZE, so max_bhs can be at most PAGE_SIZE / NTFS_BLOCK_SIZE.
Sanity checks are added for robustness.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Link: http://lkml.kernel.org/r/20180626172909.41453-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Kees Cook
2c27ce9150 ntfs: decompress: remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
moves the stack buffer used during decompression to be allocated
externally.

The existing "dest_max_index" used in the VLA is bounded by cb_max_page.
cb_max_page is bounded by max_page, and max_page is bounded by nr_pages.
Since nr_pages is used for the "pages" allocation, it can similarly be
used for the "completed_pages" allocation and passed into the
decompression function.  The error paths are updated to free the new
allocation.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Kees Cook
ac4ecf968a ntfs: aops: remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this uses
the maximum size needed on the stack and adds a sanity check for
robustness: index.block_size cannot be larger than PAGE_SIZE nor less
than NTFS_BLOCK_SIZE.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Link: http://lkml.kernel.org/r/20180626172909.41453-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Sebastian Andrzej Siewior
a10dcebacd fs/ntfs/aops.c: don't disable interrupts during kmap_atomic()
ntfs_end_buffer_async_read() disables interrupts around kmap_atomic().
This is a leftover from the old kmap_atomic() implementation which
relied on fixed mapping slots, so the caller had to make sure that the
same slot could not be reused from an interrupting context.

kmap_atomic() was changed to dynamic slots long ago and commit
1ec9c5ddc1 ("include/linux/highmem.h: remove the second argument of
k[un]map_atomic()") removed the slot assignements, but the callers were
not checked for now redundant interrupt disabling.

Remove the conditional interrupt disable.

Link: http://lkml.kernel.org/r/20180611144913.gln5mklhqcrfsoom@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Arnd Bergmann
f08957d0ff fs/hpfs: extend gmt_to_local() conversion to 64-bit times
The VFS timestamps are all 64-bit now, the only missing piece for hpfs
is the internal conversion function.  One interesting bit about hpfs is
that it can already deal with moving the 136 year window of its
timestamps to support a much wider range than other file systems with
32-bit timestamps.  It also treats the timestamps as 'unsigned' on
64-bit architectures (but signed on 32-bit, because time_t always around
to negative numbers in 2038).

Changing the conversion to use time64_t makes 32-bit architectures
behave the same way as 64-bit.  For completeness, this also adds a
clamp_t call for each conversion, so we don't wrap the timestamps but
instead stay within the [0..U32_MAX] range of the on-disk timestamps.

Link: http://lkml.kernel.org/r/20180718115017.742609-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Arnd Bergmann
bcf451ecfc fs/ntfs: use timespec64 directly for timestamp conversion
Now that the VFS has been converted from timespec to timespec64
timestamps, only the conversion to/from ntfs timestamps uses 32-bit
seconds.

This changes that last missing piece to get the ntfs implementation
y2038 safe on 32-bit architectures.

Link: http://lkml.kernel.org/r/20180718115017.742609-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Arnd Bergmann
a3fda0ffea fs/ufs: use ktime_get_real_seconds for sb and cg timestamps
get_seconds() is deprecated because of the 32-bit overflow and will be
removed.  All callers in ufs also truncate to a 32-bit number, so
nothing changes during the conversion, but this should be harmless as
the superblock and cylinder group timestamps are not visible to user
space, except for checking the fs-dirty state, wich works fine across
the overflow.

This moves the call to get_seconds() into a new inline function, with a
comment explaining the constraints, while converting it to
ktime_get_real_seconds().

Link: http://lkml.kernel.org/r/20180718115017.742609-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Dave Jiang
e1fb4a0864 dax: remove VM_MIXEDMAP for fsdax and device dax
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map.  The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations.  In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test.  It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Linus Torvalds
9bd553929f This has been a large cycle for RDMA, with several major patch series
reworking parts of the core code.
 
 - Rework the so-called 'gid cache' and internal APIs to use a kref'd
   pointer to a struct instead of copying, push this upwards into the
   callers and add more stuff to the struct. The new design avoids some
   ugly races the old one suffered with. This is part of the namespace
   enablement work as the new struct is learning to be namespace aware.
 
 - Various uapi cleanups, moving more stuff to include/uapi and fixing some
   long standing bugs that have recently been discovered.
 
 - Driver updates for mlx5, mlx4 i40iw, rxe, cxgb4, hfi1, usnic, pvrdma,
   and hns
 
 - Provide max_send_sge and max_recv_sge attributes to better support HW
   where these values are asymmetric.
 
 - mlx5 user API 'devx' allows sending commands directly to the device FW,
   instead of trying to cram every wild and niche feature into the common
   API. Sort of like what GPU does.
 
 - Major write() and ioctl() API rework to cleanly support PCI device hot
   unplug and advance the ioctl conversion work
 
 - Sparse and compile warning cleanups
 
 - Add 'const' to the ib_poll_cq() signature, and permit a NULL 'bad_wr',
   which is the common use case
 
 - Various patches to avoid high order allocations across the stack
 
 - SRQ support for cxgb4, hns and qedr
 
 - Changes to IPoIB to better follow the netdev model for working with
   struct net_device liftime
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlt17oMACgkQOG33FX4g
 mxpRsQ//YZY1Gci1IoYLMuq0Rn9+/4lRHaBev+B728z1dvEFBW8m/i2DV5dPnSxO
 AUN9dZOKBYYhc08h8vphtnBdMEtYJz6Dl76F8W+mt5vSuM5D4+0ba415RYSnV1Dc
 d6Js33OTMVbQVHmYCIAXh9FNDX8lkywT346aXlMOpW3z74xoaLkkQ0cnfB0SEX0y
 q9jiu70s6eisLlu9zJsXmCCLQ1b8eUD6IZm7hX8wMheuhDWyfrOv8JBeBCQdICuI
 MASc2T7X8E++dvIePAL7Hgx/0SH/2Mit8zaJ0Sbt2OjBDcImLSs8bcple5gPoCPk
 3vnCdb2GKg8xlxe3n1S89sGC1b8MY2CtQFElSs9C6npIGCwr2XlrZDDa0tE45+8I
 miVhoswakmKW61KTCkVf2d9RXWcIh1qwUIpan1aZMsWdNnA6FYXIF054mMmJO44+
 HUi2C93zAhx3XhFuX6O2YAHkG6CSXcZPfO7U9zy++GwAoXtGU0g6OLZbaYdEfuQh
 lN8LLqxe3M5sMdDnHYc38AsLW9MmxyJXt+h2yLxtsdZ9jitypBDQxSVfAI68RNwL
 BB1qELflF9FtAousQU9qhdNHimsgwctJ9MoZ6I1Aa1+ovwcSQgmKoQlNJIHkFroB
 wUz2sz6q25OdLWDpFrGipmG7Kfnosg7xuBSYZUQMBzLmjg0HTVY=
 =F50c
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a large cycle for RDMA, with several major patch series
  reworking parts of the core code.

   - Rework the so-called 'gid cache' and internal APIs to use a kref'd
     pointer to a struct instead of copying, push this upwards into the
     callers and add more stuff to the struct. The new design avoids
     some ugly races the old one suffered with. This is part of the
     namespace enablement work as the new struct is learning to be
     namespace aware.

   - Various uapi cleanups, moving more stuff to include/uapi and fixing
     some long standing bugs that have recently been discovered.

   - Driver updates for mlx5, mlx4 i40iw, rxe, cxgb4, hfi1, usnic,
     pvrdma, and hns

   - Provide max_send_sge and max_recv_sge attributes to better support
     HW where these values are asymmetric.

   - mlx5 user API 'devx' allows sending commands directly to the device
     FW, instead of trying to cram every wild and niche feature into the
     common API. Sort of like what GPU does.

   - Major write() and ioctl() API rework to cleanly support PCI device
     hot unplug and advance the ioctl conversion work

   - Sparse and compile warning cleanups

   - Add 'const' to the ib_poll_cq() signature, and permit a NULL
     'bad_wr', which is the common use case

   - Various patches to avoid high order allocations across the stack

   - SRQ support for cxgb4, hns and qedr

   - Changes to IPoIB to better follow the netdev model for working with
     struct net_device liftime"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (312 commits)
  Revert "net/smc: Replace ib_query_gid with rdma_get_gid_attr"
  RDMA/hns: Fix usage of bitmap allocation functions return values
  IB/core: Change filter function return type from int to bool
  IB/core: Update GID entries for netdevice whose mac address changes
  IB/core: Add default GIDs of the bond master netdev
  IB/core: Consider adding default GIDs of bond device
  IB/core: Delete lower netdevice default GID entries in bonding scenario
  IB/core: Avoid confusing del_netdev_default_ips
  IB/core: Add comment for change upper netevent handling
  qedr: Add user space support for SRQ
  qedr: Add support for kernel mode SRQ's
  qedr: Add wrapping generic structure for qpidr and adjust idr routines.
  IB/mlx5: Fix leaking stack memory to userspace
  Update the e-mail address of Bart Van Assche
  IB/ucm: Fix compiling ucm.c
  IB/uverbs: Do not check for device disassociation during ioctl
  IB/uverbs: Remove struct uverbs_root_spec and all supporting code
  IB/uverbs: Use uverbs_api to unmarshal ioctl commands
  IB/uverbs: Use uverbs_alloc for allocations
  IB/uverbs: Add a simple allocator to uverbs_attr_bundle
  ...
2018-08-17 12:44:48 -07:00
Linus Torvalds
2645b9d1a4 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlt2mBQACgkQnJ2qBz9k
 QNntGQgAluTTnuJLjoUDjFfT37Fjf2x1ve8rg6xmYS3YIhYTWWA1oazUIeyBDfwa
 soutlfAZ/ix2bP1UEmeULxFhrCIXYBbWAe8s5MRqO/7s01QftNf0M72ASmd7gZRy
 rSVt2/BWpr745mWI38tEKlIF4sQJVD7IGrnc1cQslPzleeCqsCXA+uBkBPMlcDpJ
 ZWni2qK023y9E2dsg6RsJc1HemkQvrJtoLSVqRsdhty9GEuWseMbssdgz1zMXljQ
 eXIALE5BssoxISIpH6qVKZRlr7UWGxOmV4CDPmku7DFLOSiwMk/Ml0V80BwzjNNY
 hY8qfxcJOFOGZ8t82pWkVGMjgOAKjA==
 =IN6Y
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "fsnotify cleanups from Amir and a small inotify improvement"

* tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify: Add flag IN_MASK_CREATE for inotify_add_watch()
  fanotify: factor out helpers to add/remove mark
  fsnotify: add helper to get mask from connector
  fsnotify: let connector point to an abstract object
  fsnotify: pass connp and object type to fsnotify_add_mark()
  fsnotify: use typedef fsnotify_connp_t for brevity
2018-08-17 09:41:28 -07:00
Linus Torvalds
46e62a072a \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlt2l2MACgkQnJ2qBz9k
 QNlZMAgAwVu/bMsRR6PbXJIAYEUNLehrmgUfSdYxIFqnZPq84ZfpOMQZKDYJIO5d
 WiLz9Z9pti/ldrQ33yllbJrsalAn8R+LB911eaKUvLscXyrIsoBxsBbOOtVZc9lZ
 jaQBUMLStdPvE6LgW93f1EwIg/Z8CSTzaeCO31wlZl7s7wsBhjg3MJ3f9sR6LG0G
 OKQZnjDxGbtsbeVl8cnOeeF3sd0kqYTT5EwSh+zkMIbHJQ0dbvEjj24TM9rHdzG2
 AN35+rzFZeMHRGnfWsQ/I6il1nTuWIyPRpoc57cwV/dcYwpg1Pi6MZzrFcDsWfwx
 rHgRJIkmSqi1S6Ic8o6s9fYsn6266A==
 =ljWe
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF and ext2 update from Jan Kara.

* tag 'for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: use ktime_get_real_seconds for timestamps
  udf: convert inode stamps to timespec64
2018-08-17 09:38:39 -07:00
Robbie Ko
8ecebf4d76 Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Commit e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.

The steps leading to this problem are:

1. When it's not possible to allocate data space for a write, the
   buffered write path checks if a NOCOW write is possible.  If it is,
   it will not reserve space and success (0) is returned to user space.

2. Then when a snapshot is created, the root's will_be_snapshotted
   atomic is incremented and writeback is triggered for all inode's that
   belong to the root being snapshotted. Incrementing that atomic forces
   all previous writes to fallback to COW during writeback (running
   delalloc).

3. This results in the writeback for the inodes to fail and therefore
   setting the ENOSPC error in their mappings, so that a subsequent
   fsync on them will report the error to user space. So it's not a
   completely silent data loss (since fsync will report ENOSPC) but it's
   a very unexpected and undesirable behaviour, because if a clean
   shutdown/unmount of the filesystem happens without previous calls to
   fsync, it is expected to have the data present in the files after
   mounting the filesystem again.

So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:

1. It is incremented when we start to create a snapshot after triggering
   writeback and before waiting for writeback to finish.

2. This new atomic is now what is used by writeback (running delalloc)
   to decide whether we need to fallback to COW or not. Because we
   incremented this new atomic after triggering writeback in the
   snapshot creation ioctl, we ensure that all buffered writes that
   happened before snapshot creation will succeed and not fallback to
   COW (which would make them fail with ENOSPC).

3. The existing atomic, will_be_snapshotted, is kept because it is used
   to force new buffered writes, that start after we started
   snapshotting, to reserve data space even when NOCOW is possible.
   This makes these writes fail early with ENOSPC when there's no
   available space to allocate, preventing the unexpected behaviour of
   writeback later failing with ENOSPC due to a fallback to COW mode.

Fixes: e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-17 18:35:43 +02:00
Jason Gunthorpe
0a3173a5f0 Merge branch 'linus/master' into rdma.git for-next
rdma.git merge resolution for the 4.19 merge window

Conflicts:
 drivers/infiniband/core/rdma_core.c
   - Use the rdma code and revise with the new spelling for
     atomic_fetch_add_unless
 drivers/nvme/host/rdma.c
   - Replace max_sge with max_send_sge in new blk code
 drivers/nvme/target/rdma.c
   - Use the blk code and revise to use NULL for ib_post_recv when
     appropriate
   - Replace max_sge with max_recv_sge in new blk code
 net/rds/ib_send.c
   - Use the net code and revise to use NULL for ib_post_recv when
     appropriate

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 14:21:29 -06:00
Jason Gunthorpe
89982f7cce Linux 4.18
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
 mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
 qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
 zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
 vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
 wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
 L47IdXo=
 =sJkt
 -----END PGP SIGNATURE-----

Merge tag 'v4.18' into rdma.git for-next

Resolve merge conflicts from the -rc cycle against the rdma.git tree:

Conflicts:
 drivers/infiniband/core/uverbs_cmd.c
  - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
  - Merge removal of file->ucontext in for-next with new code in -rc
 drivers/infiniband/core/uverbs_main.c
  - for-next removed code from ib_uverbs_write() that was modified
    in for-rc

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 13:12:00 -06:00
Linus Torvalds
5c60a7389d Orangefs: one cleanup and Souptick's vm_fault_t patch
1. Adding new return type vm_fault_t (Souptick Joarder)
 2. remove redundant pointer orangefs_inode (Colin Ian King)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbdZXmAAoJEM9EDqnrzg2+O+YP/AoU+NnPj9rDYKC/OImp4uhh
 aIER9LOFXFJocWULAQccFXLawRVzllBwwcWSwLlGAa2AT8DyIxpuyxJhNLIfrEKV
 axsfAQA/mU529i8PRgwnYdQJ0cKgzHR9qrQvTrBPAV+xhrlIeQI48cNlriwJikFF
 0bXkWZt5ZSn+e5FkKFm/OqiialwcrOkMGnM+Apa0B9MSvmapLcCuvGxqYYKEbSaV
 JYqnZ3DiDnBp/6RYUY/qn/Azp8gCDfrPlm05lUZnAbyFGwaidunOgNMHTbQAZ//H
 hLuGRsMWOdQqwEMr+H9vPZVBTp6DfupgH8BgB5Y5EHcwgoWK5U3sZZQKP5f8+9vh
 7StCSnc9qT5iJWTbOWIngIpSeNnVa6iF7QMXt7wxOQY2ITu5Cnot1fWhuj2UcA36
 xmf38B6YRX4VeLMc/eryQCD7d4EpBYIqdyaLAg0Qg1Y35DU9b3QkC56ca56uQrHY
 QZeQAqH63CpHiajrYCHE5wsr5zrLXbYj229Idq2KBhEqXcxCV17kwjLF3rpyEbxu
 9I4HpafzQ0Sho+zsCgakyu5DYBAfMbAYqR7pT5MGNB8yYVzxMcSEsAWSQ42Ab1qb
 P09p1ojQQxjrqApMOa6L4MrLNA7Wl75LGRnwNy7c83qkys8Y90JhdZsQlLwlp+PT
 rYnIliKQuTRY+7JV/4WL
 =3Oz+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.19-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Orangefs: one cleanup and Souptick's vm_fault_t patch:

   - add new return type vm_fault_t (Souptick Joarder)

   - remove redundant pointer (Colin Ian King)"

* tag 'for-linus-4.19-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: remove redundant pointer orangefs_inode
  orangefs: Adding new return type vm_fault_t
2018-08-16 10:53:45 -07:00
Trond Myklebust
ea51f94b45 pNFS: Treat RECALLCONFLICT like DELAY...
Yes, it is possible to get trapped in a loop, but the server should be
administratively revoking the recalled layout if it never gets returned.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-16 13:47:09 -04:00
Trond Myklebust
ecf8402603 pNFS: When updating the stateid in layoutreturn, also update the recall range
When we update the layout stateid in nfs4_layoutreturn_refresh_stateid, we
should also update the range in order to let the server know we're actually
returning everything.

Fixes: 16c278dbfa63 ("pnfs: Fix handling of NFS4ERR_OLD_STATEID replies...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-16 13:29:36 -04:00
Linus Torvalds
5bae2be4ef Just one jfs patch for 4.19
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAlt0SSYACgkQNqiEXrVA
 jGTwvA//aTnRN3lPQK0iEeiGw6lzhQX/cO7eXlTTGnHMMxy62NZTSUlZhTaoTm63
 bg0MHknX13oikraMMpSEwsf4qoPcIgalTXmFd7RWmOHKw7GBd6LJznsRQQ4i3G8g
 0a1KtXLyRXT68UgJ6U0BmukWBjNC1qG9ToWbBG8SXMhrxuFbpg4uPRtMl6eRI9fV
 U0CH//x94TYXSB2D4N1eVzvRrIs4l0iJA1RxfqmYZSBQe7b7LW3GLafsIm0axQO3
 hy7XEUtjWzhuGILRVTJ+9hmyTG41YARWYrG0Rdd4h0sB5nK/jl8YRZtofGs7zuBK
 RqzJHUSNGPXza54O0bBKQk6IwTJTsjhWg2f3AXFComEP4hvTA50i/CAa/XBZYKam
 Fq99C70txMA4Ufwrmh4dN+20qtMYFwuvpsbNMiyuQjCDUxXwvey4RNLc1o6J3tWH
 1qVNNk/k+kJY704CqA+h7Ay0A1ocaa64glwPIcBDgP5Us72LE/QjDRE4NfQVg4wq
 WbVO+Rml8kB+uU2ma2U2y4XXgZIFv7JWxmQ6fxWfMe2kH0+Z6Ech2D66t/oBW6w7
 Q0kp2+YYaSpvIbKnQYzBQjW+W3kPIPAYLV4HptM89p0ZLi4DlKLi4EywbeGPsA/R
 gynd4Uxi95TA+2bAnKSNuoNse/5mq+R2F4+RuVZMUHt2DWEz2iA=
 =VxRQ
 -----END PGP SIGNATURE-----

Merge tag 'jfs-4.19' of git://github.com/kleikamp/linux-shaggy

Pull jfs update from David Kleikamp:
 "Just one jfs patch for 4.19"

* tag 'jfs-4.19' of git://github.com/kleikamp/linux-shaggy:
  jfs: use time64_t for otime
2018-08-15 22:47:23 -07:00
Linus Torvalds
2b2f2aedba gfs2 4.19 merge
Changes on top of v4.18-rc1 / iomap-4.19-merge-1:
 
 1. Iomap support for buffered writes and for direct I/O.
 2. Two patches that reduce the size of struct gfs2_inode.
 3. Lots of fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbcakSAAoJENW/n+sDE2U6AR0QAJlai+92ERML2pM+1hiuEHWP
 KizBoV/53pc+dll1dlcEOHQFys2vbcFavcCtcsTXNhLSp1wOqxyzcFQTX6ekWfSZ
 hTvJvAKTbeXu0zOWSV2DcX40JWb7SKDAxjxNb8XhL0COilgM9r+mdqoY/UNyVSel
 SVmWvs8UYt6UBnw4G8h5UlzSYxl/M64udU1pVO5D8JMQ5cxDKj3kfFoJLLKBDwLF
 vaNFxiihdTzmMwMNo3Px7GFSsb5Jnyo9LgAoDKsYd9YlzqGpAvvoYXH8itj4TuSb
 sM1KTUZK+97XvquZfuv5BniEifP7XZSq4xYIxyr9HMaOefeys0GdzaCSCb3ifFte
 7bqjowlAbHWwBNa9ofuJ1NShsAiOv0GUGDzlY+T/0IgSlqRr0JxAikJ3jLIZQ1Hf
 CwWY66XakeSi5euDTi41SuGZMcxTXaX15VbXl6/SGsv4X0dyVXleBz6RuC9Q+n2H
 7nqlGppRW2NB1WUqkJ15n9JaNLAF5I6umERTBXKGODM56p/GmZYoScCEGrqj9obN
 CntPtL6yrazASjV3+zqXA//OvTb3xfykYu17wVLKhXWD0YWQiuDfA481BLvEut2G
 aTtNU3b4VDwv5NuBf7G3wvN0+v3WyJ3gRfhTaEdFnX5PpcH3eHz5/fzU7zCJJwDu
 g53icn3efqNu7WvpAkB+
 =oGs4
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.19.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - iomap support for buffered writes and for direct I/O

 - two patches that reduce the size of struct gfs2_inode

 - lots of fixes and cleanups

* tag 'gfs2-4.19.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (25 commits)
  gfs2: eliminate update_rgrp_lvb_unlinked
  gfs2: Fix gfs2_testbit to use clone bitmaps
  gfs2: Get rid of gfs2_ea_strlen
  gfs2: cleanup: call gfs2_rgrp_ondisk2lvb from gfs2_rgrp_out
  gfs2: Special-case rindex for gfs2_grow
  GFS2: rgrp free blocks used incorrectly
  gfs2: remove redundant variable 'moved'
  gfs2: use iomap_readpage for blocksize == PAGE_SIZE
  gfs2: Use iomap for stuffed direct I/O reads
  gfs2: fallocate_chunk: Always initialize struct iomap
  GFS2: Fix recovery issues for spectators
  fs: gfs2: Adding new return type vm_fault_t
  gfs2: using posix_acl_xattr_size instead of posix_acl_to_xattr
  gfs2: Don't reject a supposedly full bitmap if we have blocks reserved
  gfs2: Eliminate redundant ip->i_rgd
  gfs2: Stop messing with ip->i_rgd in the rlist code
  gfs2: Remove gfs2_write_{begin,end}
  gfs2: iomap direct I/O support
  gfs2: gfs2_extent_length cleanup
  gfs2: iomap buffered write support
  ...
2018-08-15 22:40:03 -07:00
Linus Torvalds
72f02ba66b SCSI misc on 20180815
This is mostly updates to the usual drivers: mpt3sas, lpfc, qla2xxx,
 hisi_sas, smartpqi, megaraid_sas, arcmsr.  In addition, with the
 continuing absence of Nic we have target updates for tcmu and target
 core (all with reviews and acks).  The biggest observable change is
 going to be that we're (again) trying to switch to mulitqueue as the
 default (a user can still override the setting on the kernel command
 line).  Other major core stuff is the removal of the remaining
 Microchannel drivers, an update of the internal timers and some
 reworks of completion and result handling.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCW3R3niYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishauRAP4yfBKK
 dbxF81c/Bxi/Stk16FWkOOrjs4CizwmnMcpM5wD/UmM9o6ebDzaYpZgA8wIl7X/N
 o/JckEZZpIp+5NySZNc=
 =ggLB
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This is mostly updates to the usual drivers: mpt3sas, lpfc, qla2xxx,
  hisi_sas, smartpqi, megaraid_sas, arcmsr.

  In addition, with the continuing absence of Nic we have target updates
  for tcmu and target core (all with reviews and acks).

  The biggest observable change is going to be that we're (again) trying
  to switch to mulitqueue as the default (a user can still override the
  setting on the kernel command line).

  Other major core stuff is the removal of the remaining Microchannel
  drivers, an update of the internal timers and some reworks of
  completion and result handling"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (203 commits)
  scsi: core: use blk_mq_run_hw_queues in scsi_kick_queue
  scsi: ufs: remove unnecessary query(DM) UPIU trace
  scsi: qla2xxx: Fix issue reported by static checker for qla2x00_els_dcmd2_sp_done()
  scsi: aacraid: Spelling fix in comment
  scsi: mpt3sas: Fix calltrace observed while running IO & reset
  scsi: aic94xx: fix an error code in aic94xx_init()
  scsi: st: remove redundant pointer STbuffer
  scsi: qla2xxx: Update driver version to 10.00.00.08-k
  scsi: qla2xxx: Migrate NVME N2N handling into state machine
  scsi: qla2xxx: Save frame payload size from ICB
  scsi: qla2xxx: Fix stalled relogin
  scsi: qla2xxx: Fix race between switch cmd completion and timeout
  scsi: qla2xxx: Fix Management Server NPort handle reservation logic
  scsi: qla2xxx: Flush mailbox commands on chip reset
  scsi: qla2xxx: Fix unintended Logout
  scsi: qla2xxx: Fix session state stuck in Get Port DB
  scsi: qla2xxx: Fix redundant fc_rport registration
  scsi: qla2xxx: Silent erroneous message
  scsi: qla2xxx: Prevent sysfs access when chip is down
  scsi: qla2xxx: Add longer window for chip reset
  ...
2018-08-15 22:06:26 -07:00
Eric W. Biederman
84fe4cc09a signal: Don't send signals to tasks that don't exist
Recently syzbot reported crashes in send_sigio_to_task and
send_sigurg_to_task in linux-next.  Despite finding a reproducer
syzbot apparently did not bisected this or otherwise track down the
offending commit in linux-next.

I happened to see this report and examined the code because I had
recently changed these functions as part of making PIDTYPE_TGID a real
pid type so that fork would does not need to restart when receiving a
signal.  By examination I see that I spotted a bug in the code
that could explain the reported crashes.

When I took Oleg's suggestion and optimized send_sigurg and send_sigio
to only send to a single task when type is PIDTYPE_PID or PIDTYPE_TGID
I failed to handle pids that no longer point to tasks.  The macro
do_each_pid_task simply iterates for zero iterations.  With pid_task
an explicit NULL test is needed.

Update the code to include the missing NULL test.

Fixes: 019191342f ("signal: Use PIDTYPE_TGID to clearly store where file signals will be sent")
Reported-by: syzkaller-bugs@googlegroups.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-15 23:03:20 -05:00
Linus Torvalds
71f3a82fab media updates for v4.19-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbdGisAAoJEAhfPr2O5OEVmXcQAIe+wWHfAFwFDCqRGvG7yHIV
 q3Rf6bqYE/IK++sSX9qrCmPC1jXsw6oXjaYqjtdplLb6GBBkNdKo6neGtARV1phB
 Moj0JVmFwvnq6yMBZl4TFIZIZHIlfXIe/Ac6b9h5aiL1bGUdnukLhSnGVnoWteVv
 F6BRAhH/PDIf7xUpxchEic7HbwnWDWI51ALeDWZQCv5SkNgBUBrLDumWr7b2YCWF
 BcNfC14fNN1Jk4Hi9NpS7VCC4Nxvxn+D3iFsVa10Ur/A8EPApIlj+8nu/CDA8p4c
 cJa8ImvaKhK0PAfciKMi5RBs/5r5BFRlrmLADkyVz8pmz5AQqjHot/TGoC22LqaN
 h0T9lTKEmFAE8CYFQhcOfb6qcR6bfnM6yyVjeCrnAQ3Fw9TgESiPbPE7BpgnxoTU
 i3h7a5WHYs4xjkMByJUYda1GhahPD82eJJp8l4iIQ2IMlTZbrEmLFmg13WNO3950
 a+dy3HqpXUd2EYwFn9CVFzjTzWKE+63obQWIxLFWJWRirqr5XKbTmB7nPLeik6Os
 kPmn8t0eKig76SYiJ2rhKgTMUB1rXjtBYtqP/f02FYN0WSZCEIs7oKMb/SoVfTlP
 2rF6k2ZPTzB2ZJp3RzNRX/+/39f04i26QzeEwwQ6/8daxSna3YlKuLMvHzJ5UuFp
 q28aN8hD5eXU1IEWHtvH
 =ClJZ
 -----END PGP SIGNATURE-----

Merge tag 'media/v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:

 - new Socionext MN88443x ISDB-S/T demodulator driver: mn88443x

 - new sensor drivers: ak7375, ov2680 and rj54n1cb0c

 - an old soc-camera sensor driver converted to the V4L2 framework:
   mt9v111

 - a new Voice-Coil Motor (VCM) driver: dw9807-vcm

 - some cleanups at cx25821, removing legacy unused code

 - some improvements at ddbridge driver

 - new platform driver: vicodec

 - some DVB API cleanups, removing ioctls and compat code for old
   out-of-tree drivers that were never merged upstream

 - improvements at DVB core to support frontents that support both
   Satellite and non-satellite delivery systems

 - got rid of the unused VIDIOC_RESERVED V4L2 ioctl

 - some cleanups/improvements at gl861 ISDB driver

 - several improvements on ov772x, ov7670 and ov5640, imx274, ov5645,
   and smiapp sensor drivers

 - fixes at em28xx to support dual TS devices

 - some cleanups at V4L2/VB2 locking logic

 - some API improvements at media controller

 - some cec core and drivers improvements

 - some uvcvideo improvements

 - some improvements at platform drivers: stm32-dcmi, rcar-vin, coda,
   reneseas-ceu, imx, vsp1, venus, camss

 - lots of other cleanups and fixes

* tag 'media/v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (406 commits)
  Revert "media: vivid: shut up warnings due to a non-trivial logic"
  siano: get rid of an unused return code for debugfs register
  media: isp: fix a warning about a wrong struct initializer
  media: radio-wl1273: fix return code for the polling routine
  media: s3c-camif: fix return code for the polling routine
  media: saa7164: fix return codes for the polling routine
  media: exynos-gsc: fix return code if mutex was interrupted
  media: mt9v111: Fix build error with no VIDEO_V4L2_SUBDEV_API
  media: xc4000: get rid of uneeded casts
  media: drxj: get rid of uneeded casts
  media: tuner-xc2028: don't use casts for printing sizes
  media: cleanup fall-through comments
  media: vivid: shut up warnings due to a non-trivial logic
  media: rtl28xxu: be sure that it won't go past the array size
  media: mt9v111: avoid going past the buffer
  media: vsp1_dl: add a description for cmdpool field
  media: sta2x11: add a missing parameter description
  media: v4l2-mem2mem: add descriptions to MC fields
  media: i2c: fix warning in Aptina MT9V111
  media: imx: shut up a false positive warning
  ...
2018-08-15 18:29:14 -07:00
Linus Torvalds
9a76aba02a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   - Gustavo A. R. Silva keeps working on the implicit switch fallthru
     changes.

   - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
     Luca Coelho.

   - Re-enable ASPM in r8169, from Kai-Heng Feng.

   - Add virtual XFRM interfaces, which avoids all of the limitations of
     existing IPSEC tunnels. From Steffen Klassert.

   - Convert GRO over to use a hash table, so that when we have many
     flows active we don't traverse a long list during accumluation.

   - Many new self tests for routing, TC, tunnels, etc. Too many
     contributors to mention them all, but I'm really happy to keep
     seeing this stuff.

   - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.

   - Lots of cleanups and fixes in L2TP code from Guillaume Nault.

   - Add IPSEC offload support to netdevsim, from Shannon Nelson.

   - Add support for slotting with non-uniform distribution to netem
     packet scheduler, from Yousuk Seung.

   - Add UDP GSO support to mlx5e, from Boris Pismenny.

   - Support offloading of Team LAG in NFP, from John Hurley.

   - Allow to configure TX queue selection based upon RX queue, from
     Amritha Nambiar.

   - Support ethtool ring size configuration in aquantia, from Anton
     Mikaev.

   - Support DSCP and flowlabel per-transport in SCTP, from Xin Long.

   - Support list based batching and stack traversal of SKBs, this is
     very exciting work. From Edward Cree.

   - Busyloop optimizations in vhost_net, from Toshiaki Makita.

   - Introduce the ETF qdisc, which allows time based transmissions. IGB
     can offload this in hardware. From Vinicius Costa Gomes.

   - Add parameter support to devlink, from Moshe Shemesh.

   - Several multiplication and division optimizations for BPF JIT in
     nfp driver, from Jiong Wang.

   - Lots of prepatory work to make more of the packet scheduler layer
     lockless, when possible, from Vlad Buslov.

   - Add ACK filter and NAT awareness to sch_cake packet scheduler, from
     Toke Høiland-Jørgensen.

   - Support regions and region snapshots in devlink, from Alex Vesker.

   - Allow to attach XDP programs to both HW and SW at the same time on
     a given device, with initial support in nfp. From Jakub Kicinski.

   - Add TLS RX offload and support in mlx5, from Ilya Lesokhin.

   - Use PHYLIB in r8169 driver, from Heiner Kallweit.

   - All sorts of changes to support Spectrum 2 in mlxsw driver, from
     Ido Schimmel.

   - PTP support in mv88e6xxx DSA driver, from Andrew Lunn.

   - Make TCP_USER_TIMEOUT socket option more accurate, from Jon
     Maxwell.

   - Support for templates in packet scheduler classifier, from Jiri
     Pirko.

   - IPV6 support in RDS, from Ka-Cheong Poon.

   - Native tproxy support in nf_tables, from Máté Eckl.

   - Maintain IP fragment queue in an rbtree, but optimize properly for
     in-order frags. From Peter Oskolkov.

   - Improvde handling of ACKs on hole repairs, from Yuchung Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
  bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
  hv/netvsc: Fix NULL dereference at single queue mode fallback
  net: filter: mark expected switch fall-through
  xen-netfront: fix warn message as irq device name has '/'
  cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
  net: dsa: mv88e6xxx: missing unlock on error path
  rds: fix building with IPV6=m
  inet/connection_sock: prefer _THIS_IP_ to current_text_addr
  net: dsa: mv88e6xxx: bitwise vs logical bug
  net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
  ieee802154: hwsim: using right kind of iteration
  net: hns3: Add vlan filter setting by ethtool command -K
  net: hns3: Set tx ring' tc info when netdev is up
  net: hns3: Remove tx ring BD len register in hns3_enet
  net: hns3: Fix desc num set to default when setting channel
  net: hns3: Fix for phy link issue when using marvell phy driver
  net: hns3: Fix for information of phydev lost problem when down/up
  net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
  net: hns3: Add support for serdes loopback selftest
  bnxt_en: take coredump_record structure off stack
  ...
2018-08-15 15:04:25 -07:00
Linus Torvalds
fa1b5d09d0 Consolidation of Kconfig files by Christoph Hellwig.
Move the source statements of arch-independent Kconfig files instead of
 duplicating the includes in every arch/$(SRCARCH)/Kconfig.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbdFsfAAoJED2LAQed4NsGxHsP/1tmA57OOOj8oGxO2OXhXVbr
 Q0MZqCoV4bqMvK/hgCQdl9f+tp0m+j12x4xDLdVf4OqnTXMbqvPDu3uQVKvaj/k1
 gHhsFA1tFgSbuJ8InltUsrPEQqbceeJsj50xHVAKijqI6LYeRPPSU7aE9obn+OzH
 n2nd5sLKvMI/dqdJvW6i5KPydqTH3r3iA7D+ne/XQj0s0EMXvXUPmDT1+ijTnM4a
 yfm6W5p7L/c3Ugf1Pz5PfnPl4BxBwZMfW5ie/UO8j5C6Rl0iPaOGuuHurocaaJb3
 MefR/7NEAR3G8MhJyL2+70jbbwhjpqR2b5ooz1vpuulPHxjeU45BY60XIBWq1afR
 ewsc12MMCYB695ieYWoHdaWgxD/jhffyRuajfpkXKIZEMgDxS03sMhdULXENVMx1
 M0ZQ01g/NLWt9ti9DY3eTKB3ymOhnBa1sa77nGGUHkITq4DQKwPX1J9FP/HT6RNt
 uOvzeH5kGzc7tqOlZAO0kHbwhQG1uqGcd78IYd4lgf/XfkSgDERTWjnJmnQbwr9m
 3PFuST2u8eyO+8Lh1MK76TXOEkXsHMdFugPmb6SlgtMEPKGVLDPlsj52o/LFtgzl
 eygfMiBFr2+ttkZ6IpNcpmQ4IztmDpz6XoMk3PqDAfUTUSYpCnq1gAEuff/eisCM
 Odva1ZZaeQ7WpxhsP8rr
 =gsQJ
 -----END PGP SIGNATURE-----

Merge tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kconfig consolidation from Masahiro Yamada:
 "Consolidation of Kconfig files by Christoph Hellwig.

  Move the source statements of arch-independent Kconfig files instead
  of duplicating the includes in every arch/$(SRCARCH)/Kconfig"

* tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: add a Memory Management options" menu
  kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
  kconfig: use a menu in arch/Kconfig to reduce clutter
  kconfig: include kernel/Kconfig.preempt from init/Kconfig
  Kconfig: consolidate the "Kernel hacking" menu
  kconfig: include common Kconfig files from top-level Kconfig
  kconfig: remove duplicate SWAP symbol defintions
  um: create a proper drivers Kconfig
  um: cleanup Kconfig files
  um: stop abusing KBUILD_KCONFIG
2018-08-15 13:05:12 -07:00
Linus Torvalds
3529b9703c - Add awareness of zstd compression (Geliang Tang)
-----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltx60wWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqmtD/43lXkoLGl31JsCVkVASg2K34B3
 DI6uc4vwqJmGCNo9JrGGFNOqnV9n/rBg77Ymi6qDVRImbDm6t3FofaPLTVU1G9oK
 3KYlv7x99j6DoFmP+U6sEvnRb5znqrmFR/t6Nx4yn7p68XY6hd7IFx8wZ6ZNt26i
 eNUN67Hzg+eEuuDSf1XbVLIbeY9qqIyJvjqf6WUWXD2iAxcVPBUyxpNcsEGtmqti
 9e76t/iaPq+T4V8AODthl/lfyD/xauD6g48CpIRTeDvEbkC/cBZgYS3y5frVv4+d
 xve5j8aM2YYIaWzJGeE+3j+PXY6pdq/50iSPalo0SQg8ypaDpB7IRYmnBoR08dlX
 GNQxNLkfqXGniJf5Khaxzta9l25yjr8nRC3mLkI42dM/FIhHWi3/wF6nJSMaKZXJ
 FAZPC06VZR4Amb2pseNtRd09nYRDtqydmjsq53Y/1X0aaqgDiX1vEtothroLS4J1
 Lg3VWNzKjowTA40xc0z/5H+8jxw2gUfjtJKIQvYvaNXmEEfC+BvHk9OC5ZHSizu5
 PvBHNTGD2inyOpjWZishr6bgpeS4K/i/IDJy6UkFF1Q+z3udRbQTqWEUwwVjRnO0
 IlluTw7PcEy5ctbcLT+ABjdyjtWGzp0lATx+s3YvvZ53MvM20rV6nj9CPwfk+PE1
 MyqBBtSvtGvOsl9p9g==
 =3N3c
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore update from Kees Cook:
 "This cycle has been very quiet for pstore: the only change is adding
  awareness of the zstd compression method"

* tag 'pstore-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: add zstd compression support
2018-08-15 09:22:52 -07:00
Trond Myklebust
8618289c46 NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
We must drop the lock before we can sleep in referring_call_exists().

Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Fixes: 045d2a6d07 ("NFSv4.1: Delay callback processing...")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-15 12:05:15 -04:00
Trond Myklebust
d0fbb1d8a1 NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
The use of the inode->i_lock was converted to a mutex, but we forgot
to remove the old inode unlock/lock() pair that allowed the layout
segment to be put inside the loop.

Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Fixes: e824f99ada ("NFSv4: Use a mutex to protect the per-inode commit...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-15 11:43:38 -04:00
Linus Torvalds
1202f4fdbc arm64 updates for 4.19
A bunch of good stuff in here:
 
 - Wire up support for qspinlock, replacing our trusty ticket lock code
 
 - Add an IPI to flush_icache_range() to ensure that stale instructions
   fetched into the pipeline are discarded along with the I-cache lines
 
 - Support for the GCC "stackleak" plugin
 
 - Support for restartable sequences, plus an arm64 port for the selftest
 
 - Kexec/kdump support on systems booting with ACPI
 
 - Rewrite of our syscall entry code in C, which allows us to zero the
   GPRs on entry from userspace
 
 - Support for chained PMU counters, allowing 64-bit event counters to be
   constructed on current CPUs
 
 - Ensure scheduler topology information is kept up-to-date with CPU
   hotplug events
 
 - Re-enable support for huge vmalloc/IO mappings now that the core code
   has the correct hooks to use break-before-make sequences
 
 - Miscellaneous, non-critical fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJbbV41AAoJELescNyEwWM0WoEIALhrKtsIn6vqFlSs/w6aDuJL
 cMWmFxjTaKLmIq2+cJIdFLOJ3CH80Pu9gB+nEv/k+cZdCTfUVKfRf28HTpmYWsht
 bb4AhdHMC7yFW752BHk+mzJspeC8h/2Rm8wMuNVplZ3MkPrwo3vsiuJTofLhVL/y
 BihlU3+5sfBvCYIsWnuEZIev+/I/s/qm1ASiqIcKSrFRZP6VTt5f9TC75vFI8seW
 7yc3odKb0CArexB8yBjiPNziehctQF42doxQyL45hezLfWw4qdgHOSiwyiOMxEz9
 Fwwpp8Tx33SKLNJgqoqYznGW9PhYJ7n2Kslv19uchJrEV+mds82vdDNaWRULld4=
 =kQn6
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "A bunch of good stuff in here. Worth noting is that we've pulled in
  the x86/mm branch from -tip so that we can make use of the core
  ioremap changes which allow us to put down huge mappings in the
  vmalloc area without screwing up the TLB. Much of the positive
  diffstat is because of the rseq selftest for arm64.

  Summary:

   - Wire up support for qspinlock, replacing our trusty ticket lock
     code

   - Add an IPI to flush_icache_range() to ensure that stale
     instructions fetched into the pipeline are discarded along with the
     I-cache lines

   - Support for the GCC "stackleak" plugin

   - Support for restartable sequences, plus an arm64 port for the
     selftest

   - Kexec/kdump support on systems booting with ACPI

   - Rewrite of our syscall entry code in C, which allows us to zero the
     GPRs on entry from userspace

   - Support for chained PMU counters, allowing 64-bit event counters to
     be constructed on current CPUs

   - Ensure scheduler topology information is kept up-to-date with CPU
     hotplug events

   - Re-enable support for huge vmalloc/IO mappings now that the core
     code has the correct hooks to use break-before-make sequences

   - Miscellaneous, non-critical fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
  arm64: alternative: Use true and false for boolean values
  arm64: kexec: Add comment to explain use of __flush_icache_range()
  arm64: sdei: Mark sdei stack helper functions as static
  arm64, kaslr: export offset in VMCOREINFO ELF notes
  arm64: perf: Add cap_user_time aarch64
  efi/libstub: Only disable stackleak plugin for arm64
  arm64: drop unused kernel_neon_begin_partial() macro
  arm64: kexec: machine_kexec should call __flush_icache_range
  arm64: svc: Ensure hardirq tracing is updated before return
  arm64: mm: Export __sync_icache_dcache() for xen-privcmd
  drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
  arm64: Add support for STACKLEAK gcc plugin
  arm64: Add stack information to on_accessible_stack
  drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
  arm64: fix ACPI dependencies
  rseq/selftests: Add support for arm64
  arm64: acpi: fix alignment fault in accessing ACPI
  efi/arm: map UEFI memory map even w/o runtime services enabled
  efi/arm: preserve early mapping of UEFI memory map longer for BGRT
  drivers: acpi: add dependency of EFI for arm64
  ...
2018-08-14 16:39:13 -07:00
Richard Weinberger
99a24e02cc ubifs: Set default assert action to read-only
Traditionally UBIFS just reported a failed assertion and moved on. The
drawback is that users will notice UBIFS bugs when it is too late, most
of the time when it is no longer about to mount. This makes bug hunting
problematic since valuable information from failing asserts is long gone
when UBIFS is dead. The other extreme, panic'ing on a failing assert is
also not worthwhile, we want users and developers give a chance to
collect as much debugging information as possible if UBIFS hits an
assert. Therefore go for the third option, switch to read-only mode when
an assert fails. That way UBIFS will not write possible bad data to the
MTD and gives users the chance to collect debugging information.

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:22 +02:00
Richard Weinberger
c38c5a7f2e ubifs: Allow setting assert action as mount parameter
Expose our three options to userspace.

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:21 +02:00
Richard Weinberger
2e52eb7446 ubifs: Rework ubifs_assert()
With having access to struct ubifs_info in ubifs_assert() we can
give more information when an assert is failing.
By using ubifs_err() we can tell which UBIFS instance failed.

Also multiple actions can be taken now.
We support:
 - report: This is what UBIFS did so far, just report the failure and go
   on.
 - read-only: Switch to read-only mode.
 - panic: shoot the kernel in the head.

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:21 +02:00
Richard Weinberger
6eb61d587f ubifs: Pass struct ubifs_info to ubifs_assert()
This allows us to have more context in ubifs_assert()
and take different actions depending on the configuration.

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:21 +02:00
Richard Weinberger
54169ddd38 ubifs: Turn two ubifs_assert() into a WARN_ON()
We are going to pass struct ubifs_info to ubifs_assert()
but while unloading the UBIFS module we don't have the info
struct anymore.
Therefore replace the asserts by a regular WARN_ON().

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:21 +02:00
Richard Weinberger
a3d218280c ubifs: Use kmalloc_array()
Since commit 6da2ec5605 ("treewide: kmalloc() -> kmalloc_array()")
we use kmalloc_array() for kmalloc() that computes the length with
a multiplication.

Cc: Kees Cook <keescook@chromium.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:20 +02:00
Richard Weinberger
95a22d2084 ubifs: Check data node size before truncate
Check whether the size is within bounds before using it.
If the size is not correct, abort and dump the bad data node.

Cc: Kees Cook <keescook@chromium.org>
Cc: Silvio Cesare <silvio.cesare@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: Silvio Cesare <silvio.cesare@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:20 +02:00
Richard Weinberger
08acbdd6fd Revert "UBIFS: Fix potential integer overflow in allocation"
This reverts commit 353748a359.
It bypassed the linux-mtd review process and fixes the issue not as it
should.

Cc: Kees Cook <keescook@chromium.org>
Cc: Silvio Cesare <silvio.cesare@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:20 +02:00
Richard Weinberger
49d2e05fb4 ubifs: Add comment on c->commit_sem
Every single time I come across that code, I get confused
because it looks like a possible dead lock.
Help myself by adding a comment.

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:20 +02:00
Stefan Agner
7e5471ce6d ubifs: introduce Kconfig symbol for xattr support
Allow to disable extended attribute support.

This aids in reliability testing, especially since some xattr
related bugs have surfaced.

Also an embedded system might not need it, so this allows for a
slightly smaller kernel (about 4KiB).

Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:14 +02:00
Gustavo A. R. Silva
1bf0572fe2 ubifs: use swap macro in swap_dirty_idx
Make use of the swap macro and remove unnecessary variable *t*. This
makes the code easier to read and maintain.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:25:08 +02:00
Adrian Hunter
6855dc41b2 x86: Add entry trampolines to kcore
Without program headers for PTI entry trampoline pages, the trampoline
virtual addresses do not map to anything.

Example before:

 sudo gdb --quiet vmlinux /proc/kcore
 Reading symbols from vmlinux...done.
 [New process 1]
 Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
 #0  0x0000000000000000 in irq_stack_union ()
 (gdb) x /21ib 0xfffffe0000006000
    0xfffffe0000006000:  Cannot access memory at address 0xfffffe0000006000
 (gdb) quit

After:

 sudo gdb --quiet vmlinux /proc/kcore
 [sudo] password for ahunter:
 Reading symbols from vmlinux...done.
 [New process 1]
 Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
 #0  0x0000000000000000 in irq_stack_union ()
 (gdb) x /21ib 0xfffffe0000006000
    0xfffffe0000006000:  swapgs
    0xfffffe0000006003:  mov    %rsp,-0x3e12(%rip)        # 0xfffffe00000021f8
    0xfffffe000000600a:  xchg   %ax,%ax
    0xfffffe000000600c:  mov    %cr3,%rsp
    0xfffffe000000600f:  bts    $0x3f,%rsp
    0xfffffe0000006014:  and    $0xffffffffffffe7ff,%rsp
    0xfffffe000000601b:  mov    %rsp,%cr3
    0xfffffe000000601e:  mov    -0x3019(%rip),%rsp        # 0xfffffe000000300c
    0xfffffe0000006025:  pushq  $0x2b
    0xfffffe0000006027:  pushq  -0x3e35(%rip)        # 0xfffffe00000021f8
    0xfffffe000000602d:  push   %r11
    0xfffffe000000602f:  pushq  $0x33
    0xfffffe0000006031:  push   %rcx
    0xfffffe0000006032:  push   %rdi
    0xfffffe0000006033:  mov    $0xffffffff91a00010,%rdi
    0xfffffe000000603a:  callq  0xfffffe0000006046
    0xfffffe000000603f:  pause
    0xfffffe0000006041:  lfence
    0xfffffe0000006044:  jmp    0xfffffe000000603f
    0xfffffe0000006046:  mov    %rdi,(%rsp)
    0xfffffe000000604a:  retq
 (gdb) quit

In addition, entry trampolines all map to the same page.  Represent that
by giving the corresponding program headers in kcore the same offset.

This has the benefit that, when perf tools uses /proc/kcore as a source
for kernel object code, samples from different CPU trampolines are
aggregated together.  Note, such aggregation is normal for profiling
i.e. people want to profile the object code, not every different virtual
address the object code might be mapped to (across different processes
for example).

Notes by PeterZ:

This also adds the KCORE_REMAP functionality.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-14 19:13:26 -03:00
Arnd Bergmann
6cff573202 ubifs: tnc: use monotonic znode timestamp
The tnc uses get_seconds() based timestamps to check the age of a znode,
which has two problems: on 32-bit architectures this may overflow in
2038 or 2106, and it gives incorrect information when the system time
is updated using settimeofday().

Using montonic timestamps with ktime_get_seconds() solves both thes
problems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:16 +02:00
Arnd Bergmann
0eca0b8067 ubifs: use timespec64 for inode timestamps
Both vfs and the on-disk inode structures can deal with fine-grained
timestamps now, so this is the last missing piece to make ubifs
y2038-safe on 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:16 +02:00
Richard Weinberger
11a6fc3dc7 ubifs: xattr: Don't operate on deleted inodes
xattr operations can race with unlink and the following assert triggers:
UBIFS assert failed in ubifs_jnl_change_xattr at 1606 (pid 6256)

Fix this by checking i_nlink before working on the host inode.

Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:15 +02:00
Richard Weinberger
312c39bd6d ubifs: gc: Fix typo
UBIFS operates on LEBs, not PEBs.

Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:15 +02:00
Richard Weinberger
eef19816ad ubifs: Fix memory leak in lprobs self-check
Allocate the buffer after we return early.
Otherwise memory is being leaked.

Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:15 +02:00
Richard Weinberger
5996559320 ubifs: Fix synced_i_size calculation for xattr inodes
In ubifs_jnl_update() we sync parent and child inodes to the flash,
in case of xattrs, the parent inode (AKA host inode) has a non-zero
data_len. Therefore we need to adjust synced_i_size too.

This issue was reported by ubifs self tests unter a xattr related work
load.
UBIFS error (ubi0:0 pid 1896): dbg_check_synced_i_size: ui_size is 4, synced_i_size is 0, but inode is clean
UBIFS error (ubi0:0 pid 1896): dbg_check_synced_i_size: i_ino 65, i_mode 0x81a4, i_size 4

Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:15 +02:00
Richard Weinberger
00ee8b6010 ubifs: Fix directory size calculation for symlinks
We have to account the name of the symlink and not the target length.

Fixes: ca7f85be8d ("ubifs: Add support for encrypted symlinks")
Cc: <stable@vger.kernel.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-15 00:06:15 +02:00
Linus Torvalds
be718b524d configfs updates for 4.19
- simplify the cide by using kvasprintf (Bart Van Assche).
  - fix a gcc 8 string truncation warning by making the code simpler
    (Guenter Roeck).
  - fix a bug in rmdir() handling (Mike Christie).
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAltxL44LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMjKxAAy9M8gy87QhrkOTVLher1A3DY48Qbvx5AEHM2jnZQ
 MekqGIcgAGYLlS3ifu54JJBB6QoM+UjKkvkjJiFaJbg685yFepkNkUUfRTy2tVqF
 2TcsxUrhZu5gz2gm7tBll321Jz34w++PWzuVoI82YefeobYGu+XxWf0JnJQYJVL3
 Q2W73AZBLwZQ1s0F5UNAFa7s+pznIpJLTMUD3iFmzoXx0Qihy7VGxcKxn6+Bzg2E
 UfojBH//K9cxk+dMZEiVaA+bXY39fd7bj+yDzcF8ThR+SPJwXjP9yLDDFqM7qsrK
 BnhcPYNV3Qjk5WCwvn/Lx22hFHgB44URjyvxaP6Bntp7CVk4IPVJqmqjUaov0fsZ
 zHNvdjk3cy6JCOTvcUECikM3qkfgpabNUbN/QSK+nKN6JT22wGrf0ojKpSYfdBb3
 nhh8lV7YFkbmVFmigN9qAw2PH/l8C8XM0P/VezRUz9gba1jW8TcLhcvL1OKs4NFb
 LSS07pHfdhipm5go1kxcCLDb93Gg0MwYEFCX3xA6YoYU7bjg7uLUa8IR7dFVNdCC
 uipcxY9fCfcu2G2AV6WUVT+TQNYjC8tgUgSzM3LI+t12NTJ5hB7MFqSQUmCZSKmJ
 nSDjoHHFbY56H0rrelEdhzjXLNGTc8UmkIRDqd3oZk0wnvxZlq8fPRkmU7DLFEh6
 tAU=
 =5AGn
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.19' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:

 - simplify the cide by using kvasprintf (Bart Van Assche)

 - fix a gcc 8 string truncation warning by making the code simpler
   (Guenter Roeck)

 - fix a bug in rmdir() handling (Mike Christie)

* tag 'configfs-for-4.19' of git://git.infradead.org/users/hch/configfs:
  configfs: fix registered group removal
  configfs: replace strncpy with memcpy
  configfs: use kvasprintf() instead of open-coding it
2018-08-14 11:44:18 -07:00
Linus Torvalds
c2fc71c9b7 JFFS2 changes:
- Support 64-bit timestamps
 
 MTD changes:
   Core changes:
   - Support sub-partitions
   - Clarify mtd_oob_ops documentation
   - Make Kconfig formatting consistent
   - Fix potential overflows in mtdchar_{write,read}()
   - Fallback to ->_{read,write}() when ->_{read,write}_oob() is missing
     and no OOB data were requested
   - Remove VLA usage in the bch lib
 
   Driver changes:
   - Use mtd_device_register() instead of mtd_device_parse_register()
     where applicable
   - Use proper printk format to print physical addresses in the
     solutionengine driver
   - Add missing mtd_set_of_node() call in the powernv driver
   - Remove unneeded variables in a few drivers
   - Plug the TRX part parser to the DT partition parsers logic
   - Check ioremap_cache() return code in the gpio-addr-flash driver
   - Stop using VMLINUX_SYMBOL_STR() in gen_probe.c
 
 SPI NOR changes:
   Core changes:
   - Apply reset hacks only when reset is explicitly marked as broken in
     the DT
 
    Driver changes:
    - Minor cleanup/fixes in the m25p80 driver
    - Release flash_np in the nxp-spifi driver
    - Add suspend/resume hooks to the atmel-quadspi driver
    - Include gpio/consumer.h instead of gpio.h in the atmel-quadspi
      driver
    - Use %pK instead of %p in the stm32-quadspi driver
    - Improve timeout handling in the cadence-quadspi driver
    - Use mtd_device_register() instead of mtd_device_parse_register()
      in the intel-spi driver
 
 NAND changes:
   Core changes:
   - Add the SPI-NAND framework.
   - Create a helper to find the best ECC configuration.
   - Create NAND controller operations.
   - Allocate dynamically ONFI parameters structure.
   - Add defines for ONFI version bits.
   - Add manufacturer fixup for ONFI parameter page.
   - Add an option to specify NAND chip as a boot device.
   - Add Reed-Solomon error correction algorithm.
   - Better name for the controller structure.
   - Remove unused caller_is_module() definition.
   - Make subop helpers return unsigned values.
   - Expose _notsupp() helpers for raw page accessors.
   - Add default values for dynamic timings.
   - Kill the chip->scan_bbt() hook.
   - Rename nand_default_bbt() into nand_create_bbt().
   - Start to clean the nand_chip structure.
   - Remove stale prototype from rawnand.h.
 
   Raw NAND controllers drivers changes:
   - Qcom: structuring cleanup.
   - Denali: use core helper to find the best ECC configuration.
   - Possible build of almost all drivers by adding a dependency on
     COMPILE_TEST for almost all of them in Kconfig, implies various
     fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
     changes in sparc64 and ia64 architectures.
   - Clean the ->probe() functions error path of a lot of drivers.
   - Migrate all drivers to use nand_scan() instead of
     nand_scan_ident()/nand_scan_tail() pair.
   - Use mtd_device_register() where applicable to simplify the code.
   - Marvell:
     * Handle on-die ECC.
     * Better clocks handling.
     * Remove bogus comment.
     * Add suspend and resume support.
   - Tegra: add NAND controller driver.
   - Atmel:
     * Add module param to avoid using dma.
     * Drop Wenyou Yang from MAINTAINERS.
   - Denali: optimize timings handling.
   - FSMC: Stop using chip->read_buf().
   - FSL:
     * Switch to SPDX license tag identifiers.
     * Fix qualifiers in MXC init functions.
 
   Raw NAND chip drivers changes:
   - Micron:
     * Add fixup for ONFI revision.
     * Update ecc_stats.corrected.
     * Make ECC activation stateful.
     * Avoid enabling/disabling ECC when it can't be disabled.
     * Get the actual number of bitflips.
     * Allow forced on-die ECC.
     * Support 8/512 on-die ECC.
     * Fix on-die ECC detection logic.
   - Hynix:
     * Fix decoding the OOB size on H27UCG8T2BTR.
     * Use ->exec_op() in hynix_nand_reg_write_op().
 -----BEGIN PGP SIGNATURE-----
 
 iQI5BAABCAAjBQJbcSYdHBxib3Jpcy5icmV6aWxsb25AYm9vdGxpbi5jb20ACgkQ
 Ze02AX4ItwA85Q//ViUNiWoluaj7q3Kpv44NZZClPjVfirDtIHEfyzob7HbkrqjT
 ivjNUu5cjKU2aqptR24UzIV2lwmaSGicIzvyANN5bNedHy8tb4BWXo1iTEflXc1P
 lYihrRz7AeHQqdrOYAiHxq0fcqgIm2NjCDxfdtkP1HJsMQH+LgvoosOkd8RyLF8E
 jV5IyreTfg10AiVS44yvnqdsYo7GDVZRgnx1dBRbdJ+WwZS05m6LSeNu0ivgfLMM
 NmISrn7pzlMYnuDkzckXwVka7STFhV9befoSCmAHedqNEHtzaGDaNMLEilaC4WzV
 1yvsdRGI3hDkdO4s44nP/Vs2XlVmCevuTpOOQ1dEK94Bbek+us7NEPjxOIAf22jE
 5MVZR6vpNN7JXJ+nvEZjjL8X9zysqZgk2r6os35Wi2zJK7iHIrFpjlMYwTJfVPzv
 JKabOQZuOUqRACbL0isTTdG0+Tzr1BxaQqFiFUteQ9QpG4nSifoHUmI+9XUCJmil
 iGTOy6rEkaMd6qu3xapPtZbxGbUbvcLiJvNrZ02ZfsELDAELowz66O0PX6rSwTV9
 NIWeVNXK9Bsp0oxbb8P8oJIB5K4F6vqncPBls34J7E/xrLO9nLk1VvFQrGqzuwwP
 HmEdgcRwKeXrvJITZz2u0u6lrKpgdDA3FTqfGCNec51PCgb1FwxcpKF/Xz0=
 =yIkF
 -----END PGP SIGNATURE-----

Merge tag 'mtd/for-4.19' of git://git.infradead.org/linux-mtd

Pull mtd updates from Boris Brezillon:
 "JFFS2 changes:
   - Support 64-bit timestamps

  MTD core changes:
   - Support sub-partitions
   - Clarify mtd_oob_ops documentation
   - Make Kconfig formatting consistent
   - Fix potential overflows in mtdchar_{write,read}()
   - Fallback to ->_{read,write}() when ->_{read,write}_oob() is missing
     and no OOB data were requested
   - Remove VLA usage in the bch lib

  MTD driver changes:
   - Use mtd_device_register() instead of mtd_device_parse_register()
     where applicable
   - Use proper printk format to print physical addresses in the
     solutionengine driver
   - Add missing mtd_set_of_node() call in the powernv driver
   - Remove unneeded variables in a few drivers
   - Plug the TRX part parser to the DT partition parsers logic
   - Check ioremap_cache() return code in the gpio-addr-flash driver
   - Stop using VMLINUX_SYMBOL_STR() in gen_probe.c

  SPI NOR core changes:
   - Apply reset hacks only when reset is explicitly marked as broken in
     the DT

   SPI NOR driver changes:
   - Minor cleanup/fixes in the m25p80 driver
   - Release flash_np in the nxp-spifi driver
   - Add suspend/resume hooks to the atmel-quadspi driver
   - Include gpio/consumer.h instead of gpio.h in the atmel-quadspi
     driver
   - Use %pK instead of %p in the stm32-quadspi driver
   - Improve timeout handling in the cadence-quadspi driver
   - Use mtd_device_register() instead of mtd_device_parse_register() in
     the intel-spi driver

  NAND core changes:
   - Add the SPI-NAND framework.
   - Create a helper to find the best ECC configuration.
   - Create NAND controller operations.
   - Allocate dynamically ONFI parameters structure.
   - Add defines for ONFI version bits.
   - Add manufacturer fixup for ONFI parameter page.
   - Add an option to specify NAND chip as a boot device.
   - Add Reed-Solomon error correction algorithm.
   - Better name for the controller structure.
   - Remove unused caller_is_module() definition.
   - Make subop helpers return unsigned values.
   - Expose _notsupp() helpers for raw page accessors.
   - Add default values for dynamic timings.
   - Kill the chip->scan_bbt() hook.
   - Rename nand_default_bbt() into nand_create_bbt().
   - Start to clean the nand_chip structure.
   - Remove stale prototype from rawnand.h.

  Raw NAND controllers drivers changes:
   - Qcom: structuring cleanup.
   - Denali: use core helper to find the best ECC configuration.
   - Possible build of almost all drivers by adding a dependency on
     COMPILE_TEST for almost all of them in Kconfig, implies various
     fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
     changes in sparc64 and ia64 architectures.
   - Clean the ->probe() functions error path of a lot of drivers.
   - Migrate all drivers to use nand_scan() instead of
     nand_scan_ident()/nand_scan_tail() pair.
   - Use mtd_device_register() where applicable to simplify the code.
   - Marvell:
      * Handle on-die ECC.
      * Better clocks handling.
      * Remove bogus comment.
      * Add suspend and resume support.
   - Tegra: add NAND controller driver.
   - Atmel:
      * Add module param to avoid using dma.
      * Drop Wenyou Yang from MAINTAINERS.
   - Denali: optimize timings handling.
   - FSMC: Stop using chip->read_buf().
   - FSL:
      * Switch to SPDX license tag identifiers.
      * Fix qualifiers in MXC init functions.

  Raw NAND chip drivers changes:
   - Micron:
      * Add fixup for ONFI revision.
      * Update ecc_stats.corrected.
      * Make ECC activation stateful.
      * Avoid enabling/disabling ECC when it can't be disabled.
      * Get the actual number of bitflips.
      * Allow forced on-die ECC.
      * Support 8/512 on-die ECC.
      * Fix on-die ECC detection logic.
   - Hynix:
      * Fix decoding the OOB size on H27UCG8T2BTR.
      * Use ->exec_op() in hynix_nand_reg_write_op()"

* tag 'mtd/for-4.19' of git://git.infradead.org/linux-mtd: (188 commits)
  mtd: rawnand: atmel: Select GENERIC_ALLOCATOR
  MAINTAINERS: drop Wenyou Yang from Atmel NAND driver support
  mtd: rawnand: allocate dynamically ONFI parameters during detection
  mtd: spi-nor: only apply reset hacks to broken hardware
  mtd: spi-nor: cadence-quadspi: fix timeout handling
  mtd: spi-nor: atmel-quadspi: Include gpio/consumer.h instead of gpio.h
  mtd: spi-nor: intel-spi: use mtd_device_register()
  mtd: spi-nor: stm32-quadspi: replace "%p" with "%pK"
  mtd: spi-nor: atmel-quadspi: add suspend/resume hooks
  mtd: rawnand: allocate model parameter dynamically
  mtd: rawnand: do not export nand_scan_[ident|tail]() anymore
  mtd: rawnand: txx9ndfmc: convert driver to nand_scan()
  mtd: rawnand: txx9ndfmc: clarify ECC parameters assignation
  mtd: rawnand: tegra: convert driver to nand_scan()
  mtd: rawnand: jz4740: convert driver to nand_scan()
  mtd: rawnand: jz4740: group nand_scan_{ident, tail} calls
  mtd: rawnand: jz4740: fix probe function error path
  mtd: rawnand: docg4: convert driver to nand_scan()
  mtd: rawnand: do not execute nand_scan_ident() if maxchips is zero
  mtd: rawnand: atmel: convert driver to nand_scan()
  ...
2018-08-14 10:57:44 -07:00
Chao Yu
dda9f4b9ca f2fs: fix to skip verifying block address for non-regular inode
generic/184 1s ... [failed, exit status 1]- output mismatch
    --- tests/generic/184.out	2015-01-11 16:52:27.643681072 +0800
     QA output created by 184 - silence is golden
    +rm: cannot remove '/mnt/f2fs/null': Bad address
    +mknod: '/mnt/f2fs/null': Bad address
    +chmod: cannot access '/mnt/f2fs/null': Bad address
    +./tests/generic/184: line 36: /mnt/f2fs/null: Bad address
    ...

F2FS-fs (zram0): access invalid blkaddr:259
EIP: f2fs_is_valid_blkaddr+0x14b/0x1b0 [f2fs]
 f2fs_iget+0x927/0x1010 [f2fs]
 f2fs_lookup+0x26e/0x630 [f2fs]
 __lookup_slow+0xb3/0x140
 lookup_slow+0x31/0x50
 walk_component+0x185/0x1f0
 path_lookupat+0x51/0x190
 filename_lookup+0x7f/0x140
 user_path_at_empty+0x36/0x40
 vfs_statx+0x61/0xc0
 __do_sys_stat64+0x29/0x40
 sys_stat64+0x13/0x20
 do_fast_syscall_32+0xaa/0x22c
 entry_SYSENTER_32+0x53/0x86

In f2fs_iget(), we will check inode's first block address, if it is valid,
we will set FI_FIRST_BLOCK_WRITTEN flag in inode.

But we should only do this for regular inode, otherwise, like special
inode, i_addr[0] is used for storing device info instead of block address,
it will fail checking flow obviously.

So for non-regular inode, let's skip verifying address and setting flag.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-14 10:29:11 -07:00
Linus Torvalds
73ba2fb33c for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Arnd Bergmann
7fa750a163 f2fs: rework fault injection handling to avoid a warning
When CONFIG_F2FS_FAULT_INJECTION is disabled, we get a warning about an
unused label:

fs/f2fs/segment.c: In function '__submit_discard_cmd':
fs/f2fs/segment.c:1059:1: error: label 'submit' defined but not used [-Werror=unused-label]

This could be fixed by adding another #ifdef around it, but the more
reliable way of doing this seems to be to remove the other #ifdefs
where that is easily possible.

By defining time_to_inject() as a trivial stub, most of the checks for
CONFIG_F2FS_FAULT_INJECTION can go away. This also leads to nicer
formatting of the code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-14 09:49:15 -07:00
Colin Ian King
e1b437691a orangefs: remove redundant pointer orangefs_inode
Pointer orangefs_inode is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'orangefs_inode' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-08-14 12:07:14 -04:00
Souptick Joarder
8bf782f647 orangefs: Adding new return type vm_fault_t
Use new return type vm_fault_t for fault handler. For now,
this is just documenting that the function returns a VM_FAULT
value rather than an errno. Once all instances are converted,
vm_fault_t will become a distinct type.

See the following
commit 1c8f422059 ("mm: change return type to vm_fault_t")

Fixed checkpatch.pl warning.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-08-14 12:07:14 -04:00
Linus Torvalds
781fca5b10 Changes for 4.19:
- Use extent maps to track pagecache page status instead of bufferhead
   state.
 - Refactor pagecache read and write paths to use the new iomap library
   functions, which enable us to drop the old bufferhead code for
   pagesize == blocksize filesystems.
 - Set up parallel per-block-per-page metadata to track subpage
   information that was tracked by buffer heads, which enables us to drop
   the old bufferhead code for pagesize > blocksize filesystems.
 - Tie a deferred ops control structure to a transaction so that we can
   take advantage of an upper-level dfops without having to plumb pointer
   passing through the code.
 - Refactor the deferred ops code to track deferred ops as part of the
   transaction structure (instead of as a separate data structure) so
   that we can simplify the scoping rules around defer_ops.
 - Refactor twisty delwri buffer submission code to avoid deadlocks.
 - Shorten and fix indenting problems in the scrub code.
 - Detect obviously bad summary counts at mount and fix them.
 - Directly associate deferred ops control structure with a transaction
   so that callers no longer have to manage it themselves.
 - Remove a couple of IRIX-era inode macros.
 - Remove the long-deprecated 'barrier' and 'nobarrier' mount options.
 - Clean up the inode fork structure a bit.
 - Check for bad fs summary counter values in the superblock.
 - Reduce COW fork lookups during writeback.
 - Refactor the deferred ops control structures into the transaction
   structure, thereby eliminating the need for transaction users to
   handle the deferred ops as a separate data structure.
 - Add the ability to repair AG headers online.
 - Fix a crash due to insufficient return value checking.
 - Various fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltwVGoACgkQ+H93GTRK
 tOt3bw//WaG1mR44Oo/mhf27MaEJK74LXeViqCH4Sdk10gujClTVnl6h33ChAyEi
 7BT4x1JtwM6xOh7nPsXQy/besVxWadjQcTtAz/3U2wJoFyOX2+I27SAawrmX6jfR
 Hi1DxXFFK7z/8YvuZqYl3vTgxMNb7bLAUybe2sYX8q+vrQaUvl9eLQlHSaT3sxrc
 /lBkog1dYmbw3yjLnWYpQtC0I6Pa3ZuG/S2vpeJ2H5MADtzrRNjuC9MHZJW7tIGm
 +rCLm0agk8yFkEA84VvS5Afee3TppY/JBaYlsvG1rp3bs0fELAJFnzS4g/QDbbsX
 HAKPcMICJksF4C9y0Xb7wXPz/4PKur5/OSuGXN4QtOivOEoAdWfh2PLInqAjo/Le
 mO92PdkBucfVqJzfEC2q2QAnGIaJlG8txhAz87wZ1YfZDQQlJDy385Z9GQXfUpy5
 /1xH7V0cze1ZBSxWSddSFg0gCtaWSerfp0CmAG3A+HWKIN6c/ZNSCrqdq0DBC99D
 qOn6ThjckZWGvz/KV5xBr/KvUYOpSeEyREtgcAN008TiUaNy4nOhWV2xgLGuPY/J
 ed4V2B9qVbq+l+sZyzukB8cmOXmcCey6omwJ7LqZzoTWTAtTQtM2MwhaQFUWtQG8
 mCqPXJp1XyL24sn0bI1t2NuKgQcs6QEQWX3zN4DA6I+N9+sTDqo=
 =2G+i
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "This is the second part of the XFS changes for 4.19.

  The biggest changes are the removal of buffer heads frm XFS, a massive
  reworking of the deferred transaction operations handling code, the
  removal of the long defunct barrier/nobarrier mount options, and the
  addition of a few more online repair functions.

  Summary:

   - Use extent maps to track pagecache page status instead of
     bufferhead state.

   - Refactor pagecache read and write paths to use the new iomap
     library functions, which enable us to drop the old bufferhead code
     for pagesize == blocksize filesystems.

   - Set up parallel per-block-per-page metadata to track subpage
     information that was tracked by buffer heads, which enables us to
     drop the old bufferhead code for pagesize > blocksize filesystems.

   - Tie a deferred ops control structure to a transaction so that we
     can take advantage of an upper-level dfops without having to plumb
     pointer passing through the code.

   - Refactor the deferred ops code to track deferred ops as part of the
     transaction structure (instead of as a separate data structure) so
     that we can simplify the scoping rules around defer_ops.

   - Refactor twisty delwri buffer submission code to avoid deadlocks.

   - Shorten and fix indenting problems in the scrub code.

   - Detect obviously bad summary counts at mount and fix them.

   - Directly associate deferred ops control structure with a
     transaction so that callers no longer have to manage it themselves.

   - Remove a couple of IRIX-era inode macros.

   - Remove the long-deprecated 'barrier' and 'nobarrier' mount options.

   - Clean up the inode fork structure a bit.

   - Check for bad fs summary counter values in the superblock.

   - Reduce COW fork lookups during writeback.

   - Refactor the deferred ops control structures into the transaction
     structure, thereby eliminating the need for transaction users to
     handle the deferred ops as a separate data structure.

   - Add the ability to repair AG headers online.

   - Fix a crash due to insufficient return value checking.

   - Various fixes and cleanups"

* tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (155 commits)
  xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
  xfs: remove b_last_holder & associated macros
  iomap: Switch to offset_in_page for clarity
  xfs: Close race between direct IO and xfs_break_layouts()
  xfs: repair the AGI
  xfs: repair the AGFL
  xfs: repair the AGF
  xfs: remove dead error handling code in xfs_dquot_disk_alloc()
  xfs: use WRITE_ONCE to update if_seq
  xfs: fix a comment in xfs_log_reserve
  xfs: only validate summary counts on primary superblock
  xfs: substitute spaces with tabs
  xfs: fold dfops into the transaction
  xfs: always defer agfl block frees
  xfs: pass transaction to xfs_defer_add()
  xfs: replace xfs_defer_ops ->dop_pending with on-stack list
  xfs: cancel dfops on xfs_defer_finish() error
  xfs: clean out superfluous dfops dop params/vars
  xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
  xfs: automatic dfops inode relogging
  ...
2018-08-14 08:56:02 -07:00
Darrick J. Wong
7d5e049e72 iomap: fix WARN_ON_ONCE on uninitialized variable
In commit 9dc55f1389 ("iomap: add support for sub-pagesize buffered
I/O without buffer heads") we moved the initialization of poff (it's
computed from pos) into a separate helper function.  Inline data only
ever deals with pos == 0, hence the WARN_ON_ONCE, but now we're testing
an uninitialized variable.

Therefore, change the test to check the parameter directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-14 08:17:02 -07:00
Darrick J. Wong
1fc25f51d7 xfs: sanity check ag header values in xrep_calc_ag_resblks
Check the values we read in from the AG headers when calculating the
block reservations for a repair transaction.  If they're obviously
wrong, substitute worst case assumptions (rather than ENOSPC on a bogus
reservation request).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-14 08:17:02 -07:00
Linus Torvalds
10f3e23f07 Convert content from the ext4 wiki to Documentation rst files so it is
more likely to be updated as we add new features to ext4.  Add 64-bit
 timestamp support to ext4's superblock fields.  And the usual bug
 fixes and cleanups, including a Spectre gadget fixup and some
 hardening against maliciously corrupted file systems.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAltx3CAACgkQ8vlZVpUN
 gaOfkAgApkNU46yPWWFqJpeTR76Go4zW00x9Piaita+KpQcTCUokK0BDXXeydOEo
 zhtNjCRl00vbhnYJ90T75Y3ce9oGQDkWpXHFZBdl+kmMBaIHQJYEECCLR2zrvxVk
 LhsixvLozQd/VFenwzmXLqYshiZjhjBZOKo7/xsoy4waer1vNidA9/SR3hGWTMOk
 9IsEVqw1h6pZt7ZB/leZmZ348H9QZZbYOUWl2ArvdmNeg2t8pWH0MD0bos+mj5lW
 xihyJEIS/OjWZRJhM2WEPMO7U4jD/aX3zmDMGePp/vC5KSqA4FGwk671xIyD+LXR
 Hd8Rn8chRe4/iyhBqHpJNkqHQRUV6g==
 =SA3p
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - Convert content from the ext4 wiki to Documentation rst files so it
   is more likely to be updated as we add new features to ext4.

 - Add 64-bit timestamp support to ext4's superblock fields.

 - ... and the usual bug fixes and cleanups, including a Spectre gadget
   fixup and some hardening against maliciously corrupted file systems.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (34 commits)
  ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
  ext4: improve code readability in ext4_iget()
  ext4: fix spectre gadget in ext4_mb_regular_allocator()
  ext4: check for NUL characters in extended attribute's name
  ext4: use ext4_warning() for sb_getblk failure
  ext4: fix race when setting the bitmap corrupted flag
  ext4: reset error code in ext4_find_entry in fallback
  ext4: handle layout changes to pinned DAX mappings
  dax: dax_layout_busy_page() warn on !exceptional
  docs: fix up the obviously obsolete bits in the new ext4 documentation
  docs: add new ext4 superblock time extension fields
  docs: create filesystem internal section
  ext4: use swap macro in mext_page_double_lock
  ext4: check allocation failure when duplicating "data" in ext4_remount()
  ext4: fix warning message in ext4_enable_quotas()
  ext4: super: extend timestamps to 40 bits
  jbd2: replace current_kernel_time64 with ktime equivalent
  ext4: use timespec64 for all inode times
  ext4: use ktime_get_real_seconds for i_dtime
  ext4: use 64-bit timestamps for mmp_time
  ...
2018-08-13 22:34:47 -07:00
Linus Torvalds
3bb37da509 smb3/cifs fixes (including 8 for stable). Improved tracing, stats, snapshots (previous version mounts work now), performance (compounding enabled for statfs)
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAltwwVkACgkQiiy9cAdy
 T1H06gv/d+b52w2X05YlLBB1GboJjv1dl3sVrFisJrMrCCr6Hotk+4+RtC8CJHh6
 /9Joq6iY8B/XLl7HrWr61vJhn8vlWOGhOQaUmHCBGeiIMX93RbaP8KvkPfHAKQiJ
 T6+7v70z/WL7mUQA2TEQ3Iz6kpCuP2y/XT6eQglTFXasqsWHy08TbsbnmP9yX2i4
 E1B6zMwpn6m4PFxEURg14eBJTrQomb/UBSLHNwxwwOQjnKzsmeH3pyiFCJSPJNFF
 Xv/lyBibgvQAsuWtLTi82fqei+c60as2LVrsFLdoWXeD6JMlx1O+SHSz0NNJL/Lk
 oxMNx9NI+LsYmekIBHgoX01PzVgCpcn6ThfoQv3JLHtqiZuk3Wlai7bPvJMIApoM
 8fgHGR9xvhRCfHgYvxx4MmP3e7xewajSe7fh8/y8BqreO+fLeoyqho5G9zDpn6bU
 wGFPtVYW/fgTEM5E3NOjsuC1zE35bjwU7U40jqjGHP1VsTs3hkLFy9fZNNq4EP1M
 ytfJPlTG
 =gzZK
 -----END PGP SIGNATURE-----

Merge tag '4.19-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:
 "smb3/cifs fixes (including 8 for stable).

  Other improvements include:

   - improved tracing, improved stats

   - snapshots (previous version mounts work now over SMB3)

   - performance (compounding enabled for statfs, ~40% faster).

   - security (make it possible to build cifs.ko with insecure vers=1.0
     disabled in Kconfig)"

* tag '4.19-smb3' of git://git.samba.org/sfrench/cifs-2.6: (43 commits)
  smb3: create smb3 equivalent alias for cifs pseudo-xattrs
  smb3: allow previous versions to be mounted with snapshot= mount parm
  cifs: don't show domain= in mount output when domain is empty
  cifs: add missing support for ACLs in SMB 3.11
  smb3: enumerating snapshots was leaving part of the data off end
  cifs: update smb2_queryfs() to use compounding
  cifs: update receive_encrypted_standard to handle compounded responses
  cifs: create SMB2_open_init()/SMB2_open_free() helpers.
  cifs: add SMB2_query_info_[init|free]()
  cifs: add SMB2_close_init()/SMB2_close_free()
  smb3: display stats counters for number of slow commands
  CIFS: fix uninitialized ptr deref in smb2 signing
  smb3: Do not send SMB3 SET_INFO if nothing changed
  smb3: fix minor debug output for CONFIG_CIFS_STATS
  smb3: add tracepoint for slow responses
  cifs: add compound_send_recv()
  cifs: make smb_send_rqst take an array of requests
  cifs: update init_sg, crypt_message to take an array of rqst
  smb3: update readme to correct information about /proc/fs/cifs/Stats
  smb3: fix reset of bytes read and written stats
  ...
2018-08-13 22:32:11 -07:00
Linus Torvalds
161fa27ff2 Merge branch 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull fs iomap refactoring from Darrick Wong:
 "This is the first part of the XFS changes for 4.19.

  Christoph and Andreas coordinated some refactoring work on the iomap
  code in preparation for removing buffer heads from XFS and porting
  gfs2 to iomap. I'm sending this small pull request ahead of the main
  XFS merge to avoid holding up gfs2 unnecessarily"

* 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: add inline data support to iomap_readpage_actor
  iomap: support direct I/O to inline data
  iomap: refactor iomap_dio_actor
  iomap: add initial support for writes without buffer heads
  iomap: add an iomap-based readpage and readpages implementation
  iomap: add private pointer to struct iomap
  iomap: add a page_done callback
  iomap: generic inline data handling
  iomap: complete partial direct I/O writes synchronously
  iomap: mark newly allocated buffer heads as new
  fs: factor out a __generic_write_end helper
2018-08-13 22:29:03 -07:00
Linus Torvalds
a1a4f841ec for-4.19-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltxe7QACgkQxWXV+ddt
 WDswMA//QlRO+Ln5CH+RlT4fyf1RQUQZblWss2zxrmlo1GRI3Ljf2DNsBE3rD7P4
 NSiXfHmgkdjcQP6poPLJwHxwkNd4NFXglYg64wWO10RjHGhKglmH6ztU88wsPfr2
 2RZv271/NvYIEkEi6kdyy8ilKeWMshOfyj3+PaeapQn67uJfimyiUvDgUgbvwH3c
 yj0nVRLP1C7snNj4Atti/rjXMhG+m1UWfjRkZsmqlBp52k2UAcrtiwQK+DS5b9mL
 aWLSaGmIcJtSMkNJPQBST9GTWbJfKTpceoCzkT0o3irvQpN2e2flAJ4ireL8q4mN
 MvqJ7giPBFHNDcHEzN6VERvsaA1Rx9Vq20ieQl8JAMd4p/bi5ehN3ww+9vau5zCw
 Pc8WeKEILKrLYEAgHOnUO1wxHw994Iv5CA26roTQ0HNXQJjyEZ4m40Ch6LzmfKPm
 WKcHX14Uw22GKaFEXHTOpRZ0U0d1cMTcn5zaAajGsB9LwcaiLM+OiFSPtDkwUOB9
 QGJHklZVXAD1IH9HFPuq85uUtXTLXbxsw1g8phEJGbmaVxxCOAUAXwEk3qxuZNbz
 CHL3G5+l3JEXxfoJSbDW60kr8xic7teqQDszqqP2qlqtP15ty2xc9d5Q8MZajSTZ
 H1z9+0gfjYYHrGuAp69MtCbdQhhDSqLyivjJJm0HBaKfVNGW2Xg=
 =jBaz
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Mostly fixes and cleanups, nothing big, though the notable thing is
  the inserted/deleted lines delta -1124.

  User visible changes:
   - allow defrag on opened read-only files that have rw permissions;
     similar to what dedupe will allow on such files

  Core changes:
   - tree checker improvements, reported by fuzzing:
      * more checks for: block group items, essential trees
      * chunk type validation
      * mount time cross-checks that physical and logical chunks match
      * switch more error codes to EUCLEAN aka EFSCORRUPTED

  Fixes:
   - fsync corner case fixes

   - fix send failure when root has deleted files still open

   - send, fix incorrect file layout after hole punching beyond eof

   - fix races between mount and deice scan ioctl, found by fuzzing

   - fix deadlock when delayed iput is called from writeback on the same
     inode; rare but has been observed in practice, also removes code

   - fix pinned byte accounting, using the right percpu helpers; this
     should avoid some write IO inefficiency during low space conditions

   - don't remove block group that still has pinned bytes

   - reset on-disk device stats value after replace, otherwise this
     would report stale values for the new device

  Cleanups:
   - time64_t/timespec64 cleanups

   - remove remaining dead code in scrub handling NOCOW extents after
     disabling it in previous cycle

   - simplify fsync regarding ordered extents logic and remove all the
     related code

   - remove redundant arguments in order to reduce stack space
     consumption

   - remove support for V0 type of extents, not in use since 2.6.30

   - remove several unused structure members

   - fewer indirect function calls by inlining some callbacks

   - qgroup rescan timing fixes

   - vfs: iget cleanups"

* tag 'for-4.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (182 commits)
  btrfs: revert fs_devices state on error of btrfs_init_new_device
  btrfs: Exit gracefully when chunk map cannot be inserted to the tree
  btrfs: Introduce mount time chunk <-> dev extent mapping check
  btrfs: Verify that every chunk has corresponding block group at mount time
  btrfs: Check that each block group has corresponding chunk at mount time
  Btrfs: send, fix incorrect file layout after hole punching beyond eof
  btrfs: Use wrapper macro for rcu string to remove duplicate code
  btrfs: simplify btrfs_iget
  btrfs: lift make_bad_inode into btrfs_iget
  btrfs: simplify IS_ERR/PTR_ERR checks
  btrfs: btrfs_iget never returns an is_bad_inode inode
  btrfs: replace: Reset on-disk dev stats value after replace
  btrfs: extent-tree: Remove unused __btrfs_free_block_rsv
  btrfs: backref: Use ERR_CAST to return error code
  btrfs: Remove redundant btrfs_release_path from btrfs_unlink_subvol
  btrfs: Remove root parameter from btrfs_unlink_subvol
  btrfs: Remove fs_info from btrfs_add_root_ref
  btrfs: Remove fs_info from btrfs_del_root_ref
  btrfs: Remove fs_info from btrfs_del_root
  btrfs: Remove fs_info from btrfs_delete_delayed_dir_index
  ...
2018-08-13 21:58:53 -07:00
Linus Torvalds
575b94386b File locking fixes and enhancements for v4.19
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbcDltAAoJEAAOaEEZVoIV9YUP/ioYw3+kIXwRa05ec8Twbn6m
 O+1HOX9UAuDWX+97P2kCJSqi9a/TkpfLQxoGxZowDv97ZkVeHdzVaJVN/0sbH1uP
 Oj86b2Y5zGyrLk3D4c5VcyHkvogr16DUvhYeRtcjPyufSaKqefftCeetTntRvr4Z
 rZ8HLKYxa4AZlP/FkgVZ/S6PmvG7tH8rVLMFCfY+rASbSU4gBxmrS0nB2UgU4ycb
 b2SVZo4Zxycw0KiP14T5AgN4puid7VEofFezBpMKjNPjCUk1wJ3mtUyLeS8jRUEx
 M6qnl7DU4BWPRtGxiXNebN85M8iaJnlaeQglowklrubrFtqnVeDQM0rz42NiY2/S
 2jk6q8b9XKvoDEev2VeKfmFztu4gGXypNR6xAolvgBnUZziKKFUJutZwJe95pZV1
 kO6CvdchuCQdLMLdHPji2kk2AA8Go5webONfioFhoWYfeSvgjRELmOFTDjP22eNv
 s1/vfCsjuU6vdyaXXXZpN/7VMenoNegSQUXFoxhmHO7BtNuTqk+ZVFNgaWsOrXcx
 I0ZIjCAmOsU+DKLn+DTT8kDtU3ihZz/OCXXZJi9JPeLFZ6gSghDRnFThwYqcZvA+
 syIn1XGGtn15Q7A5FZNvwGqYir1GtyOslrxxIKxGlhleovhT94HuKXkL1T+k74zl
 FXItKgRs0TPQfTsI8q+C
 =IMcz
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "Just a couple of patches from Konstantin to fix /proc/locks when the
  process that set the lock has exited, and a new tracepoint for the
  flock() codepath. Also threw in mailmap entries for my addresses and a
  comment cleanup"

* tag 'locks-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: remove misleading obsolete comment
  mailmap: remap some of my email addresses to kernel.org address
  locks: add tracepoint in flock codepath
  fs/lock: show locks taken by processes from another pidns
  fs/lock: skip lock owner pid translation in case we are in init_pid_ns
2018-08-13 21:56:50 -07:00
Linus Torvalds
4591343e35 Merge branches 'work.misc' and 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Misc cleanups from various folks all over the place

  I expected more fs/dcache.c cleanups this cycle, so that went into a
  separate branch. Said cleanups have missed the window, so in the
  hindsight it could've gone into work.misc instead. Decided not to
  cherry-pick, thus the 'work.dcache' branch"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: dcache: Use true and false for boolean values
  fold generic_readlink() into its only caller
  fs: shave 8 bytes off of struct inode
  fs: Add more kernel-doc to the produced documentation
  fs: Fix attr.c kernel-doc
  removed extra extern file_fdatawait_range

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill dentry_update_name_case()
2018-08-13 21:28:25 -07:00
Linus Torvalds
f2be269897 Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs aio updates from Al Viro:
 "Christoph's aio poll, saner this time around.

  This time it's pretty much local to fs/aio.c. Hopefully race-free..."

* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: allow direct aio poll comletions for keyed wakeups
  aio: implement IOCB_CMD_POLL
  aio: add a iocb refcount
  timerfd: add support for keyed wakeups
2018-08-13 20:56:23 -07:00
Linus Torvalds
4d2a073cde Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs lookup() updates from Al Viro:
 "More conversions of ->lookup() to d_splice_alias().

  Should be reasonably complete now - the only leftovers are in ceph"

* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  afs_try_auto_mntpt(): return NULL instead of ERR_PTR(-ENOENT)
  afs_lookup(): switch to d_splice_alias()
  afs: switch dynroot lookups to d_splice_alias()
  hpfs: fix an inode leak in lookup, switch to d_splice_alias()
  hostfs_lookup: switch to d_splice_alias()
2018-08-13 20:54:14 -07:00
Linus Torvalds
0ea97a2d61 Merge branch 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs icache updates from Al Viro:

 - NFS mkdir/open_by_handle race fix

 - analogous solution for FUSE, replacing the one currently in mainline

 - new primitive to be used when discarding halfway set up inodes on
   failed object creation; gives sane warranties re icache lookups not
   returning such doomed by still not freed inodes. A bunch of
   filesystems switched to that animal.

 - Miklos' fix for last cycle regression in iget5_locked(); -stable will
   need a slightly different variant, unfortunately.

 - misc bits and pieces around things icache-related (in adfs and jfs).

* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  jfs: don't bother with make_bad_inode() in ialloc()
  adfs: don't put inodes into icache
  new helper: inode_fake_hash()
  vfs: don't evict uninitialized inode
  jfs: switch to discard_new_inode()
  ext2: make sure that partially set up inodes won't be returned by ext2_iget()
  udf: switch to discard_new_inode()
  ufs: switch to discard_new_inode()
  btrfs: switch to discard_new_inode()
  new primitive: discard_new_inode()
  kill d_instantiate_no_diralias()
  nfs_instantiate(): prevent multiple aliases for directory inode
2018-08-13 20:25:58 -07:00
Linus Torvalds
a66b4cd1e7 Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs open-related updates from Al Viro:

 - "do we need fput() or put_filp()" rules are gone - it's always fput()
   now. We keep track of that state where it belongs - in ->f_mode.

 - int *opened mess killed - in finish_open(), in ->atomic_open()
   instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

 - alloc_file() wrappers with saner calling conventions are introduced
   (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
   much simplification.

 - while we are at it, saner calling conventions for path_init() and
   link_path_walk(), simplifying things inside fs/namei.c (both on
   open-related paths and elsewhere).

* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  few more cleanups of link_path_walk() callers
  allow link_path_walk() to take ERR_PTR()
  make path_init() unconditionally paired with terminate_walk()
  document alloc_file() changes
  make alloc_file() static
  do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
  new helper: alloc_file_clone()
  create_pipe_files(): switch the first allocation to alloc_file_pseudo()
  anon_inode_getfile(): switch to alloc_file_pseudo()
  hugetlb_file_setup(): switch to alloc_file_pseudo()
  ocxlflash_getfile(): switch to alloc_file_pseudo()
  cxl_getfile(): switch to alloc_file_pseudo()
  ... and switch shmem_file_setup() to alloc_file_pseudo()
  __shmem_file_setup(): reorder allocations
  new wrapper: alloc_file_pseudo()
  kill FILE_{CREATED,OPENED}
  switch atomic_open() and lookup_open() to returning 0 in all success cases
  document ->atomic_open() changes
  ->atomic_open(): return 0 in all success cases
  get rid of 'opened' in path_openat() and the helpers downstream
  ...
2018-08-13 19:58:36 -07:00
Linus Torvalds
e5a32b5b21 Here are the main MIPS changes for 4.19.
An overview of the general architecture changes:
 
   - Massive DMA ops refactoring from Christoph Hellwig (huzzah for
     deleting crufty code!).
 
   - We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes & corresponding
     regsets to expose DSP ASE & floating point mode state respectively,
     both for live debugging & core dumps.
 
   - We better optimize our code by hard-coding cpu_has_* macros at
     compile time where their values are known due to the ISA revision
     that the kernel build is targeting.
 
   - The EJTAG exception handler now better handles SMP systems, where it
     was previously possible for CPUs to clobber a register value saved
     by another CPU.
 
   - Our implementation of memset() gained a couple of fixes for MIPSr6
     systems to return correct values in some cases where stores fault.
 
   - We now implement ioremap_wc() using the uncached-accelerated cache
     coherency attribute where supported, which is detected during boot,
     and fall back to plain uncached access where necessary. The
     MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
     ioremap_cacheable_cow() are removed.
 
   - The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
     systems by reworking the way we ensure remote CPUs that may be
     running threads within the affected process switch mode.
 
   - Systems using the MIPS Coherence Manager will now set the
     MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
     maintenance overhead when flushing the icache.
 
   - A few fixes were made for building with clang/LLVM, which
     now sucessfully builds kernels for many of our platforms.
 
   - Miscellaneous cleanups all over.
 
 And some platform-specific changes:
 
   - ar7 gained stubs for a few clock API functions to fix build failures
     for some drivers.
 
   - ath79 gained support for a few new SoCs, a few fixes & better
     gpio-keys support.
 
   - Ci20 now exposes its SPI bus using the spi-gpio driver.
 
   - The generic platform can now auto-detect a suitable value for
     PHYS_OFFSET based upon the memory map described by the device tree,
     allowing us to avoid wasting memory on page book-keeping for systems
     where RAM starts at a non-zero physical address.
 
   - Ingenic systems using the jz4740 platform code now link their
     vmlinuz higher to allow for kernels of a realistic size.
 
   - Loongson32 now builds the kernel targeting MIPSr1 rather than MIPSr2
     to avoid CPU errata.
 
   - Loongson64 gains a couple of fixes, a workaround for a write
     buffering issue & support for the Loongson 3A R3.1 CPU.
 
   - Malta now uses the piix4-poweroff driver to handle powering down.
 
   - Microsemi Ocelot gained support for its SPI bus & NOR flash, its
     second MDIO bus and can now be supported by a FIT/.itb image.
 
   - Octeon saw a bunch of header cleanups which remove a lot of
     duplicate or unused code.
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCW3G6JxUccGF1bC5idXJ0
 b25AbWlwcy5jb20ACgkQPqefrLV1AN0n/gD/Rpdgay31G/4eTTKBmBrcaju6Shjt
 /2Iu6WC5Sj4hDHUBAJSbuI+B9YjcNsjekBYxB/LLD7ImcLBl6nLMIvKmXLAL
 =cUiF
 -----END PGP SIGNATURE-----

Merge tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS updates from Paul Burton:
 "Here are the main MIPS changes for 4.19.

  An overview of the general architecture changes:

   - Massive DMA ops refactoring from Christoph Hellwig (huzzah for
     deleting crufty code!).

   - We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes &
     corresponding regsets to expose DSP ASE & floating point mode state
     respectively, both for live debugging & core dumps.

   - We better optimize our code by hard-coding cpu_has_* macros at
     compile time where their values are known due to the ISA revision
     that the kernel build is targeting.

   - The EJTAG exception handler now better handles SMP systems, where
     it was previously possible for CPUs to clobber a register value
     saved by another CPU.

   - Our implementation of memset() gained a couple of fixes for MIPSr6
     systems to return correct values in some cases where stores fault.

   - We now implement ioremap_wc() using the uncached-accelerated cache
     coherency attribute where supported, which is detected during boot,
     and fall back to plain uncached access where necessary. The
     MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
     ioremap_cacheable_cow() are removed.

   - The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
     systems by reworking the way we ensure remote CPUs that may be
     running threads within the affected process switch mode.

   - Systems using the MIPS Coherence Manager will now set the
     MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
     maintenance overhead when flushing the icache.

   - A few fixes were made for building with clang/LLVM, which now
     sucessfully builds kernels for many of our platforms.

   - Miscellaneous cleanups all over.

  And some platform-specific changes:

   - ar7 gained stubs for a few clock API functions to fix build
     failures for some drivers.

   - ath79 gained support for a few new SoCs, a few fixes & better
     gpio-keys support.

   - Ci20 now exposes its SPI bus using the spi-gpio driver.

   - The generic platform can now auto-detect a suitable value for
     PHYS_OFFSET based upon the memory map described by the device tree,
     allowing us to avoid wasting memory on page book-keeping for
     systems where RAM starts at a non-zero physical address.

   - Ingenic systems using the jz4740 platform code now link their
     vmlinuz higher to allow for kernels of a realistic size.

   - Loongson32 now builds the kernel targeting MIPSr1 rather than
     MIPSr2 to avoid CPU errata.

   - Loongson64 gains a couple of fixes, a workaround for a write
     buffering issue & support for the Loongson 3A R3.1 CPU.

   - Malta now uses the piix4-poweroff driver to handle powering down.

   - Microsemi Ocelot gained support for its SPI bus & NOR flash, its
     second MDIO bus and can now be supported by a FIT/.itb image.

   - Octeon saw a bunch of header cleanups which remove a lot of
     duplicate or unused code"

* tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (123 commits)
  MIPS: Remove remnants of UASM_ISA
  MIPS: netlogic: xlr: Remove erroneous check in nlm_fmn_send()
  MIPS: VDSO: Force link endianness
  MIPS: Always specify -EB or -EL when using clang
  MIPS: Use dins to simplify __write_64bit_c0_split()
  MIPS: Use read-write output operand in __write_64bit_c0_split()
  MIPS: Avoid using array as parameter to write_c0_kpgd()
  MIPS: vdso: Allow clang's --target flag in VDSO cflags
  MIPS: genvdso: Remove GOT checks
  MIPS: Remove obsolete MIPS checks for DST node "chosen@0"
  MIPS: generic: Remove input symbols from defconfig
  MIPS: Delete unused code in linux32.c
  MIPS: Remove unused sys_32_mmap2
  MIPS: Remove nabi_no_regargs
  mips: dts: mscc: enable spi and NOR flash support on ocelot PCB123
  mips: dts: mscc: Add spi on Ocelot
  MIPS: Loongson: Merge load addresses
  MIPS: Loongson: Set Loongson32 to MIPS32R1
  MIPS: mscc: ocelot: add interrupt controller properties to GPIO controller
  MIPS: generic: Select MIPS_AUTO_PFN_OFFSET
  ...
2018-08-13 19:24:32 -07:00
Trond Myklebust
62421cd943 NFSv4: Fix a typo in nfs4_init_channel_attrs()
The back channel size is allowed to be 1 or greater.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-13 17:23:37 -04:00
Trond Myklebust
8aafd2fde3 NFSv4: Don't busy wait if NFSv4 session draining is interrupted
Catch the ERESTARTSYS error so that it can be processed by the callers.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-13 17:23:37 -04:00
Olga Kornievskaia
e4648aa4f9 NFS recover from destination server reboot for copies
Mark the destination state to indicate a server-side copy is
happening. On detecting a reboot and recovering open state check
if any state is engaged in a server-side copy, if so, find the
copy and mark it and then signal the waiting thread. Upon wakeup,
if copy was marked then propage EAGAIN to the nfsd_copy_file_range
and restart the copy from scratch.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-13 17:04:23 -04:00
Linus Torvalds
1e45e9a95e Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timers departement more or less proudly presents:

   - More Y2038 timekeeping work mostly in the core code. The work is
     slowly, but steadily targeting the actuall syscalls.

   - Enhanced timekeeping suspend/resume support by utilizing
     clocksources which do not stop during suspend, but are otherwise
     not the main timekeeping clocksources.

   - Make NTP adjustmets more accurate and immediate when the frequency
     is set directly and not incrementally.

   - Sanitize the overrung handing of posix timers

   - A new timer driver for Mediatek SoCs

   - The usual pile of fixes and updates all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  clockevents: Warn if cpu_all_mask is used as cpumask
  tick/broadcast-hrtimer: Use cpu_possible_mask for ce_broadcast_hrtimer
  clocksource/drivers/arm_arch_timer: Fix bogus cpu_all_mask usage
  clocksource: ti-32k: Remove CLOCK_SOURCE_SUSPEND_NONSTOP flag
  timers: Clear timer_base::must_forward_clk with timer_base::lock held
  clocksource/drivers/sprd: Register one always-on timer to compensate suspend time
  clocksource/drivers/timer-mediatek: Add support for system timer
  clocksource/drivers/timer-mediatek: Convert the driver to timer-of
  clocksource/drivers/timer-mediatek: Use specific prefix for GPT
  clocksource/drivers/timer-mediatek: Rename mtk_timer to timer-mediatek
  clocksource/drivers/timer-mediatek: Add system timer bindings
  clocksource/drivers: Set clockevent device cpumask to cpu_possible_mask
  time: Introduce one suspend clocksource to compensate the suspend time
  time: Fix extra sleeptime injection when suspend fails
  timekeeping/ntp: Constify some function arguments
  ntp: Use kstrtos64 for s64 variable
  ntp: Remove redundant arguments
  timer: Fix coding style
  ktime: Provide typesafe ktime_to_ns()
  hrtimer: Improve kernel message printing
  ...
2018-08-13 13:02:31 -07:00
Linus Torvalds
de5d1b39ea Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking/atomics update from Thomas Gleixner:
 "The locking, atomics and memory model brains delivered:

   - A larger update to the atomics code which reworks the ordering
     barriers, consolidates the atomic primitives, provides the new
     atomic64_fetch_add_unless() primitive and cleans up the include
     hell.

   - Simplify cmpxchg() instrumentation and add instrumentation for
     xchg() and cmpxchg_double().

   - Updates to the memory model and documentation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  locking/atomics: Rework ordering barriers
  locking/atomics: Instrument cmpxchg_double*()
  locking/atomics: Instrument xchg()
  locking/atomics: Simplify cmpxchg() instrumentation
  locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
  tools/memory-model: Rename litmus tests to comply to norm7
  tools/memory-model/Documentation: Fix typo, smb->smp
  sched/Documentation: Update wake_up() & co. memory-barrier guarantees
  locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
  sched/core: Use smp_mb() in wake_woken_function()
  tools/memory-model: Add informal LKMM documentation to MAINTAINERS
  locking/atomics/Documentation: Describe atomic_set() as a write operation
  tools/memory-model: Make scripts executable
  tools/memory-model: Remove ACCESS_ONCE() from model
  tools/memory-model: Remove ACCESS_ONCE() from recipes
  locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
  MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
  tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
  tools/memory-model: Add litmus test for full multicopy atomicity
  locking/refcount: Always allow checked forms
  ...
2018-08-13 12:23:39 -07:00
Chao Yu
d494500a70 f2fs: support fault_type mount option
Previously, once fault injection is on, by default, all kind of faults
will be injected to f2fs, if we want to trigger single or specified
combined type during the test, we need to configure sysfs entry, it will
be a little inconvenient to integrate sysfs configuring into testsuit,
such as xfstest.

So this patch introduces a new mount option 'fault_type' to assist old
option 'fault_injection', with these two mount options, we can specify
any fault rate/type at mount-time.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
3f16ecd950 f2fs: fix to return success when trimming meta area
generic/251
    --- tests/generic/251.out	2016-05-03 20:20:11.381899000 +0800
     QA output created by 251
     Running the test: done.
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    ...
Ran: generic/251
Failures: generic/251

The reason is coverage of fstrim locates in meta area, previously we
just return -EINVAL for such case, making generic/251 failed, to fix
this problem, let's relieve restriction to return success with no
block discarded.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
6b9cb1242c f2fs: fix use-after-free of dicard command entry
As Dan Carpenter reported:

The patch 20ee438232: "f2fs: issue small discard by LBA order" from
Jul 8, 2018, leads to the following Smatch warning:

	fs/f2fs/segment.c:1277 __issue_discard_cmd_orderly()
	warn: 'dc' was already freed.

See also:
fs/f2fs/segment.c:2550 __issue_discard_cmd_range() warn: 'dc' was already freed.

In order to fix this issue, let's get error from __submit_discard_cmd(),
and release current discard command after we referenced next one.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
b83dcfe671 f2fs: support discard submission error injection
This patch adds to support discard submission error injection for testing
error handling of __submit_discard_cmd().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
35ec7d5748 f2fs: split discard command in prior to block layer
Some devices has small max_{hw,}discard_sectors, so that in
__blkdev_issue_discard(), one big size discard bio can be split
into multiple small size discard bios, result in heavy load in IO
scheduler and device, which can hang other sync IO for long time.

Now, f2fs is trying to control discard commands more elaboratively,
in order to make less conflict in between discard IO and user IO
to enhance application's performance, so in this patch, we will
split discard bio in f2fs in prior to in block layer to reduce
issuing multiple discard bios in a short time.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Sheng Yong
a690efffd1 f2fs: wake up gc thread immediately when gc_urgent is set
Fixes: 5b0e95398e ("f2fs: introduce sbi->gc_mode to determine the policy")
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
6eae269461 f2fs: fix incorrect range->len in f2fs_trim_fs()
generic/260 reported below error:

     [+] Default length with start set (should succeed)
     [+] Length beyond the end of fs (should succeed)
     [+] Length beyond the end of fs with start set (should succeed)
    +./tests/generic/260: line 94: [: 18446744073709551615: integer expression expected
    +./tests/generic/260: line 104: [: 18446744073709551615: integer expression expected
     Test done
    ...

In f2fs_trim_fs(), if there is no discard being trimmed, we need to correct
range->len before return.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
2296915808 f2fs: refresh recent accessed nat entry in lru list
Introduce nat_list_lock to protect nm_i->nat_entries list, and manage
it as a LRU list, refresh location for therein recent accessed entries
in the list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
a33c150237 f2fs: fix avoid race between truncate and background GC
Thread A				Background GC
- f2fs_setattr isize to 0
 - truncate_setsize
					- gc_data_segment
					 - f2fs_get_read_data_page page #0
					  - set_page_dirty
					  - set_cold_data
 - f2fs_truncate

- f2fs_setattr isize to 4k
- read 4k <--- hit data in cached page #0

Above race condition can cause read out invalid data in a truncated
page, fix it by i_gc_rwsem[WRITE] lock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
c7079853c8 f2fs: avoid race between zero_range and background GC
Thread A				Background GC
- f2fs_zero_range
 - truncate_pagecache_range
					- gc_data_segment
					 - get_read_data_page
					  - move_data_page
					   - set_page_dirty
					   - set_cold_data
 - f2fs_do_zero_range
  - dn->data_blkaddr = NEW_ADDR;
  - f2fs_set_data_blkaddr

Actually, we don't need to set dirty & checked flag on the page, since
all valid data in the page should be zeroed by zero_range().

Use i_gc_rwsem[WRITE] to avoid such race condition.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
91291e9998 f2fs: fix to do sanity check with block address in main area v2
This patch adds f2fs_is_valid_blkaddr() in below functions to do sanity
check with block address to avoid pentential panic:
- f2fs_grab_read_bio()
- __written_first_block()

https://bugzilla.kernel.org/show_bug.cgi?id=200465

- Reproduce

- POC (poc.c)
    #define _GNU_SOURCE
    #include <sys/types.h>
    #include <sys/mount.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/xattr.h>

    #include <dirent.h>
    #include <errno.h>
    #include <error.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #include <linux/falloc.h>
    #include <linux/loop.h>

    static void activity(char *mpoint) {

      char *xattr;
      int err;

      err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);

      char buf2[113];
      memset(buf2, 0, sizeof(buf2));
      listxattr(xattr, buf2, sizeof(buf2));

    }

    int main(int argc, char *argv[]) {
      activity(argv[1]);
      return 0;
    }

- kernel message
[  844.718738] F2FS-fs (loop0): Mounted with checkpoint version = 2
[  846.430929] F2FS-fs (loop0): access invalid blkaddr:1024
[  846.431058] WARNING: CPU: 1 PID: 1249 at fs/f2fs/checkpoint.c:154 f2fs_is_valid_blkaddr+0x10f/0x160
[  846.431059] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.431310] CPU: 1 PID: 1249 Comm: a.out Not tainted 4.18.0-rc3+ #1
[  846.431312] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.431315] RIP: 0010:f2fs_is_valid_blkaddr+0x10f/0x160
[  846.431316] Code: 00 eb ed 31 c0 83 fa 05 75 ae 48 83 ec 08 48 8b 3f 89 f1 48 c7 c2 fc 0b 0f 8b 48 c7 c6 8b d7 09 8b 88 44 24 07 e8 61 8b ff ff <0f> 0b 0f b6 44 24 07 48 83 c4 08 eb 81 4c 8b 47 10 8b 8f 38 04 00
[  846.431347] RSP: 0018:ffff961c414a7bc0 EFLAGS: 00010282
[  846.431349] RAX: 0000000000000000 RBX: ffffc5f787b8ea80 RCX: 0000000000000000
[  846.431350] RDX: 0000000000000000 RSI: ffff89dfffd165d8 RDI: ffff89dfffd165d8
[  846.431351] RBP: ffff961c414a7c20 R08: 0000000000000001 R09: 0000000000000248
[  846.431353] R10: 0000000000000000 R11: 0000000000000248 R12: 0000000000000007
[  846.431369] R13: ffff89dff5492800 R14: ffff89dfae3aa000 R15: ffff89dff4ff88d0
[  846.431372] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.431373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.431374] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.431384] Call Trace:
[  846.431426]  f2fs_iget+0x6f4/0xe70
[  846.431430]  ? f2fs_find_entry+0x71/0x90
[  846.431432]  f2fs_lookup+0x1aa/0x390
[  846.431452]  __lookup_slow+0x97/0x150
[  846.431459]  lookup_slow+0x35/0x50
[  846.431462]  walk_component+0x1c6/0x470
[  846.431479]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.431488]  ? page_add_file_rmap+0x13/0x200
[  846.431491]  path_lookupat+0x76/0x230
[  846.431501]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.431504]  filename_lookup+0xb8/0x1a0
[  846.431534]  ? _cond_resched+0x16/0x40
[  846.431541]  ? kmem_cache_alloc+0x160/0x1d0
[  846.431549]  ? path_listxattr+0x41/0xa0
[  846.431551]  path_listxattr+0x41/0xa0
[  846.431570]  do_syscall_64+0x55/0x100
[  846.431583]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.431607] RIP: 0033:0x7f882de1c0d7
[  846.431607] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.431639] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.431641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.431642] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.431643] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.431645] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.431646] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.431648] ---[ end trace abca54df39d14f5c ]---
[  846.431651] F2FS-fs (loop0): invalid blkaddr: 1024, type: 5, run fsck to fix.
[  846.431762] WARNING: CPU: 1 PID: 1249 at fs/f2fs/f2fs.h:2697 f2fs_iget+0xd17/0xe70
[  846.431763] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.431797] CPU: 1 PID: 1249 Comm: a.out Tainted: G        W         4.18.0-rc3+ #1
[  846.431798] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.431800] RIP: 0010:f2fs_iget+0xd17/0xe70
[  846.431801] Code: ff ff 48 63 d8 e9 e1 f6 ff ff 48 8b 45 c8 41 b8 05 00 00 00 48 c7 c2 d8 e8 0e 8b 48 c7 c6 1d b0 0a 8b 48 8b 38 e8 f9 b4 00 00 <0f> 0b 48 8b 45 c8 f0 80 48 48 04 e9 d8 f9 ff ff 0f 0b 48 8b 43 18
[  846.431832] RSP: 0018:ffff961c414a7bd0 EFLAGS: 00010282
[  846.431834] RAX: 0000000000000000 RBX: ffffc5f787b8ea80 RCX: 0000000000000006
[  846.431835] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff89dfffd165d0
[  846.431836] RBP: ffff961c414a7c20 R08: 0000000000000000 R09: 0000000000000273
[  846.431837] R10: 0000000000000000 R11: ffff89dfad50ca60 R12: 0000000000000007
[  846.431838] R13: ffff89dff5492800 R14: ffff89dfae3aa000 R15: ffff89dff4ff88d0
[  846.431840] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.431841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.431842] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.431846] Call Trace:
[  846.431850]  ? f2fs_find_entry+0x71/0x90
[  846.431853]  f2fs_lookup+0x1aa/0x390
[  846.431856]  __lookup_slow+0x97/0x150
[  846.431858]  lookup_slow+0x35/0x50
[  846.431874]  walk_component+0x1c6/0x470
[  846.431878]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.431880]  ? page_add_file_rmap+0x13/0x200
[  846.431882]  path_lookupat+0x76/0x230
[  846.431884]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.431886]  filename_lookup+0xb8/0x1a0
[  846.431890]  ? _cond_resched+0x16/0x40
[  846.431891]  ? kmem_cache_alloc+0x160/0x1d0
[  846.431894]  ? path_listxattr+0x41/0xa0
[  846.431896]  path_listxattr+0x41/0xa0
[  846.431898]  do_syscall_64+0x55/0x100
[  846.431901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.431902] RIP: 0033:0x7f882de1c0d7
[  846.431903] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.431934] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.431936] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.431937] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.431939] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.431940] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.431941] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.431943] ---[ end trace abca54df39d14f5d ]---
[  846.432033] F2FS-fs (loop0): access invalid blkaddr:1024
[  846.432051] WARNING: CPU: 1 PID: 1249 at fs/f2fs/checkpoint.c:154 f2fs_is_valid_blkaddr+0x10f/0x160
[  846.432051] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.432085] CPU: 1 PID: 1249 Comm: a.out Tainted: G        W         4.18.0-rc3+ #1
[  846.432086] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.432089] RIP: 0010:f2fs_is_valid_blkaddr+0x10f/0x160
[  846.432089] Code: 00 eb ed 31 c0 83 fa 05 75 ae 48 83 ec 08 48 8b 3f 89 f1 48 c7 c2 fc 0b 0f 8b 48 c7 c6 8b d7 09 8b 88 44 24 07 e8 61 8b ff ff <0f> 0b 0f b6 44 24 07 48 83 c4 08 eb 81 4c 8b 47 10 8b 8f 38 04 00
[  846.432120] RSP: 0018:ffff961c414a7900 EFLAGS: 00010286
[  846.432122] RAX: 0000000000000000 RBX: 0000000000000400 RCX: 0000000000000006
[  846.432123] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff89dfffd165d0
[  846.432124] RBP: ffff89dff5492800 R08: 0000000000000001 R09: 000000000000029d
[  846.432125] R10: ffff961c414a7820 R11: 000000000000029d R12: 0000000000000400
[  846.432126] R13: 0000000000000000 R14: ffff89dff4ff88d0 R15: 0000000000000000
[  846.432128] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.432130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.432131] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.432135] Call Trace:
[  846.432151]  f2fs_wait_on_block_writeback+0x20/0x110
[  846.432158]  f2fs_grab_read_bio+0xbc/0xe0
[  846.432161]  f2fs_submit_page_read+0x21/0x280
[  846.432163]  f2fs_get_read_data_page+0xb7/0x3c0
[  846.432165]  f2fs_get_lock_data_page+0x29/0x1e0
[  846.432167]  f2fs_get_new_data_page+0x148/0x550
[  846.432170]  f2fs_add_regular_entry+0x1d2/0x550
[  846.432178]  ? __switch_to+0x12f/0x460
[  846.432181]  f2fs_add_dentry+0x6a/0xd0
[  846.432184]  f2fs_do_add_link+0xe9/0x140
[  846.432186]  __recover_dot_dentries+0x260/0x280
[  846.432189]  f2fs_lookup+0x343/0x390
[  846.432193]  __lookup_slow+0x97/0x150
[  846.432195]  lookup_slow+0x35/0x50
[  846.432208]  walk_component+0x1c6/0x470
[  846.432212]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.432215]  ? page_add_file_rmap+0x13/0x200
[  846.432217]  path_lookupat+0x76/0x230
[  846.432219]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.432221]  filename_lookup+0xb8/0x1a0
[  846.432224]  ? _cond_resched+0x16/0x40
[  846.432226]  ? kmem_cache_alloc+0x160/0x1d0
[  846.432228]  ? path_listxattr+0x41/0xa0
[  846.432230]  path_listxattr+0x41/0xa0
[  846.432233]  do_syscall_64+0x55/0x100
[  846.432235]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.432237] RIP: 0033:0x7f882de1c0d7
[  846.432237] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.432269] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.432271] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.432272] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.432273] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.432274] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.432275] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.432277] ---[ end trace abca54df39d14f5e ]---
[  846.432279] F2FS-fs (loop0): invalid blkaddr: 1024, type: 5, run fsck to fix.
[  846.432376] WARNING: CPU: 1 PID: 1249 at fs/f2fs/f2fs.h:2697 f2fs_wait_on_block_writeback+0xb1/0x110
[  846.432376] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.432410] CPU: 1 PID: 1249 Comm: a.out Tainted: G        W         4.18.0-rc3+ #1
[  846.432411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.432413] RIP: 0010:f2fs_wait_on_block_writeback+0xb1/0x110
[  846.432414] Code: 66 90 f0 ff 4b 34 74 59 5b 5d c3 48 8b 7d 00 41 b8 05 00 00 00 89 d9 48 c7 c2 d8 e8 0e 8b 48 c7 c6 1d b0 0a 8b e8 df bc fd ff <0f> 0b f0 80 4d 48 04 e9 67 ff ff ff 48 8b 03 48 c1 e8 37 83 e0 07
[  846.432445] RSP: 0018:ffff961c414a7910 EFLAGS: 00010286
[  846.432447] RAX: 0000000000000000 RBX: 0000000000000400 RCX: 0000000000000006
[  846.432448] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff89dfffd165d0
[  846.432449] RBP: ffff89dff5492800 R08: 0000000000000000 R09: 00000000000002d1
[  846.432450] R10: ffff961c414a7820 R11: ffff89dfad50cf80 R12: 0000000000000400
[  846.432451] R13: 0000000000000000 R14: ffff89dff4ff88d0 R15: 0000000000000000
[  846.432453] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.432454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.432455] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.432459] Call Trace:
[  846.432463]  f2fs_grab_read_bio+0xbc/0xe0
[  846.432464]  f2fs_submit_page_read+0x21/0x280
[  846.432466]  f2fs_get_read_data_page+0xb7/0x3c0
[  846.432468]  f2fs_get_lock_data_page+0x29/0x1e0
[  846.432470]  f2fs_get_new_data_page+0x148/0x550
[  846.432473]  f2fs_add_regular_entry+0x1d2/0x550
[  846.432475]  ? __switch_to+0x12f/0x460
[  846.432477]  f2fs_add_dentry+0x6a/0xd0
[  846.432480]  f2fs_do_add_link+0xe9/0x140
[  846.432483]  __recover_dot_dentries+0x260/0x280
[  846.432485]  f2fs_lookup+0x343/0x390
[  846.432488]  __lookup_slow+0x97/0x150
[  846.432490]  lookup_slow+0x35/0x50
[  846.432505]  walk_component+0x1c6/0x470
[  846.432509]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.432511]  ? page_add_file_rmap+0x13/0x200
[  846.432513]  path_lookupat+0x76/0x230
[  846.432515]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.432517]  filename_lookup+0xb8/0x1a0
[  846.432520]  ? _cond_resched+0x16/0x40
[  846.432522]  ? kmem_cache_alloc+0x160/0x1d0
[  846.432525]  ? path_listxattr+0x41/0xa0
[  846.432526]  path_listxattr+0x41/0xa0
[  846.432529]  do_syscall_64+0x55/0x100
[  846.432531]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.432533] RIP: 0033:0x7f882de1c0d7
[  846.432533] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.432565] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.432567] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.432568] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.432569] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.432570] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.432571] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.432573] ---[ end trace abca54df39d14f5f ]---
[  846.434280] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  846.434424] PGD 80000001ebd3a067 P4D 80000001ebd3a067 PUD 1eb1ae067 PMD 0
[  846.434551] Oops: 0000 [#1] SMP PTI
[  846.434697] CPU: 0 PID: 44 Comm: kworker/u5:0 Tainted: G        W         4.18.0-rc3+ #1
[  846.434805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.435000] Workqueue: fscrypt_read_queue decrypt_work
[  846.435174] RIP: 0010:fscrypt_do_page_crypto+0x6e/0x2d0
[  846.435351] Code: 00 65 48 8b 04 25 28 00 00 00 48 89 84 24 88 00 00 00 31 c0 e8 43 c2 e0 ff 49 8b 86 48 02 00 00 85 ed c7 44 24 70 00 00 00 00 <48> 8b 58 08 0f 84 14 02 00 00 48 8b 78 10 48 8b 0c 24 48 c7 84 24
[  846.435696] RSP: 0018:ffff961c40f9bd60 EFLAGS: 00010206
[  846.435870] RAX: 0000000000000000 RBX: ffffc5f787719b80 RCX: ffffc5f787719b80
[  846.436051] RDX: ffffffff8b9f4b88 RSI: ffffffff8b0ae622 RDI: ffff961c40f9bdb8
[  846.436261] RBP: 0000000000001000 R08: ffffc5f787719b80 R09: 0000000000001000
[  846.436433] R10: 0000000000000018 R11: fefefefefefefeff R12: ffffc5f787719b80
[  846.436562] R13: ffffc5f787719b80 R14: ffff89dff4ff88d0 R15: 0ffff89dfaddee60
[  846.436658] FS:  0000000000000000(0000) GS:ffff89dfffc00000(0000) knlGS:0000000000000000
[  846.436758] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.436898] CR2: 0000000000000008 CR3: 00000001eddd0000 CR4: 00000000000006f0
[  846.437001] Call Trace:
[  846.437181]  ? check_preempt_wakeup+0xf2/0x230
[  846.437276]  ? check_preempt_curr+0x7c/0x90
[  846.437370]  fscrypt_decrypt_page+0x48/0x4d
[  846.437466]  __fscrypt_decrypt_bio+0x5b/0x90
[  846.437542]  decrypt_work+0x12/0x20
[  846.437651]  process_one_work+0x15e/0x3d0
[  846.437740]  worker_thread+0x4c/0x440
[  846.437848]  kthread+0xf8/0x130
[  846.437938]  ? rescuer_thread+0x350/0x350
[  846.438022]  ? kthread_associate_blkcg+0x90/0x90
[  846.438117]  ret_from_fork+0x35/0x40
[  846.438201] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.438653] CR2: 0000000000000008
[  846.438713] ---[ end trace abca54df39d14f60 ]---
[  846.438796] RIP: 0010:fscrypt_do_page_crypto+0x6e/0x2d0
[  846.438844] Code: 00 65 48 8b 04 25 28 00 00 00 48 89 84 24 88 00 00 00 31 c0 e8 43 c2 e0 ff 49 8b 86 48 02 00 00 85 ed c7 44 24 70 00 00 00 00 <48> 8b 58 08 0f 84 14 02 00 00 48 8b 78 10 48 8b 0c 24 48 c7 84 24
[  846.439084] RSP: 0018:ffff961c40f9bd60 EFLAGS: 00010206
[  846.439176] RAX: 0000000000000000 RBX: ffffc5f787719b80 RCX: ffffc5f787719b80
[  846.440927] RDX: ffffffff8b9f4b88 RSI: ffffffff8b0ae622 RDI: ffff961c40f9bdb8
[  846.442083] RBP: 0000000000001000 R08: ffffc5f787719b80 R09: 0000000000001000
[  846.443284] R10: 0000000000000018 R11: fefefefefefefeff R12: ffffc5f787719b80
[  846.444448] R13: ffffc5f787719b80 R14: ffff89dff4ff88d0 R15: 0ffff89dfaddee60
[  846.445558] FS:  0000000000000000(0000) GS:ffff89dfffc00000(0000) knlGS:0000000000000000
[  846.446687] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.447796] CR2: 0000000000000008 CR3: 00000001eddd0000 CR4: 00000000000006f0

- Location
https://elixir.bootlin.com/linux/v4.18-rc4/source/fs/crypto/crypto.c#L149
	struct crypto_skcipher *tfm = ci->ci_ctfm;
Here ci can be NULL

Note that this issue maybe require CONFIG_F2FS_FS_ENCRYPTION=y to reproduce.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
bcbfbd604d f2fs: fix to do sanity check with inline flags
https://bugzilla.kernel.org/show_bug.cgi?id=200221

- Overview
BUG() in clear_inode() when mounting and un-mounting a corrupted f2fs image

- Reproduce

- Kernel message
[  538.601448] F2FS-fs (loop0): Invalid segment/section count (31, 24 x 1376257)
[  538.601458] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[  538.724091] F2FS-fs (loop0): Try to recover 2th superblock, ret: 0
[  538.724102] F2FS-fs (loop0): Mounted with checkpoint version = 2
[  540.970834] ------------[ cut here ]------------
[  540.970838] kernel BUG at fs/inode.c:512!
[  540.971750] invalid opcode: 0000 [#1] SMP KASAN PTI
[  540.972755] CPU: 1 PID: 1305 Comm: umount Not tainted 4.18.0-rc1+ #4
[  540.974034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  540.982913] RIP: 0010:clear_inode+0xc0/0xd0
[  540.983774] Code: 8d a3 30 01 00 00 4c 89 e7 e8 1c ec f8 ff 48 8b 83 30 01 00 00 49 39 c4 75 1a 48 c7 83 a0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 66 66 66 66 90 55
[  540.987570] RSP: 0018:ffff8801e34a7b70 EFLAGS: 00010002
[  540.988636] RAX: 0000000000000000 RBX: ffff8801e9b744e8 RCX: ffffffffb840eb3a
[  540.990063] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8801e9b746b8
[  540.991499] RBP: ffff8801e34a7b80 R08: ffffed003d36e8ce R09: ffffed003d36e8ce
[  540.992923] R10: 0000000000000001 R11: ffffed003d36e8cd R12: ffff8801e9b74668
[  540.994360] R13: ffff8801e9b74760 R14: ffff8801e9b74528 R15: ffff8801e9b74530
[  540.995786] FS:  00007f4662bdf840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  540.997403] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  540.998571] CR2: 000000000175c568 CR3: 00000001dcfe6000 CR4: 00000000000006e0
[  541.000015] Call Trace:
[  541.000554]  f2fs_evict_inode+0x253/0x630
[  541.001381]  evict+0x16f/0x290
[  541.002015]  iput+0x280/0x300
[  541.002654]  dentry_unlink_inode+0x165/0x1e0
[  541.003528]  __dentry_kill+0x16a/0x260
[  541.004300]  dentry_kill+0x70/0x250
[  541.005018]  dput+0x154/0x1d0
[  541.005635]  do_one_tree+0x34/0x40
[  541.006354]  shrink_dcache_for_umount+0x3f/0xa0
[  541.007285]  generic_shutdown_super+0x43/0x1c0
[  541.008192]  kill_block_super+0x52/0x80
[  541.008978]  kill_f2fs_super+0x62/0x70
[  541.009750]  deactivate_locked_super+0x6f/0xa0
[  541.010664]  deactivate_super+0x5e/0x80
[  541.011450]  cleanup_mnt+0x61/0xa0
[  541.012151]  __cleanup_mnt+0x12/0x20
[  541.012893]  task_work_run+0xc8/0xf0
[  541.013635]  exit_to_usermode_loop+0x125/0x130
[  541.014555]  do_syscall_64+0x138/0x170
[  541.015340]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  541.016375] RIP: 0033:0x7f46624bf487
[  541.017104] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[  541.020923] RSP: 002b:00007fff5e12e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  541.022452] RAX: 0000000000000000 RBX: 0000000001753030 RCX: 00007f46624bf487
[  541.023885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000175a1e0
[  541.025318] RBP: 000000000175a1e0 R08: 0000000000000000 R09: 0000000000000014
[  541.026755] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f46629c883c
[  541.028186] R13: 0000000000000000 R14: 0000000001753210 R15: 00007fff5e12ec30
[  541.029626] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  541.039445] ---[ end trace 4ce02f25ff7d3df5 ]---
[  541.040392] RIP: 0010:clear_inode+0xc0/0xd0
[  541.041240] Code: 8d a3 30 01 00 00 4c 89 e7 e8 1c ec f8 ff 48 8b 83 30 01 00 00 49 39 c4 75 1a 48 c7 83 a0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 66 66 66 66 90 55
[  541.045042] RSP: 0018:ffff8801e34a7b70 EFLAGS: 00010002
[  541.046099] RAX: 0000000000000000 RBX: ffff8801e9b744e8 RCX: ffffffffb840eb3a
[  541.047537] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8801e9b746b8
[  541.048965] RBP: ffff8801e34a7b80 R08: ffffed003d36e8ce R09: ffffed003d36e8ce
[  541.050402] R10: 0000000000000001 R11: ffffed003d36e8cd R12: ffff8801e9b74668
[  541.051832] R13: ffff8801e9b74760 R14: ffff8801e9b74528 R15: ffff8801e9b74530
[  541.053263] FS:  00007f4662bdf840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  541.054891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  541.056039] CR2: 000000000175c568 CR3: 00000001dcfe6000 CR4: 00000000000006e0
[  541.058506] ==================================================================
[  541.059991] BUG: KASAN: stack-out-of-bounds in update_stack_state+0x38c/0x3e0
[  541.061513] Read of size 8 at addr ffff8801e34a7970 by task umount/1305

[  541.063302] CPU: 1 PID: 1305 Comm: umount Tainted: G      D           4.18.0-rc1+ #4
[  541.064838] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  541.066778] Call Trace:
[  541.067294]  dump_stack+0x7b/0xb5
[  541.067986]  print_address_description+0x70/0x290
[  541.068941]  kasan_report+0x291/0x390
[  541.069692]  ? update_stack_state+0x38c/0x3e0
[  541.070598]  __asan_load8+0x54/0x90
[  541.071315]  update_stack_state+0x38c/0x3e0
[  541.072172]  ? __read_once_size_nocheck.constprop.7+0x20/0x20
[  541.073340]  ? vprintk_func+0x27/0x60
[  541.074096]  ? printk+0xa3/0xd3
[  541.074762]  ? __save_stack_trace+0x5e/0x100
[  541.075634]  unwind_next_frame.part.5+0x18e/0x490
[  541.076594]  ? unwind_dump+0x290/0x290
[  541.077368]  ? __show_regs+0x2c4/0x330
[  541.078142]  __unwind_start+0x106/0x190
[  541.085422]  __save_stack_trace+0x5e/0x100
[  541.086268]  ? __save_stack_trace+0x5e/0x100
[  541.087161]  ? unlink_anon_vmas+0xba/0x2c0
[  541.087997]  save_stack_trace+0x1f/0x30
[  541.088782]  save_stack+0x46/0xd0
[  541.089475]  ? __alloc_pages_slowpath+0x1420/0x1420
[  541.090477]  ? flush_tlb_mm_range+0x15e/0x220
[  541.091364]  ? __dec_node_state+0x24/0xb0
[  541.092180]  ? lock_page_memcg+0x85/0xf0
[  541.092979]  ? unlock_page_memcg+0x16/0x80
[  541.093812]  ? page_remove_rmap+0x198/0x520
[  541.094674]  ? mark_page_accessed+0x133/0x200
[  541.095559]  ? _cond_resched+0x1a/0x50
[  541.096326]  ? unmap_page_range+0xcd4/0xe50
[  541.097179]  ? rb_next+0x58/0x80
[  541.097845]  ? rb_next+0x58/0x80
[  541.098518]  __kasan_slab_free+0x13c/0x1a0
[  541.099352]  ? unlink_anon_vmas+0xba/0x2c0
[  541.100184]  kasan_slab_free+0xe/0x10
[  541.100934]  kmem_cache_free+0x89/0x1e0
[  541.101724]  unlink_anon_vmas+0xba/0x2c0
[  541.102534]  free_pgtables+0x101/0x1b0
[  541.103299]  exit_mmap+0x146/0x2a0
[  541.103996]  ? __ia32_sys_munmap+0x50/0x50
[  541.104829]  ? kasan_check_read+0x11/0x20
[  541.105649]  ? mm_update_next_owner+0x322/0x380
[  541.106578]  mmput+0x8b/0x1d0
[  541.107191]  do_exit+0x43a/0x1390
[  541.107876]  ? mm_update_next_owner+0x380/0x380
[  541.108791]  ? deactivate_super+0x5e/0x80
[  541.109610]  ? cleanup_mnt+0x61/0xa0
[  541.110351]  ? __cleanup_mnt+0x12/0x20
[  541.111115]  ? task_work_run+0xc8/0xf0
[  541.111879]  ? exit_to_usermode_loop+0x125/0x130
[  541.112817]  rewind_stack_do_exit+0x17/0x20
[  541.113666] RIP: 0033:0x7f46624bf487
[  541.114404] Code: Bad RIP value.
[  541.115094] RSP: 002b:00007fff5e12e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  541.116605] RAX: 0000000000000000 RBX: 0000000001753030 RCX: 00007f46624bf487
[  541.118034] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000175a1e0
[  541.119472] RBP: 000000000175a1e0 R08: 0000000000000000 R09: 0000000000000014
[  541.120890] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f46629c883c
[  541.122321] R13: 0000000000000000 R14: 0000000001753210 R15: 00007fff5e12ec30

[  541.124061] The buggy address belongs to the page:
[  541.125042] page:ffffea00078d29c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  541.126651] flags: 0x2ffff0000000000()
[  541.127418] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[  541.128963] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  541.130516] page dumped because: kasan: bad access detected

[  541.131954] Memory state around the buggy address:
[  541.132924]  ffff8801e34a7800: 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00 00
[  541.134378]  ffff8801e34a7880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  541.135814] >ffff8801e34a7900: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
[  541.137253]                                                              ^
[  541.138637]  ffff8801e34a7980: f1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  541.140075]  ffff8801e34a7a00: 00 00 00 00 00 00 00 00 f3 00 00 00 00 00 00 00
[  541.141509] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/inode.c#L512
	BUG_ON(inode->i_data.nrpages);

The root cause is root directory inode is corrupted, it has both
inline_data and inline_dentry flag, and its nlink is zero, so in
->evict(), after dropping all page cache, it grabs page #0 for inline
data truncation, result in panic in later clear_inode() where we will
check inode->i_data.nrpages value.

This patch adds inline flags check in sanity_check_inode, in addition,
do sanity check with root inode's nlink.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Linus Torvalds
400439275d Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Thomas Gleixner:
 "The EFI pile:

   - Make mixed mode UEFI runtime service invocations mutually
     exclusive, as mandated by the UEFI spec

   - Perform UEFI runtime services calls from a work queue so the calls
     into the firmware occur from a kernel thread

   - Honor the UEFI memory map attributes for live memory regions
     configured by UEFI as a framebuffer. This works around a coherency
     problem with KVM guests running on ARM.

   - Cleanups, improvements and fixes all over the place"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efivars: Call guid_parse() against guid_t type of variable
  efi/cper: Use consistent types for UUIDs
  efi/x86: Replace references to efi_early->is64 with efi_is_64bit()
  efi: Deduplicate efi_open_volume()
  efi/x86: Add missing NULL initialization in UGA draw protocol discovery
  efi/x86: Merge 32-bit and 64-bit UGA draw protocol setup routines
  efi/x86: Align efi_uga_draw_protocol typedef names to convention
  efi/x86: Merge the setup_efi_pci32() and setup_efi_pci64() routines
  efi/x86: Prevent reentrant firmware calls in mixed mode
  efi/esrt: Only call efi_mem_reserve() for boot services memory
  fbdev/efifb: Honour UEFI memory map attributes when mapping the FB
  efi: Drop type and attribute checks in efi_mem_desc_lookup()
  efi/libstub/arm: Add opt-in Kconfig option for the DTB loader
  efi: Remove the declaration of efi_late_init() as the function is unused
  efi/cper: Avoid using get_seconds()
  efi: Use a work queue to invoke EFI Runtime Services
  efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()
  efi/x86: Clean up the eboot code
2018-08-13 10:25:08 -07:00
Yan, Zheng
0fcf6c02b2 ceph: don't drop message if it contains more data than expected
Later version mds may encode more data into messages.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Yan, Zheng
342ce1823e ceph: support cephfs' own feature bits
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Chengguang Xu
e5bc08d09f ceph: refactor error handling code in ceph_reserve_caps()
Call new helper __ceph_unreserve_caps() to reduce duplicated code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Chengguang Xu
7bf8f736c8 ceph: refactor ceph_unreserve_caps()
The code of ceph_unreserve_caps() and error handling in
ceph_reserve_caps() are duplicated, so introduce a helper
__ceph_unreserve_caps() to reduce duplicated code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu
d554849290 ceph: change to void return type for __do_request()
We do not check return code for __do_request() in all callers,
so change to void return type.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu
9da12e3a7d ceph: compare fsc->max_file_size and inode->i_size for max file size limit
In ceph_llseek(), we compare fsc->max_file_size and inode->i_size to
choose max file size limit.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu
36a4c72d1c ceph: add additional size check in ceph_setattr()
ceph_setattr() finally calls vfs function inode_newsize_ok()
to do offset validation and that is based on sb->s_maxbytes.
Because we set sb->s_maxbytes to MAX_LFS_FILESIZE to through
VFS check and do proper offset validation in cephfs level,
we need adding proper offset validation before calling
inode_newsize_ok().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Darrick J. Wong
00d22a1c36 xfs: recalculate summary counters at mount time if icount is bad
Since the sb write verifier trips on bad icounts, we should also force a
mount time recalculation of the summary counters if the icount is bad.
This helps us avoid blowing up at freeze/unmount time when the bad
counter gets written back out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-13 07:58:27 -07:00
piaojun
3111784bee fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
In my testing, v9fs_fid_xattr_set will return successfully even if the
backend ext4 filesystem has no space to store xattr key-value. That will
cause inconsistent behavior between front end and back end. The reason is
that lsetxattr will be triggered by p9_client_clunk, and unfortunately we
did not catch the error. This patch will catch the error to notify upper
caller.

p9_client_clunk (in 9p)
  p9_client_rpc(clnt, P9_TCLUNK, "d", fid->fid);
    v9fs_clunk (in qemu)
      put_fid
        free_fid
          v9fs_xattr_fid_clunk
            v9fs_co_lsetxattr
              s->ops->lsetxattr
                ext4_xattr_user_set (in host ext4 filesystem)

Link: http://lkml.kernel.org/r/5B57EACC.2060900@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:35:16 +09:00
Colin Ian King
6baaac0961 fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
fix spelling mistake in pr_info message text

Link: http://lkml.kernel.org/r/20180526150650.10562-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Souptick Joarder
fe6340e2d1 fs/9p/vfs_file.c: use new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite handler.

See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.

Link: http://lkml.kernel.org/r/20180702154928.GA3964@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Linus Torvalds
d6dd643159 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A bunch of race fixes, mostly around lazy pathwalk.

  All of it is -stable fodder, a large part going back to 2013"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make sure that __dentry_kill() always invalidates d_seq, unhashed or not
  fix __legitimize_mnt()/mntput() race
  fix mntput/mntput race
  root dentries need RCU-delayed freeing
2018-08-12 11:21:17 -07:00
Shan Hai
01239d77b9 xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
Fuzzing tool reports a write to null pointer error in the
xfs_bmap_extents_to_btree, fix it by bailing out on encountering
a null pointer.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Eric Sandeen
fa6c668d80 xfs: remove b_last_holder & associated macros
The old lock tracking infrastructure in xfs using the b_last_holder
field seems to only be useful if you can get into the system with a
debugger; it seems that the existing tracepoints would be the way to
go these days, and this old infrastructure can be removed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Andreas Gruenbacher
10259de1d8 iomap: Switch to offset_in_page for clarity
Instead of open-coding pos & (PAGE_SIZE - 1) and pos & ~PAGE_MASK, use
the offset_in_page macro.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Dave Jiang
e25ff835af xfs: Close race between direct IO and xfs_break_layouts()
This patch is the duplicate of ross's fix for ext4 for xfs.

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
xfs_break_layouts() => ___wait_var_event(), the waiting function
xfs_wait_dax_page() will never be called.  This means that
xfs_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Boris Brezillon
da86748bf6 NAND core changes:
- Add the SPI-NAND framework.
 - Create a helper to find the best ECC configuration.
 - Create NAND controller operations.
 - Allocate dynamically ONFI parameters structure.
 - Add defines for ONFI version bits.
 - Add manufacturer fixup for ONFI parameter page.
 - Add an option to specify NAND chip as a boot device.
 - Add Reed-Solomon error correction algorithm.
 - Better name for the controller structure.
 - Remove unused caller_is_module() definition.
 - Make subop helpers return unsigned values.
 - Expose _notsupp() helpers for raw page accessors.
 - Add default values for dynamic timings.
 - Kill the chip->scan_bbt() hook.
 - Rename nand_default_bbt() into nand_create_bbt().
 - Start to clean the nand_chip structure.
 - Remove stale prototype from rawnand.h.
 
 Raw NAND controllers drivers changes:
 - Qcom: structuring cleanup.
 - Denali: use core helper to find the best ECC configuration.
 - Possible build of almost all drivers by adding a dependency on
   COMPILE_TEST for almost all of them in Kconfig, implies various
   fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
   changes in sparc64 and ia64 architectures.
 - Clean the ->probe() functions error path of a lot of drivers.
 - Migrate all drivers to use nand_scan() instead of
   nand_scan_ident()/nand_scan_tail() pair.
 - Use mtd_device_register() where applicable to simplify the code.
 - Marvell:
   * Handle on-die ECC.
   * Better clocks handling.
   * Remove bogus comment.
   * Add suspend and resume support.
 - Tegra: add NAND controller driver.
 - Atmel:
   * Add module param to avoid using dma.
   * Drop Wenyou Yang from MAINTAINERS.
 - Denali: optimize timings handling.
 - FSMC: Stop using chip->read_buf().
 - FSL:
   * Switch to SPDX license tag identifiers.
   * Fix qualifiers in MXC init functions.
 
 Raw NAND chip drivers changes:
 - Micron:
   * Add fixup for ONFI revision.
   * Update ecc_stats.corrected.
   * Make ECC activation stateful.
   * Avoid enabling/disabling ECC when it can't be disabled.
   * Get the actual number of bitflips.
   * Allow forced on-die ECC.
   * Support 8/512 on-die ECC.
   * Fix on-die ECC detection logic.
 - Hynix:
   * Fix decoding the OOB size on H27UCG8T2BTR.
   * Use ->exec_op() in hynix_nand_reg_write_op().
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJbYYGVAAoJECVq6hhHvVaE0poIAJy+VpZl0jTPQ/oO8TQui9hE
 IZbc8LwohCvegYYhiY1cNESyMYamDfoK6M93i/0zTJF2AJAxPl25ldT8N5Wr16DO
 5Vfsdjv75V8l0JEY2SvWYmC6glOAYs0UEDdcFNJRMPqUnQz+VvBIafJOCQqzo4ZH
 SDnLx3XzOxO4PAPnztWEg50WvaqMPt7ThcqoxThHMcQaLrNjgJUsV0mN+vNEv16Q
 6gH6hl1C019k+Kj2Zu0vAifHw1K7gIYT4HvqKwstQ6HYUX2IzIzuEpRIcIze0S5z
 XKzZ57USItb3l+Y3YwFBLjgP4N+VTT5X59LxdtCOXJ+YvzgxwtKElRvalNcryYI=
 =zkEf
 -----END PGP SIGNATURE-----

Merge tag 'nand/for-4.19' of git://git.infradead.org/linux-mtd into mtd/next

Pull NAND updates from Miquel Raynal:

"
 NAND core changes:
 - Add the SPI-NAND framework.
 - Create a helper to find the best ECC configuration.
 - Create NAND controller operations.
 - Allocate dynamically ONFI parameters structure.
 - Add defines for ONFI version bits.
 - Add manufacturer fixup for ONFI parameter page.
 - Add an option to specify NAND chip as a boot device.
 - Add Reed-Solomon error correction algorithm.
 - Better name for the controller structure.
 - Remove unused caller_is_module() definition.
 - Make subop helpers return unsigned values.
 - Expose _notsupp() helpers for raw page accessors.
 - Add default values for dynamic timings.
 - Kill the chip->scan_bbt() hook.
 - Rename nand_default_bbt() into nand_create_bbt().
 - Start to clean the nand_chip structure.
 - Remove stale prototype from rawnand.h.

 Raw NAND controllers drivers changes:
 - Qcom: structuring cleanup.
 - Denali: use core helper to find the best ECC configuration.
 - Possible build of almost all drivers by adding a dependency on
   COMPILE_TEST for almost all of them in Kconfig, implies various
   fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
   changes in sparc64 and ia64 architectures.
 - Clean the ->probe() functions error path of a lot of drivers.
 - Migrate all drivers to use nand_scan() instead of
   nand_scan_ident()/nand_scan_tail() pair.
 - Use mtd_device_register() where applicable to simplify the code.
 - Marvell:
   * Handle on-die ECC.
   * Better clocks handling.
   * Remove bogus comment.
   * Add suspend and resume support.
 - Tegra: add NAND controller driver.
 - Atmel:
   * Add module param to avoid using dma.
   * Drop Wenyou Yang from MAINTAINERS.
 - Denali: optimize timings handling.
 - FSMC: Stop using chip->read_buf().
 - FSL:
   * Switch to SPDX license tag identifiers.
   * Fix qualifiers in MXC init functions.

 Raw NAND chip drivers changes:
 - Micron:
   * Add fixup for ONFI revision.
   * Update ecc_stats.corrected.
   * Make ECC activation stateful.
   * Avoid enabling/disabling ECC when it can't be disabled.
   * Get the actual number of bitflips.
   * Allow forced on-die ECC.
   * Support 8/512 on-die ECC.
   * Fix on-die ECC detection logic.
 - Hynix:
   * Fix decoding the OOB size on H27UCG8T2BTR.
   * Use ->exec_op() in hynix_nand_reg_write_op().
"
2018-08-11 12:15:19 +02:00
Steve French
c4f7173ac3 smb3: create smb3 equivalent alias for cifs pseudo-xattrs
We really, really don't want to be encouraging people to use
cifs (the dialect) since it is insecure, so to avoid confusion
we want to move them to names which include 'smb3' instead of
'cifs' - so this simply creates an alias for the pseudo-xattrs

e.g. can now do:
getfattr -n user.smb3.creationtime /mnt1/file
and
getfattr -n user.smb3.dosattrib /mnt1/file
and
getfattr -n system.smb3_acl /mnt1/file

instead of forcing you to use the string 'cifs' in
these (e.g. getfattr -n system.cifs_acl /mnt1/file)

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-10 18:46:58 -05:00
Chao Yu
3093336481 f2fs: fix to reset i_gc_failures correctly
Let's reset i_gc_failures to zero when we unset pinned state for file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:39:26 -07:00
Chao Yu
d3f07c049d f2fs: fix invalid memory access
syzbot found the following crash on:

HEAD commit:    d9bd94c0bcaa Add linux-next specific files for 20180801
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1001189c400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=cc8964ea4d04518c
dashboard link: https://syzkaller.appspot.com/bug?extid=c966a82db0b14aa37e81
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c966a82db0b14aa37e81@syzkaller.appspotmail.com

loop7: rw=12288, want=8200, limit=20
netlink: 65342 bytes leftover after parsing attributes in process `syz-executor4'.
openvswitch: netlink: Message has 8 unknown bytes.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
CPU: 1 PID: 7615 Comm: syz-executor7 Not tainted 4.18.0-rc7-next-20180801+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:compound_head include/linux/page-flags.h:142 [inline]
RIP: 0010:PageLocked include/linux/page-flags.h:272 [inline]
RIP: 0010:f2fs_put_page fs/f2fs/f2fs.h:2011 [inline]
RIP: 0010:validate_checkpoint+0x66d/0xec0 fs/f2fs/checkpoint.c:835
Code: e8 58 05 7f fe 4c 8d 6b 80 4d 8d 74 24 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 c6 04 02 00 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 f4 06 00 00 4c 89 ea 4d 8b 7c 24 08 48 b8 00 00
RSP: 0018:ffff8801937cebe8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8801937cef30 RCX: ffffc90006035000
RDX: 0000000000000000 RSI: ffffffff82fd9658 RDI: 0000000000000005
RBP: ffff8801937cef58 R08: ffff8801ab254700 R09: fffff94000d9e026
R10: fffff94000d9e026 R11: ffffea0006cf0137 R12: fffffffffffffffb
R13: ffff8801937ceeb0 R14: 0000000000000003 R15: ffff880193419b40
FS:  00007f36a61d5700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc04ff93000 CR3: 00000001d0562000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_get_valid_checkpoint+0x436/0x1ec0 fs/f2fs/checkpoint.c:860
 f2fs_fill_super+0x2d42/0x8110 fs/f2fs/super.c:2883
 mount_bdev+0x314/0x3e0 fs/super.c:1344
 f2fs_mount+0x3c/0x50 fs/f2fs/super.c:3133
 legacy_get_tree+0x131/0x460 fs/fs_context.c:729
 vfs_get_tree+0x1cb/0x5c0 fs/super.c:1743
 do_new_mount fs/namespace.c:2603 [inline]
 do_mount+0x6f2/0x1e20 fs/namespace.c:2927
 ksys_mount+0x12d/0x140 fs/namespace.c:3143
 __do_sys_mount fs/namespace.c:3157 [inline]
 __se_sys_mount fs/namespace.c:3154 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3154
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45943a
Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 bd 8a fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 9a 8a fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:00007f36a61d4a88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f36a61d4b30 RCX: 000000000045943a
RDX: 00007f36a61d4ad0 RSI: 0000000020000100 RDI: 00007f36a61d4af0
RBP: 0000000020000100 R08: 00007f36a61d4b30 R09: 00007f36a61d4ad0
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000013
R13: 0000000000000000 R14: 00000000004c8ea0 R15: 0000000000000000
Modules linked in:
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace bd8550c129352286 ]---
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:compound_head include/linux/page-flags.h:142 [inline]
RIP: 0010:PageLocked include/linux/page-flags.h:272 [inline]
RIP: 0010:f2fs_put_page fs/f2fs/f2fs.h:2011 [inline]
RIP: 0010:validate_checkpoint+0x66d/0xec0 fs/f2fs/checkpoint.c:835
Code: e8 58 05 7f fe 4c 8d 6b 80 4d 8d 74 24 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 c6 04 02 00 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 f4 06 00 00 4c 89 ea 4d 8b 7c 24 08 48 b8 00 00
RSP: 0018:ffff8801937cebe8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8801937cef30 RCX: ffffc90006035000
RDX: 0000000000000000 RSI: ffffffff82fd9658 RDI: 0000000000000005
netlink: 65342 bytes leftover after parsing attributes in process `syz-executor4'.
RBP: ffff8801937cef58 R08: ffff8801ab254700 R09: fffff94000d9e026
openvswitch: netlink: Message has 8 unknown bytes.
R10: fffff94000d9e026 R11: ffffea0006cf0137 R12: fffffffffffffffb
R13: ffff8801937ceeb0 R14: 0000000000000003 R15: ffff880193419b40
FS:  00007f36a61d5700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc04ff93000 CR3: 00000001d0562000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

In validate_checkpoint(), if we failed to call get_checkpoint_version(), we
will pass returned invalid page pointer into f2fs_put_page, cause accessing
invalid memory, this patch tries to handle error path correctly to fix this
issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
50fa53eccf f2fs: fix to avoid broken of dnode block list
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.

By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.

Sheng Yong helps to do the test with this patch:

Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8

1 / PERSIST / Index

Base:
	SEQ-RD(MB/s)	SEQ-WR(MB/s)	RND-RD(IOPS)	RND-WR(IOPS)	Insert(TPS)	Update(TPS)	Delete(TPS)
1	867.82		204.15		41440.03	41370.54	680.8		1025.94		1031.08
2	871.87		205.87		41370.3		40275.2		791.14		1065.84		1101.7
3	866.52		205.69		41795.67	40596.16	694.69		1037.16		1031.48
Avg	868.7366667	205.2366667	41535.33333	40747.3		722.21		1042.98		1054.753333

After:
	SEQ-RD(MB/s)	SEQ-WR(MB/s)	RND-RD(IOPS)	RND-WR(IOPS)	Insert(TPS)	Update(TPS)	Delete(TPS)
1	798.81		202.5		41143		40613.87	602.71		838.08		913.83
2	805.79		206.47		40297.2		41291.46	604.44		840.75		924.27
3	814.83		206.17		41209.57	40453.62	602.85		834.66		927.91
Avg	806.4766667	205.0466667	40883.25667	40786.31667	603.3333333	837.83		922.0033333

Patched/Original:
	0.928332713	0.999074239	0.984300676	1.000957528	0.835398753	0.803303994	0.874141189

It looks like atomic write will suffer performance regression.

I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.

BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:

- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
 - writeback data: P1, P2, P3;
 - writeback node: N1, N2, N3;  <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
 - preflush + fua;
- power-cut

If we don't wait dnode writeback for atomic_write:

	SEQ-RD(MB/s)	SEQ-WR(MB/s)	RND-RD(IOPS)	RND-WR(IOPS)	Insert(TPS)	Update(TPS)	Delete(TPS)
1	779.91		206.03		41621.5		40333.16	716.9		1038.21		1034.85
2	848.51		204.35		40082.44	39486.17	791.83		1119.96		1083.77
3	772.12		206.27		41335.25	41599.65	723.29		1055.07		971.92
Avg	800.18		205.55		41013.06333	40472.99333	744.0066667	1071.08		1030.18

Patched/Original:
	0.92108464	1.001526693	0.987425886	0.993268102	1.030180511	1.026942031	0.976702294

SQLite's performance recovers.

Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."

Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Gustavo A. R. Silva
6e45f2a59f f2fs: use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
e494c2f995 f2fs: fix to do sanity check with cp_pack_start_sum
After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.

Add sanity check for cp_pack_start_sum to fix this issue.

https://bugzilla.kernel.org/show_bug.cgi?id=200419

- Reproduce

- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)

[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 __get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:ffff88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: ffff88018f0a6a78 RBX: ffffea0007a46600 RCX: ffffffff9314d1b2
[ 3117.584085] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff88018f0a6a98
[ 3117.584087] RBP: ffff88018ebe9980 R08: 0000000000000002 R09: 0000000000000001
[ 3117.584090] R10: 0000000000000001 R11: ffffed00326e4450 R12: ffff880193722200
[ 3117.584092] R13: ffff88018ebe9afc R14: 0000000000000206 R15: ffff88018eb67900
[ 3117.584096] FS:  00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.584098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.584101] CR2: 00000000016f21b8 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.584112] Call Trace:
[ 3117.584121]  ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127]  ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133]  ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145]  f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151]  ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156]  ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184]  ? map_id_range_down+0x17c/0x1b0
[ 3117.584188]  ? __put_user_ns+0x30/0x30
[ 3117.584206]  ? find_next_bit+0x53/0x90
[ 3117.584237]  ? cpumask_next+0x16/0x20
[ 3117.584249]  f2fs_fill_super+0x1948/0x2b40
[ 3117.584258]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279]  ? sget_userns+0x65e/0x690
[ 3117.584296]  ? set_blocksize+0x88/0x130
[ 3117.584302]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305]  mount_bdev+0x1c0/0x200
[ 3117.584310]  mount_fs+0x5c/0x190
[ 3117.584320]  vfs_kern_mount+0x64/0x190
[ 3117.584330]  do_mount+0x2e4/0x1450
[ 3117.584343]  ? lockref_put_return+0x130/0x130
[ 3117.584347]  ? copy_mount_string+0x20/0x20
[ 3117.584357]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383]  ? _copy_from_user+0x61/0x90
[ 3117.584396]  ? memdup_user+0x3e/0x60
[ 3117.584401]  ksys_mount+0x7e/0xd0
[ 3117.584405]  __x64_sys_mount+0x62/0x70
[ 3117.584427]  do_syscall_64+0x73/0x160
[ 3117.584440]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.584510] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.584512] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.584514] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.584516] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.584519] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_data: ino = 2 (i_size: recover) recovered = 1, err = 0
[ 3117.685707] ==================================================================
[ 3117.685955] BUG: KASAN: slab-out-of-bounds in __remove_dirty_segment+0xdd/0x1e0
[ 3117.686175] Read of size 8 at addr ffff88018f0a63d0 by task mount/1225

[ 3117.686477] CPU: 0 PID: 1225 Comm: mount Tainted: G        W         4.17.0+ #1
[ 3117.686481] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.686483] Call Trace:
[ 3117.686494]  dump_stack+0x71/0xab
[ 3117.686512]  print_address_description+0x6b/0x290
[ 3117.686517]  kasan_report+0x28e/0x390
[ 3117.686522]  ? __remove_dirty_segment+0xdd/0x1e0
[ 3117.686527]  __remove_dirty_segment+0xdd/0x1e0
[ 3117.686532]  locate_dirty_segment+0x189/0x190
[ 3117.686538]  f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.686543]  recover_data+0x703/0x2c20
[ 3117.686547]  ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.686553]  ? ksys_mount+0x7e/0xd0
[ 3117.686564]  ? policy_nodemask+0x1a/0x90
[ 3117.686567]  ? policy_node+0x56/0x70
[ 3117.686571]  ? add_fsync_inode+0xf0/0xf0
[ 3117.686592]  ? blk_finish_plug+0x44/0x60
[ 3117.686597]  ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.686602]  ? find_inode_fast+0xac/0xc0
[ 3117.686606]  ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.686618]  ? __radix_tree_lookup+0x150/0x150
[ 3117.686633]  ? dqget+0x670/0x670
[ 3117.686648]  ? pagecache_get_page+0x29/0x410
[ 3117.686656]  ? kmem_cache_alloc+0x176/0x1e0
[ 3117.686660]  ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.686664]  f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.686670]  ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.686674]  ? rb_insert_color+0x323/0x3d0
[ 3117.686678]  ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.686683]  ? proc_register+0x153/0x1d0
[ 3117.686686]  ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.686695]  ? f2fs_attr_store+0x50/0x50
[ 3117.686700]  ? proc_create_single_data+0x52/0x60
[ 3117.686707]  f2fs_fill_super+0x1d06/0x2b40
[ 3117.686728]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686735]  ? sget_userns+0x65e/0x690
[ 3117.686740]  ? set_blocksize+0x88/0x130
[ 3117.686745]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686748]  mount_bdev+0x1c0/0x200
[ 3117.686753]  mount_fs+0x5c/0x190
[ 3117.686758]  vfs_kern_mount+0x64/0x190
[ 3117.686762]  do_mount+0x2e4/0x1450
[ 3117.686769]  ? lockref_put_return+0x130/0x130
[ 3117.686773]  ? copy_mount_string+0x20/0x20
[ 3117.686777]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.686780]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.686786]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.686790]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.686795]  ? _copy_from_user+0x61/0x90
[ 3117.686801]  ? memdup_user+0x3e/0x60
[ 3117.686804]  ksys_mount+0x7e/0xd0
[ 3117.686809]  __x64_sys_mount+0x62/0x70
[ 3117.686816]  do_syscall_64+0x73/0x160
[ 3117.686824]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.686829] RIP: 0033:0x7f5693f14b9a
[ 3117.686830] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.686887] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.686892] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.686894] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.686896] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.686899] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.686901] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003

[ 3117.687005] Allocated by task 1225:
[ 3117.687152]  kasan_kmalloc+0xa6/0xd0
[ 3117.687157]  kmem_cache_alloc_trace+0xfd/0x200
[ 3117.687161]  f2fs_build_segment_manager+0x2d09/0x3190
[ 3117.687165]  f2fs_fill_super+0x1948/0x2b40
[ 3117.687168]  mount_bdev+0x1c0/0x200
[ 3117.687171]  mount_fs+0x5c/0x190
[ 3117.687174]  vfs_kern_mount+0x64/0x190
[ 3117.687177]  do_mount+0x2e4/0x1450
[ 3117.687180]  ksys_mount+0x7e/0xd0
[ 3117.687182]  __x64_sys_mount+0x62/0x70
[ 3117.687186]  do_syscall_64+0x73/0x160
[ 3117.687190]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 3117.687285] Freed by task 19:
[ 3117.687412]  __kasan_slab_free+0x137/0x190
[ 3117.687416]  kfree+0x8b/0x1b0
[ 3117.687460]  ttm_bo_man_put_node+0x61/0x80 [ttm]
[ 3117.687476]  ttm_bo_cleanup_refs+0x15f/0x250 [ttm]
[ 3117.687492]  ttm_bo_delayed_delete+0x2f0/0x300 [ttm]
[ 3117.687507]  ttm_bo_delayed_workqueue+0x17/0x50 [ttm]
[ 3117.687528]  process_one_work+0x2f9/0x740
[ 3117.687531]  worker_thread+0x78/0x6b0
[ 3117.687541]  kthread+0x177/0x1c0
[ 3117.687545]  ret_from_fork+0x35/0x40

[ 3117.687638] The buggy address belongs to the object at ffff88018f0a6300
                which belongs to the cache kmalloc-192 of size 192
[ 3117.688014] The buggy address is located 16 bytes to the right of
                192-byte region [ffff88018f0a6300, ffff88018f0a63c0)
[ 3117.688382] The buggy address belongs to the page:
[ 3117.688554] page:ffffea00063c2980 count:1 mapcount:0 mapping:ffff8801f3403180 index:0x0
[ 3117.688788] flags: 0x17fff8000000100(slab)
[ 3117.688944] raw: 017fff8000000100 ffffea00063c2840 0000000e0000000e ffff8801f3403180
[ 3117.689166] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 3117.689386] page dumped because: kasan: bad access detected

[ 3117.689653] Memory state around the buggy address:
[ 3117.689816]  ffff88018f0a6280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 3117.690027]  ffff88018f0a6300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690239] >ffff88018f0a6380: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.690448]                                                  ^
[ 3117.690644]  ffff88018f0a6400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690868]  ffff88018f0a6480: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.691077] ==================================================================
[ 3117.691290] Disabling lock debugging due to kernel taint
[ 3117.693893] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 3117.694120] PGD 80000001f01bc067 P4D 80000001f01bc067 PUD 1d9638067 PMD 0
[ 3117.694338] Oops: 0002 [#1] SMP KASAN PTI
[ 3117.694490] CPU: 1 PID: 1225 Comm: mount Tainted: G    B   W         4.17.0+ #1
[ 3117.694703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.695073] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.695246] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.695793] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.695969] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.696182] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.696391] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.696604] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.696813] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.697032] FS:  00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.697280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.702357] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.707235] Call Trace:
[ 3117.712077]  locate_dirty_segment+0x189/0x190
[ 3117.716891]  f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.721617]  recover_data+0x703/0x2c20
[ 3117.726316]  ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.730957]  ? ksys_mount+0x7e/0xd0
[ 3117.735573]  ? policy_nodemask+0x1a/0x90
[ 3117.740198]  ? policy_node+0x56/0x70
[ 3117.744829]  ? add_fsync_inode+0xf0/0xf0
[ 3117.749487]  ? blk_finish_plug+0x44/0x60
[ 3117.754152]  ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.758831]  ? find_inode_fast+0xac/0xc0
[ 3117.763448]  ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.768046]  ? __radix_tree_lookup+0x150/0x150
[ 3117.772603]  ? dqget+0x670/0x670
[ 3117.777159]  ? pagecache_get_page+0x29/0x410
[ 3117.781648]  ? kmem_cache_alloc+0x176/0x1e0
[ 3117.786067]  ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.790476]  f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.794790]  ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.799086]  ? rb_insert_color+0x323/0x3d0
[ 3117.803304]  ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.807563]  ? proc_register+0x153/0x1d0
[ 3117.811766]  ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.815947]  ? f2fs_attr_store+0x50/0x50
[ 3117.820087]  ? proc_create_single_data+0x52/0x60
[ 3117.824262]  f2fs_fill_super+0x1d06/0x2b40
[ 3117.828367]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.832432]  ? sget_userns+0x65e/0x690
[ 3117.836500]  ? set_blocksize+0x88/0x130
[ 3117.840501]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.844420]  mount_bdev+0x1c0/0x200
[ 3117.848275]  mount_fs+0x5c/0x190
[ 3117.852053]  vfs_kern_mount+0x64/0x190
[ 3117.855810]  do_mount+0x2e4/0x1450
[ 3117.859441]  ? lockref_put_return+0x130/0x130
[ 3117.862996]  ? copy_mount_string+0x20/0x20
[ 3117.866417]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.869719]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.872948]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.876121]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.879333]  ? _copy_from_user+0x61/0x90
[ 3117.882467]  ? memdup_user+0x3e/0x60
[ 3117.885604]  ksys_mount+0x7e/0xd0
[ 3117.888700]  __x64_sys_mount+0x62/0x70
[ 3117.891742]  do_syscall_64+0x73/0x160
[ 3117.894692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.897669] RIP: 0033:0x7f5693f14b9a
[ 3117.900563] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.906922] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.910159] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.913469] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.916764] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.920071] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.923393] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.926680] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.949979] CR2: 0000000000000000
[ 3117.954283] ---[ end trace a8e0d899985faf32 ]---
[ 3117.958575] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.962810] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.971789] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.976333] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.980926] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.985497] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.990098] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.994761] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.999392] FS:  00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3118.004096] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3118.008816] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0

- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/segment.c#L775
		if (test_and_clear_bit(segno, dirty_i->dirty_segmap[t]))
			dirty_i->nr_dirty[t]--;
Here dirty_i->dirty_segmap[t] can be NULL which leads to crash in test_and_clear_bit()

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Jaegeuk Kim
8d714f8aa3 f2fs: avoid f2fs_bug_on() in cp_error case
There is a subtle race condition to invoke f2fs_bug_on() in shutdown tests. I've
confirmed that the last checkpoint is preserved in consistent state, so it'd be
fine to just return error at this moment.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
66110abc4c f2fs: fix to clear PG_checked flag in set_page_dirty()
PG_checked flag will be set on data page during GC, later, we can
recognize such page by the flag and migrate page to cold segment.

But previously, we don't clear this flag when invalidating data page,
after page redirtying, we will write it into wrong log.

Let's clear PG_checked flag in set_page_dirty() to avoid this.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Darrick J. Wong
13942aa94a xfs: repair the AGI
Rebuild the AGI header items with some help from the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-08-10 11:44:31 -07:00
Darrick J. Wong
0e93d3f43e xfs: repair the AGFL
Repair the AGFL from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-08-10 11:44:31 -07:00
Darrick J. Wong
f9ed6debca xfs: repair the AGF
Regenerate the AGF from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-08-10 11:44:31 -07:00
Steve French
cdeaf9d04a smb3: allow previous versions to be mounted with snapshot= mount parm
mounting with the "snapshots=" mount parm allows a read-only
view of a previous version of a file system (see MS-SMB2
and "timewarp" tokens, section 2.2.13.2.6) based on the timestamp
passed in on the snapshots mount parm.

Add processing to optionally send this create context.

Example output:

/mnt1 is mounted with "snapshots=..." and will see an earlier
version of the directory, with three fewer files than /mnt2
the current version of the directory.

root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# cat /proc/mounts | grep cifs
//172.22.149.186/public /mnt1 cifs
ro,relatime,vers=default,cache=strict,username=smfrench,uid=0,noforceuid,gid=0,noforcegid,addr=172.22.149.186,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,snapshot=131748608570000000,actimeo=1

//172.22.149.186/public /mnt2 cifs
rw,relatime,vers=default,cache=strict,username=smfrench,uid=0,noforceuid,gid=0,noforcegid,addr=172.22.149.186,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1

root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt1
EmptyDir  newerdir
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt1/newerdir

root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt2
EmptyDir  file  newerdir  newestdir  timestamp-trace.cap
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt2/newerdir
new-file-not-in-snapshot

Snapshots are extremely useful for comparing previous versions of files or directories,
and recovering from data corruptions or mistakes.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-10 11:54:08 -05:00
Ronnie Sahlberg
e55954a5f7 cifs: don't show domain= in mount output when domain is empty
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-10 11:53:51 -05:00
Ronnie Sahlberg
c1777df1a5 cifs: add missing support for ACLs in SMB 3.11
We were missing the methods for get_acl and friends for the 3.11
dialect.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-10 11:53:32 -05:00
Steve French
e02789a53d smb3: enumerating snapshots was leaving part of the data off end
When enumerating snapshots, the last few bytes of the final
snapshot could be left off since we were miscalculating the
length returned (leaving off the sizeof struct SRV_SNAPSHOT_ARRAY)
See MS-SMB2 section 2.2.32.2. In addition fixup the length used
to allow smaller buffer to be passed in, in order to allow
returning the size of the whole snapshot array more easily.

Sample userspace output with a kernel patched with this
(mounted to a Windows volume with two snapshots).
Before this patch, the second snapshot would be missing a
few bytes at the end.

~/cifs-2.6# ~/enum-snapshots /mnt/file
press enter to issue the ioctl to retrieve snapshot information ...

size of snapshot array = 102
Num snapshots: 2 Num returned: 2 Array Size: 102

Snapshot 0:@GMT-2018.06.30-19.34.17
Snapshot 1:@GMT-2018.06.30-19.33.37

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-09 21:20:01 -05:00
Ronnie Sahlberg
730928c8f4 cifs: update smb2_queryfs() to use compounding
Change smb2_queryfs() to use a Create/QueryInfo/Close compound request.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-09 21:19:56 -05:00
Ronnie Sahlberg
b24df3e30c cifs: update receive_encrypted_standard to handle compounded responses
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-09 21:19:45 -05:00
Al Viro
4c0d7cd5c8 make sure that __dentry_kill() always invalidates d_seq, unhashed or not
RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq.  That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all.  Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.

We could try and be careful in the (few) places where that could
happen.  Or we could just make the final dput() invalidate the damn
thing, unhashed or not.  The latter is much simpler and easier to
backport, so let's do it that way.

Reported-by: "Dae R. Jeong" <threeearcat@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09 18:07:15 -04:00
Al Viro
119e1ef80e fix __legitimize_mnt()/mntput() race
__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.

Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped.  Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.

It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there.  The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...

Reported-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09 17:51:32 -04:00
Al Viro
9ea0a46ca2 fix mntput/mntput race
mntput_no_expire() does the calculation of total refcount under mount_lock;
unfortunately, the decrement (as well as all increments) are done outside
of it, leading to false positives in the "are we dropping the last reference"
test.  Consider the following situation:
	* mnt is a lazy-umounted mount, kept alive by two opened files.  One
of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
mntput(mnt) (called from __fput()) drops one reference, decrementing component
	* After it has looked at component #0, the process on CPU 0 does
mntget(), incrementing component #0, gets preempted and gets to run again -
on CPU 69.  There it does mntput(), which drops the reference (component #69)
and proceeds to spin on mount_lock.
	* On CPU 42 our first mntput() finishes counting.  It observes the
decrement of component #69, but not the increment of component #0.  As the
result, the total it gets is not 1 as it should've been - it's 0.  At which
point we decide that vfsmount needs to be killed and proceed to free it and
shut the filesystem down.  However, there's still another opened file
on that filesystem, with reference to (now freed) vfsmount, etc. and we are
screwed.

It's not a wide race, but it can be reproduced with artificial slowdown of
the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.

Fix consists of moving the refcount decrement under mount_lock; the tricky
part is that we want (and can) keep the fast case (i.e. mount that still
has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
be dropped.

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Jann Horn <jannh@google.com>
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09 17:21:17 -04:00
Gustavo A. R. Silva
a677a78325 nfsd: use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Eric Biggers
c2cdc2ab24 nfsd: constify write_op[]
write_op[] is never modified, so make it 'const'.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
nixiaoming
5ed96bc545 fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
READ_BUF(8);
dummy = be32_to_cpup(p++);
dummy = be32_to_cpup(p++);
...
READ_BUF(4);
dummy = be32_to_cpup(p++);

Assigning value to "dummy" here, but that stored value
is overwritten before it can be used.
At the same time READ_BUF() will re-update the pointer p.

delete invalid assignment statements

Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
11b4d66ea3 NFSD: Handle full-length symlinks
I've given up on the idea of zero-copy handling of SYMLINK on the
server side. This is because the Linux VFS symlink API requires the
symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
NUL-termination is going to be problematic (watching out for
landing on a page boundary and dealing with a 4096-byte pathname).

I don't believe that SYMLINK creation is on a performance path or is
requested frequently enough that it will cause noticeable CPU cache
pollution due to data copies.

There will be two places where a transport callout will be necessary
to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
helper that is used by NFSv2 and NFSv3, and the other will be in
nfsd4_decode_create().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
3fd9557aec NFSD: Refactor the generic write vector fill helper
fill_in_write_vector() is nearly the same logic as
svc_fill_write_vector(), but there are a few differences so that
the former can handle multiple WRITE payloads in a single COMPOUND.

svc_fill_write_vector() can be adjusted so that it can be used in
the NFSv4 WRITE code path too. Instead of assuming the pages are
coming from rq_args.pages, have the caller pass in the page list.

The immediate benefit is a reduction of code duplication. It also
prevents the NFSv4 WRITE decoder from passing an empty vector
element when the transport has provided the payload in the xdr_buf's
page array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Gustavo A. R. Silva
7b4d6da4bb nfsd: Mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Warning level 2 was used: -Wimplicit-fallthrough=2

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Amir Goldstein
64bed6cbe3 nfsd: fix leaked file lock with nfs exported overlayfs
nfsd and lockd call vfs_lock_file() to lock/unlock the inode
returned by locks_inode(file).

Many places in nfsd/lockd code use the inode returned by
file_inode(file) for lock manipulation. With Overlayfs, file_inode()
(the underlying inode) is not the same object as locks_inode() (the
overlay inode). This can result in "Leaked POSIX lock" messages
and eventually to a kernel crash as reported by Eddie Horng:
https://marc.info/?l=linux-unionfs&m=153086643202072&w=2

Fix all the call sites in nfsd/lockd that should use locks_inode().
This is a correctness bug that manifested when overlayfs gained
NFS export support in v4.16.

Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Tested-by: Eddie Horng <eddiehorng.tw@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Fixes: 8383f17488 ("ovl: wire up NFS export operations")
Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Olga Kornievskaia
6b8d84e2f1 NFS add a simple sync nfs4_proc_commit after async COPY
A COPY with unstable write data needs a simple sync commit.
Filehandle value is gotten as a part of the inner loop so in
case of a reboot retry it should get the new value.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:39 -04:00
Olga Kornievskaia
539f57b3e0 NFS handle COPY ERR_OFFLOAD_NO_REQS
If client sent async COPY and server replied with
ERR_OFFLOAD_NO_REQS, client should retry with a synchronous copy.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:39 -04:00
Olga Kornievskaia
c975c20926 NFS send OFFLOAD_CANCEL when COPY killed
When COPY is killed by the user send OFFLOAD_CANCEL to server
processing the copy.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:39 -04:00
Olga Kornievskaia
0f913a57d7 NFS export nfs4_async_handle_error
Make this function available to nfs42proc.c

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:39 -04:00
Olga Kornievskaia
bc0c9079b4 NFS handle COPY reply CB_OFFLOAD call race
It's possible that server replies back with CB_OFFLOAD call and
COPY reply at the same time such that client will process
CB_OFFLOAD before reply to COPY. For that keep a list of pending
callback stateids received and then before waiting on completion
check the pending list.

Cleanup any pending copies on the client shutdown.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:39 -04:00
Olga Kornievskaia
62164f3179 NFS add support for asynchronous COPY
Change xdr to always send COPY asynchronously.

Keep the list copies send in a list under a server structure.
Once copy is sent, it waits on a completion structure that will
be signalled by the callback thread that receives CB_OFFLOAD.

If CB_OFFLOAD returned an error and even if it returned partial
bytes, ignore them (as we can't commit without a verifier to
match) and return an error.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:39 -04:00
Olga Kornievskaia
67aa7444c4 NFS COPY xdr handle async reply
If server returns async reply, it must include a callback stateid,
wr_callback_id in the write_response4.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:38 -04:00
Olga Kornievskaia
cb95deea0b NFS OFFLOAD_CANCEL xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:38 -04:00
Olga Kornievskaia
5178a125f6 NFS CB_OFFLOAD xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-09 12:56:38 -04:00
Ronnie Sahlberg
1eb9fb5204 cifs: create SMB2_open_init()/SMB2_open_free() helpers.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-08 18:10:26 -05:00
Ronnie Sahlberg
296ecbae7f cifs: add SMB2_query_info_[init|free]()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
2018-08-08 18:08:47 -05:00
Ronnie Sahlberg
8eb4ecfab0 cifs: add SMB2_close_init()/SMB2_close_free()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
2018-08-08 16:49:08 -05:00
NeilBrown
46483c2ea4 NFS: Use an appropriate work queue for direct-write completion
When a direct-write completes, a work_struct is schedule to handle
the completion.
When NFS is being used for swap, the direct write might be a swap-out,
so memory allocation can block until the write completes.
The work queue currently used is not WQ_MEM_RECLAIM, so tasks
can block waiting for memory - this leads to deadlock.

So use nfsiod_workqueue instead.  This will always have a running
thread, and work items should never block waiting for memory.

Signed-off-by: Neil Brown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 17:07:38 -04:00
Wei Yongjun
72bf75cfc0 NFSv4: Fix error handling in nfs4_sp4_select_mode()
Error code is set in the error handling cases but never used. Fix it.

Fixes: 937e3133cd ("NFSv4.1: Ensure we clear the SP4_MACH_CRED flags in nfs4_sp4_select_mode()")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:31 -04:00
Gustavo A. R. Silva
10db5b7a2f pnfs: Use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:03 -04:00
Trond Myklebust
2230ca0d28 pnfs: pnfs_find_lseg() should not check NFS_LSEG_LAYOUTRETURN
Layout segment validity is determined only by the NFS_LSEG_VALID flag. If
it is set, the layout segment is finable. As it is, when the flexfiles
driver sets NFS_LSEG_LAYOUTRETURN to indicate that we cannot discard
the layout segment, but that it must be returned, then this can result
in an unnecessary layoutget storm.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:03 -04:00
Gustavo A. R. Silva
01e03bdc74 NFS: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Warning level 2 was used: -Wimplicit-fallthrough=2

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:02 -04:00
Trond Myklebust
c8d07159c9 NFSv4: Mark the inode change attribute up to date in update_changeattr()
When we update the change attribute, we should also clear the flag that
says it is out of date.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:02 -04:00
Trond Myklebust
5636ec4eb6 NFSv4: Detect nlink changes on cross-directory renames too
If the object being renamed from one directory to another is also
a directory, then 'nlink' will change for both directories.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:02 -04:00
Trond Myklebust
3c591175d6 NFSv4: bump/drop the nlink count on the parent dir when we mkdir/rmdir
Ensure that we always bump or drop the nlink count on the parent directory
when we do a mkdir or a rmdir(). This needs to be done by hand as we don't
have pre/post op attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:01 -04:00
Trond Myklebust
c16467dc03 pnfs: Fix handling of NFS4ERR_OLD_STATEID replies to layoutreturn
If the server tells us that out layoutreturn raced with another layout
update, then we must ensure that the new layout segments are not in use
before we resend with an updated layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:50:01 -04:00
Jeff Layton
da33a871ba locks: remove misleading obsolete comment
The spinlock handling in this file has changed significantly since this
comment was written, and the file_lock_lock is no more. In addition,
this overall comment no longer applies. Deleting an entry now requires
both locks.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2018-08-08 12:59:06 -04:00
Bob Peterson
f5580d0f8b gfs2: eliminate update_rgrp_lvb_unlinked
Function update_rgrp_lvb_unlinked used to do the same thing as
be32_add_cpu. This patch removes it in favor of using be32_add_cpu
directly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
2018-08-08 10:34:39 -05:00
Steve French
468d677954 smb3: display stats counters for number of slow commands
When CONFIG_CIFS_STATS2 is enabled keep counters for slow
commands (ie server took longer than 1 second to respond)
by SMB2/SMB3 command code.  This can help in diagnosing
whether performance problems are on server (instead of
client) and which commands are causing the problem.

Sample output (the new lines contain words "slow responses ...")

$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Total Large 10 Small 490 Allocations
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 67 maximum at one time: 2
4 slow responses from localhost for command 5
1 slow responses from localhost for command 6
1 slow responses from localhost for command 14
1 slow responses from localhost for command 16

1) \\localhost\test
SMBs: 243
Bytes read: 1024000  Bytes written: 104857600
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 40 total 0 failed
Closes: 39 total 0 failed
...

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07 14:30:59 -05:00
Aurelien Aptel
a5c62f4833 CIFS: fix uninitialized ptr deref in smb2 signing
server->secmech.sdeschmacsha256 is not properly initialized before
smb2_shash_allocate(), set shash after that call.

also fix typo in error message

Fixes: 8de8c4608f ("cifs: Fix validation of signed data in smb2")

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-08-07 14:30:59 -05:00
Steve French
fd09b7d3b3 smb3: Do not send SMB3 SET_INFO if nothing changed
An earlier commit had a typo which prevented the
optimization from working:

commit 18dd8e1a65 ("Do not send SMB3 SET_INFO request if nothing is changing")

Thank you to Metze for noticing this.  Also clear a
reserved field in the FILE_BASIC_INFO struct we send
that should be zero (all the other fields in that
struct were set or cleared explicitly already in
cifs_set_file_info).

Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> # 4.9.x+
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:30:59 -05:00
Steve French
d258650004 smb3: fix minor debug output for CONFIG_CIFS_STATS
CONFIG_CIFS_STATS is now always enabled (to simplify the
code and since the STATS are important for some common
customer use cases and also debugging), but needed one
minor change so that STATS shows as enabled in the debug
output in /proc/fs/cifs/DebugData, otherwise it could
get confusing with STATS no longer showing up in the
"Features" list in /proc/fs/cifs/DebugData when basic
stats were in fact available.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07 14:30:59 -05:00
Steve French
020eec5f71 smb3: add tracepoint for slow responses
If responses take longer than one second from the server,
we can optionally log them to dmesg in current cifs.ko code
(CONFIG_CIFS_STATS2 must be configured and a
/proc/fs/cifs/cifsFYI flag must be set), but can be more useful
to log these via ftrace (tracepoint is smb3_slow_rsp) which
is easier and more granular (still requires CONFIG_CIFS_STATS2
to be configured in the build though).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:28:01 -05:00
Ronnie Sahlberg
e0bba0b854 cifs: add compound_send_recv()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:23:20 -05:00
Ronnie Sahlberg
1f3a8f5f7a cifs: make smb_send_rqst take an array of requests
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:23:04 -05:00
Ronnie Sahlberg
b2c96de7fe cifs: update init_sg, crypt_message to take an array of rqst
These are used for SMB3 encryption and compounded requests.
Update these functions and the other functions related to SMB3 encryption to
take an array of requests.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:21:18 -05:00
Steve French
c281bc0c74 smb3: fix reset of bytes read and written stats
echo 0 > /proc/fs/cifs/Stats is supposed to reset the stats
but there were four (see example below) that were not reset
(bytes read and witten, total vfs ops and max ops
at one time).

...
0 session 0 share reconnects
Total vfs operations: 100 maximum at one time: 2

1) \\localhost\test
SMBs: 0
Bytes read: 502092  Bytes written: 31457286
TreeConnects: 0 total 0 failed
TreeDisconnects: 0 total 0 failed
...

This patch fixes cifs_stats_proc_write to properly reset
those four.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07 14:20:22 -05:00
Steve French
52ce1ac429 smb3: display bytes_read and bytes_written in smb3 stats
We were only displaying bytes_read and bytes_written in cifs
stats, fix smb3 stats to also display them.  Sample output
with this patch:

    cat /proc/fs/cifs/Stats:

CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 94 maximum at one time: 2

1) \\localhost\test
SMBs: 214
Bytes read: 502092  Bytes written: 31457286
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 52 total 3 failed
Closes: 48 total 0 failed
Flushes: 0 total 0 failed
Reads: 17 total 0 failed
Writes: 31 total 0 failed
...

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07 14:20:22 -05:00
Steve French
fcabb89299 cifs: simple stats should always be enabled
CONFIG_CIFS_STATS should always be enabled as Pavel recently
noted.  Simple statistics are not a significant performance hit,
and removing the ifdef simplifies the code slightly.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:20:22 -05:00
Ronnie Sahlberg
9da6ec7775 cifs: use a refcount to protect open/closing the cached file handle
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: <stable@vger.kernel.org>
2018-08-07 14:20:22 -05:00
Steve French
bf1fdeb789 smb3: add reconnect tracepoints
Add tracepoints for reconnecting an smb3 session

Example output (from trace-cmd) with the patch
(showing the session marked for reconnect, the stat failing, and then
the subsequent SMB3 commands after the server comes back up).
The "smb3_reconnect" event is the new one.

           cifsd-25993 [000] .... 29635.368265: smb3_reconnect: server=localhost current_mid=0x1e
            stat-26200 [001] .... 29638.516403: smb3_enter: 	cifs_revalidate_dentry_attr: xid=22
            stat-26200 [001] .... 29648.723296: smb3_exit_err: 	cifs_revalidate_dentry_attr: xid=22 rc=-112
     kworker/0:1-22830 [000] .... 29653.850947: smb3_cmd_done: 	sid=0x0 tid=0x0 cmd=0 mid=0
     kworker/0:1-22830 [000] .... 29653.851191: smb3_cmd_err: 	sid=0x8ae4683c tid=0x0 cmd=1 mid=1 status=0xc0000016 rc=-5
     kworker/0:1-22830 [000] .... 29653.855254: smb3_cmd_done: 	sid=0x8ae4683c tid=0x0 cmd=1 mid=2
     kworker/0:1-22830 [000] .... 29653.855482: smb3_cmd_done: 	sid=0x8ae4683c tid=0x8084f30d cmd=3 mid=3

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07 14:20:22 -05:00
Steve French
e68a932b0b smb3: add tracepoint for session expired or deleted
In debugging reconnection problems, want to be able to more easily
trace cases in which the server has marked the SMB3 session
expired or deleted (to distinguish from timeout cases).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07 14:15:57 -05:00
Steve French
06188fcf9c cifs: remove unused stats
These timers were a good idea but weren't used in current code,
and the idea was cifs specific.  Future patch will add similar timers
for SMB2/SMB3, but no sense using memory for cifs timers that
aren't used in current code.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07 14:15:57 -05:00
Steve French
22783155f4 smb3: don't request leases in symlink creation and query
Fixes problem pointed out by Pavel in discussions about commit
729c0c9dd5

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org> # 3.18.x+
2018-08-07 14:15:57 -05:00
Steve French
1995d28f84 smb3: remove per-session operations from per-tree connection stats
Remove counters from the per-tree connection /proc/fs/cifs/Stats
output that will always be zero (since they are not per-tcon ops)
ie SMB3 Negotiate, SessionSetup, Logoff, Echo, Cancel.

Also clarify "sent" to be "total" per-Pavel's suggestion
(since this "total" includes total for all operations that we try to
send whether or not succesffully sent). Sample output below:

Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

1 session 2 share reconnects
Total vfs operations: 23 maximum at one time: 2

1) \\localhost\test
SMBs: 45
TreeConnects: 2 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 13 total 2 failed
Closes: 9 total 0 failed
Flushes: 0 total 0 failed
Reads: 0 total 0 failed
Writes: 1 total 0 failed
Locks: 0 total 0 failed
IOCTLs: 3 total 1 failed
QueryDirectories: 4 total 2 failed
ChangeNotifies: 0 total 0 failed
QueryInfos: 10 total 0 failed
SetInfos: 3 total 0 failed
OplockBreaks: 0 sent 0 failed

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:15:56 -05:00
Steve French
289131e1f1 SMB3: Number of requests sent should be displayed for SMB3 not just CIFS
For SMB2/SMB3 the number of requests sent was not displayed
in /proc/fs/cifs/Stats unless CONFIG_CIFS_STATS2 was
enabled (only number of failed requests displayed). As
with earlier dialects, we should be displaying these
counters if CONFIG_CIFS_STATS is enabled. They
are important for debugging.

e.g. when you cat /proc/fs/cifs/Stats (before the patch)
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 690 maximum at one time: 2

1) \\localhost\test
SMBs: 975
Negotiates: 0 sent 0 failed
SessionSetups: 0 sent 0 failed
Logoffs: 0 sent 0 failed
TreeConnects: 0 sent 0 failed
TreeDisconnects: 0 sent 0 failed
Creates: 0 sent 2 failed
Closes: 0 sent 0 failed
Flushes: 0 sent 0 failed
Reads: 0 sent 0 failed
Writes: 0 sent 0 failed
Locks: 0 sent 0 failed
IOCTLs: 0 sent 1 failed
Cancels: 0 sent 0 failed
Echos: 0 sent 0 failed
QueryDirectories: 0 sent 63 failed

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07 14:15:56 -05:00
Steve French
8a69e96e61 smb3: snapshot mounts are read-only and make sure info is displayable about the mount
snapshot mounts were not marked as read-only and did not display the snapshot
time (in /proc/mounts) specified on mount

With this patch - note that can not write to the snapshot mount (see "ro" in
/proc/mounts line) and also the missing snapshot timewarp token time is
dumped.  Sample line from /proc/mounts with the patch:

//127.0.0.1/scratch /mnt2 smb3 ro,relatime,vers=default,cache=strict,username=testuser,domain=,uid=0,noforceuid,gid=0,noforcegid,addr=127.0.0.1,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,noperm,rsize=1048576,wsize=1048576,echo_interval=60,snapshot=1234567,actimeo=1 0 0

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
2018-08-07 14:15:56 -05:00
Steve French
c3ed44026c smb3: remove noisy warning message on mount
Some servers, like Samba, don't support the fsctl for
query_network_interface_info so don't log a noisy warning
message on mount for this by default unless the error is more serious.
Lower the error to an FYI level so it does not get logged by
default.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:15:56 -05:00
Steve French
0fdfef9aa7 smb3: simplify code by removing CONFIG_CIFS_SMB311
We really, really want to be encouraging use of secure dialects,
and SMB3.1.1 offers useful security features, and will soon
be the recommended dialect for many use cases. Simplify the code
by removing the CONFIG_CIFS_SMB311 ifdef so users don't disable
it in the build, and create compatibility and/or security issues
with modern servers - many of which have been supporting this
dialect for multiple years.

Also clarify some of the Kconfig text for cifs.ko about
SMB3.1.1 and current supported features in the module.

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07 14:15:56 -05:00
Steve French
950132afd5 cifs: add missing debug entries for kconfig options
/proc/fs/cifs/DebugData displays the features (Kconfig options)
used to build cifs.ko but it was missing some, and needed comma
separator.  These can be useful in debugging certain problems
so we know which optional features were enabled in the user's build.
Also clarify them, by making them more closely match the
corresponding CONFIG_CIFS_* parm.

Old format:
Features: dfs fscache posix spnego xattr acl

New format:
Features: DFS,FSCACHE,SMB_DIRECT,STATS,DEBUG2,ALLOW_INSECURE_LEGACY,CIFS_POSIX,UPCALL(SPNEGO),XATTR,ACL

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
CC: Stable <stable@vger.kernel.org>
2018-08-07 14:15:56 -05:00
Steve French
2d30421783 smb3: add support for statfs for smb3.1.1 posix extensions
Output now matches expected stat -f output for all fields
except for Namelen and ID which were addressed in a companion
patch (which retrieves them from existing SMB3 mechanisms
and works whether POSIX enabled or not)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07 14:15:41 -05:00
Steve French
21ba3845b5 smb3: fill in statfs fsid and correct namelen
Fil in the correct namelen (typically 255 not 4096) in the
statfs response and also fill in a reasonably unique fsid
(in this case taken from the volume id, and the creation time
of the volume).

In the case of the POSIX statfs all fields are now filled in,
and in the case of non-POSIX mounts, all fields are filled
in which can be.

Signed-off-by: Steve French <stfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07 14:15:41 -05:00
Paulo Alcantara
a12d0c590c cifs: Make sure all data pages are signed correctly
Check if every data page is signed correctly in sigining helper.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:15:41 -05:00
Aurelien Aptel
256b4c3f03 CIFS: fix memory leak and remove dead code
also fixes error code in smb311_posix_mkdir() (where
the error assignment needs to go before the goto)
a typo that Dan Carpenter and Paulo and Gustavo
pointed out.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:15:41 -05:00
Steve French
7420451f6a cifs: allow disabling insecure dialects in the config
allow disabling cifs (SMB1 ie vers=1.0) and vers=2.0 in the
config for the build of cifs.ko if want to always prevent mounting
with these less secure dialects.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
2018-08-07 14:15:41 -05:00
Steve French
8505c8bfd8 smb3: if server does not support posix do not allow posix mount option
If user specifies "posix" on an SMB3.11 mount, then fail the mount
if server does not return the POSIX negotiate context indicating
support for posix.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07 14:15:41 -05:00
Arnd Bergmann
cbedeadf9c cifs: use 64-bit timestamps for fscache
In the fscache, we just need the timestamps as cookies to check for
changes, so we don't really care about the overflow, but it's better
to stop using the deprecated timespec so we don't have to go through
explicit conversion functions.

To avoid comparing uninitialized padding values that are copied
while assigning the timespec values, this rearranges the members of
cifs_fscache_inode_auxdata to avoid padding, and assigns them
individually.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:15:41 -05:00
Arnd Bergmann
95390201e7 cifs: use timespec64 internally
In cifs, the timestamps are stored in memory in the cifs_fattr structure,
which uses the deprecated 'timespec' structure. Now that the VFS code
has moved on to 'timespec64', the next step is to change over the fattr
as well.

This also makes 32-bit and 64-bit systems behave the same way, and
no longer overflow the 32-bit time_t in year 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:15:41 -05:00
Dan Carpenter
ff361fda55 cifs: Silence uninitialized variable warning
This is not really a runtime issue but Smatch complains that:

    fs/cifs/smb2ops.c:1740 smb2_query_symlink()
    error: uninitialized symbol 'resp_buftype'.

The warning is right that it can be uninitialized...  Also "err_buf"
would be NULL at this point and we're not supposed to pass NULLs to
free_rsp_buf() or it might trigger some extra output if we turn on
debugging.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07 14:15:41 -05:00
Brian Foster
73971b172a xfs: remove dead error handling code in xfs_dquot_disk_alloc()
Colin Ian King reports that commit 82ff27bc52 ("xfs: automatic dfops
buffer relogging") leaves around some dead error handling code in
xfs_dquot_disk_alloc(). This was discovered via Coverity scan.

Since the associated commit eliminates the act of joining a buffer
to a dfops, this intermediate error state is no longer possible and
the error handling code can be removed. Since the caller cancels the
transaction on error, which cancels the dfops, eliminate the
unnecessary xfs_defer_cancel() call and error handling labels.

Fixes: 82ff27bc52 ("xfs: automatic dfops buffer relogging")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-07 10:57:13 -07:00
Christoph Hellwig
2ba090d521 xfs: use WRITE_ONCE to update if_seq
This adds ordering of the updates and makes sure we always see the if_seq
update before the extent tree is modified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-07 10:57:12 -07:00
Bob Peterson
dffe12a828 gfs2: Fix gfs2_testbit to use clone bitmaps
Function gfs2_testbit is called in three places. Two of those places,
gfs2_alloc_extent and gfs2_unaligned_extlen, should be using the clone
bitmaps, not the "real" bitmaps. Function gfs2_unaligned_extlen is used
by the block reservations scheme to determine the length of an extent of
free blocks. Before this patch, it wasn't using the clone bitmap, which
means recently-freed blocks were treated as free blocks for the purposes
of an allocation.

This patch adds a new parameter to gfs2_testbit to indicate whether or
not the clone bitmaps should be used (if available).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-08-07 10:07:00 -05:00
Jeff Layton
c883da313e locks: add tracepoint in flock codepath
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2018-08-06 13:15:16 -04:00
Al Viro
90bad5e05b root dentries need RCU-delayed freeing
Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.

It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock().  If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.

Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-06 09:13:32 -04:00
Naohiro Aota
39379faaad btrfs: revert fs_devices state on error of btrfs_init_new_device
When btrfs hits error after modifying fs_devices in
btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it
leaves everything as is, but frees allocated btrfs_device. As a result,
fs_devices->devices and fs_devices->alloc_list contain already freed
btrfs_device, leading to later use-after-free bug.

Error path also messes the things like ->num_devices. While they go back
to the original value by unscanning btrfs devices, it is safe to revert
them here.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:04 +02:00
Qu Wenruo
64f64f43c8 btrfs: Exit gracefully when chunk map cannot be inserted to the tree
It's entirely possible that a crafted btrfs image contains overlapping
chunks.

Although we can't detect such problem by tree-checker, it's not a
catastrophic problem, current extent map can already detect such problem
and return -EEXIST.

We just only need to exit gracefully and fail the mount.

Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Qu Wenruo
cf90d884b3 btrfs: Introduce mount time chunk <-> dev extent mapping check
This patch will introduce chunk <-> dev extent mapping check, to protect
us against invalid dev extents or chunks.

Since chunk mapping is the fundamental infrastructure of btrfs, extra
check at mount time could prevent a lot of unexpected behavior (BUG_ON).

Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Qu Wenruo
7ef49515fa btrfs: Verify that every chunk has corresponding block group at mount time
If a crafted image has missing block group items, it could cause
unexpected behavior and breaks the assumption of 1:1 chunk<->block group
mapping.

Although we have the block group -> chunk mapping check, we still need
chunk -> block group mapping check.

This patch will do extra check to ensure each chunk has its
corresponding block group.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Qu Wenruo
514c7dca85 btrfs: Check that each block group has corresponding chunk at mount time
A crafted btrfs image with incorrect chunk<->block group mapping will
trigger a lot of unexpected things as the mapping is essential.

Although the problem can be caught by block group item checker
added in "btrfs: tree-checker: Verify block_group_item", it's still not
sufficient.  A sufficiently valid block group item can pass the check
added by the mentioned patch but could fail to match the existing chunk.

This patch will add extra block group -> chunk mapping check, to ensure
we have a completely matching (start, len, flags) chunk for each block
group at mount time.

Here we reuse the original helper find_first_block_group(), which is
already doing the basic bg -> chunk checks, adding further checks of the
start/len and type flags.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Filipe Manana
22d3151c2c Btrfs: send, fix incorrect file layout after hole punching beyond eof
When doing an incremental send, if we have a file in the parent snapshot
that has prealloc extents beyond EOF and in the send snapshot it got a
hole punch that partially covers the prealloc extents, the send stream,
when replayed by a receiver, can result in a file that has a size bigger
than it should and filled with zeroes past the correct EOF.

For example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar
  $ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1
  $ btrfs send -f /tmp/1.send /mnt/snap1

  $ xfs_io -c "fpunch 1M 2M" /mnt/foobar

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2
  $ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2

  $ stat --format %s /mnt/snap2/foobar
  1048576
  $ md5sum /mnt/snap2/foobar
  d31659e82e87798acd4669a1e0a19d4f  /mnt/snap2/foobar

  $ umount /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ btrfs receive -f /mnt/1.snap /mnt
  $ btrfs receive -f /mnt/2.snap /mnt

  $ stat --format %s /mnt/snap2/foobar
  3145728
  # --> should be 1Mb and not 3Mb (which was the end offset of hole
  #     punch operation)
  $ md5sum /mnt/snap2/foobar
  117baf295297c2a995f92da725b0b651  /mnt/snap2/foobar
  # --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs

This issue actually happens only since commit ffa7c4296e ("Btrfs: send,
do not issue unnecessary truncate operations"), but before that commit we
were issuing a write operation full of zeroes (to "punch" a hole) which
was extending the file size beyond the correct value and then immediately
issue a truncate operation to the correct size and undoing the previous
write operation. Since the send protocol does not support fallocate, for
extent preallocation and hole punching, fix this by not even attempting
to send a "hole" (regular write full of zeroes) if it starts at an offset
greater then or equals to the file's size. This approach, besides being
much more simple then making send issue the truncate operation, adds the
benefit of avoiding the useless pair of write of zeroes and truncate
operations, saving time and IO at the receiver and reducing the size of
the send stream.

A test case for fstests follows soon.

Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
CC: stable@vger.kernel.org # 4.17+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:03 +02:00
Misono Tomohiro
672d599041 btrfs: Use wrapper macro for rcu string to remove duplicate code
Cleanup patch and no functional changes.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:02 +02:00
Al Viro
f5b3a4173f btrfs: simplify btrfs_iget
Don't open-code iget_failed(), don't bother with btrfs_free_path(NULL),
move handling of positive return values of btrfs_lookup_inode() from
btrfs_read_locked_inode() to btrfs_iget() and kill now obviously
pointless ASSERT() in there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:02 +02:00
Al Viro
9bc2ceff66 btrfs: lift make_bad_inode into btrfs_iget
We don't need to check is_bad_inode() after the call of
btrfs_read_locked_inode() - it's exactly the same as checking return
value for being non-zero.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:02 +02:00
Al Viro
8d9e220ca0 btrfs: simplify IS_ERR/PTR_ERR checks
IS_ERR(p) && PTR_ERR(p) == n is a weird way to spell p == ERR_PTR(n).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:02 +02:00
Al Viro
2e19f1f9d3 btrfs: btrfs_iget never returns an is_bad_inode inode
Just get rid of pointless checks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:02 +02:00
Misono Tomohiro
1e7e1f9e3a btrfs: replace: Reset on-disk dev stats value after replace
on-disk devs stats value is updated in btrfs_run_dev_stats(),
which is called during commit transaction, if device->dev_stats_ccnt
is not zero.

Since current replace operation does not touch dev_stats_ccnt,
on-disk dev stats value is not updated. Therefore "btrfs device stats"
may return old device's value after umount/mount
(Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish).

Fix this by just incrementing dev_stats_ccnt in
btrfs_dev_replace_finishing() when replace is succeeded and this will
update the values.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:01 +02:00
Misono Tomohiro
85c3954819 btrfs: extent-tree: Remove unused __btrfs_free_block_rsv
There is no user of this function anymore.

This was forgotten to be removed in commit a575ceeb13
("Btrfs: get rid of unused orphan infrastructure").

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:01 +02:00
Misono Tomohiro
afc6961ffd btrfs: backref: Use ERR_CAST to return error code
Use ERR_CAST() instead of void * to make meaning clear.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:01 +02:00
Lu Fengqi
5b7d687ad5 btrfs: Remove redundant btrfs_release_path from btrfs_unlink_subvol
Although it is safe to call this on already released paths with no locks
held or extent buffers, removing the redundant btrfs_release_path is
reasonable.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:01 +02:00
Lu Fengqi
401b3b19d5 btrfs: Remove root parameter from btrfs_unlink_subvol
All callers pass the root tree of dir, we can push that down to the
function itself.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:01 +02:00
Lu Fengqi
6025c19fb2 btrfs: Remove fs_info from btrfs_add_root_ref
It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:00 +02:00
Lu Fengqi
3ee1c5530e btrfs: Remove fs_info from btrfs_del_root_ref
It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:00 +02:00
Lu Fengqi
ab9ce7d42b btrfs: Remove fs_info from btrfs_del_root
It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:00 +02:00
Lu Fengqi
9add29457a btrfs: Remove fs_info from btrfs_delete_delayed_dir_index
It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:00 +02:00
Lu Fengqi
4465c8b422 btrfs: Remove fs_info from btrfs_insert_delayed_dir_index
It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:13:00 +02:00
David Sterba
b5851021f1 btrfs: extent-tree: remove unused member walk_control::for_reloc
Leftover after fix e339a6b097 ("Btrfs: __btrfs_mod_ref should always
use no_quota"), that removed it from the function calls but not the
structure.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
Filipe Manana
46b2f4590a Btrfs: fix send failure when root has deleted files still open
The more common use case of send involves creating a RO snapshot and then
use it for a send operation. In this case it's not possible to have inodes
in the snapshot that have a link count of zero (inode with an orphan item)
since during snapshot creation we do the orphan cleanup. However, other
less common use cases for send can end up seeing inodes with a link count
of zero and in this case the send operation fails with a ENOENT error
because any attempt to generate a path for the inode, with the purpose
of creating it or updating it at the receiver, fails since there are no
inode reference items. One use case it to use a regular subvolume for
a send operation after turning it to RO mode or turning a RW snapshot
into RO mode and then using it for a send operation. In both cases, if a
file gets all its hard links deleted while there is an open file
descriptor before turning the subvolume/snapshot into RO mode, the send
operation will encounter an inode with a link count of zero and then
fail with errno ENOENT.

Example using a full send with a subvolume:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ btrfs subvolume create /mnt/sv1
  $ touch /mnt/sv1/foo
  $ touch /mnt/sv1/bar

  # keep an open file descriptor on file bar
  $ exec 73</mnt/sv1/bar
  $ unlink /mnt/sv1/bar

  # Turn the subvolume to RO mode and use it for a full send, while
  # holding the open file descriptor.
  $ btrfs property set /mnt/sv1 ro true

  $ btrfs send -f /tmp/full.send /mnt/sv1
  At subvol /mnt/sv1
  ERROR: send ioctl failed with -2: No such file or directory

Example using an incremental send with snapshots:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ btrfs subvolume create /mnt/sv1
  $ touch /mnt/sv1/foo
  $ touch /mnt/sv1/bar

  $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1

  $ echo "hello world" >> /mnt/sv1/bar

  $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap2

  # Turn the second snapshot to RW mode and delete file foo while
  # holding an open file descriptor on it.
  $ btrfs property set /mnt/snap2 ro false
  $ exec 73</mnt/snap2/foo
  $ unlink /mnt/snap2/foo

  # Set the second snapshot back to RO mode and do an incremental send.
  $ btrfs property set /mnt/snap2 ro true

  $ btrfs send -f /tmp/inc.send -p /mnt/snap1 /mnt/snap2
  At subvol /mnt/snap2
  ERROR: send ioctl failed with -2: No such file or directory

So fix this by ignoring inodes with a link count of zero if we are either
doing a full send or if they do not exist in the parent snapshot (they
are new in the send snapshot), and unlink all paths found in the parent
snapshot when doing an incremental send (and ignoring all other inode
items, such as xattrs and extents).

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Reported-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
Filipe Manana
0d836392ca Btrfs: fix mount failure after fsync due to hard link recreation
If we end up with logging an inode reference item which has the same name
but different index from the one we have persisted, we end up failing when
replaying the log with an errno value of -EEXIST. The error comes from
btrfs_add_link(), which is called from add_inode_ref(), when we are
replaying an inode reference item.

Example scenario where this happens:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foo
  $ ln /mnt/foo /mnt/bar

  $ sync

  # Rename the first hard link (foo) to a new name and rename the second
  # hard link (bar) to the old name of the first hard link (foo).
  $ mv /mnt/foo /mnt/qwerty
  $ mv /mnt/bar /mnt/foo

  # Create a new file, in the same parent directory, with the old name of
  # the second hard link (bar) and fsync this new file.
  # We do this instead of calling fsync on foo/qwerty because if we did
  # that the fsync resulted in a full transaction commit, not triggering
  # the problem.
  $ touch /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  <power fail>

  $ mount /dev/sdb /mnt
  mount: mount /dev/sdb on /mnt failed: File exists

So fix this by checking if a conflicting inode reference exists (same
name, same parent but different index), removing it (and the associated
dir index entries from the parent inode) if it exists, before attempting
to add the new reference.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
Josef Bacik
4559b0a717 btrfs: don't leak ret from do_chunk_alloc
If we're trying to make a data reservation and we have to allocate a
data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if
it allocated a chunk.  Since the end of the function is the success path
just return 0.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
David Sterba
84db5ccf42 btrfs: merge free_fs_root helpers
The exported helper just calls the static one. There's no obvious reason
to have them separate eg. for performance reasons where the static one
could be better optimized in the same unit. There's a slight decrease in
code size and stack consumption.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
David Sterba
2ffad70ed3 btrfs: constify strings passed to assertion helper
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:59 +02:00
David Sterba
e9539cff04 btrfs: dev-replace: remove unused members of btrfs_dev_replace
Lock owner and nesting level have been unused since day 1, probably
copy&pasted from the extent_buffer locking scheme without much thinking.
The locking of device replace is simpler and does not need any lock
nesting.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
David Sterba
e17385ca29 btrfs: remove unused member btrfs_root::name
Added in 58176a9604 ("Btrfs: Add per-root block accounting and sysfs
entries") in 2007, the roots had names exported in sysfs. The code
was commented out in 4df27c4d5c ("Btrfs: change how subvolumes
are organized") and cleaned by 182608c829 ("btrfs: remove old
unused commented out code").

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Adam Borowski
616d374efa btrfs: allow defrag on a file opened read-only that has rw permissions
Requiring a read-write descriptor conflicts both ways with exec,
returning ETXTBSY whenever you try to defrag a program that's currently
being run, or causing intermittent exec failures on a live system being
defragged.

As defrag doesn't change the file's contents in any way, there's no
reason to consider it a rw operation.  Thus, let's check only whether
the file could have been opened rw.  Such access control is still needed
as currently defrag can use extra disk space, and might trigger bugs.

We return EINVAL when the request is invalid; here it's ok but merely
the user has insufficient privileges.  Thus, the EPERM return value
reflects the error better -- as discussed in the identical case for
dedupe.

According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the EPERM patch from Adam ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Josef Bacik
3c4276936f Btrfs: fix btrfs_write_inode vs delayed iput deadlock
We recently ran into the following deadlock involving
btrfs_write_inode():

[  +0.005066]  __schedule+0x38e/0x8c0
[  +0.007144]  schedule+0x36/0x80
[  +0.006447]  bit_wait+0x11/0x60
[  +0.006446]  __wait_on_bit+0xbe/0x110
[  +0.007487]  ? bit_wait_io+0x60/0x60
[  +0.007319]  __inode_wait_for_writeback+0x96/0xc0
[  +0.009568]  ? autoremove_wake_function+0x40/0x40
[  +0.009565]  inode_wait_for_writeback+0x21/0x30
[  +0.009224]  evict+0xb0/0x190
[  +0.006099]  iput+0x1a8/0x210
[  +0.006103]  btrfs_run_delayed_iputs+0x73/0xc0
[  +0.009047]  btrfs_commit_transaction+0x799/0x8c0
[  +0.009567]  btrfs_write_inode+0x81/0xb0
[  +0.008008]  __writeback_single_inode+0x267/0x320
[  +0.009569]  writeback_sb_inodes+0x25b/0x4e0
[  +0.008702]  wb_writeback+0x102/0x2d0
[  +0.007487]  wb_workfn+0xa4/0x310
[  +0.006794]  ? wb_workfn+0xa4/0x310
[  +0.007143]  process_one_work+0x150/0x410
[  +0.008179]  worker_thread+0x6d/0x520
[  +0.007490]  kthread+0x12c/0x160
[  +0.006620]  ? put_pwq_unlocked+0x80/0x80
[  +0.008185]  ? kthread_park+0xa0/0xa0
[  +0.007484]  ? do_syscall_64+0x53/0x150
[  +0.007837]  ret_from_fork+0x29/0x40

Writeback calls:

btrfs_write_inode
  btrfs_commit_transaction
    btrfs_run_delayed_iputs

If iput() is called on that same inode, evict() will wait for writeback
forever.

btrfs_write_inode() was originally added way back in 4730a4bc5b
("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode()
hasn't been used for O_SYNC since 148f948ba8 ("vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so
btrfs_write_inode() is actually unnecessary (and leads to a bunch of
unnecessary commits). Get rid of it, which also gets rid of the
deadlock.

CC: stable@vger.kernel.org # 3.2+
Signed-off-by: Josef Bacik <jbacik@fb.com>
[Omar: new commit message]
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Nikolay Borisov
97aff912a2 btrfs: Remove fs_info from btrfs_finish_chunk_alloc
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Nikolay Borisov
f4208794d0 btrfs: Remove fs_info form btrfs_free_chunk
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
4f5ad7bd63 btrfs: Remove fs_info from btrfs_destroy_dev_replace_tgtdev
This function is always passed a well-formed tgtdevice so the fs_info
can be referenced from there.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
d6507cf1e2 btrfs: Remove fs_info from btrfs_assign_next_active_device
It can be referenced from the passed 'device' argument which is always
a well-formed device.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
5495f195fc btrfs: remove fs_info argument from update_dev_stat_item
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
68a9db5f23 btrfs: Remove fs_info from btrfs_rm_dev_replace_remove_srcdev
It can be referenced from the passed srcdev argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
8e87e85627 btrfs: Remove fs_info argument from btrfs_add_dev_item
It can be referenced form the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
Qu Wenruo
5e23a6fea6 btrfs: extent-tree: Remove dead alignment check
In find_free_extent() under checks: label, we have the following code:

		search_start = ALIGN(offset, fs_info->stripesize);
		/* move on to the next group */
		if (search_start + num_bytes >
		    block_group->key.objectid + block_group->key.offset) {
			btrfs_add_free_space(block_group, offset, num_bytes);
			goto loop;
		}
		if (offset < search_start)
			btrfs_add_free_space(block_group, offset,
					     search_start - offset);
		BUG_ON(offset > search_start);

However ALIGN() is rounding up, thus @search_start >= @offset and that
BUG_ON() will never be triggered.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
Filipe Manana
ca5d2ba1ae Btrfs: remove unused key assignment when doing a full send
At send.c:full_send_tree() we were setting the 'key' variable in the loop
while never using it later. We were also using two btrfs_key variables
to store the initial key for search and the key found in every iteration
of the loop. So remove this useless key assignment and use the same
btrfs_key variable to store the initial search key and the key found in
each iteration. This was introduced in the initial send commit but was
never used (commit 31db9f7c23 ("Btrfs: introduce BTRFS_IOC_SEND for
btrfs send/receive").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
5cdc84bfde btrfs: drop extent_io_ops::set_range_writeback callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
00032d38ea btrfs: drop extent_io_ops::merge_bio_hook callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
05912a3c04 btrfs: drop extent_io_ops::tree_fs_info callback
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
e288c080dd btrfs: unify end_io callbacks of async_submit_bio
The end_io callbacks passed to btrfs_wq_submit_bio
(btrfs_submit_bio_done and btree_submit_bio_done) are effectively the
same code, there's no point to do the indirection. Export
btrfs_submit_bio_done and call it directly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
d7cbfafc4b btrfs: remove unused member async_submit_bio::bio_flags
After splitting the start and end hooks in a758781d4b ("btrfs:
separate types for submit_bio_start and submit_bio_done"), some of
the function arguments were dropped but not removed from the structure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
d7e8555b1d btrfs: remove unused member async_submit_bio::fs_info
Introduced by c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data") to be used in run_one_async_done where it got
unused after 736cd52e0c ("Btrfs: remove nr_async_submits and
async_submit_draining").

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Gu Jinxiang
315409b009 btrfs: validate type when reading a chunk
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an
image that has an invalid chunk type but does not return an error.

Add chunk type check in btrfs_check_chunk_valid, to detect the wrong
type combinations.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839
Reported-by: Xu Wen <wen.xu@gatech.edu>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Nikolay Borisov
b0132a3be5 btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_UNMAPPED
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Nikolay Borisov
07e21c4dad btrfs: Document locking requirement via lockdep_assert_held
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
David Sterba
55ac01396a btrfs: rename btrfs_release_extent_buffer_page
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page").  Update the name and comment.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
d64766fdf9 btrfs: Refactor loop in btrfs_release_extent_buffer_page
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.

The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
b16d011e79 btrfs: Reword dodgy comments in alloc_extent_buffer
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
28187ae569 btrfs: Simplify page unlocking in alloc_extent_buffer
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").

However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Qu Wenruo
8b9b6f2554 btrfs: scrub: cleanup the remaining nodatasum fixup code
Remove the remaining code that misused the page cache pages during
device replace and could cause data corruption for compressed nodatasum
extents. Such files do not normally exist but there's a bug that allows
this combination and the corruption was exposed by device replace fixup
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
David Sterba
46df06b85e btrfs: refactor block group replication factor calculation to a helper
There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Anand Jain
321a4bf72b btrfs: use the assigned fs_devices instead of the dereference
We have assigned the %fs_info->fs_devices in %fs_devices as its not
modified just use it for the mutex_lock().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
62088ca742 btrfs: qgroup: Drop fs_info parameter from qgroup_rescan_leaf
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
a937742250 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inherit
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
280f8bd2cb btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroups
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8696d76045 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extent
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
deb4062743 btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree
The fs_info can be fetched from the transaction handle directly.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8d38d7eb7b btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_items
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
a95f3aafd6 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent
It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because it's not possible to hit this condition.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
f0042d5e92 btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
3efbee1d00 btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
49a05ecde3 btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
39616c2735 btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
6b36f1aa5c btrfs: qgroup: Drop fs_info parameter from __del_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
9f8a6ce6ba btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
2e980acdd8 btrfs: qgroup: Drop quota_root and fs_info parameters from update_qgroup_status_item
They can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
3e07e9a09f btrfs: qgroup: Drop root parameter from update_qgroup_info_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
ac8a866af1 btrfs: qgroup: Drop root parameter from update_qgroup_limit_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
69104618f4 btrfs: qgroup: Drop quota_root parameter from del_qgroup_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
99d7f09ac0 btrfs: qgroup: Drop quota_root parameter from del_qgroup_relation_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
711169c40f btrfs: qgroup: Drop quota_root parameter from add_qgroup_relation_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Anand Jain
fa59f27c8c btrfs: rename btrfs_parse_early_options
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As
btrfs_parse_early_options() parses the -o device options and scan the
device provided. So this rename specifies its action. Also the function
name is in line with btrfs_parse_subvol_options().
No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Lu Fengqi
c8389d4c0d btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit
Since commit 0b246afa62 ("btrfs: root->fs_info cleanup, add fs_info
convenience variables"), the srcroot is no longer used to get
fs_info::nodesize.  In fact, it can be dropped after commit 707e8a0715
("btrfs: use nodesize everywhere, kill leafsize").

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Qu Wenruo
031f24da2c btrfs: Use btrfs_mark_bg_unused to replace open code
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Nikolay Borisov
2556fbb0be btrfs: Rewrite retry logic in do_chunk_alloc
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.

This patch refactors all of this:

1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.

2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.

3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.

4. Move the mutex locking for the case where the caller is the one doing
the allocation. For the case where the caller needs to wait a concurrent
allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
barrier and reword the comment.

5. Switch local vars to bool type where pertinent.

All in all this shouldn't introduce any functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Ethan Lien
dec59fa3a7 btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.

So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.

[Test]
We tested the patch on a 4-cores x86 machine:

1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file

We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.

[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:            82             82
        0.21           0.61          29.46           0.36         298340
      635973           0.09          11.01      173476.25           0.27

patched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:             1              1
        0.62           0.62           0.62           0.62          13601
       31542           0.14           9.61       11016.90           0.35

[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
Ethan Lien
d814a49198 btrfs: use correct compare function of dirty_metadata_bytes
We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().

Fixes: e2d845211e ("Btrfs: use percpu counter for dirty metadata count")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
Gu Jinxiang
36350e95a2 btrfs: return device pointer from btrfs_scan_one_device
Return device pointer (with the IS_ERR semantics) from
btrfs_scan_one_device so we don't have to return in through pointer.

And since btrfs_fs_devices can be obtained from btrfs_device, return that.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fixed conflics after recent changes to btrfs_scan_one_device ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
Gu Jinxiang
d64dcbd183 btrfs: make fs_devices a local variable in btrfs_parse_early_options
fs_devices is always passed to btrfs_scan_one_device which overrides it.
In the call stack below fs_devices is passed to btrfs_scan_one_device
from btrfs_mount_root.  In btrfs_mount_root the output fs_devices of
this call stack is not used.

btrfs_mount_root
  btrfs_parse_early_options
    btrfs_scan_one_device

So, it is not necessary to pass fs_devices from btrfs_mount_root, using
a local variable in btrfs_parse_early_options is enough.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
81ffd56b57 btrfs: fix mount and ioctl device scan ioctl race
Technically this extends the critical section covered by uuid_mutex to:

- parse early mount options -- here we can call device scan on paths
  that can be passed as 'device=/dev/...'

- scan the device passed to mount

- open the devices related to the fs_devices -- this increases
  fs_devices::opened

The race can happen when mount calls one of the scans and there's
another one called eg. by mkfs or 'btrfs dev scan':

Mount                                  Scan
-----                                  ----
scan_one_device (dev1, fsid1)
                                       scan_one_device (dev2, fsid1)
				           add the device
					   free stale devices
					       fsid1 fs_devices::opened == 0
					           find fsid1:dev1
					           free fsid1:dev1
					           if it's the last one,
					            free fs_devices of fsid1
						    too

open_devices (dev1, fsid1)
   dev1 not found

When fixed, the uuid mutex will make sure that mount will increase
fs_devices::opened and this will not be touched by the racing scan
ioctl.

Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
399f7f4c42 btrfs: reorder initialization before the mount locks uuid_mutex
In preparation to take a big lock, move resource initialization before
the critical section. It's not obvious from the diff, the desired order
is:

- initialize mount security options
- allocate temporary fs_info
- allocate superblock buffers

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
5139cff598 btrfs: lift uuid_mutex to callers of btrfs_parse_early_options
Prepartory work to fix race between mount and device scan.

btrfs_parse_early_options calls the device scan from mount and we'll
need to let mount completely manage the critical section.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
David Sterba
f5194e34ca btrfs: lift uuid_mutex to callers of btrfs_open_devices
Prepartory work to fix race between mount and device scan.

The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
David Sterba
899f9307c3 btrfs: lift uuid_mutex to callers of btrfs_scan_one_device
Prepartory work to fix race between mount and device scan.

The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
Anand Jain
7bcb8164ad btrfs: use device_list_mutex when removing stale devices
btrfs_free_stale_devices() finds a stale (not opened) device matching
path in the fs_uuid list. We are already under uuid_mutex so when we
check for each fs_devices, hold the device_list_mutex too.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
Anand Jain
fa6d2ae540 btrfs: rename local devices for fs_devices in btrfs_free_stale_devices(
Over the years we named %fs_devices and %devices to represent the
struct btrfs_fs_devices and the struct btrfs_device. So follow the same
scheme here too. No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
Anand Jain
9c6d173ea6 btrfs: extend locked section when adding a new device in device_list_add
Make sure the device_list_lock is held the whole time:

* when the device is being looked up
* new device is initialized and put to the list
* the list counters are updated (fs_devices::opened, fs_devices::total_devices)

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Anand Jain
4306a97449 btrfs: do btrfs_free_stale_devices outside of device_list_add
btrfs_free_stale_devices() looks for device path reused for another
filesystem, and deletes the older fs_devices::device entry.

In preparation to handle locking in device_list_add, move
btrfs_free_stale_devices outside as these two functions serve a
different purpose.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Nikolay Borisov
959b1c0467 btrfs: close devices without offloading to a temporary list
Since commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e ("btrfs: fix a
possible umount deadlock") aimed to fix no longer exists, we can free
the devices immediatelly and remove the code that does the pending work.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Qu Wenruo
621567a28c btrfs: Remove unused function btrfs_account_dev_extents_size
This function is not used since the alloc_start parameter has been
obsoleted in commit 0d0c71b317 ("btrfs: obsolete and remove
mount option alloc_start").

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Gu Jinxiang
93b9bcdf9f btrfs: remove unused parameter from btrfs_parse_subvol_options
Since parameter flags is no more used since commit d740760656 ("btrfs:
split parse_early_options() in two"), remove it.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Anand Jain
b4993e64f7 btrfs: fix in-memory value of total_devices after seed device deletion
In case of deleting the seed device the %cur_devices (seed) and the
%fs_devices (parent) are different. Now, as the parent
fs_devices::total_devices also maintains the total number of devices
including the seed device, so decrement its in-memory value for the
successful seed delete. We are already updating its corresponding
on-disk btrfs_super_block::number_devices value.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
Nikolay Borisov
340f1aa27f btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable
Commit 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:

- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed

This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.

This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.

Fixes: 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
c7b562c548 btrfs: raid56: catch errors from full_stripe_write
Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
176571a1f6 btrfs: raid56: merge rbio_is_full helpers
There's only one call site of the unlocked helper so it can be folded
into the caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
a81b747d0f btrfs: raid56: use new helper for async_scrub_parity
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
e66d8d5a41 btrfs: raid56: use new helper for async_read_rebuild
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
cf6a4a7587 btrfs: raid56: use new helper for async_rmw_stripe
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
ac63885907 btrfs: raid56: add new helper for starting async work
Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
ebcc326316 btrfs: open-code bio_set_op_attrs
The helper is trivial and marked as deprecated.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
cc5e31a477 btrfs: switch types to int when counting eb pages
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
8791d43207 btrfs: use round_up wrapper in num_extent_pages
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
65ad010488 btrfs: pass only eb to num_extent_pages
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
d7f663fa3f btrfs: prune unused includes
Remove includes if none of the interfaces and exports is used in the
given source file.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
69d2480456 btrfs: use copy_page for copying pages instead of memcpy
Use the helper that's possibly optimized for full page copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
3ffbd68c48 btrfs: simplify pointer chasing of local fs_info variables
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.

If the transaction handle is available and not NULL it's used instead.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
3750851562 btrfs: simplify some assignments of inode numbers
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
Zhihui Zhang
8f6c72a9e0 Btrfs: free space cache: make sure there is always room for generation number
io_ctl_set_generation() assumes that the generation number shares
the same page with inline CRCs. Let's make sure this is always true.

Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Anand Jain
694c51fb2e btrfs: drop unnecessary variable in btrfs_init_new_device
There is only usage of the declared devices variable, instead use its
value directly.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Anand Jain
5da54bc138 btrfs: use a temporary variable for fs_devices in btrfs_init_new_device
There are many instances of the %fs_info->fs_devices pointer
dereferences, use a temporary variable instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Qu Wenruo
389305b2aa btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.

It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Qu Wenruo
ba480dd4db btrfs: tree-checker: Detect invalid and empty essential trees
A crafted image has empty root tree block, which will later cause NULL
pointer dereference.

The following trees should never be empty:
1) Tree root
   Must contain at least root items for extent tree, device tree and fs
   tree

2) Chunk tree
   Or we can't even bootstrap as it contains the mapping.

3) Fs tree
   At least inode item for top level inode (.).

4) Device tree
   Dev extents for chunks

5) Extent tree
   Must have corresponding extent for each chunk.

If any of them is empty, we are sure the fs is corrupted and no need to
mount it.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Qu Wenruo
fce466eab7 btrfs: tree-checker: Verify block_group_item
A crafted image with invalid block group items could make free space cache
code to cause panic.

We could detect such invalid block group item by checking:
1) Item size
   Known fixed value.
2) Block group size (key.offset)
   We have an upper limit on block group item (10G)
3) Chunk objectid
   Known fixed value.
4) Type
   Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
   No more than 1 bit set for profile type.
5) Used space
   No more than the block group size.

This should allow btrfs to detect and refuse to mount the crafted image.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
David Sterba
6d8ff4e458 btrfs: annotate unlikely branches after V0 extent type removal
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Nikolay Borisov
ba3c2b196b btrfs: Add graceful handling of V0 extents
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.

In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Nikolay Borisov
a79865c680 btrfs: Remove V0 extent support
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.

This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Chengguang Xu
4de426cd39 btrfs: remove unnecessary curly braces in btrfs_get_acl
It's only coding style fix not functinal change.  When if/else has only
one statement then the braces are not needed.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Chengguang Xu
dc7789ef87 btrfs: avoid error code override in btrfs_get_acl
It's not good to override the error code when failing from
btrfs_getxattr() in btrfs_get_acl() because it hides the real reason of
the failure.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Chengguang Xu
5ee552da50 btrfs: remove unnecessary -ERANGE check in btrfs_get_acl
There is no chance to get into -ERANGE error condition because we first
call btrfs_getxattr to get the length of the attribute, then we do a
subsequent call with the size from the first call.  Between the 2 calls
the size shouldn't change. So remove the unnecessary -ERANGE error
check.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Chengguang Xu
7e35eab958 btrfs: replace empty string with NULL when getting attribute length in btrfs_get_acl
In btrfs_get_acl() the first call of btr_getxattr() is for getting the
length of attribute, the value buffer is never used in this case. So
it's better to replace empty string with NULL.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Chengguang Xu
ab3629ed86 btrfs: return error instead of crash when detecting unexpected type in btrfs_get_acl
The caller of btrfs_get_acl() checks error condition so there is no
impact from this change. In practice there is no chance to get into
default case of switch statement because VFS has already checked the
type.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Su Yue
af431dcb24 btrfs: return EUCLEAN if extent_inline_ref type is invalid
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Goldwyn Rodrigues
e4af400a9c btrfs: Use iocb to derive pos instead of passing a separate parameter
struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.

generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
[ rename to btrfs_buffered_write ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Su Yue
893bf4b115 btrfs: print more details when checking tree block finds a problem
For easier debugging, print eb->start if level is invalid.  Also make
clear if bytenr found is not expected.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Nikolay Borisov
7b4284de93 btrfs: Streamline memory allocation failure handling in btrfs_add_delayed_tree_ref
Currently the function uses 2 goto labels to properly handle allocation
failures. This could be simplified by simply re-arranging the code so
that allocations are the in the beginning of the function. This allows
to use simple return statements. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Qu Wenruo
4379444654 btrfs: Don't remove block group that still has pinned down bytes
[BUG]
Under certain KVM load and LTP tests, it is possible to hit the
following calltrace if quota is enabled:

BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096

WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
 submit_extent_page+0x191/0x270 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 __do_readpage+0x2d2/0x810 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 __extent_read_full_page+0xe7/0x100 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
 read_tree_block+0x31/0x60 [btrfs]
 read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
 btrfs_search_slot+0x46b/0xa00 [btrfs]
 ? kmem_cache_alloc+0x1a8/0x510
 ? btrfs_get_token_32+0x5b/0x120 [btrfs]
 find_parent_nodes+0x11d/0xeb0 [btrfs]
 ? leaf_space_used+0xb8/0xd0 [btrfs]
 ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
 ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots+0x45/0x60 [btrfs]
 btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
 btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
 btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
 insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
 btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
 ? pick_next_task_fair+0x2cd/0x530
 ? __switch_to+0x92/0x4b0
 btrfs_worker_helper+0x81/0x300 [btrfs]
 process_one_work+0x1da/0x3f0
 worker_thread+0x2b/0x3f0
 ? process_one_work+0x3f0/0x3f0
 kthread+0x11a/0x130
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x35/0x40

BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680

[CAUSE]
It's caused by race with block group auto removal:

- There is a meta block group X, which has only one tree block
  The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
  The tree block gets COWed, so the block group X is empty, and marked
  as unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
  Which will call btrfs_delete_unused_bgs() to remove unused block
  groups.
  So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
  Quota needs to get the original reference of the extent, which will
  read tree blocks of commit root of 257.
  Then since the chunk map gets removed, the above warning gets
  triggered.

[FIX]
Just let btrfs_delete_unused_bgs() skip block group which still has
pinned bytes.

However there is a minor side effect: currently we only queue empty
blocks at update_block_group(), and such empty block group with pinned
bytes won't go through update_block_group() again, such block group
won't be removed, until it gets new extent allocated and removed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Geert Uytterhoeven
bc931c0ef8 btrfs: Refactor count handling in btrfs_unpin_free_ino
With gcc 4.1.2:

    fs/btrfs/inode-map.c: In function ‘btrfs_unpin_free_ino’:
    fs/btrfs/inode-map.c:241: warning: ‘count’ may be used uninitialized in this function

While this warning is a false-positive, it can easily be killed by
refactoring the code.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Arnd Bergmann
d3c6be6fda btrfs: use timespec64 for i_otime
While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Arnd Bergmann
afd48513f0 btrfs: use monotonic time for transaction handling
The transaction times were changed to ktime_get_real_seconds to avoid
the y2038 overflow, but they still have a minor problem when they go
backwards or jump due to settimeofday() or leap seconds.

This changes the transaction handling to instead use ktime_get_seconds(),
which returns a CLOCK_MONOTONIC timestamp that has neither of those
problems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Qu Wenruo
e41ca58974 btrfs: Get rid of the confusing btrfs_file_extent_inline_len
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
bc877d285c btrfs: Deduplicate extent_buffer init code
When a new extent buffer is allocated there are a few mandatory fields
which need to be set in order for the buffer to be sane: level,
generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
is open coded in the callers of btrfs_alloc_tree_block, meaning it's
fairly high in the abstraction hierarchy of operations. This patch
solves this by simply moving this init code in btrfs_init_new_buffer,
since this is the function which initializes a newly allocated
extent buffer. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Qu Wenruo
9912bbf644 btrfs: check-integrity: Fix NULL pointer dereference for degraded mount
Commit f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
changed how btrfsic indexes device state.

Now we need to access device->bdev->bd_dev, while for degraded mount
it's completely possible to have device->bdev as NULL, thus it will
trigger a NULL pointer dereference at mount time.

Fix it by checking if the device is degraded before accessing
device->bdev->bd_dev.

There are a lot of other places accessing device->bdev->bd_dev, however
the other call sites have either checked device->bdev, or the
device->bdev is passed from btrfsic_map_block(), so it won't cause harm.

Fixes: f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
43a7e99db6 btrfs: Remove fs_info from btrfs_force_chunk_alloc
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
c83488afc5 btrfs: Remove fs_info from btrfs_inc_block_group_ro
It can be referenced from the passed bg cache.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
61da2abfca btrfs: Remove fs_info from btrfs_alloc_logged_file_extent
It can be referenced from trans since the function is always called
within a valid transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
87cc7a8a2a btrfs: Remove fs_info from remove_extent_backref
It can be referenced directly from the transaction handle since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
5fac7f9ee1 btrfs: Remove fs_info from run_one_delayed_ref
It can be referenced from the passed transaction handle, since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
a639cdeba3 btrfs: Remove fs_info from insert_inline_extent_backref
It can be referenced from the passed transaction handle, since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
3c4da6574e btrfs: Remove fs_info from exclude_super_stripes
It can be referenced from the passed block group.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
9e715da860 btrfs: Remove fs_info from free_excluded_extents
It can be referenced from the passed block group.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
451a2c1303 btrfs: Remove fs_info from check_system_chunk
It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
c216b2039a btrfs: Remove fs_info from btrfs_alloc_chunk
It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
01458828bb btrfs: Remove fs_info from do_chunk_alloc
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
f97806f2ee btrfs: Remove fs_info from run_delayed_tree_ref
It can always be referneced from the passed transaction handle since
it's always valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
f9871eddd9 btrfs: Remove fs_info from cleanup_ref_head
fs_info can be refenreced from the transaction handle, since it's always
valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
c4d56d4a16 btrfs: Remove unused fs_info from cleanup_extent_op
The argument is no longer used so remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
20b9a2d670 btrfs: Remove fs_info from run_delayed_extent_op
This function is always called with a valid transaction handle so
fs_info can be referenced from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
2bf98ef35f btrfs: Remove fs_info from run_delayed_data_ref
This function is always called with a valid transaction from where
fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
2590d0f155 btrfs: Remove fs_info argument from __btrfs_inc_extent_ref
This function already takes a transaction which holds a reference to
the fs_info struct. Use that reference and remove the extra arg. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
ef89b8245b btrfs: Remove fs_info from alloc_reserved_file_extent
fs_info can be referenced from the transaction handle, which is always
valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
e72cb9235d btrfs: Remove fs_info from __btrfs_free_extent
This function is always called with a valid transaction handle so we
can reference the fs_info from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
5a98ec0141 btrfs: Remove fs_info from btrfs_remove_block_group
This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
e7e02096d9 btrfs: Remove fs_info from btrfs_make_block_group
This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
88a979c615 btrfs: Remove fs_info from btrfs_add_delayed_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
44e1c47d5c btrfs: Remove fs_info from btrfs_add_delayed_tree_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
fbe4801b26 btrfs: Remove fs_info from lookup_extent_backref
This argument is unused. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
bd1d53ef35 btrfs: Remove fs_info argument from lookup_extent_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
b8582eeabb btrfs: Remove fs_info argument from lookup_tree_block_ref
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
61a18f1c66 btrfs: Remove fs_info argument from update_inline_extent_backref
This function always uses the leaf's extent_buffer which already
contains a reference to the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
867cc1fbeb btrfs: Remove fs_info from lookup_inline_extent_backref
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
b167fa9152 btrfs: Remove fs_info from fixup_low_keys
This argument is unused. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
e9f6290d59 btrfs: Remove fs_info from remove_extent_data_ref
This function is always called with a valid transaction from where the
fs_info can be referenced. No functional change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
375934105c btrfs: Remove fs_info argument from insert_extent_backref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
62b895af40 btrfs: Remove fs_info from insert_extent_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. So remove the redundant argument.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
10728404c6 btrfs: Remove fs_info from insert_tree_block_ref
This function is always called with a valid transaction so there is no
need to duplicate the fs_info, we can reference it directly from the
trans handle. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
edf57cbf2b btrfs: Fix a C compliance issue
The C programming language does not allow to use preprocessor statements
inside macro arguments (pr_info() is defined as a macro). Hence rework
the pr_info() statement in btrfs_print_mod_info() such that it becomes
compliant. This patch allows tools like sparse to analyze the BTRFS
source code.

Fixes: 62e855771d ("btrfs: convert printk(KERN_* to use pr_* calls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
acd43e3cdf btrfs: Annotate fall-through when parsing mount option
This patch avoids that the compiler complains that a fall-through
annotation is missing when building with W=1.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
bece2e8239 btrfs: Fix misleading indentation reported by smatch
This patch avoids that building the BTRFS source code with smatch
triggers complaints about inconsistent indenting.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Nikolay Borisov
a9ecb653b0 btrfs: Streamline log_extent_csums a bit
Currently this function takes the root as an argument only to get the
log_root from it. Simplify this by directly passing the log root from
the caller. Also eliminate the fs_info local variable, since it's used
only once, so directly reference it from the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
David Sterba
ca5788aba3 btrfs: remove remaing full_sync logic from btrfs_sync_file
The logic to check if the inode is already in the log can now be
simplified since we always wait for the ordered extents to complete
before deciding whether the inode needs to be logged. The big comment
about it can go away too.

CC: Filipe Manana <fdmanana@suse.com>
Suggested-by: Filipe Manana <fdmanana@suse.com>
[ code and changelog copied from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Josef Bacik
5636cf7d6d btrfs: remove the logged extents infrastructure
This is no longer used anywhere, remove all of it.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Josef Bacik
a2120a473a btrfs: clean up the left over logged_list usage
We no longer use this list we've passed around so remove it everywhere.
Also remove the extra checks for ordered/filemap errors as this is
handled higher up now that we're waiting on ordered_extents before
getting to the tree log code.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Josef Bacik
e7175a6927 btrfs: remove the wait ordered logic in the log_one_extent path
Since we are waiting on all ordered extents at the start of the fsync()
path we don't need to wait on any logged ordered extents, and we don't
need to look up the checksums on the ordered extents as they will
already be on disk prior to getting here.  Rework this so we're only
looking up and copying the on-disk checksums for the extent range we
care about.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Josef Bacik
b5e6c3e170 btrfs: always wait on ordered extents at fsync time
There's a priority inversion that exists currently with btrfs fsync.  In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second.  However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Nikolay Borisov
16d1c062c7 btrfs: Fix comment in lookup_inline_extent_backref
The comment wrongfully states that the owner parameter is the level of
the parent block. In fact owner is the level of the current block and
by adding 1 to it we can eventually get to the parent/root.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Nikolay Borisov
bd3c685ed9 btrfs: Document __btrfs_inc_extent_ref
Here is a doc-only patch which tires to deobfuscate the terra-incognita
that arguments for delayed refs are.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Qu Wenruo
9bebe665c3 btrfs: scrub: Remove unused copy_nocow_pages and its callchain
Since commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages
for device replace") the function is not used and we can remove all
functions down the call chain.

There was an optimization that reused inode pages to speed up device
replace, but broke when there was nodatasum and compressed page. The
potential performance gain is small so we don't loose much by removing
it and using scrub_pages same as the other pages.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Allen Pais
a944442c2b btrfs: replace get_seconds with new 64bit time API
The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Christoph Hellwig
e8693bcfa0 aio: allow direct aio poll comletions for keyed wakeups
If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the
ctx_lock without spinning we can just complete the iocb straight from the
wakeup callback to avoid a context switch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:39 +02:00
Christoph Hellwig
bfe4037e72 aio: implement IOCB_CMD_POLL
Simple one-shot poll through the io_submit() interface.  To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL.  It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:33 +02:00
Christoph Hellwig
9018ccc453 aio: add a iocb refcount
This is needed to prevent races caused by the way the ->poll API works.
To avoid introducing overhead for other users of the iocbs we initialize
it to zero and only do refcount operations if it is non-zero in the
completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:28 +02:00
Christoph Hellwig
7dda712818 timerfd: add support for keyed wakeups
This prepares timerfd for use with aio poll.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:08 +02:00
Jens Axboe
05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
David S. Miller
c1c8626fce Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes, mostly trivial in nature.

The mlxsw conflict was resolving using the example
resolution at:

https://github.com/jpirko/linux_mlxsw/blob/combined_queue/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 13:04:31 -07:00
Gustavo A. R. Silva
7964410fcf fs: dcache: Use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:52:44 -04:00
Al Viro
808aa6c5e3 Merge branch 'work.hpfs' into work.lookup 2018-08-05 15:51:10 -04:00
Al Viro
1401a0fc2d afs_try_auto_mntpt(): return NULL instead of ERR_PTR(-ENOENT)
simpler logics in callers that way

Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:50:59 -04:00
Al Viro
34b2a88fb4 afs_lookup(): switch to d_splice_alias()
->lookup() methods can (and should) use d_splice_alias() instead of
d_add().  Even if they are not going to be hit by open_by_handle(),
code does get copied around; besides, d_splice_alias() has better
calling conventions for use in ->lookup(), so the code gets simpler.

Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:44:14 -04:00
Al Viro
855371bd01 afs: switch dynroot lookups to d_splice_alias()
->lookup() methods can (and should) use d_splice_alias() instead of
d_add().  Even if they are not going to be hit by open_by_handle(),
code does get copied around...

Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:41:16 -04:00
Linus Torvalds
60f5a21736 - Fix JFS usercopy whitelist (it needed to cover neighboring field too) for
"overflow" inline inode data.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltlv70WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj3eD/9+goWbs9U62tlO2cIqT5lgTFX9
 3vvVpqNJ3atw/fU6SZoa0Z2nRa9TIpZEfEljgARGhCyd2p2MplLJalWo6bq/gUUA
 5aWQbVxhyVXUp8kh8m3OnZsAZz658Y5geLMk8vakXdyQ//PF43wO0cZyOFYdG3ec
 sYuWA318UKaIsxqB9tT/K8YBRjjBgJ8wjgtpoSAr+FUhgg9Qqp3NL4fjr6SOEQrv
 XWXLVLLhyUo8kQJ29E/VyzIfLysgiA67O4ClW+DEyD6rr/ZK9XOeG6yv3vwLGSHI
 06/4BXMJce23iGIYf57Jz7b5tfAaZntdzNqUiTW6up0TuvG09auwjv/hKkwfjQv2
 fNf0TVnYOCN4ZWbCm4FTEU5q31u+pZDHiRxOTJG9EbuBVEsFiV5X9B9VTLoA+dWK
 SqWmE2X9YdONt9q6K8TGpxxrdQnQXGKOOtEY23KoF04XGYQtutIfj7cdFCDDX/eC
 AjcOIfV3u8dHWjc/wKIYS5XRUzbkdpEeNOaiCQ3RN79JN2UrA0/w7lAxZTICmYCs
 HtMtaFTER5weKQGzTzg0SP3M95qKPdWhlIq5lspdhGedonZAKBNfGahMltTH2UoY
 vIb1qGT8uz8tQSzGsu0uvrR92ZIJZ0qUVjB/nhE9HxTz26xqtDDlLXpdIXTGrnU4
 hM+Omud//MNgNbsmrg==
 =pwq8
 -----END PGP SIGNATURE-----

Merge tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull usercopy whitelisting fix from Kees Cook:
 "Bart Massey discovered that the usercopy whitelist for JFS was
  incomplete: the inline inode data may intentionally "overflow" into
  the neighboring "extended area", so the size of the whitelist needed
  to be raised to include the neighboring field"

* tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  jfs: Fix usercopy whitelist for inline inode data
2018-08-04 18:34:55 -07:00
Linus Torvalds
f639bef55d Changes since last update:
- Fix incorrect shifting in the iomap bmap functions.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltj7lcACgkQ+H93GTRK
 tOutzw//U9yOrUhuOTFrkmA2x+CB1ArncFZysUTXDFUsLRrMp0dwNaRV0WafyypO
 tCr4nvqgwg2VTFf21zWblOSAajstgJzH+x3FdxTsU5Gktd/BH9gCHSDV0lORnFhp
 jbqNOsBRPJjNIBzaIIAnyDGMuOF/MeWpGsbGF4eDcsgj58lr098rir7grEVgCkKM
 +eCMtqJwUXUZ2bo/grBWvZXnLms+8LMPYeRQMJSnw59DVunwxRdSbWumJwkPt9DD
 mz3Upa/qFJCZw2n9kluU1b/tnrpxkNWYuSjzp9iu0cMdo52HF+yDNHriUPCHa4PB
 3KfyGrhOvg+iC2pUAJ5mI3Dpv+NNAk7j/+4mOVtlUYIf0mi5gszIjNHbLiONKVH5
 5x8G59wM/ae6a1WcHe7tJRma7r984G+JLTIxAQnWaWhFq5AsYMOplk4oq6X7Lmir
 wAR7laDN0Xgl3WCFj6SXaKcnuzZFsE4A3SnZILtlgO/WxAujUyFEnICx1cO4Q9OY
 64txWii6ora9kdBtalolxjfGHwyScbhx6FiBQdKIznxGgBQR89X8hzxghdp7KTIx
 kxkC1hAM3KXXtokArjad2hgd8QG23jUshyBwKpfHnEwL75GiZQ3qtPdDm/oJlVWV
 ItnRSt+tGIT/fsT5szNPmLKgtIW4kQHflVuih0KR4IsGPMmZ8dU=
 =3omy
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs bugfix from Darrick Wong:
 "One more patch for 4.18 to fix a coding error in the iomap_bmap()
  function introduced in -rc1: fix incorrect shifting"

* tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: fix iomap_bmap position calculation
2018-08-04 18:30:58 -07:00
zhong jiang
863c37fcb1 ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
The err is not used after initalization. So just remove the variable.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-08-04 17:34:07 -04:00
Kees Cook
961b33c244 jfs: Fix usercopy whitelist for inline inode data
Bart Massey reported what turned out to be a usercopy whitelist false
positive in JFS when symlink contents exceeded 128 bytes. The inline
inode data (i_inline) is actually designed to overflow into the "extended
area" following it (i_inline_ea) when needed. So the whitelist needed to
be expanded to include both i_inline and i_inline_ea (the whole size
of which is calculated internally using IDATASIZE, 256, instead of
sizeof(i_inline), 128).

$ cd /mnt/jfs
$ touch $(perl -e 'print "B" x 250')
$ ln -s B* b
$ ls -l >/dev/null

[  249.436410] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'jfs_ip' (offset 616, size 250)!

Reported-by: Bart Massey <bart.massey@gmail.com>
Fixes: 8d2704d382 ("jfs: Define usercopy region in jfs_ip slab cache")
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: jfs-discussion@lists.sourceforge.net
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-08-04 07:53:46 -07:00
Geliang Tang
1021bcf44d pstore: add zstd compression support
This patch added the 6th compression algorithm support for pstore: zstd.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-08-03 18:12:18 -07:00
Al Viro
c7b15a8657 jfs: don't bother with make_bad_inode() in ialloc()
We hit that when inumber allocation has failed.  In that case
the in-core inode is not hashed and since its ->i_nlink is 1
the only place where jfs checks is_bad_inode() won't be reached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:33 -04:00
Al Viro
d8e78da868 adfs: don't put inodes into icache
We never look them up in there; inode_fake_hash() will make them appear
hashed for mark_inode_dirty() purposes.  And don't leave them around
until memory pressure kicks them out - we never look them up again.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:33 -04:00
Al Viro
5bef915104 new helper: inode_fake_hash()
open-coded in a quite a few places...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:32 -04:00
Miklos Szeredi
e950564b97 vfs: don't evict uninitialized inode
iput() ends up calling ->evict() on new inode, which is not yet initialized
by owning fs.  So use destroy_inode() instead.

Add to sb->s_inodes list only if inode is not in I_CREATING state (meaning
that it wasn't allocated with new_inode(), which already does the
insertion).

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 80ea09a002 ("vfs: factor out inode_insert5()")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:32 -04:00
Al Viro
a6cbedfa87 jfs: switch to discard_new_inode()
we don't want open-by-handle to pick an in-core inode that
has failed setup halfway through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:31 -04:00
Al Viro
2e5afe54e0 ext2: make sure that partially set up inodes won't be returned by ext2_iget()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:31 -04:00
Al Viro
5c1a68a358 udf: switch to discard_new_inode()
we don't want open-by-handle to pick an in-core inode that
has failed setup halfway through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:30 -04:00
Al Viro
dd54992776 ufs: switch to discard_new_inode()
we don't want open-by-handle to pick an in-core inode that
has failed setup halfway through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:30 -04:00
Al Viro
32955c5422 btrfs: switch to discard_new_inode()
Make sure that no partially set up inodes can be returned by
open-by-handle.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:29 -04:00
Al Viro
c2b6d621c4 new primitive: discard_new_inode()
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction.  However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.

	Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations.  That primitive unlocks new
inode, but leaves I_CREATING in place.

	iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
	insert_inode_locked() treats the same as instant -EBUSY.
	ilookup() treats those as icache miss.

[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 15:55:30 -04:00
David Howells
eb9950eb31 rxrpc: Push iov_iter up from rxrpc_kernel_recv_data() to caller
Push iov_iter up from rxrpc_kernel_recv_data() to its caller to allow
non-contiguous iovs to be passed down, thereby permitting file reading to
be simplified in the AFS filesystem in a future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 12:46:20 -07:00
Linus Torvalds
310810ae19 NFS client bugfixes for Linux 4.18
Highlights include:
 
 Bugfixes:
 - Fix a NFSv4 file locking regression
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbZFG/AAoJEA4mA3inWBJcyo8P/0MN6eRpDU2VCZheM7DgsN2L
 a7W/Phc1PjHoaqksQB4Zb3uMYQQCWAyvE/VB5nRJSF1ZQbKvv7kgkyqX+PxEG8LI
 Pp3igaVXzqkRj3eA33G1HXZCrymjf7sTb4CEdMgSUBe3wLXjrFPP4Og0RJ8YGhCG
 QBG0ZENwcseUk5bmbSpp9Ac60URy1si1pD1nB3z+zSQT2ViA8QHgzg3Hpwgm+0R7
 WG/QzIFoGoJ8J9sDX/tXdkwUT2Z3yusxXcwK7Be5dtUcu1codf+EaPxRNG55myRW
 J2fY+KIocgQK8lo3w9ok1sGTyN+YkS8eIQqTeZzg0Gty/LX7bwH/3ScCeQtbu9RH
 nAR2OJQkc/wJ8sJojmUmDnBgskvgWzdfxfxRGQwlnRMD0W3t0LUDCeIUZ/1OL69l
 4pgvFLaR5MRD/DS4sSftKcOpgH5KDTlfuUXA+PamELLAk93FWJEZTVI4hmUR02+h
 /0QoRE6FAraQ7IY9TuLd/Jj3wWmqvataL6JGuWSdmhd35PbRxxBun+5zCyj62BAM
 /h0SjrCMD+dhotcdiekHINNbNYRG6ukbswgP6zCtuq4icTCW8SMVNyI3mXUVQwF3
 hAc3FylKpdGkgSrK3unLnBSeBgGwnCy1PYtusx0MgJf/qhdPYsl0bwgZhcR1U01y
 WfyGrwoNhLEmxL6+zECQ
 =gMVX
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfix from Trond Myklebust:
 "Fix a NFSv4 file locking regression"

* tag 'nfs-for-4.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix _nfs4_do_setlk()
2018-08-03 10:42:01 -07:00
Huang Chong
a0e336ba3e xfs: fix a comment in xfs_log_reserve
Fix the comment in xfs_log_reserve to avoid confusing.

Signed-of-by: Huang Chong <huang.chong@zte.com.cn>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-03 08:17:54 -07:00
Darrick J. Wong
1f31c98d65 xfs: only validate summary counts on primary superblock
Skip the summary counter checks for secondary superblocks and inprogress
primary superblocks because mkfs has always written those out with
zeroed summary counters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-03 08:17:35 -07:00
Andreas Gruenbacher
21e2156f3c gfs2: Get rid of gfs2_ea_strlen
Function gfs2_ea_strlen is only called from ea_list_i, so inline it
there.  Remove the duplicate switch statement and the creative use of
memcpy to set a null byte.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-08-03 13:20:02 +01:00
Thomas Bianchi
c2b6e1591b xfs: substitute spaces with tabs
Inside xfs_attr_shortform_list removes spaces at the beginnig of the line
and replaces with tabs.
Issue found by checkpatch.

ERROR: code indent should use tabs where possible

Signed-off-by: Thomas Bianchi <thomas.bianchi8@gmail.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
9d9e623385 xfs: fold dfops into the transaction
struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.

Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
c03edc9e49 xfs: always defer agfl block frees
The AGFL fixup code conditionally defers block frees from the free
list based on whether the current transaction has an associated
xfs_defer_ops structure. Now that dfops is embedded in the
transaction and the internal dfops is used unconditionally, this
invariant is always true.

Remove the now dead logic to check for ->t_dfops in
xfs_alloc_fix_freelist() and unconditionally defer AGFL block frees.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
0f37d1780c xfs: pass transaction to xfs_defer_add()
The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.

Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.

This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
1ae093cbea xfs: replace xfs_defer_ops ->dop_pending with on-stack list
The xfs_defer_ops ->dop_pending list is used to track active
deferred operations once intents are logged. These items must be
aborted in the event of an error. The list is populated as intents
are logged and items are removed as they complete (or are aborted).

Now that xfs_defer_finish() cancels on error, there is no need to
ever access ->dop_pending outside of xfs_defer_finish(). The list is
only ever populated after xfs_defer_finish() begins and is either
completed or cancelled before it returns.

Remove ->dop_pending from xfs_defer_ops and replace it with a local
list in the xfs_defer_finish() path. Pass the local list to the
various helpers now that it is not accessible via dfops. Note that
we have to check for NULL in the abort case as the final tx roll
occurs outside of the scope of the new local list (once the dfops
has completed and thus drained the list).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
9b1f4e9831 xfs: cancel dfops on xfs_defer_finish() error
The current semantics of xfs_defer_finish() require the caller to
call xfs_defer_cancel() on error. This is slightly inconsistent with
transaction commit error handling where a failed commit cleans up
the transaction before returning.

More significantly, the only requirement for exposure of
->dop_pending outside of xfs_defer_finish() is so that
xfs_defer_cancel() can drain it on error. Since the only recourse of
xfs_defer_finish() errors is cancellation, mirror the transaction
logic and cancel remaining dfops before returning from
xfs_defer_finish() with an error.

Beside simplifying xfs_defer_finish() semantics, this ensures that
xfs_defer_finish() always returns with an empty ->dop_pending and
thus facilitates removal of the list from xfs_defer_ops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
60f31a609e xfs: clean out superfluous dfops dop params/vars
The dfops code still passes around the xfs_defer_ops pointer
superfluously in a few places. Clean this up wherever the
transaction will suffice.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
7dbddbaccd xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
The dfops infrastructure ->finish_item() callback passes the
transaction and dfops as separate parameters. Since dfops is always
part of a transaction, the latter parameter is no longer necessary.
Remove it from the various callbacks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
a8198666fb xfs: automatic dfops inode relogging
Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.

Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
82ff27bc52 xfs: automatic dfops buffer relogging
Buffers that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While buffers are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for held buffers.

Replace the xfs_defer_bjoin() infrastructure with such detection and
automatic relogging of held buffers. This eliminates the need for
the per-dfops buffer list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
488c919a5b xfs: add missing defer ijoins for held inodes
Log items that require relogging during deferred operations
processing are explicitly joined to the associated dfops via the
xfs_defer_*join() helpers. These calls imply that the associated
object is "held" by the transaction such that when rolled, the item
can be immediately joined to a follow up transaction. For buffers,
this means the buffer remains locked and held after each roll. For
inodes, this means that the inode remains locked.

Failure to join a held item to the dfops structure means the
associated object pins the tail of the log while dfops processing
completes, because the item never relogs and is not unlocked or
released until deferred processing completes.

Currently, all buffers that are held in transactions (XFS_BLI_HOLD)
with deferred operations are explicitly joined to the dfops. This is
not the case for inodes, however, as various contexts defer
operations to transactions with held inodes without explicit joins
to the associated dfops (and thus not relogging).

While this is not a catastrophic problem, it is not ideal. Given
that we want to eventually relog such items automatically during
dfops processing, start by explicitly adding these missing
xfs_defer_ijoin() calls. A call is added everywhere an inode is
joined to a transaction without transferring lock ownership and
said transaction runs deferred operations.

All xfs_defer_ijoin() calls will eventually be replaced by automatic
dfops inode relogging. This patch essentially implements the
behavior change that would otherwise occur due to automatic inode
dfops relogging.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
1214f1cf66 xfs: replace dop_low with transaction flag
The dop_low field enables the low free space allocation mode when a
previous allocation has detected difficulty allocating blocks. It
has historically been part of the xfs_defer_ops structure, which
means if enabled, it remains enabled across a set of transactions
until the deferred operations have completed and the dfops is reset.

Now that the dfops is embedded in the transaction, we can save a bit
more space by using a transaction flag rather than a standalone
boolean. Drop the ->dop_low field and replace it with a transaction
flag that is set at the same points, carried across rolling
transactions and cleared on completion of deferred operations. This
essentially emulates the behavior of ->dop_low and so should not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
ce356d6477 xfs: pass transaction to dfops reset/move helpers
All callers pass ->t_dfops of the associated transactions. Refactor
the helpers to receive the transactions and facilitate further
cleanups between xfs_defer_ops and xfs_trans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
7279aa13b8 xfs: remove unused __xfs_defer_cancel() internal helper
With no more external dfops users, there is no need for an
xfs_defer_ops cancel wrapper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
fbfa977d25 xfs: use transaction for intent recovery instead of raw dfops
Log intent recovery is the last user of an external (on-stack)
dfops. The pattern exists because the dfops is used to collect
additional deferred operations queued during the whole recovery
sequence. The dfops is finished with a new transaction after intent
recovery completes.

We already have a mechanism to create an empty, container-like
transaction to support the scrub infrastructure. We can reuse that
mechanism here to drop the final user of external dfops. This
facilitates folding dfops state (i.e., dop_low) into the
transaction, the elimination of now unused external dfops support
and also eliminates the only caller of __xfs_defer_cancel().

Replace the on-stack dfops with an empty transaction and pass it
around to the various helpers that queue and finish deferred
operations during intent recovery.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
98719051e7 xfs: refactor internal dfops initialization
The current transaction allocation code conditionally initializes
the ->t_dfops indirection pointer. Transaction commit/cancel check
the validity of the pointer to determine whether to finish/cancel
the internal dfops.

This disallows the ability to use the internal dfops list as a
temporary container (via xfs_trans_alloc_empty()). Refactor
transaction allocation to always initialize ->t_dfops and check
permanent reservation state on transaction commit/cancel.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Mike Rapoport
31e810aa10 userfaultfd: remove uffd flags from vma->vm_flags if UFFD_EVENT_FORK fails
The fix in commit 0cbb4b4f4c ("userfaultfd: clear the
vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
that were copied from the parent process VMA.

As the result, there is an inconsistency between the values of
vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
in userfaultfd_release().

Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
failure resolves the issue.

Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 0cbb4b4f4c ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 16:03:40 -07:00
Eric Sandeen
79b3dbe4ad fs: fix iomap_bmap position calculation
The position calculation in iomap_bmap() shifts bno the wrong way,
so we don't progress properly and end up re-mapping block zero
over and over, yielding an unchanging physical block range as the
logical block advances:

# filefrag -Be file
 ext:   logical_offset:     physical_offset: length:   expected: flags:
   0:      0..       0:      21..        21:      1:             merged
   1:      1..       1:      21..        21:      1:         22: merged
Discontinuity: Block 1 is at 21 (was 22)
   2:      2..       2:      21..        21:      1:         22: merged
Discontinuity: Block 2 is at 21 (was 22)
   3:      3..       3:      21..        21:      1:         22: merged

This breaks the FIBMAP interface for anyone using it (XFS), which
in turn breaks LILO, zipl, etc.

Bug-actually-spotted-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes: 89eb1906a9 ("iomap: add an iomap-based bmap implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 13:09:27 -07:00
Bart Van Assche
2afc9166f7 scsi: sysfs: Introduce sysfs_{un,}break_active_protection()
Introduce these two functions and export them such that the next patch
can add calls to these functions from the SCSI core.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-08-02 15:53:15 -04:00
Chengguang Xu
8687a3e2c7 ceph: add additional offset check in ceph_write_iter()
If the offset is larger or equal to both real file size and
max file size, then return -EFBIG.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:28 +02:00
Chengguang Xu
0671e9968d ceph: add additional range check in ceph_fallocate()
If the range is larger than both real file size and limit of
max file size, then return -EFBIG.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:28 +02:00
Chengguang Xu
719784ba70 ceph: add new field max_file_size in ceph_fs_client
In order to not bother to VFS and other specific filesystems,
we decided to do offset validation inside ceph kernel client,
so just simply set sb->s_maxbytes to MAX_LFS_FILESIZE so that
it can successfully pass VFS check. We add new field max_file_size
in ceph_fs_client to store real file size limit and doing proper
check based on it.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:27 +02:00
Ilya Dryomov
6daca13d2e libceph: add authorizer challenge
When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply).  This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge).  The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:24 +02:00
Souptick Joarder
24499847e4 ceph: adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite
and fault handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:20 +02:00
Arnd Bergmann
0ed1e90a09 ceph: use timespec64 for r_stamp
The ceph_mds_request stamp still uses the deprecated timespec structure,
this converts it over as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:19 +02:00
Arnd Bergmann
fac02ddf91 libceph: use timespec64 for r_mtime
The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.

[ Remove now redundant ts variable in writepage_nounlock(). ]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:14 +02:00
Arnd Bergmann
9bbeab41ce ceph: use timespec64 for inode timestamp
Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.

In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Arnd Bergmann
63ecae7e43 ceph: stop using current_kernel_time()
ceph_mdsc_create_request() is one of the last callers of the
deprecated current_kernel_time() as well as timespec_trunc().

This changes it to use the timespec64 based interfaces instead,
though we still need to convert the result until we are ready to
change over req->r_stamp.

The output of the two functions, ktime_get_coarse_real_ts64() and
current_kernel_time() is the same coarse-granular timestamp,
the only difference here is that ktime_get_coarse_real_ts64()
doesn't overflow in 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu
67fcd15140 ceph: add d_drop for some error cases in ceph_symlink()
When file num exceeds quota limit, should call d_drop to drop
dentry from cache as well.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu
0459871c49 ceph: add d_drop for some error cases in ceph_mknod()
When file num exceeds quota limit or fails from ceph_per_init_acls()
should call d_drop to drop dentry from cache as well.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu
61ad36d47d ceph: return errors from posix_acl_equiv_mode() correctly
In order to return correct error code should replace variable ret
using err in error case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Yan, Zheng
dfeb84d4ad ceph: fix incorrect use of strncpy
GCC8 prints following warning:

 fs/ceph/mds_client.c:3683:2: warning: ‘strncpy’ output may be truncated
 copying 64 bytes from a string of length 64 [-Wstringop-truncation]

[ Change to strscpy() while at it. ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Ilya Dryomov
2f56b6bae7 libceph: amend "bad option arg" error message
Don't mention "mount" -- in the rbd case it is "mapping".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu
93d35c754d ceph: restore ctime as well in the case of restoring old mode
It's better to restore ctime as well in the case of restoring old mode
in ceph_set_acl().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu
f017754d69 ceph: add retry logic for error -ERANGE in ceph_get_acl()
When the size of acl extended attribution is larger than pre-allocated
value buffer size, we will hit error '-ERANGE' and it's probabaly caused
by concurrent get/set acl from different clients. In this case, current
logic just sets acl to NULL so that we cannot get proper information but
the operation looks successful.

This patch adds retry logic for error -ERANGE and return -EIO if fail
from the retry. Additionally, print real errno when failing from
__ceph_getxattr().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Phillip Lougher
a3f94cb99a Squashfs: Compute expected length from inode size rather than block length
Previously in squashfs_readpage() when copying data into the page
cache, it used the length of the datablock read from the filesystem
(after decompression).  However, if the filesystem has been corrupted
this data block may be short, which will leave pages unfilled.

The fix for this is to compute the expected number of bytes to copy
from the inode size, and use this to detect if the block is short.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 09:34:02 -07:00
Linus Torvalds
71755ee535 squashfs: more metadata hardening
The squashfs fragment reading code doesn't actually verify that the
fragment is inside the fragment table.  The end result _is_ verified to
be inside the image when actually reading the fragment data, but before
that is done, we may end up taking a page fault because the fragment
table itself might not even exist.

Another report from Anatoly and his endless squashfs image fuzzing.

Reported-by: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Acked-by:: Phillip Lougher <phillip.lougher@gmail.com>,
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 09:32:23 -07:00
Liu Song
bc71652346 ext4: improve code readability in ext4_iget()
Merge the duplicated complex conditions to improve code readability.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
2018-08-02 00:11:16 -04:00
Jeremy Cline
1a5d5e5d51 ext4: fix spectre gadget in ext4_mb_regular_allocator()
'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the
derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to
index arrays which makes it a potential spectre gadget. Fix this by
sanitizing the value assigned to 'ac->ac2_order'.  This covers the
following accesses found with the help of smatch:

* fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential
  spectre issue 'grp->bb_counters' [w] (local cap)

* fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue
  'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap)

* fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue
  'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap)

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-02 00:03:40 -04:00
Al Viro
c971e6a006 kill d_instantiate_no_diralias()
The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success.  ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-01 23:18:53 -04:00
Trond Myklebust
6ea76bf513 NFSv4: Fix _nfs4_do_setlk()
The patch to fix the case where a lock request was interrupted ended up
changing default handling of errors such as NFS4ERR_DENIED and caused the
client to immediately resend the lock request. Let's do a partial revert
of that request so that the default is now to exit, but change the way
we handle resends to take into account the fact that the user may have
interrupted the request.

Reported-by: Kenneth Johansson <ken@kenjo.org>
Fixes: a3cf9bca2a ("NFSv4: Don't add a new lock on an interrupted wait..")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-08-01 23:17:06 -04:00
Christoph Hellwig
006477f40d kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
No need to have this in the top-level Kconfig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-08-02 08:06:55 +09:00
Chao Yu
82cf4f132e f2fs: fix to active page in lru list for read path
If config CONFIG_F2FS_FAULT_INJECTION is on, for both read or write path
we will call find_lock_page() to get the page, but for read path, it
missed to passing FGP_ACCESSED to allocator to active the page in LRU
list, result in being reclaimed in advance incorrectly, fix it.

Reported-by: Xianrong Zhou <zhouxianrong@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
18767e6263 f2fs: don't keep meta pages used for block migration
For migration of encrypted inode's block, we load data of encrypted block
into meta inode's page cache, after checkpoint, those all intermediate
pages should be clean, and no one will read them again, so let's just
release them for more memory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
4ddc1b28aa f2fs: fix to restrict mount condition when without CONFIG_QUOTA
Like quota_ino feature, we need to reject mounting RDWR with image
which enables project_quota feature when there is no CONFIG_QUOTA
be set in kernel.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
00960c2cd8 f2fs: quota: do not mount as RDWR without QUOTA if quota feature enabled
If quota feature is enabled, quota is on by default. However, if
CONFIG_QUOTA is not built in kernel, dquot entries will not get updated,
which leads to quota inconsistency.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
76cf05d79c f2fs: quota: fix incorrect comments
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
955ac6e523 f2fs: quota: decrease the lock granularity of statfs_project
According to fs/quota/dquot.c, `dq_data_lock' protects mem_dqinfo
structures and modifications of dquot pointers in the inode, and
`dquot->dq_dqb_lock' protects data from dq_dqb.

We should use dquot->dq_dqb_lock in statfs_project instead of
dq_dat_lock.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
970e348d98 f2fs: add proc entry to show victim_secmap bitmap
This patch adds a new proc entry to show victim_secmap information in
more detail, which is very helpful to know the get_victim candidate
status clearly, and helpful to debug problems (e.g., some sections can
not gc all of its blocks, since some blocks belong to atomic file,
leaving victim_secmap with section bit setting, in extrem case, this
will lead all bytes of victim_secmap setting with 0xff).

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
fd8c8caf7e f2fs: let checkpoint flush dnode page of regular
Fsyncer will wait on all dnode pages of regular writeback before flushing,
if there are async dnode pages blocked by IO scheduler, it may decrease
fsync's performance.

In this patch, we choose to let f2fs_balance_fs_bg() to trigger checkpoint
to flush these dnode pages of regular, so async IO of dnode page can be
elimitnated, making fsyncer only need to wait for sync IO.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
ad6672bbc5 f2fs: issue discard align to section in LFS mode
For the case when sbi->segs_per_sec > 1 with lfs mode, take
section:segment = 5 for example, if the section prefree_map is
...previous section | current section (1 1 0 1 1) | next section...,
then the start = x, end = x + 1, after start = start_segno +
sbi->segs_per_sec, start = x + 5, then it will skip x + 3 and x + 4, but
their bitmap is still set, which will cause duplicated
f2fs_issue_discard of this same section in the next write_checkpoint:

round 1: section bitmap : 1 1 1 1 1, all valid, prefree_map: 0 0 0 0 0
then rm data block NO.2, block NO.2 becomes invalid, prefree_map: 0 0 1 0 0
write_checkpoint: section bitmap: 1 1 0 1 1, prefree_map: 0 0 0 0 0,
prefree of NO.2 is cleared, and no discard issued

round 2: rm data block NO.0, NO.1, NO.3, NO.4
all invalid, but prefree bit of NO.2 is set and cleared in round 1, then
prefree_map: 1 1 0 1 1
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1, no
valid blocks of this section, so discard issued, but this time prefree
bit of NO.3 and NO.4 is skipped due to start = start_segno + sbi->segs_per_sec;

round 3:
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1 ->
0 0 0 0 0, no valid blocks of this section, so discard issued,
this time prefree bit of NO.3 and NO.4 is cleared, but the discard of
this section is sent again...

To fix this problem, we can align the start and end value to section
boundary for fstrim and real-time discard operation, and decide to issue
discard only when the whole section is invalid, which can issue discard
aligned to section size as much as possible and avoid redundant discard.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Jaegeuk Kim
455e3a5887 f2fs: don't allow any writes on aborted atomic writes
In order to prevent abusing atomic writes by abnormal users, we've added a
threshold, 20% over memory footprint, which disallows further atomic writes.
Previously, however, SQLite doesn't know the files became normal, so that
it could write stale data and commit on revoked normal database file.

Once f2fs detects such the abnormal behavior, this patch tries to avoid further
writes in write_begin().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
797c1cb56b f2fs: restrict setting up inode.i_advise
In order to give advise to f2fs to recognize hot/cold file, it is possible
that we can set specific bit in inode.i_advise through setxattr(), but
there are several bits which are used internally, such as encrypt_bit,
keep_size_bit, they should never be changed through setxattr().

So that this patch 1) adds FADVISE_MODIFIABLE_BITS to filter modifiable
bits user given, 2) supports to clear {hot,cold}_file bits.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlei He
e6b0b159cf f2fs: fix wrong kernel message when recover fsync data on ro fs
This patch fix wrong message info for recover fsync data
on readonly fs.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
059c0648c6 f2fs: clean up ioctl interface naming
Romve redundant prefix 'f2fs_' in the middle of f2fs_ioc_f2fs_write_checkpoint().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
2079f115e7 f2fs: clean up with f2fs_is_{atomic,volatile}_file()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
5b72d5e0df f2fs: clean up with f2fs_encrypted_inode()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
80551d1773 f2fs: clean up with get_current_nat_page
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
6122003a1a f2fs: kill EXT_TREE_VEC_SIZE
Since commit 201ef5e080 ("f2fs: improve shrink performance of extent nodes"),
there is no user of EXT_TREE_VEC_SIZE, just kill it for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Hyunchul Lee
5d3ce4f701 f2fs: avoid duplicated permission check for "trusted." xattrs
Because xattr_permission already checks CAP_SYS_ADMIN
capability, we don't need to check it.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
7735730d39 f2fs: fix to propagate error from __get_meta_page()
If caller of __get_meta_page() can handle error, let's propagate error
from __get_meta_page().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
18dd6470c2 f2fs: fix to do sanity check with i_extra_isize
If inode.i_extra_isize was fuzzed to an abnormal value, when
calculating inline data size, the result will overflow, result
in accessing invalid memory area when operating inline data.

Let's do sanity check with i_extra_isize during inode loading
for fixing.

https://bugzilla.kernel.org/show_bug.cgi?id=200421

- Reproduce

- POC (poc.c)
    #define _GNU_SOURCE
    #include <sys/types.h>
    #include <sys/mount.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/xattr.h>

    #include <dirent.h>
    #include <errno.h>
    #include <error.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #include <linux/falloc.h>
    #include <linux/loop.h>

    static void activity(char *mpoint) {

      char *foo_bar_baz;
      char *foo_baz;
      char *xattr;
      int err;

      err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);
      err = asprintf(&foo_baz, "%s/foo/baz", mpoint);
      err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);

      rename(foo_bar_baz, foo_baz);

      char buf2[113];
      memset(buf2, 0, sizeof(buf2));
      listxattr(xattr, buf2, sizeof(buf2));
      removexattr(xattr, "user.mime_type");

    }

    int main(int argc, char *argv[]) {
      activity(argv[1]);
      return 0;
    }

- Kernel message
Umount the image will leave the following message
[ 2910.995489] F2FS-fs (loop0): Mounted with checkpoint version = 2
[ 2918.416465] ==================================================================
[ 2918.416807] BUG: KASAN: slab-out-of-bounds in f2fs_iget+0xcb9/0x1a80
[ 2918.417009] Read of size 4 at addr ffff88018efc2068 by task a.out/1229

[ 2918.417311] CPU: 1 PID: 1229 Comm: a.out Not tainted 4.17.0+ #1
[ 2918.417314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2918.417323] Call Trace:
[ 2918.417366]  dump_stack+0x71/0xab
[ 2918.417401]  print_address_description+0x6b/0x290
[ 2918.417407]  kasan_report+0x28e/0x390
[ 2918.417411]  ? f2fs_iget+0xcb9/0x1a80
[ 2918.417415]  f2fs_iget+0xcb9/0x1a80
[ 2918.417422]  ? f2fs_lookup+0x2e7/0x580
[ 2918.417425]  f2fs_lookup+0x2e7/0x580
[ 2918.417433]  ? __recover_dot_dentries+0x400/0x400
[ 2918.417447]  ? legitimize_path.isra.29+0x5a/0xa0
[ 2918.417453]  __lookup_slow+0x11c/0x220
[ 2918.417457]  ? may_delete+0x2a0/0x2a0
[ 2918.417475]  ? deref_stack_reg+0xe0/0xe0
[ 2918.417479]  ? __lookup_hash+0xb0/0xb0
[ 2918.417483]  lookup_slow+0x3e/0x60
[ 2918.417488]  walk_component+0x3ac/0x990
[ 2918.417492]  ? generic_permission+0x51/0x1e0
[ 2918.417495]  ? inode_permission+0x51/0x1d0
[ 2918.417499]  ? pick_link+0x3e0/0x3e0
[ 2918.417502]  ? link_path_walk+0x4b1/0x770
[ 2918.417513]  ? _raw_spin_lock_irqsave+0x25/0x50
[ 2918.417518]  ? walk_component+0x990/0x990
[ 2918.417522]  ? path_init+0x2e6/0x580
[ 2918.417526]  path_lookupat+0x13f/0x430
[ 2918.417531]  ? trailing_symlink+0x3a0/0x3a0
[ 2918.417534]  ? do_renameat2+0x270/0x7b0
[ 2918.417538]  ? __kasan_slab_free+0x14c/0x190
[ 2918.417541]  ? do_renameat2+0x270/0x7b0
[ 2918.417553]  ? kmem_cache_free+0x85/0x1e0
[ 2918.417558]  ? do_renameat2+0x270/0x7b0
[ 2918.417563]  filename_lookup+0x13c/0x280
[ 2918.417567]  ? filename_parentat+0x2b0/0x2b0
[ 2918.417572]  ? kasan_unpoison_shadow+0x31/0x40
[ 2918.417575]  ? kasan_kmalloc+0xa6/0xd0
[ 2918.417593]  ? strncpy_from_user+0xaa/0x1c0
[ 2918.417598]  ? getname_flags+0x101/0x2b0
[ 2918.417614]  ? path_listxattr+0x87/0x110
[ 2918.417619]  path_listxattr+0x87/0x110
[ 2918.417623]  ? listxattr+0xc0/0xc0
[ 2918.417637]  ? mm_fault_error+0x1b0/0x1b0
[ 2918.417654]  do_syscall_64+0x73/0x160
[ 2918.417660]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2918.417676] RIP: 0033:0x7f2f3a3480d7
[ 2918.417677] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[ 2918.417732] RSP: 002b:00007fff4095b7d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000c2
[ 2918.417744] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2f3a3480d7
[ 2918.417746] RDX: 0000000000000071 RSI: 00007fff4095b810 RDI: 000000000126a0c0
[ 2918.417749] RBP: 00007fff4095b890 R08: 000000000126a010 R09: 0000000000000000
[ 2918.417751] R10: 00000000000001ab R11: 0000000000000206 R12: 00000000004005e0
[ 2918.417753] R13: 00007fff4095b990 R14: 0000000000000000 R15: 0000000000000000

[ 2918.417853] Allocated by task 329:
[ 2918.418002]  kasan_kmalloc+0xa6/0xd0
[ 2918.418007]  kmem_cache_alloc+0xc8/0x1e0
[ 2918.418023]  mempool_init_node+0x194/0x230
[ 2918.418027]  mempool_init+0x12/0x20
[ 2918.418042]  bioset_init+0x2bd/0x380
[ 2918.418052]  blk_alloc_queue_node+0xe9/0x540
[ 2918.418075]  dm_create+0x2c0/0x800
[ 2918.418080]  dev_create+0xd2/0x530
[ 2918.418083]  ctl_ioctl+0x2a3/0x5b0
[ 2918.418087]  dm_ctl_ioctl+0xa/0x10
[ 2918.418092]  do_vfs_ioctl+0x13e/0x8c0
[ 2918.418095]  ksys_ioctl+0x66/0x70
[ 2918.418098]  __x64_sys_ioctl+0x3d/0x50
[ 2918.418102]  do_syscall_64+0x73/0x160
[ 2918.418106]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 2918.418204] Freed by task 0:
[ 2918.418301] (stack is not available)

[ 2918.418521] The buggy address belongs to the object at ffff88018efc0000
                which belongs to the cache biovec-max of size 8192
[ 2918.418894] The buggy address is located 104 bytes to the right of
                8192-byte region [ffff88018efc0000, ffff88018efc2000)
[ 2918.419257] The buggy address belongs to the page:
[ 2918.419431] page:ffffea00063bf000 count:1 mapcount:0 mapping:ffff8801f2242540 index:0x0 compound_mapcount: 0
[ 2918.419702] flags: 0x17fff8000008100(slab|head)
[ 2918.419879] raw: 017fff8000008100 dead000000000100 dead000000000200 ffff8801f2242540
[ 2918.420101] raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000
[ 2918.420322] page dumped because: kasan: bad access detected

[ 2918.420599] Memory state around the buggy address:
[ 2918.420764]  ffff88018efc1f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.420975]  ffff88018efc1f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.421194] >ffff88018efc2000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 2918.421406]                                                           ^
[ 2918.421627]  ffff88018efc2080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 2918.421838]  ffff88018efc2100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.422046] ==================================================================
[ 2918.422264] Disabling lock debugging due to kernel taint
[ 2923.901641] BUG: unable to handle kernel paging request at ffff88018f0db000
[ 2923.901884] PGD 22226a067 P4D 22226a067 PUD 222273067 PMD 18e642063 PTE 800000018f0db061
[ 2923.902120] Oops: 0003 [#1] SMP KASAN PTI
[ 2923.902274] CPU: 1 PID: 1231 Comm: umount Tainted: G    B             4.17.0+ #1
[ 2923.902490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2923.902761] RIP: 0010:__memset+0x24/0x30
[ 2923.902906] Code: 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 2923.903446] RSP: 0018:ffff88018ddf7ae0 EFLAGS: 00010206
[ 2923.903622] RAX: 0000000000000000 RBX: ffff8801d549d888 RCX: 1ffffffffffdaffb
[ 2923.903833] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88018f0daffc
[ 2923.904062] RBP: ffff88018efc206c R08: 1ffff10031df840d R09: ffff88018efc206c
[ 2923.904273] R10: ffffffffffffe1ee R11: ffffed0031df65fa R12: 0000000000000000
[ 2923.904485] R13: ffff8801d549dc98 R14: 00000000ffffc3db R15: ffffea00063bec80
[ 2923.904693] FS:  00007fa8b2f8a840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 2923.904937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2923.910080] CR2: ffff88018f0db000 CR3: 000000018f892000 CR4: 00000000000006e0
[ 2923.914930] Call Trace:
[ 2923.919724]  f2fs_truncate_inline_inode+0x114/0x170
[ 2923.924487]  f2fs_truncate_blocks+0x11b/0x7c0
[ 2923.929178]  ? f2fs_truncate_data_blocks+0x10/0x10
[ 2923.933834]  ? dqget+0x670/0x670
[ 2923.938437]  ? f2fs_destroy_extent_tree+0xd6/0x270
[ 2923.943107]  ? __radix_tree_lookup+0x2f/0x150
[ 2923.947772]  f2fs_truncate+0xd4/0x1a0
[ 2923.952491]  f2fs_evict_inode+0x5ab/0x610
[ 2923.957204]  evict+0x15f/0x280
[ 2923.961898]  __dentry_kill+0x161/0x250
[ 2923.966634]  shrink_dentry_list+0xf3/0x250
[ 2923.971897]  shrink_dcache_parent+0xa9/0x100
[ 2923.976561]  ? shrink_dcache_sb+0x1f0/0x1f0
[ 2923.981177]  ? wait_for_completion+0x8a/0x210
[ 2923.985781]  ? migrate_swap_stop+0x2d0/0x2d0
[ 2923.990332]  do_one_tree+0xe/0x40
[ 2923.994735]  shrink_dcache_for_umount+0x3a/0xa0
[ 2923.999077]  generic_shutdown_super+0x3e/0x1c0
[ 2924.003350]  kill_block_super+0x4b/0x70
[ 2924.007619]  deactivate_locked_super+0x65/0x90
[ 2924.011812]  cleanup_mnt+0x5c/0xa0
[ 2924.015995]  task_work_run+0xce/0xf0
[ 2924.020174]  exit_to_usermode_loop+0x115/0x120
[ 2924.024293]  do_syscall_64+0x12f/0x160
[ 2924.028479]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2924.032709] RIP: 0033:0x7fa8b2868487
[ 2924.036888] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[ 2924.045750] RSP: 002b:00007ffc39824d58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2924.050190] RAX: 0000000000000000 RBX: 00000000008ea030 RCX: 00007fa8b2868487
[ 2924.054604] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000008f4360
[ 2924.058940] RBP: 00000000008f4360 R08: 0000000000000000 R09: 0000000000000014
[ 2924.063186] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007fa8b2d7183c
[ 2924.067418] R13: 0000000000000000 R14: 00000000008ea210 R15: 00007ffc39824fe0
[ 2924.071534] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 2924.098044] CR2: ffff88018f0db000
[ 2924.102520] ---[ end trace a8e0d899985faf31 ]---
[ 2924.107012] RIP: 0010:__memset+0x24/0x30
[ 2924.111448] Code: 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 2924.120724] RSP: 0018:ffff88018ddf7ae0 EFLAGS: 00010206
[ 2924.125312] RAX: 0000000000000000 RBX: ffff8801d549d888 RCX: 1ffffffffffdaffb
[ 2924.129931] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88018f0daffc
[ 2924.134537] RBP: ffff88018efc206c R08: 1ffff10031df840d R09: ffff88018efc206c
[ 2924.139175] R10: ffffffffffffe1ee R11: ffffed0031df65fa R12: 0000000000000000
[ 2924.143825] R13: ffff8801d549dc98 R14: 00000000ffffc3db R15: ffffea00063bec80
[ 2924.148500] FS:  00007fa8b2f8a840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 2924.153247] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2924.158003] CR2: ffff88018f0db000 CR3: 000000018f892000 CR4: 00000000000006e0
[ 2924.164641] BUG: Bad rss-counter state mm:00000000fa04621e idx:0 val:4
[ 2924.170007] BUG: Bad rss-counter
tate mm:00000000fa04621e idx:1 val:2

- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/inline.c#L78
	memset(addr + from, 0, MAX_INLINE_DATA(inode) - from);
Here the length can be negative.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
66415cee3d f2fs: blk_finish_plug of submit_bio in lfs mode
Expand the blk_finish_plug action from blkzoned to normal lfs mode,
since plug will cause the out-of-order IO submission, which is not
friendly to flash in lfs mode.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
3611ce9911 f2fs: do not set free of current section
For the case when sbi->segs_per_sec > 1, take section:segment = 5 for
example, if segment 1 is just used and allocate new segment 2, and the
blocks of segment 1 is invalidated, at this time, the previous code will
use __set_test_and_free to free the free_secmap and free_sections++,
this is not correct since it is still a current section, so fix it.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Daniel Rosenberg
36b877af79 f2fs: Keep alloc_valid_block_count in sync
If we attempt to request more blocks than we have room for, we try to
instead request as much as we can, however, alloc_valid_block_count
is not decremented to match the new value, allowing it to drift higher
until the next checkpoint. This always decrements it when the requested
amount cannot be fulfilled.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
20ee438232 f2fs: issue small discard by LBA order
For small granularity discard which size is smaller than 64KB, if we
issue those kind of discards orderly by size, their IOs will be spread
into entire logical address, so that in FTL, L2P table will be updated
randomly, result bad wear rate in the table.

In this patch, we choose to issue small discard by LBA order, by this
way, we can expect that L2P table updates from adjacent discard IOs can
be merged in the cache, so it can reduce lifetime wearing of flash.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
522d1711d6 f2fs: stop issuing discard immediately if there is queued IO
For background discard policy, even if there is queued user IO, still
we will check max_requests times for next discard entry, it is unneeded,
let's just stop this round submission immediately.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
4c6b56c002 f2fs: clean up with IS_INODE()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
2482c4325d f2fs: detect bug_on in f2fs_wait_discard_bios
Add bug_on to detect potential non-empty discard wait list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Randy Dunlap
cb15d1e43d f2fs: fix defined but not used build warnings
Fix build warnings in f2fs when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.

../fs/f2fs/sysfs.c:519:12: warning: 'segment_info_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:546:12: warning: 'segment_bits_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:570:12: warning: 'iostat_info_seq_show' defined but not used [-Wunused-function]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
a39e536583 f2fs: enable real-time discard by default
f2fs is focused on flash based storage, so let's enable real-time
discard by default, if user don't want to enable it, 'nodiscard'
mount option should be used on mount.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
82902c06bd f2fs: fix to detect looped node chain correctly
Below dmesg was printed when testing generic/388 of fstest:

F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22
F2FS-fs (zram1): Mounted with checkpoint version = 22300d0e
F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22

The reason is that we initialize free_blocks with free blocks of
filesystem, so if filesystem is full, free_blocks can be zero,
below condition will be true, so that, it will fail recovery.

if (++loop_cnt >= free_blocks ||
	blkaddr == next_blkaddr_of_node(page))

To fix this issue, initialize free_blocks with correct value which
includes over-privision blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
c9b60788fc f2fs: fix to do sanity check with block address in main area
This patch add to do sanity check with below field:
- cp_pack_total_block_count
- blkaddr of data/node
- extent info

- Overview
BUG() in verify_block_addr() when writing to a corrupted f2fs image

- Reproduce (4.18 upstream kernel)

- POC (poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);

  int fd = open(foo_bar_baz, O_RDWR | O_TRUNC, 0777);
  if (fd >= 0) {
    write(fd, (char *)buf, sizeof(buf));
    fdatasync(fd);
    close(fd);
  }
}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

- Kernel message
[  689.349473] F2FS-fs (loop0): Mounted with checkpoint version = 3
[  699.728662] WARNING: CPU: 0 PID: 1309 at fs/f2fs/segment.c:2860 f2fs_inplace_write_data+0x232/0x240
[  699.728670] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  699.729056] CPU: 0 PID: 1309 Comm: a.out Not tainted 4.18.0-rc1+ #4
[  699.729064] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.729074] RIP: 0010:f2fs_inplace_write_data+0x232/0x240
[  699.729076] Code: ff e9 cf fe ff ff 49 8d 7d 10 e8 39 45 ad ff 4d 8b 7d 10 be 04 00 00 00 49 8d 7f 48 e8 07 49 ad ff 45 8b 7f 48 e9 fb fe ff ff <0f> 0b f0 41 80 4d 48 04 e9 65 fe ff ff 90 66 66 66 66 90 55 48 8d
[  699.729130] RSP: 0018:ffff8801f43af568 EFLAGS: 00010202
[  699.729139] RAX: 000000000000003f RBX: ffff8801f43af7b8 RCX: ffffffffb88c9113
[  699.729142] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8802024e5540
[  699.729144] RBP: ffff8801f43af590 R08: 0000000000000009 R09: ffffffffffffffe8
[  699.729147] R10: 0000000000000001 R11: ffffed0039b0596a R12: ffff8802024e5540
[  699.729149] R13: ffff8801f0335500 R14: ffff8801e3e7a700 R15: ffff8801e1ee4450
[  699.729154] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.729156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.729159] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.729171] Call Trace:
[  699.729192]  f2fs_do_write_data_page+0x2e2/0xe00
[  699.729203]  ? f2fs_should_update_outplace+0xd0/0xd0
[  699.729238]  ? memcg_drain_all_list_lrus+0x280/0x280
[  699.729269]  ? __radix_tree_replace+0xa3/0x120
[  699.729276]  __write_data_page+0x5c7/0xe30
[  699.729291]  ? kasan_check_read+0x11/0x20
[  699.729310]  ? page_mapped+0x8a/0x110
[  699.729321]  ? page_mkclean+0xe9/0x160
[  699.729327]  ? f2fs_do_write_data_page+0xe00/0xe00
[  699.729331]  ? invalid_page_referenced_vma+0x130/0x130
[  699.729345]  ? clear_page_dirty_for_io+0x332/0x450
[  699.729351]  f2fs_write_cache_pages+0x4ca/0x860
[  699.729358]  ? __write_data_page+0xe30/0xe30
[  699.729374]  ? percpu_counter_add_batch+0x22/0xa0
[  699.729380]  ? kasan_check_write+0x14/0x20
[  699.729391]  ? _raw_spin_lock+0x17/0x40
[  699.729403]  ? f2fs_mark_inode_dirty_sync.part.18+0x16/0x30
[  699.729413]  ? iov_iter_advance+0x113/0x640
[  699.729418]  ? f2fs_write_end+0x133/0x2e0
[  699.729423]  ? balance_dirty_pages_ratelimited+0x239/0x640
[  699.729428]  f2fs_write_data_pages+0x329/0x520
[  699.729433]  ? generic_perform_write+0x250/0x320
[  699.729438]  ? f2fs_write_cache_pages+0x860/0x860
[  699.729454]  ? current_time+0x110/0x110
[  699.729459]  ? f2fs_preallocate_blocks+0x1ef/0x370
[  699.729464]  do_writepages+0x37/0xb0
[  699.729468]  ? f2fs_write_cache_pages+0x860/0x860
[  699.729472]  ? do_writepages+0x37/0xb0
[  699.729478]  __filemap_fdatawrite_range+0x19a/0x1f0
[  699.729483]  ? delete_from_page_cache_batch+0x4e0/0x4e0
[  699.729496]  ? __vfs_write+0x2b2/0x410
[  699.729501]  file_write_and_wait_range+0x66/0xb0
[  699.729506]  f2fs_do_sync_file+0x1f9/0xd90
[  699.729511]  ? truncate_partial_data_page+0x290/0x290
[  699.729521]  ? __sb_end_write+0x30/0x50
[  699.729526]  ? vfs_write+0x20f/0x260
[  699.729530]  f2fs_sync_file+0x9a/0xb0
[  699.729534]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.729548]  vfs_fsync_range+0x68/0x100
[  699.729554]  ? __fget_light+0xc9/0xe0
[  699.729558]  do_fsync+0x3d/0x70
[  699.729562]  __x64_sys_fdatasync+0x24/0x30
[  699.729585]  do_syscall_64+0x78/0x170
[  699.729595]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  699.729613] RIP: 0033:0x7f9bf930d800
[  699.729615] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 49 bf 2c 00 00 75 10 b8 4b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 78 01 00 48 89 04 24
[  699.729668] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.729673] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.729675] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.729678] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.729680] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.729683] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000
[  699.729687] ---[ end trace 4ce02f25ff7d3df5 ]---
[  699.729782] ------------[ cut here ]------------
[  699.729785] kernel BUG at fs/f2fs/segment.h:654!
[  699.731055] invalid opcode: 0000 [#1] SMP KASAN PTI
[  699.732104] CPU: 0 PID: 1309 Comm: a.out Tainted: G        W         4.18.0-rc1+ #4
[  699.733684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.735611] RIP: 0010:f2fs_submit_page_bio+0x29b/0x730
[  699.736649] Code: 54 49 8d bd 18 04 00 00 e8 b2 59 af ff 41 8b 8d 18 04 00 00 8b 45 b8 41 d3 e6 44 01 f0 4c 8d 73 14 41 39 c7 0f 82 37 fe ff ff <0f> 0b 65 8b 05 2c 04 77 47 89 c0 48 0f a3 05 52 c1 d5 01 0f 92 c0
[  699.740524] RSP: 0018:ffff8801f43af508 EFLAGS: 00010283
[  699.741573] RAX: 0000000000000000 RBX: ffff8801f43af7b8 RCX: ffffffffb88a7cef
[  699.743006] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8801e3e7a64c
[  699.744426] RBP: ffff8801f43af558 R08: ffffed003e066b55 R09: ffffed003e066b55
[  699.745833] R10: 0000000000000001 R11: ffffed003e066b54 R12: ffffea0007876940
[  699.747256] R13: ffff8801f0335500 R14: ffff8801e3e7a600 R15: 0000000000000001
[  699.748683] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.750293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.751462] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.752874] Call Trace:
[  699.753386]  ? f2fs_inplace_write_data+0x93/0x240
[  699.754341]  f2fs_inplace_write_data+0xd2/0x240
[  699.755271]  f2fs_do_write_data_page+0x2e2/0xe00
[  699.756214]  ? f2fs_should_update_outplace+0xd0/0xd0
[  699.757215]  ? memcg_drain_all_list_lrus+0x280/0x280
[  699.758209]  ? __radix_tree_replace+0xa3/0x120
[  699.759164]  __write_data_page+0x5c7/0xe30
[  699.760002]  ? kasan_check_read+0x11/0x20
[  699.760823]  ? page_mapped+0x8a/0x110
[  699.761573]  ? page_mkclean+0xe9/0x160
[  699.762345]  ? f2fs_do_write_data_page+0xe00/0xe00
[  699.763332]  ? invalid_page_referenced_vma+0x130/0x130
[  699.764374]  ? clear_page_dirty_for_io+0x332/0x450
[  699.765347]  f2fs_write_cache_pages+0x4ca/0x860
[  699.766276]  ? __write_data_page+0xe30/0xe30
[  699.767161]  ? percpu_counter_add_batch+0x22/0xa0
[  699.768112]  ? kasan_check_write+0x14/0x20
[  699.768951]  ? _raw_spin_lock+0x17/0x40
[  699.769739]  ? f2fs_mark_inode_dirty_sync.part.18+0x16/0x30
[  699.770885]  ? iov_iter_advance+0x113/0x640
[  699.771743]  ? f2fs_write_end+0x133/0x2e0
[  699.772569]  ? balance_dirty_pages_ratelimited+0x239/0x640
[  699.773680]  f2fs_write_data_pages+0x329/0x520
[  699.774603]  ? generic_perform_write+0x250/0x320
[  699.775544]  ? f2fs_write_cache_pages+0x860/0x860
[  699.776510]  ? current_time+0x110/0x110
[  699.777299]  ? f2fs_preallocate_blocks+0x1ef/0x370
[  699.778279]  do_writepages+0x37/0xb0
[  699.779026]  ? f2fs_write_cache_pages+0x860/0x860
[  699.779978]  ? do_writepages+0x37/0xb0
[  699.780755]  __filemap_fdatawrite_range+0x19a/0x1f0
[  699.781746]  ? delete_from_page_cache_batch+0x4e0/0x4e0
[  699.782820]  ? __vfs_write+0x2b2/0x410
[  699.783597]  file_write_and_wait_range+0x66/0xb0
[  699.784540]  f2fs_do_sync_file+0x1f9/0xd90
[  699.785381]  ? truncate_partial_data_page+0x290/0x290
[  699.786415]  ? __sb_end_write+0x30/0x50
[  699.787204]  ? vfs_write+0x20f/0x260
[  699.787941]  f2fs_sync_file+0x9a/0xb0
[  699.788694]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.789572]  vfs_fsync_range+0x68/0x100
[  699.790360]  ? __fget_light+0xc9/0xe0
[  699.791128]  do_fsync+0x3d/0x70
[  699.791779]  __x64_sys_fdatasync+0x24/0x30
[  699.792614]  do_syscall_64+0x78/0x170
[  699.793371]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  699.794406] RIP: 0033:0x7f9bf930d800
[  699.795134] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 49 bf 2c 00 00 75 10 b8 4b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 78 01 00 48 89 04 24
[  699.798960] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.800483] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.801923] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.803373] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.804798] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.806233] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000
[  699.807667] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  699.817079] ---[ end trace 4ce02f25ff7d3df6 ]---
[  699.818068] RIP: 0010:f2fs_submit_page_bio+0x29b/0x730
[  699.819114] Code: 54 49 8d bd 18 04 00 00 e8 b2 59 af ff 41 8b 8d 18 04 00 00 8b 45 b8 41 d3 e6 44 01 f0 4c 8d 73 14 41 39 c7 0f 82 37 fe ff ff <0f> 0b 65 8b 05 2c 04 77 47 89 c0 48 0f a3 05 52 c1 d5 01 0f 92 c0
[  699.822919] RSP: 0018:ffff8801f43af508 EFLAGS: 00010283
[  699.823977] RAX: 0000000000000000 RBX: ffff8801f43af7b8 RCX: ffffffffb88a7cef
[  699.825436] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8801e3e7a64c
[  699.826881] RBP: ffff8801f43af558 R08: ffffed003e066b55 R09: ffffed003e066b55
[  699.828292] R10: 0000000000000001 R11: ffffed003e066b54 R12: ffffea0007876940
[  699.829750] R13: ffff8801f0335500 R14: ffff8801e3e7a600 R15: 0000000000000001
[  699.831192] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.832793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.833981] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.835556] ==================================================================
[  699.837029] BUG: KASAN: stack-out-of-bounds in update_stack_state+0x38c/0x3e0
[  699.838462] Read of size 8 at addr ffff8801f43af970 by task a.out/1309

[  699.840086] CPU: 0 PID: 1309 Comm: a.out Tainted: G      D W         4.18.0-rc1+ #4
[  699.841603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.843475] Call Trace:
[  699.843982]  dump_stack+0x7b/0xb5
[  699.844661]  print_address_description+0x70/0x290
[  699.845607]  kasan_report+0x291/0x390
[  699.846351]  ? update_stack_state+0x38c/0x3e0
[  699.853831]  __asan_load8+0x54/0x90
[  699.854569]  update_stack_state+0x38c/0x3e0
[  699.855428]  ? __read_once_size_nocheck.constprop.7+0x20/0x20
[  699.856601]  ? __save_stack_trace+0x5e/0x100
[  699.857476]  unwind_next_frame.part.5+0x18e/0x490
[  699.858448]  ? unwind_dump+0x290/0x290
[  699.859217]  ? clear_page_dirty_for_io+0x332/0x450
[  699.860185]  __unwind_start+0x106/0x190
[  699.860974]  __save_stack_trace+0x5e/0x100
[  699.861808]  ? __save_stack_trace+0x5e/0x100
[  699.862691]  ? unlink_anon_vmas+0xba/0x2c0
[  699.863525]  save_stack_trace+0x1f/0x30
[  699.864312]  save_stack+0x46/0xd0
[  699.864993]  ? __alloc_pages_slowpath+0x1420/0x1420
[  699.865990]  ? flush_tlb_mm_range+0x15e/0x220
[  699.866889]  ? kasan_check_write+0x14/0x20
[  699.867724]  ? __dec_node_state+0x92/0xb0
[  699.868543]  ? lock_page_memcg+0x85/0xf0
[  699.869350]  ? unlock_page_memcg+0x16/0x80
[  699.870185]  ? page_remove_rmap+0x198/0x520
[  699.871048]  ? mark_page_accessed+0x133/0x200
[  699.871930]  ? _cond_resched+0x1a/0x50
[  699.872700]  ? unmap_page_range+0xcd4/0xe50
[  699.873551]  ? rb_next+0x58/0x80
[  699.874217]  ? rb_next+0x58/0x80
[  699.874895]  __kasan_slab_free+0x13c/0x1a0
[  699.875734]  ? unlink_anon_vmas+0xba/0x2c0
[  699.876563]  kasan_slab_free+0xe/0x10
[  699.877315]  kmem_cache_free+0x89/0x1e0
[  699.878095]  unlink_anon_vmas+0xba/0x2c0
[  699.878913]  free_pgtables+0x101/0x1b0
[  699.879677]  exit_mmap+0x146/0x2a0
[  699.880378]  ? __ia32_sys_munmap+0x50/0x50
[  699.881214]  ? kasan_check_read+0x11/0x20
[  699.882052]  ? mm_update_next_owner+0x322/0x380
[  699.882985]  mmput+0x8b/0x1d0
[  699.883602]  do_exit+0x43a/0x1390
[  699.884288]  ? mm_update_next_owner+0x380/0x380
[  699.885212]  ? f2fs_sync_file+0x9a/0xb0
[  699.885995]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.886877]  ? vfs_fsync_range+0x68/0x100
[  699.887694]  ? __fget_light+0xc9/0xe0
[  699.888442]  ? do_fsync+0x3d/0x70
[  699.889118]  ? __x64_sys_fdatasync+0x24/0x30
[  699.889996]  rewind_stack_do_exit+0x17/0x20
[  699.890860] RIP: 0033:0x7f9bf930d800
[  699.891585] Code: Bad RIP value.
[  699.892268] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.893781] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.895220] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.896643] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.898069] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.899505] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000

[  699.901241] The buggy address belongs to the page:
[  699.902215] page:ffffea0007d0ebc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  699.903811] flags: 0x2ffff0000000000()
[  699.904585] raw: 02ffff0000000000 0000000000000000 ffffffff07d00101 0000000000000000
[  699.906125] raw: 0000000000000000 0000000000240000 00000000ffffffff 0000000000000000
[  699.907673] page dumped because: kasan: bad access detected

[  699.909108] Memory state around the buggy address:
[  699.910077]  ffff8801f43af800: 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00 00
[  699.911528]  ffff8801f43af880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  699.912953] >ffff8801f43af900: 00 00 00 00 00 00 00 00 f1 01 f4 f4 f4 f2 f2 f2
[  699.914392]                                                              ^
[  699.915758]  ffff8801f43af980: f2 00 f4 f4 00 00 00 00 f2 00 00 00 00 00 00 00
[  699.917193]  ffff8801f43afa00: 00 00 00 00 00 00 00 00 00 f3 f3 f3 00 00 00 00
[  699.918634] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.h#L644

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:32 -07:00
Linus Torvalds
cdbb65c4c7 squashfs metadata 2: electric boogaloo
Anatoly continues to find issues with fuzzed squashfs images.

This time, corrupt, missing, or undersized data for the page filling
wasn't checked for, because the squashfs_{copy,read}_cache() functions
did the squashfs_copy_data() call without checking the resulting data
size.

Which could result in the page cache pages being incompletely filled in,
and no error indication to the user space reading garbage data.

So make a helper function for the "fill in pages" case, because the
exact same incomplete sequence existed in two places.

[ I should have made a squashfs branch for these things, but I didn't
  intend to start doing them in the first place.

  My historical connection through cramfs is why I got into looking at
  these issues at all, and every time I (continue to) think it's a
  one-off.

  Because _this_ time is always the last time. Right?   - Linus ]

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-01 10:38:43 -07:00
Theodore Ts'o
7d95178c77 ext4: check for NUL characters in extended attribute's name
Extended attribute names are defined to be NUL-terminated, so the name
must not contain a NUL character.  This is important because there are
places when remove extended attribute, the code uses strlen to
determine the length of the entry.  That should probably be fixed at
some point, but code is currently really messy, so the simplest fix
for now is to simply validate that the extended attributes are sane.

https://bugzilla.kernel.org/show_bug.cgi?id=200401

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-01 12:36:52 -04:00
Wang Shilong
5ef2a69993 ext4: use ext4_warning() for sb_getblk failure
Out of memory should not be considered as critical errors; so replace
ext4_error() with ext4_warnig().

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-01 12:02:31 -04:00
Darrick J. Wong
56830d6cc1 xfs: check da node magic in _node_lookup_int
Before we start processing what we /think/ is a da3 node block, actually
check the magic to make sure that we're looking at a node block.  This
way we won't blow the asserts in _node_hdr_from_disk on corrupted
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:42:43 -07:00
Darrick J. Wong
611995db2c xfs: use a local variable for magic number in xfs_da3_node_lookup_int
Use a local variable for the block magic number checks instead of
abusing blk->magic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:42:18 -07:00
Darrick J. Wong
0c60d3aa0e xfs: refactor log recovery check
Add a predicate to decide if the log is actively in recovery and use
that instead of open-coding a pagf_init check in the attr leaf verifier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:40:48 -07:00
Darrick J. Wong
ff23f4af7e xfs: move extent busy tree initialization to xfs_initialize_perag
Move the per-AG busy extent tree initialization to the per-ag structure
initialization since we don't want online repair to leak the old tree.
We only deconstruct the tree at unmount time, so this should be safe.
This also enables us to eliminate the commented out initialization in
the xfsprogs libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-31 13:18:09 -07:00
Christoph Hellwig
e666aa37f4 xfs: avoid COW fork extent lookups in writeback if the fork didn't change
Used the per-fork sequence counter to avoid lookups in the writeback code
unless the COW fork actually changed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31 13:18:09 -07:00
Christoph Hellwig
745b3f76d1 xfs: maintain a sequence count for inode fork manipulations
Add a simple 32-bit unsigned integer as the sequence count for
modifications to the extent list in the inode fork.  This will be
used to optimize away extent list lookups in the writeback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
9e037cb797 xfs: check for unknown v5 feature bits in superblock write verifier
Make sure we never try to write the superblock with unknown feature bits
set.  We checked those at mount time, so if they're set now then memory
is corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
69775fd15d xfs: verify icount in superblock write
Add a helper predicate to check the inode count for sanity, then use it
in the superblock write verifier to inspect sb_icount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00