2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

2759 Commits

Author SHA1 Message Date
Kirill Smelkov
c5bf68fe0c *: convert stream-like files from nonseekable_open -> stream_open
Using scripts/coccinelle/api/stream_open.cocci added in 10dce8af34
("fs: stream_open - opener for stream-like files so that read and write
can run simultaneously without deadlock"), search and convert to
stream_open all in-kernel nonseekable_open users for which read and
write actually do not depend on ppos and where there is no other methods
in file_operations which assume @offset access.

I've verified each generated change manually - that it is correct to convert -
and each other nonseekable_open instance left - that it is either not correct
to convert there, or that it is not converted due to current stream_open.cocci
limitations. The script also does not convert files that should be valid to
convert, but that currently have .llseek = noop_llseek or generic_file_llseek
for unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Among cases converted 14 were potentially vulnerable to read vs write deadlock
(see details in 10dce8af34):

	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/infiniband/core/user_mad.c:988:1-17: ERROR: umad_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/input/misc/uinput.c:401:1-17: ERROR: uinput_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.

and the rest were just safe to convert to stream_open because their read and
write do not use ppos at all and corresponding file_operations do not
have methods that assume @offset file access(*):

	arch/powerpc/platforms/52xx/mpc52xx_gpt.c:631:8-24: WARNING: mpc52xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_ibox_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_ibox_stat_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_mbox_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_mbox_stat_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_wbox_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_wbox_stat_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/um/drivers/harddog_kern.c:88:8-24: WARNING: harddog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/x86/kernel/cpu/microcode/core.c:430:33-49: WARNING: microcode_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/ds1620.c:215:8-24: WARNING: ds1620_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/dtlk.c:301:1-17: WARNING: dtlk_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/ipmi/ipmi_watchdog.c:840:9-25: WARNING: ipmi_wdog_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/pcmcia/scr24x_cs.c:95:8-24: WARNING: scr24x_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/tb0219.c:246:9-25: WARNING: tb0219_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/firewire/nosy.c:306:8-24: WARNING: nosy_ops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/hwmon/fschmd.c:840:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/hwmon/w83793.c:1344:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/infiniband/core/ucma.c:1747:8-24: WARNING: ucma_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/infiniband/core/ucm.c:1178:8-24: WARNING: ucm_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/infiniband/core/uverbs_main.c:1086:8-24: WARNING: uverbs_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/input/joydev.c:282:1-17: WARNING: joydev_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/pci/switch/switchtec.c:393:1-17: WARNING: switchtec_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/platform/chrome/cros_ec_debugfs.c:135:8-24: WARNING: cros_ec_console_log_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/rtc/rtc-ds1374.c:470:9-25: WARNING: ds1374_wdt_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/rtc/rtc-m41t80.c:805:9-25: WARNING: wdt_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/s390/char/tape_char.c:293:2-18: WARNING: tape_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/s390/char/zcore.c:194:8-24: WARNING: zcore_reipl_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/s390/crypto/zcrypt_api.c:528:8-24: WARNING: zcrypt_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/spi/spidev.c:594:1-17: WARNING: spidev_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/staging/pi433/pi433_if.c:974:1-17: WARNING: pi433_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/acquirewdt.c:203:8-24: WARNING: acq_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/advantechwdt.c:202:8-24: WARNING: advwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/alim1535_wdt.c:252:8-24: WARNING: ali_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/alim7101_wdt.c:217:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ar7_wdt.c:166:8-24: WARNING: ar7_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/at91rm9200_wdt.c:113:8-24: WARNING: at91wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ath79_wdt.c:135:8-24: WARNING: ath79_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/bcm63xx_wdt.c:119:8-24: WARNING: bcm63xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/cpu5wdt.c:143:8-24: WARNING: cpu5wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/cpwd.c:397:8-24: WARNING: cpwd_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/eurotechwdt.c:319:8-24: WARNING: eurwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/f71808e_wdt.c:528:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/gef_wdt.c:232:8-24: WARNING: gef_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/geodewdt.c:95:8-24: WARNING: geodewdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ib700wdt.c:241:8-24: WARNING: ibwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ibmasr.c:326:8-24: WARNING: asr_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/indydog.c:80:8-24: WARNING: indydog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/intel_scu_watchdog.c:307:8-24: WARNING: intel_scu_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/iop_wdt.c:104:8-24: WARNING: iop_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/it8712f_wdt.c:330:8-24: WARNING: it8712f_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ixp4xx_wdt.c:68:8-24: WARNING: ixp4xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ks8695_wdt.c:145:8-24: WARNING: ks8695wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/m54xx_wdt.c:88:8-24: WARNING: m54xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/machzwd.c:336:8-24: WARNING: zf_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/mixcomwd.c:153:8-24: WARNING: mixcomwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/mtx-1_wdt.c:121:8-24: WARNING: mtx1_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/mv64x60_wdt.c:136:8-24: WARNING: mv64x60_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/nuc900_wdt.c:134:8-24: WARNING: nuc900wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/nv_tco.c:164:8-24: WARNING: nv_tco_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pc87413_wdt.c:289:8-24: WARNING: pc87413_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd.c:698:8-24: WARNING: pcwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd.c:737:8-24: WARNING: pcwd_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_pci.c:581:8-24: WARNING: pcipcwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_pci.c:623:8-24: WARNING: pcipcwd_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_usb.c:488:8-24: WARNING: usb_pcwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_usb.c:527:8-24: WARNING: usb_pcwd_temperature_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pika_wdt.c:121:8-24: WARNING: pikawdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pnx833x_wdt.c:119:8-24: WARNING: pnx833x_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/rc32434_wdt.c:153:8-24: WARNING: rc32434_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/rdc321x_wdt.c:145:8-24: WARNING: rdc321x_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/riowd.c:79:1-17: WARNING: riowd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sa1100_wdt.c:62:8-24: WARNING: sa1100dog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc60xxwdt.c:211:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc7240_wdt.c:139:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc8360.c:274:8-24: WARNING: sbc8360_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc_epx_c3.c:81:8-24: WARNING: epx_c3_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc_fitpc2_wdt.c:78:8-24: WARNING: fitpc2_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sb_wdog.c:108:1-17: WARNING: sbwdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sc1200wdt.c:181:8-24: WARNING: sc1200wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sc520_wdt.c:261:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sch311x_wdt.c:319:8-24: WARNING: sch311x_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/scx200_wdt.c:105:8-24: WARNING: scx200_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/smsc37b787_wdt.c:369:8-24: WARNING: wb_smsc_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/w83877f_wdt.c:227:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/w83977f_wdt.c:301:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wafer5823wdt.c:200:8-24: WARNING: wafwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/watchdog_dev.c:828:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdrtas.c:379:8-24: WARNING: wdrtas_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdrtas.c:445:8-24: WARNING: wdrtas_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt285.c:104:1-17: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt977.c:276:8-24: WARNING: wdt977_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt.c:424:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt.c:484:8-24: WARNING: wdt_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt_pci.c:464:8-24: WARNING: wdtpci_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt_pci.c:527:8-24: WARNING: wdtpci_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	net/batman-adv/log.c:105:1-17: WARNING: batadv_log_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/control.c:57:7-23: WARNING: snd_ctl_f_ops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/rawmidi.c:385:7-23: WARNING: snd_rawmidi_f_ops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/seq/seq_clientmgr.c:310:7-23: WARNING: snd_seq_f_ops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/timer.c:1428:7-23: WARNING: snd_timer_f_ops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.

One can also recheck/review the patch via generating it with explanation comments included via

	$ make coccicheck MODE=patch COCCI=scripts/coccinelle/api/stream_open.cocci SPFLAGS="-D explain"

(*) This second group also contains cases with read/write deadlocks that
stream_open.cocci don't yet detect, but which are still valid to convert to
stream_open since ppos is not used. For example drivers/pci/switch/switchtec.c
calls wait_for_completion_interruptible() in its .read, but stream_open.cocci
currently detects only "wait_event*" as blocking.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James R. Van Zandt" <jrv@vanzandt.mv.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Harald Welte <laforge@gnumonks.org>
Acked-by: Lubomir Rintel <lkundrak@v3.sk> [scr24x_cs]
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Johan Hovold <johan@kernel.org>
Cc: David Herrmann <dh.herrmann@googlemail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Jean Delvare <jdelvare@suse.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>	[watchdog/* hwmon/*]
Cc: Rudolf Marek <r.marek@assembler.cz>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com>
Acked-by: Logan Gunthorpe <logang@deltatee.com> [drivers/pci/switch/switchtec]
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [drivers/pci/switch/switchtec]
Cc: Benson Leung <bleung@chromium.org>
Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> [platform/chrome]
Cc: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> [rtc/*]
Cc: Mark Brown <broonie@kernel.org>
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: bcm-kernel-feedback-list@broadcom.com
Cc: Wan ZongShun <mcuos.com@gmail.com>
Cc: Zwane Mwaikambo <zwanem@gmail.com>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <a@unstable.cc>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
2019-05-06 17:46:41 +03:00
Parav Pandit
943bd984b1 RDMA/core: Allow detaching gid attribute netdevice for RoCE
When there is active traffic through a GID, a QP/AH holds reference to
this GID entry. RoCE GID entry holds reference to its attached
netdevice. Due to this when netdevice is deleted by admin user, its
refcount is not dropped.

Therefore, while deleting RoCE GID, wait for all GID attribute's netdev
users to finish accessing netdev in rcu context.  Once all users done
accessing it, release the netdev refcount.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 11:10:03 -03:00
Parav Pandit
adb4a57a7a RDMA/cma: Use rdma_read_gid_attr_ndev_rcu to access netdev
To access the netdevice of the GID attribute, use an existing API
rdma_read_gid_attr_ndev_rcu().

This further reduces dependency on open access to netdevice of GID
attribute.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 11:10:03 -03:00
Parav Pandit
a70c07397f RDMA: Introduce and use GID attr helper to read RoCE L2 fields
Instead of RoCE drivers figuring out vlan, smac fields while working on
QP/AH, provide a helper routine to read the L2 fields such as vlan_id and
source mac address.

This moves logic from mlx5 driver to core for wider usage for RoCE ports.

This is a preparation patch to allow detaching netdev in subsequent patch.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 11:10:02 -03:00
Parav Pandit
8f97486024 IB/cm: Reduce dependency on gid attribute ndev check
GID type to path record type conversion can be done directly based on port
type and gid attribute type.  There is no need to find out using indirect
way by its GID attribute's ndev field.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 11:10:02 -03:00
Kamal Heib
dd05cb828d RDMA: Get rid of iw_cm_verbs
Integrate iw_cm_verbs data members into ib_device_ops and ib_device
structs, this is done to achieve the following:

1) Avoid memory related bugs durring error unwind
2) Make the code more cleaner
3) Reduce code duplication

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 10:56:56 -03:00
Parav Pandit
eb15c78b05 RDMA/core: Do not invoke init_port on compat devices
The driver interface cannot manipulate the sysfs of the compat device,
only of the full device so we must avoid calling the driver sysfs APIs on
compat devices.

This prevents an oops:

 Call Trace:
 dump_stack+0x5a/0x73
 kobject_init+0x74/0x80
 kobject_init_and_add+0x35/0xb0
 hfi1_create_port_files+0x6e/0x3c0 [hfi1]
 ib_setup_port_attrs+0x43b/0x560 [ib_core]
 add_one_compat_dev+0x16a/0x230 [ib_core]
 rdma_dev_init_net+0x110/0x160 [ib_core]
 ops_init+0x38/0xf0
 setup_net+0xcf/0x1e0
 copy_net_ns+0xb7/0x130
 create_new_namespaces+0x11a/0x1b0
 unshare_nsproxy_namespaces+0x55/0xa0
 ksys_unshare+0x1a7/0x340
 __x64_sys_unshare+0xe/0x20
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 5417783eab ("RDMA/core: Support core port attributes in non init_net")
Reported-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 10:41:23 -03:00
Artemy Kovalyov
1a418f7764 IB/core: Set qp->real_qp before it may be accessed
real_qp should be initialized before ib_destroy_qp() is called.
ib_destroy_qp() may be called in the error flow if ib_create_qp_security()
failed.

Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 10:17:45 -03:00
Shamir Rabinovitch
4f33dd41b2 RDMA/uverbs: Initialize uverbs_attr_bundle ucontext in ib_uverbs_get_context
ib_uverbs_get_context does not have a uobject so it does not call the
rdma_lookup_get_uobject which is used to set up the uverbs_attr_bundle
ucontext. For ib_uverbs_get_context we need to set up this manually before
we send the uverbs_attr_bundle down to the driver layer.

This completes the change that was done in commit 70f06b26f0 ("IB:
ucontext should be set properly for all cmd & ioctl paths")

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 10:09:25 -03:00
David S. Miller
ff24e4980a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three trivial overlapping conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 22:14:21 -04:00
Gal Pressman
f89adedaf3 RDMA/uverbs: Initialize udata struct on destroy flows
Cited commit introduced the udata parameter to different destroy flows
but the uapi method definition does not have udata (i.e has_udata flag
is not set). As a result, an uninitialized udata struct is being passed
down to the driver callbacks.

Fix that by clearing the driver udata even in cases where has_udata flag
is not set.

Fixes: c4367a2635 ("IB: Pass uverbs_attr_bundle down ib_x destroy path")
Cc: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Co-developed-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-02 17:07:02 -03:00
Shiraz Saleem
7872168a83 RDMA/umem: Handle page combining avoidance correctly in ib_umem_add_sg_table()
The flag update_cur_sg tracks whether contiguous pages from a new set of
page_list pages can be merged into the SGE passed into
ib_umem_add_sg_table(). If this flag is true, but the total segment length
exceeds the max_seg_size supported by HW, we avoid combining to this SGE
and move to a new SGE (x) and merge 'len' pages to it. However, if i <
npages, the next iteration can incorrectly merge 'len' contiguous pages
into x instead of into a new SGE since update_cur_sg is still true.

Reset update_cur_sg to false always after the check to merge pages into
the first SGE passed in to ib_umem_add_sg_table().  Also, prevent a new
SGE's segment length from ever exceeding HW max_seg_sz.

There is a crash on hfi1 as result of this where-in max_seg_sz is
defaulting to 64K. Due to above bug, unfolding SGE's in __ib_umem_release
points to a bad page ptr.

 TEST comp-wfr.perfnative.STL-22166-WDT _ perftest native 2-Write_4097QP_4MB STARTING at 1555387093
 BUG: Bad page state in process ib_write_bw  pfn:7ebca0
 page:ffffcd675faf2800 count:0 mapcount:1 mapping:0000000000000000 index:0x1
 flags: 0x17ffffc0000000()
 raw: 0017ffffc0000000 dead000000000100 dead000000000200 0000000000000000
 raw: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
 page dumped because: nonzero mapcount
 CPU: 18 PID: 15853 Comm: ib_write_bw Tainted: G    B             5.1.0-rc4 #1
 Hardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015
 Call Trace:
  dump_stack+0x5a/0x73
  bad_page+0xf5/0x10f
  free_pcppages_bulk+0x62c/0x680
  free_unref_page+0x54/0x70
  __ib_umem_release+0x148/0x1a0 [ib_uverbs]
  ib_umem_release+0x22/0x80 [ib_uverbs]
  rvt_dereg_mr+0x67/0xb0 [rdmavt]
  ib_dereg_mr_user+0x37/0x60 [ib_core]
  destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
  uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
  uobj_destroy+0x4d/0x60 [ib_uverbs]
  __uobj_get_destroy+0x33/0x50 [ib_uverbs]
  __uobj_perform_destroy+0xa/0x30 [ib_uverbs]
  ib_uverbs_dereg_mr+0x66/0x90 [ib_uverbs]
  ib_uverbs_write+0x3e1/0x500 [ib_uverbs]
  vfs_write+0xad/0x1b0
  ksys_write+0x5a/0xd0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: d10bcf947a ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-02 16:50:12 -03:00
Gal Pressman
923abb9d79 RDMA/core: Introduce RDMA subsystem ibdev_* print functions
Similarly to dev/netdev/etc printk helpers, add standard printk helpers
for the RDMA subsystem.

Example output:
efa 0000:00:06.0 efa_0: Hello World!
efa_0: Hello World! (no parent device set)
(NULL ib_device): Hello World! (ibdev is NULL)

Cc: Jason Baron <jbaron@akamai.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-05-01 12:29:28 -04:00
Linus Torvalds
6a5c5d26c4 rdma: fix build errors on s390 and MIPS due to bad ZERO_PAGE use
The parameter to ZERO_PAGE() was wrong, but since all architectures
except for MIPS and s390 ignore it, it wasn't noticed until 0-day
reported the build error.

Fixes: 67f269b37f ("RDMA/ucontext: Fix regression with disassociate")
Cc: stable@vger.kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-29 09:48:53 -07:00
Linus Torvalds
14f974d7f0 5.1 Third RC pull request
One core bug fix and a few driver ones
 
 - FRWR memory registration for hfi1/qib didn't work with with some iovas
   causing a NFSoRDMA failure regression due to a fix in the NFS side
 
 - A command flow error in mlx5 allowed user space to send a corrupt
   command (and also smash the kernel stack we've since learned)
 
 - Fix a regression and some bugs with device hot unplug that was
   discovered while reviewing Andrea's patches
 
 - hns has a failure if the user asks for certain QP configurations
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlzFkh8ACgkQOG33FX4g
 mxrWDQ/8CFK0TNGIf+LTQk2urQ5XAT0amNDNjEvi5kT4Vk2PFdkT5IZxlfK2FU+W
 68FKzP0zpUfSgz83BS26wBH939mJZV+4hUE/6ESyHtsEV9Hsin1zIgrraiad0l4E
 WOXQMB76rIzKLj1Ws1G8udW7Tr4d9tm0kNb/PQhlhZW8+yt6lsAcJRdoetKT+kYj
 WaSqJ+U2Y1LhOxHfc+w3M8NJOvIW3qx9ju7sx2RyIYxU46M4f4r+pT8Z25LnMrh1
 7PoOsfoDXZlng6UNueSmM1glTlRQDbiy3XdW4wQcvQABmmJfSLOLf9beeSn6pgPC
 YfNT6fznOTPGUrLhpiMMSsA5R6S/4cGZ9CVpGuojGl7VOWu/fr/Aja3JY2krNpWn
 jIcvh6nnGg5GuGTg/ZCmBYyAF22xbFmEmV7K0FP+dXZJyDVEiuC02j+JkTCknZYJ
 DaqzV/K/l1ROlKD+CBwWewrDztXjnxu3BvnNfMeAE9C8X/AGNdNY/86/IdIAgJSe
 QRrjf4rV8dqvb0i7lgkEe7swjwLoocjcM6OqMW42J35HUXjnkytrNhhZcgtQzSsq
 M1SM8ascnXE5OxIKfuAWQdHRR46rkgZVIsf8JLXaJQp+ZP55uiq355txwkeKgYrg
 oyC/7yuADZtXwEYsMDGgbI1RMpgMlAyAkDoPEumSol2LtmUNSgk=
 =K4Hb
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "One core bug fix and a few driver ones

   - FRWR memory registration for hfi1/qib didn't work with with some
     iovas causing a NFSoRDMA failure regression due to a fix in the NFS
     side

   - A command flow error in mlx5 allowed user space to send a corrupt
     command (and also smash the kernel stack we've since learned)

   - Fix a regression and some bugs with device hot unplug that was
     discovered while reviewing Andrea's patches

   - hns has a failure if the user asks for certain QP configurations"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/hns: Bugfix for mapping user db
  RDMA/ucontext: Fix regression with disassociate
  RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
  RDMA/mlx5: Do not allow the user to write to the clock page
  IB/mlx5: Fix scatter to CQE in DCT QP creation
  IB/rdmavt: Fix frwr memory registration
2019-04-28 10:00:45 -07:00
Johannes Berg
8cb081746c netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:

 1) liberal (default)
     - undefined (type >= max) & NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted
     - garbage at end of message accepted
 2) strict (opt-in)
     - NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted

Split out parsing strictness into four different options:
 * TRAILING     - check that there's no trailing data after parsing
                  attributes (in message or nested)
 * MAXTYPE      - reject attrs > max known type
 * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
 * STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
 * nla_parse           -> nla_parse_deprecated
 * nla_parse_strict    -> nla_parse_deprecated_strict
 * nlmsg_parse         -> nlmsg_parse_deprecated
 * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
 * nla_parse_nested    -> nla_parse_nested_deprecated
 * nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Michal Kubecek
ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
David S. Miller
8b44836583 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easy cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25 23:52:29 -04:00
Matthew Wilcox
b9b0f34531 uverbs: Convert idr to XArray
The word 'idr' is scattered throughout the API, so I haven't changed it,
but the 'idr' variable is now an XArray.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-25 12:27:11 -03:00
Jason Gunthorpe
449a224c10 Merge branch 'rdma_mmap' into rdma.git for-next
Jason Gunthorpe says:

====================
Upon review it turns out there are some long standing problems in BAR
mapping area:
 * BAR pages intended for read-only can be switched to writable via mprotect.
 * Missing use of rdma_user_mmap_io for the mlx5 clock BAR page.
 * Disassociate causes SIGBUS when touching the pages.
 * CPU pages are being mapped through to the process via remap_pfn_range
   instead of the more appropriate vm_insert_page, causing weird behaviors
   during disassociation.

This series adds the missing VM_* flag manipulation, adds faulting a zero
page for disassociation and revises the CPU page mappings to use
vm_insert_page.
====================

For dependencies this branch is based on for-rc from
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

* branch 'rdma_mmap':
  RDMA: Remove rdma_user_mmap_page
  RDMA/mlx5: Use get_zeroed_page() for clock_info
  RDMA/ucontext: Fix regression with disassociate
  RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
  RDMA/mlx5: Do not allow the user to write to the clock page

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 16:20:34 -03:00
Jason Gunthorpe
4eb6ab13b9 RDMA: Remove rdma_user_mmap_page
Upon further research drivers that want this should simply call the core
function vm_insert_page(). The VMA holds a reference on the page and it
will be automatically freed when the last reference drops. No need for
disassociate to sequence the cleanup.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 16:18:36 -03:00
Jason Gunthorpe
67f269b37f RDMA/ucontext: Fix regression with disassociate
When this code was consolidated the intention was that the VMA would
become backed by anonymous zero pages after the zap_vma_pte - however this
very subtly relied on setting the vm_ops = NULL and clearing the VM_SHARED
bits to transform the VMA into an anonymous VMA. Since the vm_ops was
removed this broke.

Now userspace gets a SIGBUS if it touches the vma after disassociation.

Instead of converting the VMA to anonymous provide a fault handler that
puts a zero'd page into the VMA when user-space touches it after
disassociation.

Cc: stable@vger.kernel.org
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Fixes: 5f9794dc94 ("RDMA/ucontext: Add a core API for mmaping driver IO memory")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 13:32:25 -03:00
Parav Pandit
5d7ed2f27b RDMA/cma: Consider scope_id while binding to ipv6 ll address
When two netdev have same link local addresses (such as vlan and non
vlan), two rdma cm listen id should be able to bind to following different
addresses.

listener-1: addr=lla, scope_id=A, port=X
listener-2: addr=lla, scope_id=B, port=X

However while comparing the addresses only addr and port are considered,
due to which 2nd listener fails to listen.

In below example of two listeners, 2nd listener is failing with address in
use error.

$ rping -sv -a fe80::268a:7ff:feb3:d113%ens2f1 -p 4545&

$ rping -sv -a fe80::268a:7ff:feb3:d113%ens2f1.200 -p 4545
rdma_bind_addr: Address already in use

To overcome this, consider the scope_ids as well which forms the accurate
IPv6 link local address.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 10:39:16 -03:00
Parav Pandit
823b23da71 IB/core: Allow vlan link local address based RoCE GIDs
IPv6 link local address for a VLAN netdevice has nothing to do with its
resemblance with the default GID, because VLAN link local GID is in
different layer 2 domain.

Now that RoCE MAD packet processing and route resolution consider the
right GID index, there is no need for an unnecessary check which prevents
the addition of vlan based IPv6 link local GIDs.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 10:08:15 -03:00
Parav Pandit
2e5b8a0116 RDMA/core: Add a netlink command to change net namespace of rdma device
Provide an option to change the net namespace of a rdma device through a
netlink command. When multiple rdma devices exists in a system, and when
containers are used, this will limit rdma device visibility to a specified
net namespace.

An example command to change net namespace of mlx5_1 device to the
previously created net namespace 'foo' is:

$ ip netns add foo
$ rdma dev set mlx5_1 netns foo

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 14:44:58 -03:00
Parav Pandit
decbc7a6b0 RDMA/core: Introduce a helper function to change net namespace of rdma device
Introduce a helper function that changes rdma device's net namespace which
performs mini disable/enable sequence to have device visible only in
assigned net namespace.

Device unregistration, device rename and device change net namespace
may be invoked concurrently.

(a) device unregistration needs to wait if a device change (rename or net
    namespace change) operation is in progress.
(b) device net namespace change should not proceed if the unregistration
    has started.
(c) while one cpu is changing device net namespace, other cpu should not
    be able to rename or change net namespace.

To address above concurrency,
(a) Use unreg_mutex to synchronize between ib_unregister_device() and net
    namespace change operation
(b) In cases where unregister_device() has started unregistration before
    change_netns got chance to acquire unreg_mutex, validate the refcount
    - if it dropped to zero, abort the net namespace change operation.

Finally use the helper function to change net namespace of ib device to
move the device back to init_net when such net is deleted.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 14:44:58 -03:00
Parav Pandit
3042492bd1 RDMA/core: Avoid freeing netdevs in disable_device()
So we can use the disable_device() helper while changing the net namespace
of the rdma device in a subsequent patch, move free_netdevs() out of it.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 14:44:58 -03:00
Andrea Arcangeli
04f5866e41 coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19 09:46:05 -07:00
David Ahern
1550c17193 ipv4: Prepare rtable for IPv6 gateway
To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
Shiraz Saleem
d0b5c01bb4 RDMA/umem: Use correct value for SG entries in sg_copy_to_buffer()
With page combining, the assumption that number of SG entries in umem SGL
equal to number of system pages in umem no longer holds.

umem->sg_nents tracks the SG entries in umem SGL. Use it in
sg_pcopy_to_buffer() as opposed to ib_umem_num_pages(umem).

Fixes: d10bcf947a ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Reported-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Leon Romanovsky
68e326dea1 RDMA: Handle SRQ allocations by IB/core
Convert SRQ allocation from drivers to be in the IB/core

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Leon Romanovsky
d345691471 RDMA: Handle AH allocations by IB/core
Simplify drivers by ensuring lifetime of ib_ah object. The changes
in .create_ah() go hand in hand with relevant update in .destroy_ah().

We will use this opportunity and convert .destroy_ah() to don't fail, as
it was suggested a long time ago, because there is nothing to do in case
of failure during destroy.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Jason Gunthorpe
feec576a6a IB: When attrs.udata/ufile is available use that instead of uobject
The ucontext and ufile should not be accessed via the uobject, all these
cases have an attrs so use that instead.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Leon Romanovsky
9e886b39a7 RDMA/nldev: Return device protocol
Add new RDMA_NLDEV_ATTR_DEV_PROTOCOL attribute to give ability for UDEV
rules create IB device stable names based on link type protocol.  The
assumption that devices like mlx4 with duality in their link type under
one IB device struct won't be allowed in the future.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Leon Romanovsky
c87e65cfb9 RDMA/cm: Move debug counters to be under relevant IB device
The sysfs layout is created by CM incorrectly presented RDMA devices with
InfiniBand link layer. Layout of such devices represents device tree of
connections. By moving CM statistics to be under relevant port of IB
device, we will fix the following issues:

 * Symlink name - It used device name instead of specific identifier.
 * Target location - It was supposed to point to PCI-ID/infiniband_cm/
   instead of PCI-ID/infiniband/
 * Target name - It created extra device file under already existing
   device folder, e.g. mlx5_0/mlx5_0
 * Crash during boot with RDMA persistent naming patches.

 sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0'
 CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178
 Call Trace:
  dump_stack+0xcc/0x180
  sysfs_warn_dup.cold.3+0x17/0x2d
  sysfs_do_create_link_sd.isra.2+0xd0/0xf0
  device_add+0x7cb/0x1450
  device_create_groups_vargs+0x1ae/0x220
  device_create+0x93/0xc0
  cm_add_one+0x38f/0xf60 [ib_cm]
  add_client_context+0x167/0x210 [ib_core]
  enable_device_and_get+0x230/0x3f0 [ib_core]
  ib_register_device+0x823/0xbf0 [ib_core]
  __mlx5_ib_add+0x45/0x150 [mlx5_ib]
  mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib]
  mlx5_add_device+0x130/0x3a0 [mlx5_core]
  mlx5_register_interface+0x1a9/0x270 [mlx5_core]
  do_one_initcall+0x14f/0x5de
  do_init_module+0x247/0x7c0
  load_module+0x4c2f/0x60d0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

After this change:
[leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_rx_duplicates
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_rx_msgs
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_tx_msgs
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_tx_retries

Fixes: 110cf374a8 ("infiniband: make cm_device use a struct device and not a kobject.")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:24 -03:00
Shiraz Saleem
d10bcf947a RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs
Combine contiguous regions of PAGE_SIZE pages into single scatter list
entry while building the scatter table for a umem. This minimizes the
number of the entries in the scatter list and reduces the DMA mapping
overhead, particularly with the IOMMU.

Set default max_seg_size in core for IB devices to 2G and do not combine
if we exceed this limit.

Also, purge npages in struct ib_umem as we now DMA map the umem SGL with
sg_nents and npage computation is not needed. Drivers should now be using
ib_umem_num_pages(), so fix the last stragglers.

Move npages tracking to ib_umem_odp as ODP drivers still need it.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Gal Pressman <galpress@amazon.com>
Tested-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:24 -03:00
Leon Romanovsky
c7252a6532 RDMA/cm: Remove useless zeroing of static global variable
Static global variables are initialized to zero by C standard,
there is no need to zero them again.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-04 08:29:04 -03:00
Leon Romanovsky
061ccb52d2 RDMA/cma: Set proper port number as index
Conversion from IDR to XArray missed the fact that idr_alloc() returned
index as a return value, this index was saved in port variable and used as
query index later on. This caused to the following error.

 BUG: KASAN: use-after-free in cma_check_port+0x86a/0xa20 [rdma_cm]
 Read of size 8 at addr ffff888069fde998 by task ucmatose/387
 CPU: 3 PID: 387 Comm: ucmatose Not tainted 5.1.0-rc2+ #253
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
 Call Trace:
  dump_stack+0x7c/0xc0
  print_address_description+0x6c/0x23c
  ? cma_check_port+0x86a/0xa20 [rdma_cm]
  kasan_report.cold.3+0x1c/0x35
  ? cma_check_port+0x86a/0xa20 [rdma_cm]
  ? cma_check_port+0x86a/0xa20 [rdma_cm]
  cma_check_port+0x86a/0xa20 [rdma_cm]
  rdma_bind_addr+0x11bc/0x1b00 [rdma_cm]
  ? find_held_lock+0x33/0x1c0
  ? cma_ndev_work_handler+0x180/0x180 [rdma_cm]
  ? wait_for_completion+0x3d0/0x3d0
  ucma_bind+0x120/0x160 [rdma_ucm]
  ? ucma_resolve_addr+0x1a0/0x1a0 [rdma_ucm]
  ucma_write+0x1f8/0x2b0 [rdma_ucm]
  ? ucma_open+0x260/0x260 [rdma_ucm]
  vfs_write+0x157/0x460
  ksys_write+0xb8/0x170
  ? __ia32_sys_read+0xb0/0xb0
  ? trace_hardirqs_off_caller+0x5b/0x160
  ? do_syscall_64+0x18/0x3c0
  do_syscall_64+0x95/0x3c0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Allocated by task 381:
   __kasan_kmalloc.constprop.5+0xc1/0xd0
   cma_alloc_port+0x4d/0x160 [rdma_cm]
   rdma_bind_addr+0x14e7/0x1b00 [rdma_cm]
   ucma_bind+0x120/0x160 [rdma_ucm]
   ucma_write+0x1f8/0x2b0 [rdma_ucm]
   vfs_write+0x157/0x460
   ksys_write+0xb8/0x170
   do_syscall_64+0x95/0x3c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Freed by task 381:
   __kasan_slab_free+0x12e/0x180
   kfree+0xed/0x290
   rdma_destroy_id+0x6b6/0x9e0 [rdma_cm]
   ucma_close+0x110/0x300 [rdma_ucm]
   __fput+0x25a/0x740
   task_work_run+0x10e/0x190
   do_exit+0x85e/0x29e0
   do_group_exit+0xf0/0x2e0
   get_signal+0x2e0/0x17e0
   do_signal+0x94/0x1570
   exit_to_usermode_loop+0xfa/0x130
   do_syscall_64+0x327/0x3c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: <syzbot+2e3e485d5697ea610460@syzkaller.appspotmail.com>
Reported-by: Ran Rozenstein <ranro@mellanox.com>
Fixes: 638267537a ("cma: Convert portspace IDRs to XArray")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:20:32 -03:00
Shamir Rabinovitch
ff23dfa134 IB: Pass only ib_udata in function prototypes
Now when ib_udata is passed to all the driver's object create/destroy APIs
the ib_udata will carry the ib_ucontext for every user command. There is
no need to also pass the ib_ucontext via the functions prototypes.

Make ib_udata the only argument psssed.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 15:00:47 -03:00
Shamir Rabinovitch
bdeacabd1a IB: Remove 'uobject->context' dependency in object destroy APIs
Now that we have the udata passed to all the ib_xxx object destroy APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:59:35 -03:00
Shamir Rabinovitch
c4367a2635 IB: Pass uverbs_attr_bundle down ib_x destroy path
The uverbs_attr_bundle with the ucontext is sent down to the drivers ib_x
destroy path as ib_udata. The next patch will use the ib_udata to free the
drivers destroy path from the dependency in 'uobject->context' as we
already did for the create path.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:57:35 -03:00
Shamir Rabinovitch
a6a3797df2 IB: Pass uverbs_attr_bundle down uobject destroy path
Pass uverbs_attr_bundle down the uobject destroy path. The next patch will
use this to eliminate the dependecy of the drivers in ib_x->uobject
pointers.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:55:36 -03:00
Shamir Rabinovitch
70f06b26f0 IB: ucontext should be set properly for all cmd & ioctl paths
the Attempt to use the below commit to initialize the ucontext for the
uobject destroy path has shown that the below commit is incomplete.

Parts were reverted and the ucontext set up in the uverbs_attr_bundle was
moved to rdma_lookup_get_uobject which is called from the uobj_get_XXX
macros and rdma_alloc_begin_uobject which is called when uobject is
created.

Fixes: 3d9dfd0603 ("IB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flows")
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:55:03 -03:00
David Ahern
3616d08bcb ipv6: Move ipv6 stubs to a separate header file
The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:53:45 -07:00
Parav Pandit
2b34c55802 RDMA/core: Add command to set ib_core device net namspace sharing mode
Add netlink command that enables/disables sharing rdma device among
multiple net namespaces.

Using rdma tool,
$rdma sys set netns shared (default mode)

When rdma subsystem netns mode is set to shared mode, rdma devices
will be accessible in all net namespaces.

Using rdma tool,
$rdma sys set netns exclusive

When rdma subsystem netns mode is set to exclusive mode, devices
will be accessible in only one net namespace at any given
point of time.

If there are any net namespaces other than default init_net exists,
while executing this command, it will fail and mode cannot be changed.

To change this mode, netlink command is used instead of sysctl, because
netlink command allows to auto load a module.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
cb7e0e1305 RDMA/core: Add interface to read device namespace sharing mode
Add an interface via netlink command to query whether rdma devices are
shared among multiple net namespaces or not. When using RDMAtool, it can
be queried as,

$rdma system show netns
netns shared

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
37eeab55ae RDMA/core: Extend ib_device_get_by_index for net namespace
Extend ib_device_get_by_index() API to check device access for
net namespace for serving netlink commands.

Also enforce net ns check on dumpit commands which iterate over all
registered rdma devices and which don't call ib_device_get_by_index().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
41c6140189 RDMA: Check net namespace access for uverbs, umad, cma and nldev
Introduce an API rdma_dev_access_netns() to check whether a rdma device
can be accessed from the specified net namespace or not.
Use rdma_dev_access_netns() while opening character uverbs, umad network
device and also check while rdma cm_id binds to rdma device.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
a56bc45b27 RDMA/core: Add module param to disable device sharing among net ns
Add module parameter to change a sharing mode of ib_core early in the
boot process. This parameter helps to those systems where modern up
to date rdma tool (iproute2) package may not be available during
kernel upgrade cycle.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
5417783eab RDMA/core: Support core port attributes in non init_net
Now that sysfs compatibility layer for non init_net exists, add core port
attributes such as pkey and gid table to non init_net ns.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
4e0f7b9070 RDMA/core: Implement compat device/sysfs tree in net namespace
Implement compatibility layer sysfs entries of ib_core so that non
init_net net namespaces can also discover rdma devices.

Each non init_net net namespace has ib_core_device created in it.
Such ib_core_device sysfs tree resembles rdma devices found in
init_net namespace.

This allows discovering rdma devices in multiple non init_net net
namespaces via sysfs entries and helpful to rdma-core userspace.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
62dfa7955e RDMA/core: Restrict sysfs entries view to init_net
This is a preparation patch to provide isolation of rdma device in a
network namespace.

As first step, make rdma device visible only in init net namespace.
Subsequent patch will enable rdma device visibility back in multiple net
namespaces using compat ib_core_device device/sysfs tree.

Given that the IB subsystem depends on net stack, it needs to be
initialized after netdev and since it support devices, it needs to be
initialized before the device subsystem; therefore, change initcall
sequence to fs_initcall, so that when ib_core is compiled in the kernel
image, the right init sequence is followed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Parav Pandit
cebe556bd7 RDMA/core: Introduce ib_core_device to hold device
In order to support sysfs entries in multiple net namespaces for a rdma
device, introduce a ib_core_device whose scope is limited to hold core
device and per port sysfs related entries.

This is preparation patch so that multiple ib_core_devices in each net
namespace can be created in subsequent patch who all can share ib_device.

(a) Move sysfs specific fields to ib_core_device.
(b) Make sysfs and device life cycle related routines to work on
    ib_core_device.
(c) Introduce and use rdma_init_coredev() helper to initialize
    coredev fields.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:52:02 -03:00
Bart Van Assche
0080aed4e4 RDMA/uverbs: Allow the compiler to verify declaration and definition consistency
This patch avoids that sparse reports the following warnings:

drivers/infiniband/core/uverbs_std_types_flow_action.c:442:30: warning: symbol 'uverbs_def_obj_flow_action' was not declared. Should it be static?
drivers/infiniband/core/uverbs_std_types_dm.c:112:30: warning: symbol 'uverbs_def_obj_dm' was not declared. Should it be static?
drivers/infiniband/core/uverbs_std_types_counters.c:153:30: warning: symbol 'uverbs_def_obj_counters' was not declared. Should it be static?
drivers/infiniband/core/uverbs_std_types_mr.c:213:30: warning: symbol 'uverbs_def_obj_mr' was not declared. Should it be static?

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 0bd01f3d09 ("RDMA/uverbs: Require all objects to have a driver destroy function")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 10:22:48 -03:00
Bart Van Assche
2dcdebff5e RDMA/uverbs: Annotate uverbs_request_next_ptr() return value as a __user pointer
This patch avoids that sparse complains about a mismatch between the
returned value and the function return type.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: c3bea3d2dc ("RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 10:22:48 -03:00
Bart Van Assche
259e66bcdf RDMA/uverbs: Add a __user annotation to a pointer
This patch avoids that sparse and smatch report the following:

  warning: cast removes address space of expression

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 3a6532c9af ("RDMA/uverbs: Use uverbs_attr_bundle to pass udata for write")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 10:22:48 -03:00
Ira Weiny
2ccfbb70c2 IB/MAD: Add SMP details to MAD tracing
Decode more information from the packet and include it in the trace.

Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:52:01 -03:00
Ira Weiny
056533192a IB/UMAD: Add umad trace points
Trace MADs going to/from user space.

Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:52:01 -03:00
Ira Weiny
0e65bae205 IB/MAD: Add agent trace points
Trace agent details when agents are [un]registered.  In addition, report
agent details on send/recv.

Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:52:00 -03:00
Ira Weiny
821bf1de45 IB/MAD: Add recv path trace point
Trace received MAD details.

Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:52:00 -03:00
Ira Weiny
4d60cad5db IB/MAD: Add send path trace points
Use the standard Linux trace mechanism to trace MADs being sent.  4 trace
points are added, when the MAD is posted to the qp, when the MAD is
completed, if a MAD is resent, and when the MAD completes in error.

Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:52:00 -03:00
Ira Weiny
4ae2744410 IB/core: Ensure an invalidate_range callback on ODP MR
No device supports ODP MR without an invalidate_range callback.

Warn on any any device which attempts to support ODP without supplying
this callback.

Then we can remove the checks for the callback within the code.

This stems from the discussion

https://www.spinics.net/lists/linux-rdma/msg76460.html

...which concluded this code was no longer necessary.

Acked-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 16:39:40 -03:00
Matthew Wilcox
638267537a cma: Convert portspace IDRs to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 12:00:15 -03:00
Matthew Wilcox
81cc440883 ucm: Convert ctx_id_table to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 11:50:29 -03:00
Matthew Wilcox
8e5a9d61e2 ib core: Convert query_idr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 11:47:05 -03:00
Matthew Wilcox
ae78ff3a0f RDMA/cm: Convert local_id_table to XArray
Also introduce cm_local_id() to reduce the amount of boilerplate when
converting a local ID to an XArray index.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 11:44:22 -03:00
Matthew Wilcox
949a237046 IB/mad: Convert ib_mad_clients to XArray
Pull the allocation function out into its own function to reduce the
length of ib_register_mad_agent() a little and keep all the allocation
logic together.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 11:36:20 -03:00
Erez Alfasi
19b1a294b0 RDMA: Use __packed annotation instead of __attribute__ ((packed))
"__attribute__" set of macros has been standardized, have became more
potentially portable and consistent code back in v2.6.21 by commit
82ddcb040 ("[PATCH] extend the set of "__attribute__" shortcut macros").
Moreover, nowadays checkpatch.pl warns about using __attribute__((packed))
instead of __packed.

This patch converts all the "__attribute__ ((packed))" annotations to
"__packed" within the RDMA subsystem.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 21:14:12 -03:00
Linus Torvalds
ea295481b6 XArray updates for 5.1-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlyHF2oUHHdpbGx5QGlu
 ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5j9AgAlpeptRfnPO0+VXj+EbxaOOI8tOG+
 w+vBasWoQB+lZ9ctf1qUQVSeLn0ErxTM7BaIP7plfDrEWiIbRWkV18B+heS5d1Yz
 aTV1d/8tG6/eo61K2VqXHbUhymgMtbXDsg1rwWTF8+Q4xIcMqfYAR0f9ptU1Oejc
 pNAn16dYgKi6+4eluY7gXxruBosQ6yNml6iEje9A3uR8nhzTI/P3Yf2GGIZnQLsL
 +UIx4Ps38dJ3VCYBPfbnszZfYPpILUH9/Bdx+mAMUtZwvpM3JYqc8XsiFfqDO7n1
 3003yUytnRkb1UK3QIvkbPt0G8UOI4s9fxRPsA8lLSww/f2y1r5kC4Mxbg==
 =HSP/
 -----END PGP SIGNATURE-----

Merge tag 'xarray-5.1-rc1' of git://git.infradead.org/users/willy/linux-dax

Pull XArray updates from Matthew Wilcox:
 "This pull request changes the xa_alloc() API. I'm only aware of one
  subsystem that has started trying to use it, and we agree on the fixup
  as part of the merge.

  The xa_insert() error code also changed to match xa_alloc() (EEXIST to
  EBUSY), and I added xa_alloc_cyclic(). Beyond that, the usual
  bugfixes, optimisations and tweaking.

  I now have a git tree with all users of the radix tree and IDR
  converted over to the XArray that I'll be feeding to maintainers over
  the next few weeks"

* tag 'xarray-5.1-rc1' of git://git.infradead.org/users/willy/linux-dax:
  XArray: Fix xa_reserve for 2-byte aligned entries
  XArray: Fix xa_erase of 2-byte aligned entries
  XArray: Use xa_cmpxchg to implement xa_reserve
  XArray: Fix xa_release in allocating arrays
  XArray: Mark xa_insert and xa_reserve as must_check
  XArray: Add cyclic allocation
  XArray: Redesign xa_alloc API
  XArray: Add support for 1s-based allocation
  XArray: Change xa_insert to return -EBUSY
  XArray: Update xa_erase family descriptions
  XArray tests: RCU lock prohibits GFP_KERNEL
2019-03-11 20:06:18 -07:00
John Hubbard
0c507d8f84 RDMA/umem: Revert broken 'off by one' fix
The previous attempted bug fix overlooked the fact that
ib_umem_odp_map_dma_single_page() was doing a put_page() upon hitting an
error. So there was not really a bug there.

Therefore, this reverts the off-by-one change, but keeps the change to use
release_pages() in the error path.

Fixes: 75a3e6a3c1 ("RDMA/umem: minor bug fix in error handling path")
Suggested-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-06 14:42:37 -04:00
John Hubbard
75a3e6a3c1 RDMA/umem: minor bug fix in error handling path
1. Bug fix: fix an off by one error in the code that cleans up if it fails
   to dma-map a page, after having done a get_user_pages_remote() on a
   range of pages.

2. Refinement: for that same cleanup code, release_pages() is better than
   put_page() in a loop.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-04 16:41:31 -04:00
Leon Romanovsky
bb61845154 RDMA/uverbs: Don't do double free of allocated PD
There is no need to call kfree(pd) because ib_dealloc_pd() internally
frees PD.

Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-25 15:00:48 -07:00
Leon Romanovsky
a2a074ef39 RDMA: Handle ucontext allocations by IB/core
Following the PD conversion patch, do the same for ucontext allocations.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-22 14:11:37 -07:00
Dan Carpenter
afc1990e08 RDMA/core: Fix a WARN() message
The first parameter of WARN_ONCE() is a condition, then following
parameters are the message.  In this case, we left out the condition so it
will just print the ops->type string.

Fixes: 3856ec4b93 ("RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-22 11:53:37 -07:00
Moni Shoua
4438ee3f13 IB/core: Abort page fault handler silently during owning process exit
It is possible that during a page fault handling, the process that owns
the MR is terminating. The indication for it is failure to get the
task_struct or take reference on the mm_struct. In this case just abort
the page-fault handler with error but without a warning to the kernel log.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-21 16:32:45 -07:00
Leon Romanovsky
25fd08eb2b RDMA/uverbs: Store PR pointer before it is overwritten
The IB_MR_REREG_PD command rewrites mr->pd after successful
rereg_user_mr(), such change causes to lost usecnt information and
produces the following warning:

 WARNING: CPU: 1 PID: 1771 at drivers/infiniband/core/verbs.c:336 ib_dealloc_pd+0x4e/0x60 [ib_core]
 CPU: 1 PID: 1771 Comm: rereg_mr Tainted: G        W  OE 5.0.0-rc7-for-upstream-perf-2019-02-20_14-03-40-34 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
 RIP: 0010:ib_dealloc_pd+0x4e/0x60 [ib_core]
 RSP: 0018:ffffc90003923dc0 EFLAGS: 00010286
 RAX: 00000000ffffffff RBX: ffff88821f7f0400 RCX: ffff888236a40c00
 RDX: ffff88821f7f0400 RSI: 0000000000000001 RDI: 0000000000000000
 RBP: 0000000000000001 R08: ffff88835f665d80 R09: ffff8882209c90d8
 R10: ffff88835ec003e0 R11: 0000000000000000 R12: ffff888221680ba0
 R13: ffff888221680b00 R14: 00000000ffffffea R15: ffff88821f53c318
 FS:  00007f70db11e740(0000) GS:ffff88835f640000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000001dfd030 CR3: 000000029d9d8000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  uverbs_free_pd+0x2d/0x30 [ib_uverbs]
  destroy_hw_idr_uobject+0x16/0x40 [ib_uverbs]
  uverbs_destroy_uobject+0x28/0x170 [ib_uverbs]
  __uverbs_cleanup_ufile+0x6b/0x90 [ib_uverbs]
  uverbs_destroy_ufile_hw+0x8b/0x110 [ib_uverbs]
  ib_uverbs_close+0x1f/0x80 [ib_uverbs]
  __fput+0xb1/0x220
  task_work_run+0x7f/0xa0
  exit_to_usermode_loop+0x6b/0xb2
  do_syscall_64+0xc5/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f70dad00664

Fixes: e278173fd1 ("RDMA/core: Cosmetic change - move member initialization to correct block")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-21 14:09:23 -07:00
Noa Osherovich
d0e02bf6cd RDMA/core: Verify that memory window type is legal
Before calling the provider's alloc_mw function, verify that the
given memory type is either IB_MW_TYPE_1 or IB_MW_TYPE_2.

Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Leon Romanovsky
1882ab8678 RDMA/iwcm: Fix string truncation error
The strlen() check at the beginning of iw_cm_map() ensures that devname
and ifname strings are less than destinations to which they are supposed
to be copied. Change strncpy() call to be strcpy(), because we are
protected from overflow. Zero the entire string buffer to avoid copying
uninitialized kernel stack memory to userspace.

This fixes the compilation warning below:

In file included from ./include/linux/dma-mapping.h:6,
                 from drivers/infiniband/core/iwcm.c:38:
In function _strncpy_,
    inlined from _iw_cm_map_ at drivers/infiniband/core/iwcm.c:519:2:
./include/linux/string.h:253:9: warning: ___builtin_strncpy_ specified
bound 32 equals destination size [-Wstringop-truncation]
  return __builtin_strncpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fixes: d53ec8af56 ("RDMA/iwcm: Don't copy past the end of dev_name() string")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Yuval Shaia
e278173fd1 RDMA/core: Cosmetic change - move member initialization to correct block
old_pd is used only if IB_MR_REREG_PD flags is set.
For readability move it's initialization to where it is used.

While there rewrite the whole 'if-else' block so on error jump directly
to label and no need for 'else'

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Steve Wise
3856ec4b93 RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support
Add support for new LINK messages to allow adding and deleting rdma
interfaces.  This will be used initially for soft rdma drivers which
instantiate device instances dynamically by the admin specifying a netdev
device to use.  The rdma_rxe module will be the first user of these
messages.

The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
with the rdma core if they provide link add/delete functions.  Each driver
registers with a unique "type" string, that is used to dispatch messages
coming from user space.  A new RDMA_NLDEV_ATTR is defined for the "type"
string.  User mode will pass 3 attributes in a NEWLINK message:
RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
link.  The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
the device to delete.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Jason Gunthorpe
ca22354b14 RDMA/rxe: Close a race after ib_register_device
Since rxe allows unregistration from other threads the rxe pointer can
become invalid any moment after ib_register_driver returns. This could
cause a user triggered use after free.

Add another driver callback to be called right after the device becomes
registered to complete any device setup required post-registration.  This
callback has enough core locking to prevent the device from becoming
unregistered.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:18 -07:00
Jason Gunthorpe
6cc2c8e535 RDMA/rxe: Add ib_device_get_by_name() and use it in rxe
rxe has an open coded version of this that is not as safe as the core
version. This lets us eliminate the internal device list entirely from
rxe.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:18 -07:00
Jason Gunthorpe
d0899892ed RDMA/device: Provide APIs from the core code to help unregistration
These APIs are intended to support drivers that exist outside the usual
driver core probe()/remove() callbacks. Normally the driver core will
prevent remove() from running concurrently with probe(), once this safety
is lost drivers need more support to get the locking and lifetimes right.

ib_unregister_driver() is intended to be used during module_exit of a
driver using these APIs. It unregisters all the associated ib_devices.

ib_unregister_device_and_put() is to be used by a driver-specific removal
function (ie removal by name, removal from a netdev notifier, removal from
netlink)

ib_unregister_queued() is to be used from netdev notifier chains where
RTNL is held.

The locking is tricky here since once things become async it is possible
to race unregister with registration. This is largely solved by relying on
the registration refcount, unregistration will only ever work on something
that has a positive registration refcount - and then an unregistration
mutex serializes all competing unregistrations of the same device.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:18 -07:00
Jason Gunthorpe
324e227ea7 RDMA/device: Add ib_device_get_by_netdev()
Several drivers need to find the ib_device from a given netdev. rxe needs
this at speed in an unsleepable context, so choose to implement the
translation using a RCU safe hash table.

The hash table can have a many to one mapping. This is intended to support
some future case where multiple IB drivers (ie iWarp and RoCE) connect to
the same netdevs. driver_ids will need to be different to support this.

In the process this makes the struct ib_device and ib_port_data RCU safe
by deferring their kfrees.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:18 -07:00
Jason Gunthorpe
c2261dd76b RDMA/device: Add ib_device_set_netdev() as an alternative to get_netdev
The associated netdev should not actually be very dynamic, so for most
drivers there is no reason for a callback like this. Provide an API to
inform the core code about the net dev affiliation and use a core
maintained data structure instead.

This allows the core code to be more aware of the ndev relationship which
will allow some new APIs based around this.

This also uses locking that makes some kind of sense, many drivers had a
confusing RCU lock, or missing locking which isn't right.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:18 -07:00
Jason Gunthorpe
8faea9fd4a RDMA/cache: Move the cache per-port data into the main ib_port_data
Like the other cases there no real reason to have another array just for
the cache. This larger conversion gets its own patch.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Jason Gunthorpe
8ceb1357b3 RDMA/device: Consolidate ib_device per_port data into one place
There is no reason to have three allocations of per-port data. Combine
them together and make the lifetime for all the per-port data match the
struct ib_device.

Following patches will require more port-specific data, now there is a
good place to put it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Jason Gunthorpe
ea1075edcb RDMA: Add and use rdma_for_each_port
We have many loops iterating over all of the end port numbers on a struct
ib_device, simplify them with a for_each helper.

Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Leon Romanovsky
f2a0e45f36 RDMA/nldev: Don't expose number of not-visible entries
Netlink dumpit handshake exchanges the index from which kernel should
start to return its value, in current code, this index included
not-visible in this PID items too and indirectly revealed the number of
entries.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Leon Romanovsky
1b8b778864 RDMA/nldev: Connect QP number to .doit callback
This patch adds ability to query specific QP based on its LQPN (local
QPN), which is assigned by HW and needs special treatment while inserting
into restrack DB.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Leon Romanovsky
c3d02788b4 RDMA/nldev: Provide parent IDs for PD, MR and QP objects
PD, MR and QP objects have parents objects: contexts and PDs.  The exposed
parent IDs allow to correlate various objects and simplify debug
investigation.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Leon Romanovsky
517b773e0f RDMA/nldev: Share with user-space object IDs
Give to the user space tools unique identifier for PD, MR, CQ and CM_ID
objects, so they can be able to query on them with .doit callbacks.

QP .doit is not supported yet, till all drivers will be updated to provide
their LQPN to be equal to their restrack ID.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:39 -07:00
Leon Romanovsky
7c77c6a9bf RDMA/restrack: Prepare restrack_root to addition of extra fields per-type
As a preparation to extension of rdma_restrack_root to provide software
IDs, which will be per-type too. We convert the rdma_restrack_root from
struct with arrays to array of structs.

Such conversion allows us to drop rwsem lock in favour of internal XArray
lock.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 10:13:38 -07:00
Leon Romanovsky
41eda65c61 RDMA/restrack: Hide restrack DB from IB/core
There is no need to expose internals of restrack DB to IB/core.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-18 21:04:36 -07:00
Leon Romanovsky
4811852718 RDMA/restrack: Reduce scope of synchronization lock while updating DB
XArray uses internal lock for updates to XArray. This means that our
external RW lock is needed to ensure that entry is not deleted while we
are performing iteration over list.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-18 21:04:36 -07:00
Leon Romanovsky
c5dfe0ea6f RDMA/nldev: Add resource tracker doit callback
Implement doit callbacks and ensure that users won't provide port values
on resource entry allocated in per-device mode needed for .doit callback.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-18 21:04:36 -07:00
Leon Romanovsky
18c4c66f76 RDMA/restrack: Translate from ID to restrack object
Add new general helper to get restrack entry given by ID and their
respective type.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-18 21:04:36 -07:00
Leon Romanovsky
fd47c2f99f RDMA/restrack: Convert internal DB from hash to XArray
The additions of .doit callbacks posses new access pattern to the resource
entries by some user visible index. Back then, the legacy DB was
implemented as hash because per-index access wasn't needed and XArray
wasn't accepted yet.

Acceptance of XArray together with per-index access requires the refresh
of DB implementation.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-18 21:04:36 -07:00
Parav Pandit
5f8f549900 RDMA/core: Move device addition deletion to device.c
Move core device addition and removal from sysfs.c to device.c as device.c
is more appropriate place for device management.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 21:57:14 -07:00
Parav Pandit
5767198a14 RDMA/core: Introduce and use ib_setup_port_attrs()
Refactor code for device and port sysfs attributes for reuse.

While at it, rename counter part free function to ib_free_port_attrs.

Also attribute setup sequence is:
(a) port specific init.
(b) device stats alloc/init.

So for cleanup, follow reverse sequence:
(a) device stats dealloc
(b) port specific cleanup

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 21:52:17 -07:00
Parav Pandit
e155755e53 RDMA/core: Use simpler device_del() instead of device_unregister()
Instead of holding extra reference using get_device() that
device_unregister() releases, simplify it as below.

device_add() balances with device_del().  device_initialize() balances
with put_device(), always via ib_dealloc_device().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 21:42:02 -07:00
Jason Gunthorpe
9a7786784d RDMA/uverbs: Fix an error flow in ib_uverbs_poll_cq
The new output_written block was wrongly placed before the ret=0, causing
the error code to be lost. uverbs_output_written is not expected to fail,
and even if it does fail it has no significant impact on the userspace
flow.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: d6f4a21f30 ("RDMA/uverbs: Mark ioctl responses with UVERBS_ATTR_F_VALID_OUTPUT")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2019-02-15 15:54:46 -07:00
Shamir Rabinovitch
3d9dfd0603 IB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flows
Add ib_ucontext to the uverbs_attr_bundle sent down the iocl and cmd flows
as soon as the flow has ib_uobject.

In addition, remove rdma_get_ucontext helper function that is only used by
ib_umem_get.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 11:16:21 -07:00
YueHaibing
36f0a1ccb3 RDMA/iwpm: Remove set but not used variable 'msg_seq'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/core/iwpm_util.c: In function 'iwpm_send_hello':
drivers/infiniband/core/iwpm_util.c:811:6: warning:
 variable 'msg_seq' set but not used [-Wunused-but-set-variable]

It never used since introduction in commit b0bad9ad51 ("RDMA/IWPM:
Support no port mapping requirements")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14 14:47:39 -07:00
Jason Gunthorpe
921eab1143 RDMA/devices: Re-organize device.c locking
The locking here started out with a single lock that covered everything
and then has lately veered into crazy town.

The fundamental problem is that several places need to iterate over a
linked list, but also need to drop their locks to avoid deadlock during
client callbacks.

xarray's restartable iteration offers a simple solution to the
problem. Once all the lists are xarrays we can drop locks in the places
that need that and rely on xarray to provide consistency and locking for
the data structure.

The resulting simplification is that each of the three lists has a
dedicated rwsem that must be held when working with the list it
covers. One data structure is no longer covered by multiple locks.

The sleeping semaphore is selected because the read side generally needs
to be held over something sleeping, and using RCU reader locking in those
cases is overkill.

In the process this simplifies the entire registration/unregistration flow
to be the expected list of setups and the reversed list of matching
teardowns, and the registration lock 'refcount' can now be revised to be
released after the ULPs are removed, providing a very sane semantic for
this feature.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
0df91bb673 RDMA/devices: Use xarray to store the client_data
Now that we have a small ID for each client we can use xarray instead of
linearly searching linked lists for client data. This will give much
faster and scalable client data lookup, and will lets us revise the
locking scheme.

Since xarray can store 'going_down' using a mark just entirely eliminate
the struct ib_client_data and directly store the client_data value in the
xarray. However this does require a special iterator as we must still
iterate over any NULL client_data values.

Also eliminate the client_data_lock in favour of internal xarray locking.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
e59178d895 RDMA/devices: Use xarray to store the clients
This gives each client a unique ID and will let us move client_data to use
xarray, and revise the locking scheme.

clients have to be add/removed in strict FIFO/LIFO order as they
interdepend. To support this the client_ids are assigned to increase in
FIFO order. The existing linked list is kept to support reverse iteration
until xarray can get a reverse iteration API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
3b88afd38e RDMA/device: Use an ida instead of a free page in alloc_name
ida is the proper data structure to hold list of clustered small integers
and then allocate an unused integer. Get rid of the convoluted and limited
open-coded bitmap.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
652432f33c RDMA/device: Get rid of reg_state
This really has no purpose anymore, refcount can be used to tell if the
device is still registered. Keeping it around just invites mis-use.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
d45f89d59b RDMA/device: Call ib_cache_release_one() only from ib_device_release()
Instead of complicated logic about when this memory is freed, always free
it during device release(). All the cache pointers start out as NULL, so
it is safe to call this before the cache is initialized.

This makes for a simpler error unwind flow, and a simpler understanding of
the lifetime of the memory allocations inside the struct ib_device.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
b34b269ad8 RDMA/device: Ensure that security memory is always freed
Since this only frees memory it should be done during the release
callback. Otherwise there are possible error flows where it might not get
called if registration aborts.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:56:45 -07:00
Jason Gunthorpe
e3593b568a RDMA/device: Check that the rename is nop under the lock
Since another rename could be running in parallel it is safer to check
that the name is not changing inside the lock, where we already know the
device name will not change.

Fixes: d21943dd19 ("RDMA/core: Implement IB device rename function")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
2019-02-08 16:56:45 -07:00
Leon Romanovsky
21a428a019 RDMA: Handle PD allocations by IB/core
The PD allocations in IB/core allows us to simplify drivers and their
error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand
in had with relevant update in .dealloc_pd().

We will use this opportunity and convert .dealloc_pd() to don't fail, as
it was suggested a long time ago, failures are not happening as we have
never seen a WARN_ON print.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:51:04 -07:00
Leon Romanovsky
30471d4b20 RDMA/core: Share driver structure size with core
Add new macros to be used in drivers while registering ops structure and
IB/core while calling allocation routines, so drivers won't need to
perform kzalloc/kfree in their paths.

The change in allocation stage allows us to initialize common fields prior
to calling to drivers (e.g. restrack).

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:50:58 -07:00
Daniel Jurgens
c66f67414c IB/core: Don't register each MAD agent for LSM notifier
When creating many MAD agents in a short period of time, receive packet
processing can be delayed long enough to cause timeouts while new agents
are being added to the atomic notifier chain with IRQs disabled.  Notifier
chain registration and unregstration is an O(n) operation. With large
numbers of MAD agents being created and destroyed simultaneously the CPUs
spend too much time with interrupts disabled.

Instead of each MAD agent registering for it's own LSM notification,
maintain a list of agents internally and register once, this registration
already existed for handling the PKeys. This list is write mostly, so a
normal spin lock is used vs a read/write lock. All MAD agents must be
checked, so a single list is used instead of breaking them down per
device.

Notifier calls are done under rcu_read_lock, so there isn't a risk of
similar packet timeouts while checking the MAD agents security settings
when notified.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:24:44 -07:00
Daniel Jurgens
6e88e672b6 IB/core: Fix potential memory leak while creating MAD agents
If the MAD agents isn't allowed to manage the subnet, or fails to register
for the LSM notifier, the security context is leaked. Free the context in
these cases.

Fixes: 47a2b338fe ("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:24:44 -07:00
Daniel Jurgens
d60667fc39 IB/core: Unregister notifier before freeing MAD security
If the notifier runs after the security context is freed an access of
freed memory can occur.

Fixes: 47a2b338fe ("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:24:44 -07:00
Steve Wise
926ba19b35 RDMA/iwcm: add tos_set bool to iw_cm struct
This allows drivers to know the tos was actively set by the application.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:18:06 -07:00
Steve Wise
9491128f78 RDMA/cma: listening device cm_ids should inherit tos
If a user binds to INADDR_ANY and sets the service id, then the
device-specific cm_ids should also use this tos.  This allows an app to
do:

rdma_bind_addr(INADDR_ANY)
set_service_type()
rdma_listen()

And connections setup via this listening endpoint will use the correct
tos.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:15:40 -07:00
Danit Goldberg
2c1619edef IB/cma: Define option to set ack timeout and pack tos_set
Define new option in 'rdma_set_option' to override calculated QP timeout
when requested to provide QP attributes to modify a QP.

At the same time, pack tos_set to be bitfield.

Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:14:21 -07:00
Davidlohr Bueso
b95df5e3e4 drivers/IB,core: reduce scope of mmap_sem
ib_umem_get() uses gup_longterm() and relies on the lock to stabilze the
vma_list, so we cannot really get rid of mmap_sem altogether, but now that
the counter is atomic, we can get of some complexity that mmap_sem brings
with only pinned_vm.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Davidlohr Bueso
70f8a3ca68 mm: make mm->pinned_vm an atomic64 counter
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Steve Wise
a2bfd708b1 RDMA/iwpm: move kdoc comments to functions
Move the iwpm kdoc comments from the prototype declarations to above
the function bodies.  There are no functional changes in this patch.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-05 15:40:41 -07:00
Leon Romanovsky
a78e8723a5 RDMA/cma: Remove CM_ID statistics provided by rdma-cm module
Netlink statistics exported by rdma-cm never had any working user space
component published to the mailing list or to any open source
project. Canvassing various proprietary users, and the original requester,
we find that there are no real users of this interface.

This patch simply removes all occurrences of RDMA CM netlink in favour of
modern nldev implementation, which provides the same information and
accompanied by widely used user space component.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-05 15:30:33 -07:00
Steve Wise
b0bad9ad51 RDMA/IWPM: Support no port mapping requirements
A soft iwarp driver that uses the host TCP stack via a kernel mode socket
does not need port mapping.  In fact, if the port map daemon, iwpmd, is
running, then iwpmd must not try and create/bind a socket to the actual
port for a soft iwarp connection, since the driver already has that socket
bound.

Yet if the soft iwarp driver wants to interoperate with hard iwarp devices
that -are- using port mapping, then the soft iwarp driver's mappings still
need to be maintained and advertised by the iwpm protocol.

This patch enhances the rdma driver<->iwcm interface to allow an iwarp
driver to specify that it does not want port mapping.  The iwpm
kernel<->iwpmd interface is also enhanced to pass up this information on
map requests.

Care is taken to interoperate with the current iwpmd version (ABI version
3) and only use the new NL attributes if iwpmd supports ABI version 4.

The ABI version define has also been created in rdma_netlink.h so both
kernel and user code can share it.  The iwcm and iwpmd negotiate the ABI
version to use with a new HELLO netlink message.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 16:26:02 -07:00
Steve Wise
f76903d574 RDMA/IWPM: refactor the IWPM message attribute names
In order to add new IWPM_NL attributes, the enums for the IWPM commands
attributes are refactored such that a new attribute can be added without
breaking ABI version 3. Instead of sharing nl attribute enums for both
request and response messages, we create separate enums for each IWPM
message request and reply.  This allows us to extend any given IWPM
message by adding new attributes for just that message.  These new enums
are created, though, in a way to avoid breaking ABI version 3.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 16:26:02 -07:00
Jason Gunthorpe
6a8a2aa62d Linux 5.0-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxXYaEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkSQH/2yrfnviNPFYpZOR
 QQdc71Bfhkd8m85SmWIsSebkxmi3hKFVj15sGbWXd6+0/VxjEEGvQCZpvVwJceke
 LwDxtkKGg/74wAqJvlSAWxFNZ+Had4jDeoSoeQChddsBVXBBCxQx2v6ECg3o2x7W
 k8Z8t4+3RijDf8fYXY9ETyO2zW8R/wgT+dnl+DPgUH7u4dxh7FzAUfc4bgZIDg+i
 FzBQfbTJuz4BU7uRZ9IJiwhWKv0Iyi2DR3BY8Z1pqEpRaUMJMrCs2WGytHbTgt9e
 0EtO1airbVneU4eumU/ZaF9cyEbah9HousEPnP7J09WG4s/Odxc4zE+uK1QqS2im
 5Xv88is=
 =dVd1
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc5' into rdma.git for-next

Linux 5.0-rc5

Needed to merge the include/uapi changes so we have an up to date
single-tree for these files. Patches already posted are also expected to
need this for dependencies.
2019-02-04 14:53:42 -07:00
Bart Van Assche
a163afc885 IB/core: Remove ib_sg_dma_address() and ib_sg_dma_len()
Keeping single line wrapper functions is not useful. Hence remove the
ib_sg_dma_address() and ib_sg_dma_len() functions. This patch does not
change any functionality.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:07 -07:00
Moni Shoua
52a72e2a39 IB/uverbs: Expose XRC ODP device capabilities
Expose XRC ODP capabilities as part of the extended device capabilities.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:06 -07:00
Leon Romanovsky
02da375097 RDMA/core: Use the ops infrastructure to keep all callbacks in one place
As preparation to hide rdma_restrack_root, refactor the code to use the
ops structure instead of a special callback which is hidden in
rdma_restrack_root.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:34:21 -07:00
Leon Romanovsky
5e458d3f89 RDMA/restrack: Refactor user/kernel restrack additions
Since we already know if we are user/kernel before calling restrack_add,
move type dependent code into the callers to make the flow more readable.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:20:23 -07:00
Leon Romanovsky
0ad699c0ed RDMA/core: Simplify restrack interface
In the current implementation, we have one restrack root per-device and
all users are simply providing it directly. Let's simplify the interface
and have callers provide the ib_device and internally access the
restrack_root.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:15:47 -07:00
Leon Romanovsky
659067b0b5 RDMA/nldev: Prepare CAP_NET_ADMIN checks for .doit callbacks
The .doit callbacks don't have a netlink_callback to check capabilities so
in order to use the same fill_res_func for both .dump and .doit, we need
to do the capability check outside of those functions.

For .doit callbacks, it is possible to check CAP_NET_ADMIN directly on the
received sk_buff.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:12:33 -07:00
Leon Romanovsky
8be565e65f RDMA/nldev: Factor out the PID namespace check
The PID namespace is going to be used in the .doit callback, so generalize
its implementation.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:11:45 -07:00
Leon Romanovsky
f732e7135b RDMA/nldev: Dynamically generate restrack dumpit callbacks
There is no need to manually write same callbacks, automatically generate
them using C-macro language.

This macro is going to be extended to generate doit callbacks too, so use
general name for this macro.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:10:21 -07:00
Gal Pressman
6780c4fa9d RDMA: Add indication for in kernel API support to IB device
Drivers that do not provide kernel verbs support should not be used by ib
kernel clients at all.

In case a device does not implement all mandatory verbs for kverbs usage
mark it as a non kverbs provider and prevent its usage for all clients
except for uverbs.

The device is marked as a non kverbs provider using the 'kverbs_provider'
flag which should only be set by the core code.  The clients can choose
whether kverbs are requested for its usage using the 'no_kverbs_req' flag
which is currently set for uverbs only.

This patch allows drivers to remove mandatory verbs stubs and simply set
the callbacks to NULL. The IB device will be registered as a non-kverbs
provider. Note that verbs that are required for the device registration
process must be implemented.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 20:32:25 -07:00
Leon Romanovsky
459cc69fa4 RDMA: Provide safe ib_alloc_device() function
All callers to ib_alloc_device() provide a larger size than struct
ib_device and rely on the fact that struct ib_device is embedded in their
driver specific structure as the first member.

Provide a safer variant of ib_alloc_device() that checks and enforces this
approach to make sure the drivers are using it right.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 15:52:30 -07:00
Yishai Hadas
7b21b69ab2 IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate
The vma->vm_mm can become impossible to get before rdma_umap_close() is
called, in this case we must not try to get an mm that is already
undergoing process exit. In this case there is no need to wait for
anything as the VMA will be destroyed by another thread soon and is
already effectively 'unreachable' by userspace.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
 PGD 800000012bc50067 P4D 800000012bc50067 PUD 129db5067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 1 PID: 2050 Comm: bash Tainted: G        W  OE 4.20.0-rc6+ #3
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__rb_erase_color+0xb9/0x280
 Code: 84 17 01 00 00 48 3b 68 10 0f 84 15 01 00 00 48 89
               58 08 48 89 de 48 89 ef 4c 89 e3 e8 90 84 22 00 e9 60 ff ff ff 48 8b 5d
               10 <f6> 03 01 0f 84 9c 00 00 00 48 8b 43 10 48 85 c0 74 09 f6 00 01 0f
 RSP: 0018:ffffbecfc090bab8 EFLAGS: 00010246
 RAX: ffff97616346cf30 RBX: 0000000000000000 RCX: 0000000000000101
 RDX: 0000000000000000 RSI: ffff97623b6ca828 RDI: ffff97621ef10828
 RBP: ffff97621ef10828 R08: ffff97621ef10828 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff97623b6ca838
 R13: ffffffffbb3fef50 R14: ffff97623b6ca828 R15: 0000000000000000
 FS:  00007f7a5c31d740(0000) GS:ffff97623bb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000011255a000 CR4: 00000000000006e0
 Call Trace:
  unlink_file_vma+0x3b/0x50
  free_pgtables+0xa1/0x110
  exit_mmap+0xca/0x1a0
  ? mlx5_ib_dealloc_pd+0x28/0x30 [mlx5_ib]
  mmput+0x54/0x140
  uverbs_user_mmap_disassociate+0xcc/0x160 [ib_uverbs]
  uverbs_destroy_ufile_hw+0xf7/0x120 [ib_uverbs]
  ib_uverbs_remove_one+0xea/0x240 [ib_uverbs]
  ib_unregister_device+0xfb/0x200 [ib_core]
  mlx5_ib_remove+0x51/0xe0 [mlx5_ib]
  mlx5_remove_device+0xc1/0xd0 [mlx5_core]
  mlx5_unregister_device+0x3d/0xb0 [mlx5_core]
  remove_one+0x2a/0x90 [mlx5_core]
  pci_device_remove+0x3b/0xc0
  device_release_driver_internal+0x16d/0x240
  unbind_store+0xb2/0x100
  kernfs_fop_write+0x102/0x180
  __vfs_write+0x36/0x1a0
  ? __alloc_fd+0xa9/0x170
  ? set_close_on_exec+0x49/0x70
  vfs_write+0xad/0x1a0
  ksys_write+0x52/0xc0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Cc: <stable@vger.kernel.org> # 4.19
Fixes: 5f9794dc94 ("RDMA/ucontext: Add a core API for mmaping driver IO memory")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:57:22 -07:00
Jason Gunthorpe
55c293c38e Merge branch 'devx-async' into k.o/for-next
Yishai Hadas says:

Enable DEVX asynchronous query commands

This series enables querying a DEVX object in an asynchronous mode.

The userspace application won't block when calling the firmware and it will be
able to get the response back once that it will be ready.

To enable the above functionality:

- DEVX asynchronous command completion FD object was introduced.
- The applicable file operations were implemented to enable using it by
  the user application.
- Query asynchronous method was added to the DEVX object, it will call the
  firmware asynchronously and manages the response on the given input FD.
- Hot unplug support was added for the FD to work properly upon
  unbind/disassociate.
- mlx5 core fence for asynchronous commands was implemented and used to
  prevent racing upon unbind/disassociate.

This branch is based on mlx5-next & v5.0-rc2 due to dependencies, from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

* branch 'devx-async':
  IB/mlx5: Implement DEVX hot unplug for async command FD
  IB/mlx5: Implement the file ops of DEVX async command FD
  IB/mlx5: Introduce async DEVX obj query API
  IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:49:31 -07:00
Yishai Hadas
6bf8f22aea IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD
Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD and its initial implementation.

This object is from type class FD and will be used to read DEVX async
commands completion.

The core layer should allow the driver to set object from type FD in a
safe mode, this option was added with a matching comment in place.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:32:43 -07:00
Yishai Hadas
f8ade8e242 IB/uverbs: Fix ioctl query port to consider device disassociation
Methods cannot peak into the ufile, the only way to get a ucontext and
hence a device is via the ib_uverbs_get_ucontext() call or inspecing a
locked uobject.

Otherwise during/after disassociation the pointers may be null or free'd.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
 PGD 800000005ece6067 P4D 800000005ece6067 PUD 5ece7067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 10631 Comm: ibv_ud_pingpong Tainted: GW  OE     4.20.0-rc6+ #3
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:ib_uverbs_handler_UVERBS_METHOD_QUERY_PORT+0x53/0x191 [ib_uverbs]
 Code: 80 00 00 00 31 c0 48 8b 47 40 48 8d 5c 24 38 48 8d 6c 24
               08 48 89 df 48 8b 40 08 4c 8b a0 18 03 00 00 31 c0 f3 48 ab 48 89
               ef <49> 83 7c 24 78 00 b1 06 f3 48 ab 0f 84 89 00 00 00 45 31  c9 31 d2
 RSP: 0018:ffffb54802ccfb10 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffffb54802ccfb48 RCX:0000000000000000
 RDX: fffffffffffffffa RSI: ffffb54802ccfcf8 RDI:ffffb54802ccfb18
 RBP: ffffb54802ccfb18 R08: ffffb54802ccfd18 R09:0000000000000000
 R10: 0000000000000000 R11: 00000000000000d0 R12:0000000000000000
 R13: ffffb54802ccfcb0 R14: ffffb54802ccfc48 R15:ffff9f736e0059a0
 FS:  00007f55a6bd7740(0000) GS:ffff9f737ba00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000078 CR3: 0000000064214000 CR4:00000000000006f0
 Call Trace:
  ib_uverbs_cmd_verbs.isra.5+0x94d/0xa60 [ib_uverbs]
  ? copy_port_attr_to_resp+0x120/0x120 [ib_uverbs]
  ? arch_tlb_finish_mmu+0x16/0xc0
  ? tlb_finish_mmu+0x1f/0x30
  ? unmap_region+0xd9/0x120
  ib_uverbs_ioctl+0xbc/0x120 [ib_uverbs]
  do_vfs_ioctl+0xa9/0x620
  ? __do_munmap+0x29f/0x3a0
  ksys_ioctl+0x60/0x90
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f55a62cb567

Fixes: 641d1207d2 ("IB/core: Move query port to ioctl")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-25 11:58:06 -07:00
Yishai Hadas
425784aa5b IB/uverbs: Fix OOPs upon device disassociation
The async_file might be freed before the disassociation has been ended,
causing qp shutdown to use after free on it.

Since uverbs_destroy_ufile_hw is not a fence, it returns if a
disassociation is ongoing in another thread. It has to be written this way
to avoid deadlock. However this means that the ufile FD close cannot
destroy anything that may still be used by an active kref, such as the the
async_file.

To fix that move the kref_put() to be in ib_uverbs_release_file().

 BUG: unable to handle kernel paging request at ffffffffba682787
 PGD bc80e067 P4D bc80e067 PUD bc80f063 PMD 1313df163 PTE 80000000bc682061
 Oops: 0003 [#1] SMP PTI
 CPU: 1 PID: 32410 Comm: bash Tainted: G           OE 4.20.0-rc6+ #3
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__pv_queued_spin_lock_slowpath+0x1b3/0x2a0
 Code: 98 83 e2 60 49 89 df 48 8b 04 c5 80 18 72 ba 48 8d
		ba 80 32 02 00 ba 00 80 00 00 4c 8d 65 14 41 bd 01 00 00 00 48 01 c7 85
		d2 <48> 89 2f 48 89 fb 74 14 8b 45 08 85 c0 75 42 84 d2 74 6b f3 90 83
 RSP: 0018:ffffc1bbc064fb58 EFLAGS: 00010006
 RAX: ffffffffba65f4e7 RBX: ffff9f209c656c00 RCX: 0000000000000001
 RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffffffffba682787
 RBP: ffff9f217bb23280 R08: 0000000000000001 R09: 0000000000000000
 R10: ffff9f209d2c7800 R11: ffffffffffffffe8 R12: ffff9f217bb23294
 R13: 0000000000000001 R14: 0000000000000000 R15: ffff9f209c656c00
 FS:  00007fac55aad740(0000) GS:ffff9f217bb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffba682787 CR3: 000000012f8e0000 CR4: 00000000000006e0
 Call Trace:
  _raw_spin_lock_irq+0x27/0x30
  ib_uverbs_release_uevent+0x1e/0xa0 [ib_uverbs]
  uverbs_free_qp+0x7e/0x90 [ib_uverbs]
  destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
  uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
  __uverbs_cleanup_ufile+0x73/0x90 [ib_uverbs]
  uverbs_destroy_ufile_hw+0x5d/0x120 [ib_uverbs]
  ib_uverbs_remove_one+0xea/0x240 [ib_uverbs]
  ib_unregister_device+0xfb/0x200 [ib_core]
  mlx5_ib_remove+0x51/0xe0 [mlx5_ib]
  mlx5_remove_device+0xc1/0xd0 [mlx5_core]
  mlx5_unregister_device+0x3d/0xb0 [mlx5_core]
  remove_one+0x2a/0x90 [mlx5_core]
  pci_device_remove+0x3b/0xc0
  device_release_driver_internal+0x16d/0x240
  unbind_store+0xb2/0x100
  kernfs_fop_write+0x102/0x180
  __vfs_write+0x36/0x1a0
  ? __alloc_fd+0xa9/0x170
  ? set_close_on_exec+0x49/0x70
  vfs_write+0xad/0x1a0
  ksys_write+0x52/0xc0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fac551aac60

Cc: <stable@vger.kernel.org> # 4.2
Fixes: 036b106357 ("IB/uverbs: Enable device removal when there are active user space applications")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-25 11:58:06 -07:00
Artemy Kovalyov
a2093dd35f RDMA/umem: Add missing initialization of owning_mm
When allocating a umem leaf for implicit ODP MR during page fault the
field owning_mm was not set.

Initialize and take a reference on this field to avoid kernel panic when
trying to access this field.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
 PGD 800000022dfed067 P4D 800000022dfed067 PUD 22dfcf067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 634 Comm: kworker/u33:0 Not tainted 4.20.0-rc6+ #89
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
 RIP: 0010:ib_umem_odp_map_dma_pages+0xf3/0x710 [ib_core]
 Code: 45 c0 48 21 f3 48 89 75 b0 31 f6 4a 8d 04 33 48 89 45 a8 49 8b 44 24 60 48 8b 78 10 e8 66 16 a8 c5 49 8b 54 24 08 48 89 45 98 <8b> 42 58 85 c0 0f 84 8e 05 00 00 8d 48 01 48 8d 72 58 f0 0f b1 4a
 RSP: 0000:ffffb610813a7c20 EFLAGS: 00010202
 RAX: ffff95ace6e8ac80 RBX: 0000000000000000 RCX: 000000000000000c
 RDX: 0000000000000000 RSI: 0000000000000850 RDI: ffff95aceaadae80
 RBP: ffffb610813a7ce0 R08: 0000000000000000 R09: 0000000000080c77
 R10: ffff95acfffdbd00 R11: 0000000000000000 R12: ffff95aceaa20a00
 R13: 0000000000001000 R14: 0000000000001000 R15: 000000000000000c
 FS:  0000000000000000(0000) GS:ffff95acf7800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000058 CR3: 000000022c834001 CR4: 00000000001606f0
 Call Trace:
  pagefault_single_data_segment+0x1df/0xc60 [mlx5_ib]
  mlx5_ib_eqe_pf_action+0x7bc/0xa70 [mlx5_ib]
  ? __switch_to+0xe1/0x470
  process_one_work+0x174/0x390
  worker_thread+0x4f/0x3e0
  kthread+0x102/0x140
  ? drain_workqueue+0x130/0x130
  ? kthread_stop+0x110/0x110
  ret_from_fork+0x1f/0x30

Fixes: f27a0d50a4 ("RDMA/umem: Use umem->owning_mm inside ODP")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-25 09:55:48 -07:00
Yuval Avnery
535005ca8e IB/core: Destroy QP if XRC QP fails
The open-coded variant missed destroy of SELinux created QP, reuse already
existing ib_detroy_qp() call and use this opportunity to clean
ib_create_qp() from double prints and unclear exit paths.

Reported-by: Parav Pandit <parav@mellanox.com>
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00
Moni Shoua
da6a496a34 IB/mlx5: Ranges in implicit ODP MR inherit its write access
A sub-range in ODP implicit MR should take its write permission from the
MR and not be set always to allow.

Fixes: d07d1d70ce ("IB/umem: Update on demand page (ODP) support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00
Bart Van Assche
f373859190 IB/core: Declare local functions 'static'
This patch avoids that sparse complains about missing function
declarations.

Fixes: f27a0d50a4 ("RDMA/umem: Use umem->owning_mm inside ODP")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00
Parav Pandit
039d713a59 IB/umad: Do not check status of nonseekable_open()
As the comment block of nonseekable_open() describes, nonseekable_open()
can never fail. Several places in kernel depend on this behavior.
Therefore, simplify the umad module to depend on this basic kernel
functionality.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:29 -07:00
Parav Pandit
ee848721f6 IB/umad: Avoid additional device reference during open()/close()
ib_umad_init_port_dev() holds the reference of a ib_umad_device instance.
ib_umad_device contains standard core device and cdev.  cdev holds the
reference of its parent core device.  file ops holds the reference to cdev
using core kernel.

Therefore, there is no need to hold additional reference while opening
umad related char devices.

While at it, add comments to bring clarity on releasing references to
ib_umd_device.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-23 11:37:05 -07:00
Jason Gunthorpe
d79af7242b RDMA/device: Expose ib_device_try_get(()
It turns out future patches need this capability quite widely now, not
just for netlink, so provide two global functions to manage the
registration lock refcount.

This also moves the point the lock becomes 1 to within
ib_register_device() so that the semantics of the public API are very sane
and clear. Calling ib_device_try_get() will fail on devices that are only
allocated but not yet registered.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
2019-01-21 14:33:08 -07:00
Parav Pandit
7527a7b157 IB/core: Simplify rdma cgroup registration
RDMA cgroup registration routine always returns success, so simplify
function to be void and run clang formatter over whole CONFIG_CGROUP_RDMA
art of core_priv.h.

This reduces unwinding error path for regular registration and future net
namespace change functionality for rdma device.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-18 13:43:10 -07:00
Jason Gunthorpe
344684e6d0 RDMA/device: Use __ib_device_get_by_name() in ib_device_rename()
No reason to open code this loop.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
2019-01-18 13:05:58 -07:00
Myungho Jung
5fc01fb846 RDMA/cma: Rollback source IP address if failing to acquire device
If cma_acquire_dev_by_src_ip() returns error in addr_handler(), the
device state changes back to RDMA_CM_ADDR_BOUND but the resolved source
IP address is still left. After that, if rdma_destroy_id() is called
after rdma_listen(), the device is freed without removed from
listen_any_list in cma_cancel_operation(). Revert to the previous IP
address if acquiring device fails.

Reported-by: syzbot+f3ce716af730c8f96637@syzkaller.appspotmail.com
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-14 15:28:26 -07:00
Jason Gunthorpe
d6f4a21f30 RDMA/uverbs: Mark ioctl responses with UVERBS_ATTR_F_VALID_OUTPUT
When the ioctl interface for the write commands was introduced it did
not mark the core response with UVERBS_ATTR_F_VALID_OUTPUT. This causes
rdma-core in userspace to not mark the buffers as written for valgrind.

Along the same lines it turns out we have always missed marking the driver
data. Fixing both of these makes valgrind work properly with rdma-core and
ioctl.

Fixes: 4785860e04 ("RDMA/uverbs: Implement an ioctl that can call write and write_ex handlers")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-01-14 14:02:22 -07:00
Parav Pandit
5474723115 RDMA: Introduce and use rdma_device_to_ibdev()
Introduce and use rdma_device_to_ibdev() API for those drivers which are
registering one sysfs group and also use in ib_core.

In subsequent patch, device->provider_ibdev one-to-one mapping is no
longer holds true during accessing sysfs entries.
Therefore, introduce an API rdma_device_to_ibdev() that provides such
information.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-14 13:12:03 -07:00
Parav Pandit
ea4baf7f11 RDMA: Rename port_callback to init_port
Most provider routines are callback routines which ib core invokes.
_callback suffix doesn't convey information about when such callback is
invoked. Therefore, rename port_callback to init_port.

Additionally, store the init_port function pointer in ib_device_ops, so
that it can be accessed in subsequent patches when binding rdma device to
net namespace.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-14 13:05:14 -07:00
Jason Gunthorpe
b0ea0fa543 IB/{core,hw}: Have ib_umem_get extract the ib_ucontext from ib_udata
ib_umem_get() can only be called in a method callback, which always has a
udata parameter. This allows ib_umem_get() to derive the ucontext pointer
directly from the udata without requiring the drivers to find it in some
way or another.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
2019-01-10 17:07:45 -07:00
Shamir Rabinovitch
6fa8f1afd3 IB/{core,uverbs}: Move ib_umem_xxx functions from ib_core to ib_uverbs
The next patch will add dependency from ib_umem_get in to ib_uverbs so
move the required ib_umem_xxx functionality to it's correct module -
ib_uverbs - and avoid circular dependecy from the form of ib_core ->
ib_uverbs -> ib_core in depmod.

Since this now requires all drivers to be build modular if uverbs is
modular, hoist the test a couple drivers had into the main kconfig and
apply it to all drivers uniformly.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-10 17:06:44 -07:00
Steve Wise
917cb8a72a RDMA/cma: Add cm_id restrack resource based on kernel or user cm_id type
A recent regression causes a null ptr crash when dumping cm_id resources.
The cma is incorrectly adding all cm_id restrack resources as kernel mode.

Fixes: af8d70375d ("RDMA/restrack: Resource-tracker should not use uobject pointers")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-08 17:12:33 -07:00
Leon Romanovsky
13859d5df4 RDMA/mlx5: Embed into the code flow the ODP config option
Convert various places to more readable code, which embeds
CONFIG_INFINIBAND_ON_DEMAND_PAGING into the code flow.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-08 16:41:38 -07:00
Leon Romanovsky
e502b8b011 RDMA/core: Don't depend device ODP capabilities on kconfig option
Device capability bits are exposing what specific device supports from HW
perspective. Those bits are not dependent on kernel configurations and
RDMA/core should ensure that proper interfaces to users will be disabled
if CONFIG_INFINIBAND_ON_DEMAND_PAGING is not set.

Fixes: f4056bfd8c ("IB/core: Add on demand paging caps to ib_uverbs_ex_query_device")
Fixes: 8cdd312cfe ("IB/mlx5: Implement the ODP capability query verb")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-08 16:41:38 -07:00
Leon Romanovsky
a9666c1cae RDMA/nldev: Don't expose unsafe global rkey to regular user
Unsafe global rkey is considered dangerous because it exposes memory
registered for all memory in the system. Only users with a QP on the same
PD can use the rkey, and generally those QPs will already know the
value. However, out of caution, do not expose the value to unprivleged
users on the local system. Require CAP_NET_ADMIN instead.

Cc: <stable@vger.kernel.org> # 4.16
Fixes: 29cf1351d4 ("RDMA/nldev: provide detailed PD information")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-07 13:35:57 -07:00
Gustavo A. R. Silva
5aad26a7ea IB/core: Use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-07 11:59:33 -07:00
Gustavo A. R. Silva
b5c61b968d IB/cm: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-07 11:43:03 -07:00
Gal Pressman
f687ccea10 RDMA/uverbs: Fix post send success return value in case of error
If get QP object fails 'ret' must be assigned with a proper error code.

Fixes: 9a0738575f ("RDMA/uverbs: Use uverbs_response() for remaining response copying")
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-07 11:42:22 -07:00
Linus Torvalds
3954e1d031 4.21 merge window 2nd pull request
Over the break a few defects were found, so this is a -rc style pull
 request of various small things that have been posted.
 
 - An attempt to shorten RCU grace period driven delays showed crashes
   during heavier testing, and has been entirely reverted
 
 - A missed merge/rebase error between the advise_mr and ib_device_ops
   series
 
 - Some small static analysis driven fixes from Julia and Aditya
 
 - Missed ability to create a XRC_INI in the devx verbs interop series
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwu4TIACgkQOG33FX4g
 mxqPgw//XU7X2/AbXALQOvZgI6y/qs6BSzucGEkTEEMyJ5KvjS537yJqN7ltfe9d
 BiJLIpCUJ2NKqyUnbah7nHT06Mm7wZM+FkIxtf2N3te/MfYd5HwIvUdIwwmX+VEc
 k1DcRD1EfowZCSgBAVQAqqJu6oBW//Wi48BQ7HNGvyXVJJ/F+uKIM/Am6oGUTV/5
 69yo0ZfqP/+bRfbNvg7cHqWafCL8ed70pIqpoL67hRfHcxUW/TQVV6njw8FNB/MH
 DNL6pN3oncUweyOPDV/Z6Cx+De5BFF498Rbvosugk8OO62wQ780DTvTeA5AlEtxV
 TEjTtd7QqDhWRELzv4WtU9ojrOnp3bzEu36Ok7ANEGAW40WdAL//eWQiaJF423Az
 zcD3w/t9ZE2mIX9h7YcVnMpmDvGpyQorG4mFYPfZgXLVxgrY2phLwiZsOk3B6PY8
 cszL4mJFnk6DKB9/31nWgPpWl+V1/E48JODwU9Fz1d3ov+XvNC4SBp0hM6cfG25c
 insZevsAfMQ+k43Rw+iE62Sz9JTfJZpVekyMmIG5fqCZlzG4UXhB6On5r6TGvWc0
 cnbZ+ELmsZY54DyAloOAKvBUuVY/t8QYaFo3y69v0B5ZiVnY1I00r74FyGEo21Cv
 /uxKbUmQxW4T9rdgZtWtfsSKcuiGrRDLTcLJ5j19c6bqJyF3fao=
 =REsM
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "Over the break a few defects were found, so this is a -rc style pull
  request of various small things that have been posted.

   - An attempt to shorten RCU grace period driven delays showed crashes
     during heavier testing, and has been entirely reverted

   - A missed merge/rebase error between the advise_mr and ib_device_ops
     series

   - Some small static analysis driven fixes from Julia and Aditya

   - Missed ability to create a XRC_INI in the devx verbs interop
     series"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  infiniband/qedr: Potential null ptr dereference of qp
  infiniband: bnxt_re: qplib: Check the return value of send_message
  IB/ipoib: drop useless LIST_HEAD
  IB/core: Add advise_mr to the list of known ops
  Revert "IB/mlx5: Fix long EEH recover time with NVMe offloads"
  IB/mlx5: Allow XRC INI usage via verbs in DEVX context
2019-01-05 18:20:51 -08:00
Linus Torvalds
96d4f267e4 Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
Moni Shoua
2f1927b090 IB/core: Add advise_mr to the list of known ops
We need to add advise_mr to the list of operation setters on the ib_device
or otherwise callers to ib_set_device_ops() for advise_mr operation will
not have their callback registered.

When the advise_mr series was merged with the device ops series the
SET_DEVICE_OPS() was missed.

Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-02 09:40:34 -07:00
Linus Torvalds
f346b0becb Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - large KASAN update to use arm's "software tag-based mode"

 - a few misc things

 - sh updates

 - ocfs2 updates

 - just about all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
  kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
  memcg, oom: notify on oom killer invocation from the charge path
  mm, swap: fix swapoff with KSM pages
  include/linux/gfp.h: fix typo
  mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  memory_hotplug: add missing newlines to debugging output
  mm: remove __hugepage_set_anon_rmap()
  include/linux/vmstat.h: remove unused page state adjustment macro
  mm/page_alloc.c: allow error injection
  mm: migrate: drop unused argument of migrate_page_move_mapping()
  blkdev: avoid migration stalls for blkdev pages
  mm: migrate: provide buffer_migrate_page_norefs()
  mm: migrate: move migrate_page_lock_buffers()
  mm: migrate: lock buffers before migrate_page_move_mapping()
  mm: migration: factor out code to compute expected number of page references
  mm, page_alloc: enable pcpu_drain with zone capability
  kmemleak: add config to select auto scan
  mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
  ...
2018-12-28 16:55:46 -08:00
Linus Torvalds
5d24ae67a9 4.21 merge window pull request
This has been a fairly typical cycle, with the usual sorts of driver
 updates. Several series continue to come through which improve and
 modernize various parts of the core code, and we finally are starting to
 get the uAPI command interface cleaned up.
 
 - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5,
   qib, rxe, usnic
 
 - Rework the entire syscall flow for uverbs to be able to run over
   ioctl(). Finally getting past the historic bad choice to use write()
   for command execution
 
 - More functional coverage with the mlx5 'devx' user API
 
 - Start of the HFI1 series for 'TID RDMA'
 
 - SRQ support in the hns driver
 
 - Support for new IBTA defined 2x lane widths
 
 - A big series to consolidate all the driver function pointers into
   a big struct and have drivers provide a 'static const' version of the
   struct instead of open coding initialization
 
 - New 'advise_mr' uAPI to control device caching/loading of page tables
 
 - Support for inline data in SRPT
 
 - Modernize how umad uses the driver core and creates cdev's and sysfs
   files
 
 - First steps toward removing 'uobject' from the view of the drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwhV2oACgkQOG33FX4g
 mxpF8A/9EkRCg6wCDC59maA53b5PjuNmD//9hXbycQPQSlxntI2PyYtxrzBqc0+2
 yIaFFMehL41XNN6y1zfkl7ndl62McCH2TpiidU8RyTxVw/e3KsDD5sU6++atfHRo
 M82RNfedDtxPG8TcCPKVLof6JHADApGSR1r4dCYfAnu7KFMyvlLmeYyx4r/2E6yC
 iQPmtKVOdbGkuWGeX+brGEA0vg7FUOAvaysnxddjyh9hyem4h0SUR3Af/Ik0N5ME
 PYzC+hMKbkPVBLoCWyg7QwUaqK37uWwguMQLtI2byF7FgbiK/lBQt6TsidR4Fw3p
 EalL7uqxgCTtLYh918vxLFjdYt6laka9j7xKCX8M8d06sy/Lo8iV4hWjiTESfMFG
 usqs7D6p09gA/y1KISji81j6BI7C92CPVK2drKIEnfyLgY5dBNFcv9m2H12lUCH2
 NGbfCNVaTQVX6bFWPpy2Bt2y/Litsfxw5RviehD7jlG0lQjsXGDkZzsDxrMSSlNU
 S79iiTJyK4kUZkXzrSSlN58pLBlbupJwm5MDjKmM+irsrsCHjGIULvc902qtnC3/
 8ImiTtW6XvqLbgWXyy2Th8/ZgRY234p1ybhog+DFaGKUch0XqB7VXTV2OZm0GjcN
 Fp4PUeBt+/gBgYqjpuffqQc1rI4uwXYSoz7wq9RBiOpw5zBFT1E=
 =T0p1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a fairly typical cycle, with the usual sorts of driver
  updates. Several series continue to come through which improve and
  modernize various parts of the core code, and we finally are starting
  to get the uAPI command interface cleaned up.

   - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
     mlx5, qib, rxe, usnic

   - Rework the entire syscall flow for uverbs to be able to run over
     ioctl(). Finally getting past the historic bad choice to use
     write() for command execution

   - More functional coverage with the mlx5 'devx' user API

   - Start of the HFI1 series for 'TID RDMA'

   - SRQ support in the hns driver

   - Support for new IBTA defined 2x lane widths

   - A big series to consolidate all the driver function pointers into a
     big struct and have drivers provide a 'static const' version of the
     struct instead of open coding initialization

   - New 'advise_mr' uAPI to control device caching/loading of page
     tables

   - Support for inline data in SRPT

   - Modernize how umad uses the driver core and creates cdev's and
     sysfs files

   - First steps toward removing 'uobject' from the view of the drivers"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
  RDMA/srpt: Use kmem_cache_free() instead of kfree()
  RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
  IB/uverbs: Signedness bug in UVERBS_HANDLER()
  IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
  IB/umad: Start using dev_groups of class
  IB/umad: Use class_groups and let core create class file
  IB/umad: Refactor code to use cdev_device_add()
  IB/umad: Avoid destroying device while it is accessed
  IB/umad: Simplify and avoid dynamic allocation of class
  IB/mlx5: Fix wrong error unwind
  IB/mlx4: Remove set but not used variable 'pd'
  RDMA/iwcm: Don't copy past the end of dev_name() string
  IB/mlx5: Fix long EEH recover time with NVMe offloads
  IB/mlx5: Simplify netdev unbinding
  IB/core: Move query port to ioctl
  RDMA/nldev: Expose port_cap_flags2
  IB/core: uverbs copy to struct or zero helper
  IB/rxe: Reuse code which sets port state
  IB/rxe: Make counters thread safe
  IB/mlx5: Use the correct commands for UMEM and UCTX allocation
  ...
2018-12-28 14:57:10 -08:00
Jérôme Glisse
5d6527a784 mm/mmu_notifier: use structure for invalidate_range_start/end callback
Patch series "mmu notifier contextual informations", v2.

This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback.  This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).

For instance device can have they own page table that mirror the process
address space.  When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.

Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap().  This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.

Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking.  Or device driver can
optimize away mprotect() that change the page table permission access for
the range.

This patchset enables all this optimizations for device drivers.  I do not
include any of those in this series but another patchset I am posting will
leverage this.

The patchset is pretty simple from a code point of view.  The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments.  The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).

This patch (of 3):

To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback.  No functional changes
with this patch.

[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Dan Carpenter
573671a5f6 IB/uverbs: Signedness bug in UVERBS_HANDLER()
The "num_sge" variable needs to be signed for the error handling to work.
The uverbs_attr_ptr_get_array_size() returns int so this change is safe.

Fixes: ad8a449675 ("IB/uverbs: Add support to advise_mr")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-22 16:07:13 -07:00
Parav Pandit
75bf8a2a2f IB/umad: Start using dev_groups of class
Start using core defined dev_groups of a class which allows to add device
attributes to the core kernel and simplify the umad module.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-21 11:39:41 -07:00
Parav Pandit
cdb53b65ae IB/umad: Use class_groups and let core create class file
Use class->class_groups core kernel facility to create the abi version
file instead of open coding.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-21 11:39:40 -07:00
Parav Pandit
e9dd5daf88 IB/umad: Refactor code to use cdev_device_add()
Refactor code to use cdev_device_add() and do other minor refactors while
modifying these functions as below.

1. Instead of returning generic -1, return an actual error for
   ib_umad_init_port().

2. Introduce and use ib_umad_init_port_dev() for sm and umad char devices.

3. Instead of kobj, use more light weight kref to refcount ib_umad_device.

4. Use modern cdev_device_add() single code cut down three steps of
   cdev_add(), device_create(). This further helps to move device sysfs
   files to class attributes in subsequent patch.

5. Remove few empty lines while refactoring these functions.

6. Use sizeof() instead of sizeof to avoid checkpatch warning.

7. Use struct_size() for calculation of ib_umad_port.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-21 11:39:38 -07:00
Parav Pandit
cf7ad30302 IB/umad: Avoid destroying device while it is accessed
ib_umad_reg_agent2() and ib_umad_reg_agent() access the device name in
dev_notice(), while concurrently, ib_umad_kill_port() can destroy the
device using device_destroy().

        cpu-0                               cpu-1
        -----                               -----
    ib_umad_ioctl()
        [...]                            ib_umad_kill_port()
                                              device_destroy(dev)

        ib_umad_reg_agent()
            dev_notice(dev)

Therefore, first mark ib_dev as NULL, to block any further access in file
ops, unregister the mad agent and destroy the device at the end after
mutex is unlocked.

This ensures that device doesn't get destroyed, while it may get accessed.

Fixes: 0f29b46d49 ("IB/mad: add new ioctl to ABI to support new registration options")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-21 10:39:36 -07:00
Parav Pandit
900d07c12d IB/umad: Simplify and avoid dynamic allocation of class
Simplify code to have a static structure instance for umad class
allocation.

This will allow to have class attributes defined along with class
registration in subsequent patch and allows more class methods definition
similar to ib_core module.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-21 10:39:36 -07:00
Steve Wise
d53ec8af56 RDMA/iwcm: Don't copy past the end of dev_name() string
We now use dev_name(&ib_device->dev) instead of ib_device->name in iwpm
messages.  The name field in struct device is a const char *, where as
ib_device->name is a char array of size IB_DEVICE_NAME_MAX, and it is
pre-initialized to zeros.

Since iw_cm_map() was using memcpy() to copy in the device name, and
copying IWPM_DEVNAME_SIZE bytes, it ends up copying past the end of the
source device name string and copying random bytes.  This results in iwpmd
failing the REGISTER_PID request from iwcm.  Thus port mapping is broken.

Validate the device and if names, and use strncpy() to inialize the entire
message field.

Fixes: 896de0090a ("RDMA/core: Use dev_name instead of ibdev->name")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-20 20:45:56 -07:00
Michael Guralnik
641d1207d2 IB/core: Move query port to ioctl
Add a method for query port under the uverbs global methods.  Current
ib_port_attr struct is passed as a single attribute and port_cap_flags2 is
added as a new attribute to the function.

Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-20 15:18:24 -07:00
Michael Guralnik
4fa2813d26 RDMA/nldev: Expose port_cap_flags2
port_cap_flags2 represents IBTA PortInfo:CapabilityMask2.

The field safely extends the RDMA_NLDEV_ATTR_CAP_FLAGS operand as it was
exported as 64 bit to allow this kind of extension.

Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-20 15:18:24 -07:00
Michael Guralnik
2e8039c656 IB/core: uverbs copy to struct or zero helper
Add a helper to zero fill fields before copying data to
UVERBS_ATTR_STRUCT.

As UVERBS_ATTR_STRUCT can be used as an extensible struct, we want to make
sure that if the user supplies us with a struct that has new fields that
we are not aware of, we return them zeroed to the user.

This helper should be used when using UVERBS_ATTR_STRUCT for an extendable
data structure and there is a need to make sure that extended members of
the struct, that the kernel doesn't handle, are returned zeroed to the
user. This is needed due to the fact that UVERBS_ATTR_STRUCT allows
non-zero values for members after 'last' member.

Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-20 15:18:18 -07:00
David S. Miller
2be09de7d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.

Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 11:53:36 -08:00
Gal Pressman
2553ba217e RDMA: Mark if destroy address handle is in a sleepable context
Introduce a 'flags' field to destroy address handle callback and add a
flag that marks whether the callback is executed in an atomic context or
not.

This will allow drivers to wait for completion instead of polling for it
when it is allowed.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-19 16:28:03 -07:00
Gal Pressman
b090c4e3a0 RDMA: Mark if create address handle is in a sleepable context
Introduce a 'flags' field to create address handle callback and add a flag
that marks whether the callback is executed in an atomic context or not.

This will allow drivers to wait for completion instead of polling for it
when it is allowed.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-19 16:17:19 -07:00
Shamir Rabinovitch
af8d70375d RDMA/restrack: Resource-tracker should not use uobject pointers
Having uobject pointer embedded in ib core objects is not aligned with a
future shared ib_x model. The resource tracker only does this to keep
track of user/kernel objects - track this directly instead.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-18 15:38:26 -07:00
Moni Shoua
ad8a449675 IB/uverbs: Add support to advise_mr
Add new ioctl method for the MR object - ADVISE_MR.

This command can be used by users to give an advice or directions to the
kernel about an address range that belongs to memory regions.

A new ib_device callback, advise_mr(), is introduced here to suupport the
new command. This command takes the following arguments:

- pd:		The protection domain to which all memory regions belong
- advice: 	The type of the advice
	  	* IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH - Pre-fetch a range of
		an on-demand paging MR
	  	* IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE - Pre-fetch a range
		of an on-demand paging MR with write intention
- flags:	The properties of the advice
		* IB_UVERBS_ADVISE_MR_FLAG_FLUSH - Operation must end before
		return to the caller
- sg_list:	The list of memory ranges
- num_sge:	The number of memory ranges in the list
- attrs:	More attributes to be parsed by the provider

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-18 15:19:46 -07:00
Parav Pandit
bbc13cda37 RDMA/uverbs: Add an ioctl method to destroy an object
Add an ioctl method to destroy the PD, MR, MW, AH, flow, RWQ indirection
table and XRCD objects by handle which doesn't require any output response
during destruction.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-18 14:59:57 -07:00
Jason Gunthorpe
149d3845f4 RDMA/uverbs: Add a method to introspect handles in a context
Introduce a helper function gather_objects_handle() to copy object handles
under a spin lock.

Expose these objects handles via the uverbs ioctl interface.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-12-18 14:54:54 -07:00
Parav Pandit
be5914c124 RDMA/core: Delete RoCE GID in hw when corresponding IP is deleted
Currently a RoCE GID entry is removed from the hardware when all
references to the GID entry drop to zero. This is a change in behavior
from before the fixed patch. The GID entry should be removed from the
hardware when GID entry deletion is requested. This allows the driver
terminate ongoing traffic through the RoCE GID.

While a GID is deleted from the hardware, GID slot in the software GID
cache is not freed. GID slot is freed once all references of such GID are
dropped. This continue to ensure that such GID slot of hardware is not
allocated to new GID entry allocation request. It is allocated once all
references to GID entry drop.

This approach allows drivers to put a tombestone of some kind on the HW
GID index to block the traffic.

Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-18 14:16:44 -07:00
Jason Gunthorpe
4785860e04 RDMA/uverbs: Implement an ioctl that can call write and write_ex handlers
Now that the handlers do not process their own udata we can make a
sensible ioctl that wrappers them. The ioctl follows the same format as
the write_ex() and has the user explicitly specify the core and driver
in/out opaque structures and a command number.

This works for all forms of write commands.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-18 14:12:48 -05:00
Mark Zhang
37fbd834b4 IB/core: Fix oops in netdev_next_upper_dev_rcu()
When support for bonding of RoCE devices was added, there was
necessarily a link between the RoCE device and the paired netdevice that
was part of the bond.  If you remove the mlx4_en module, that paired
association is broken (the RoCE device is still present but the paired
netdevice has been released).  We need to account for this in
is_upper_ndev_bond_master_filter() and filter out those links with a
broken pairing or else we later oops in netdev_next_upper_dev_rcu().

Fixes: 408f1242d9 ("IB/core: Delete lower netdevice default GID entries in bonding scenario")
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-12 12:14:49 -05:00
Kamal Heib
3023a1e936 RDMA: Start use ib_device_ops
Make all the required change to start use the ib_device_ops structure.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-12 07:40:16 -07:00
Kamal Heib
521ed0d92a RDMA/core: Introduce ib_device_ops
This change introduces the ib_device_ops structure that defines all the
InfiniBand device operations in one place, so the code will be more
readable and clean, unlike today when the ops are mixed with ib_device
data members.

The providers will need to define the supported operations and assign them
using ib_set_device_ops(), that will also make the providers code more
readable and clean.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-11 15:14:09 -07:00
Leon Romanovsky
9435ef4cae RDMA/uverbs: Optimize clearing of extra bytes in response
Clear extra bytes in response in batch manner instead
of doing it per-byte.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-11 14:38:17 -07:00
Jason Gunthorpe
28ab1bb0e8 Linux 4.20-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
 zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
 OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
 giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
 UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
 VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
 MkmA1lI=
 =7kaD
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc6' into rdma.git for-next

For dependencies in following patches.
2018-12-11 14:24:57 -07:00
Michael Guralnik
a5a5d19936 IB/core: Add new IB rates
Add the new rates that were added to Infiniband spec as part of HDR and 2x
support.

Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-11 13:22:45 -07:00
Saeed Mahameed
2f62747c77 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
mlx5-next shared branch with rdma subtree to avoid mlx5 rdma v.s. netdev
conflicts.

Highlights:

1) RDMA ODP  (On Demand Paging) improvements and moving ODP logic to
mlx5 RDMA driver
2) Improved mlx5 core driver and device events handling and provided API
for upper layers to subscribe to device events.
3) RDMA only code cleanup from mlx5 core
4) Add helper to get CQE opcode
5) Rework handling of port module events
6) shared mlx5_ifc.h updates to avoid conflicts

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-10 15:50:50 -08:00
Yuval Shaia
9af3f5cf9d RDMA/core: Validate port number in query_pkey verb
Before calling the driver's function let's make sure port is valid.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-07 14:33:09 -07:00
Yishai Hadas
04ca16cc19 IB/core: Enable getting an object type from a given uobject
Enable getting an object type from a given uobject, the type is saved
upon tree merging and is returned as part of some helper function.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-04 13:46:41 -05:00
Yishai Hadas
4d7e8cc574 IB/core: Introduce UVERBS_IDR_ANY_OBJECT
Introduce the UVERBS_IDR_ANY_OBJECT type to match any IDR object.

Once used, the infrastructure skips checking for the IDR type, it
becomes the driver handler responsibility.

This enables drivers to get in a given method an object from various of
types.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-04 13:46:41 -05:00
Leon Romanovsky
ffd321e4b7 RDMA/nldev: Export to user space number of contexts
[leonro@server ~]$ rdma res show
1: mlx5_0: pd 3 cq 5 qp 4 cm_id 0 mr 0 ctx 0

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 14:58:25 -05:00
Leon Romanovsky
12d23a9198 RDMA/uverbs: Annotate alloc/deallloc paths with context tracking
Add restrack annotations to track allocations of ucontexts.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 14:58:25 -05:00
Leon Romanovsky
606152107b RDMA/restrack: Track ucontext
Add ability to track allocated ib_ucontext, which are limited
resource and worth to be visible by users.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 14:58:25 -05:00
Jason Gunthorpe
974d6b4b2b RDMA/uverbs: Use only attrs for the write() handler signature
All of the old arguments can be derived from the uverbs_attr_bundle
structure, so get rid of the redundant arguments. Most of the prior work
has been removing users of the arguments to allow this to be a simple
patch.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
ece9ca97cc RDMA/uverbs: Do not check the input length on create_cq/qp paths
If the user did not provide a long enough command buffer then the missing
bytes are forced to zero. There is no reason to check the length if a zero
value is OK.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
c3bea3d2dc RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()
This has a very complicated memory layout, with two flex arrays. Use
the iterator API to make reading it clearer.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
335708c751 RDMA/uverbs: Add a simple iterator interface for reading the command
Several methods have a command with a trailing flex array, and they
all open code some extraction scheme. Centralize this into a simple
iterator API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
7eebced1ba RDMA/uverbs: Simplify ib_uverbs_ex_query_device
We truncate the response structure if there is not enough room in the
user buffer so there is no reason to have all the mess with finely managing
response_length. Just fully fill the attrs and truncate on copy.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
40efca7a46 RDMA/uverbs: Fill in the response for IB_USER_VERBS_EX_CMD_MODIFY_QP
A response struct was defined, and userspace is providing it (but not
checking it). Fill it in and write it out.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
29a29d1852 RDMA/uverbs: Use uverbs_request() and core for write_ex handlers
The write_ex handlers have this horrible boilerplate in every function to
do the zero extend/zero check and min size checks. This is now handled in
the core code via the meta-data, and the zero checks are handled by
uverbs_request(). Replace all the occurrences.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
3c2c20947d RDMA/uverbs: Use uverbs_request() for request copying
This function properly zero-extends, and zero-checks if the user
buffer is not the same size as the kernel command struct.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
9a0738575f RDMA/uverbs: Use uverbs_response() for remaining response copying
This function properly truncates and zero-fills the response which is the
standard used by the ioctl uAPI when working with user data.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
931373a118 RDMA/uverbs: Get rid of the 'callback' scheme in the compat path
There is no reason for this. For response processing we simply need to
copy, truncate, and zero fill the response into whatever output buffer
was provided. Add a function uverbs_response() that does this
consistently.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 12:01:58 -05:00
Jason Gunthorpe
c2a939fda4 RDMA/uverbs: Use uverbs_attr_bundle to pass ucore for write/write_ex
This creates a consistent way to access the two core buffers across write
and write_ex handlers.

Remove the open coded ucore conversion in the write/ex compatibility
handlers.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 11:57:41 -05:00
Jason Gunthorpe
bbb28ad903 RDMA/uverbs: Remove out_len checks that are now done by the core
write() methods must work with fixed sized structures as that is the only
way to know where the udata segment starts. The common udata code now
rejects any write() that has a response buffer shorter than the core's
response.

Thus all the checks of out_len for write methods are redundant and can be
removed.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 11:57:41 -05:00
kbuild test robot
90849f4d05 RDMA/uverbs: fix ptr_ret.cocci warnings
drivers/infiniband/core/uverbs_cmd.c:1095:1-3: WARNING: PTR_ERR_OR_ZERO can be used

 Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Fixes: 7106a97697 ("RDMA/uverbs: Make write() handlers return 0 on success")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-29 15:59:40 -07:00
Jason Gunthorpe
07f05f40d9 RDMA/uverbs: Use uverbs_attr_bundle to pass udata for ioctl()
Have the core code initialize the driver_udata if the method has a udata
description. This is done using the same create_udata the handler was
supposed to call.

This makes ioctl consistent with the write and write_ex paths.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
3a6532c9af RDMA/uverbs: Use uverbs_attr_bundle to pass udata for write
Now that we have metadata describing the command format the core code can
directly compute the udata pointers and all the really ugly
ib_uverbs_init_udata() calls can be removed from the handlers.

This means all the write() handlers are no longer sensitive to the layout
of the command buffer.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
ef87df2c7a RDMA/uverbs: Use uverbs_attr_bundle to pass udata for write_ex
The core code needs to compute the udata so we may as well pass it in the
uverbs_attr_bundle instead of on the stack. This converts the simple case
of write_ex() which already has a core calculation.

Also change the write() path to use the attrs for ib_uverbs_init_udata()
instead of on the stack. This lets the write to write_ex compatibility
path continue to follow the lead of the _ex path.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
da0f60df7b RDMA/uverbs: Prohibit write() calls with too small buffers
The size meta-data in the prior patch describes the smallest acceptable
buffer for the write() interface. Globally check this in the core code.

This is necessary in the case of write() methods that have a driver udata
to prevent computing a negative udata buffer length.

The return code of -ENOSPC is chosen here as some of the handlers already
use this code, however many other handler use EINVAL.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
669dac1e00 RDMA/uverbs: Add structure size info to write commands
We need the structure sizes to compute the location of the udata in the
core code. Annotate the sizes into the new macro language.

This is generated largely by script and checked by comparing against the
similar list in rdma-core.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
15a1b4becb RDMA/uverbs: Do not pass ib_uverbs_file to ioctl methods
The uverbs_attr_bundle already contains this pointer, and most methods
don't actually need it. Get rid of the redundant function argument.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
7106a97697 RDMA/uverbs: Make write() handlers return 0 on success
Currently they return the command length, while all other handlers return
0. This makes the write path closer to the write_ex and ioctl path.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Jason Gunthorpe
8313c10fa8 RDMA/uverbs: Replace ib_uverbs_file with uverbs_attr_bundle for write
Now that we can add meta-data to the description of write() methods we
need to pass the uverbs_attr_bundle into all write based handlers so
future patches can use it as a container for any new data transferred out
of the core.

This is the first step to bringing the write() and ioctl() methods to a
common interface signature.

This is a simple search/replace, and we push the attr down into the uobj
and other APIs to keep changes minimal.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-26 16:48:07 -07:00
Artemy Kovalyov
605728e65a IB/umem: Set correct address to the invalidation function
The invalidate range was using PAGE_SIZE instead of the computed 'end',
and had the wrong transformation of page_index due the weird
construction. This can trigger during error unwind and would cause
malfunction.

Inline the code and correct the math.

Fixes: 403cd12e2c ("IB/umem: Add contiguous ODP support")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-26 16:28:36 -07:00
Parav Pandit
01b671170d RDMA/core: Sync unregistration with netlink commands
When the rdma device is getting removed, get resource info can race with
device removal, as below:

      CPU-0                                  CPU-1
    --------                               --------
    rdma_nl_rcv_msg()
       nldev_res_get_cq_dumpit()
          mutex_lock(device_lock);
          get device reference
          mutex_unlock(device_lock);        [..]
                                            ib_unregister_device()
                                            /* Valid reference to
                                             * device->dev exists.
                                             */
                                             ib_dealloc_device()

          [..]
          provider->fill_res_entry();

Even though device object is not freed, fill_res_entry() can get called on
device which doesn't have a driver anymore. Kernel core device reference
count is not sufficient, as this only keeps the structure valid, and
doesn't guarantee the driver is still loaded.

Similar race can occur with device renaming and device removal, where
device_rename() tries to rename a unregistered device. While this is fine
for devices of a class which are not net namespace aware, but it is
incorrect for net namespace aware class coming in subsequent series.  If a
class is net namespace aware, then the below [1] call trace is observed in
above situation.

Therefore, to avoid the race, keep a reference count and let device
unregistration wait until all netlink users drop the reference.

[1] Call trace:
kernfs: ns required in 'infiniband' for 'mlx5_0'
WARNING: CPU: 18 PID: 44270 at fs/kernfs/dir.c:842 kernfs_find_ns+0x104/0x120
libahci i2c_core mlxfw libata dca [last unloaded: devlink]
RIP: 0010:kernfs_find_ns+0x104/0x120
Call Trace:
kernfs_find_and_get_ns+0x2e/0x50
sysfs_rename_link_ns+0x40/0xb0
device_rename+0xb2/0xf0
ib_device_rename+0xb3/0x100 [ib_core]
nldev_set_doit+0x165/0x190 [ib_core]
rdma_nl_rcv_msg+0x249/0x250 [ib_core]
? netlink_deliver_tap+0x8f/0x3e0
rdma_nl_rcv+0xd6/0x120 [ib_core]
netlink_unicast+0x17c/0x230
netlink_sendmsg+0x2f0/0x3e0
sock_sendmsg+0x30/0x40
__sys_sendto+0xdc/0x160

Fixes: da5c850782 ("RDMA/nldev: add driver-specific resource tracking")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-22 12:39:26 -07:00
Parav Pandit
eeb8df87e4 RDMA/cma: Move cma module specific functions to cma_priv.h
Currently several rdma_cm module specific functions are declared in
core_priv.h file. Now that we have cma_priv.h file specific to rdma_cm
kernel module, move them from core_priv.h to cma_priv.h

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-22 11:57:33 -07:00
Jason Gunthorpe
a140692a52 RDMA/uverbs: Check for NULL driver methods for every write call
Add annotations to the uverbs_api structure indicating which driver
methods are called by the implementation. If the required method
is NULL the write API will be not be callable.

This effectively duplicates the cmd_mask system, however it does it by
expressing invariants required by the core code, not by delegating
decision making to the driver. This is another step toward eliminating
cmd_mask.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:33 -07:00
Jason Gunthorpe
1de751caf7 RDMA/uverbs: Make all the method functions in uverbs_cmd static
Now that we use struct uverbs_uapi to link the method functions to the
dispatcher there is no reason to have them be extern symbols.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:33 -07:00
Jason Gunthorpe
d120c3c918 RDMA/uverbs: Convert the write interface to use uverbs_api
This organizes the write commands into objects and links them to the
uverbs_api data structure. The command path is reworked to use uapi
instead of its internal structures.

The command mask is moved from a runtime check to a registration time
check in the uapi.

Since the write interface does not have the object ID as part of the
command, the radix bins are converted into linear lists to support the
lookup.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:33 -07:00
Jason Gunthorpe
6884c6c4bd RDMA/verbs: Store the write/write_ex uapi entry points in the uverbs_api
Bringing all uapi entry points into one place lets us deal with them
consistently. For instance the write, write_ex and ioctl paths can be
disabled when an API is not supported by the driver.

This will replace the uverbs_cmd_table static arrays.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:33 -07:00
Jason Gunthorpe
0bd01f3d09 RDMA/uverbs: Require all objects to have a driver destroy function
If we can't destroy the object then we certainly shouldn't allow it be
created or used. Remove it from the uverbs_uapi in this case.

This also disables methods of other objects that have mandatory object
handle inputs - ie REG_DM_MR is now automatically removed if DM objects
cannot be created.

Typically drivers not supporting an interface will mark all of the
supporting functions as NULL, including destroy.

This is intended to automatically eliminate entire corner cases in the API
that are difficult to test.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:32 -07:00
Jason Gunthorpe
6829c1c2b3 RDMA/uverbs: Add helpers to mark uapi functions as unsupported
We have many cases where parts of the uapi are not supported in a driver,
needs a certain protocol, or whatever. It is best to reflect this directly
into the struct uverbs_api when it is built so that everything is simply
blocked off, and future introspection can report a proper supported list.

This is done by adding some additional helpers to the definition list
language that disable objects based on a 'supported' call back, and a
helper that disables based on a NULL struct ib_device function pointer.

Disablement is global. For instance, if a driver disables an object then
everything connected to that object is removed, including core methods.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:32 -07:00
Jason Gunthorpe
c27f6aa8c9 RDMA/uverbs: Factor out the add/get pattern into a helper
The next patch needs another copy of this, provide a simple helper to
reduce the coding. uapi_add_get_elm() returns an existing entry or adds a
new one.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:32 -07:00
Jason Gunthorpe
0cbf432db4 RDMA/uverbs: Use a linear list to describe the compiled-in uapi
The 'tree' data structure is very hard to build at compile time, and this
makes it very limited. The new radix tree based compiler can handle a more
complex input language that does not require the compiler to perfectly
group everything into a neat tree structure.

Instead use a simple list to describe to input, where the list elements
can be of various different 'opcodes' instructing the radix compiler what
to do. Start out with opcodes chaining to other definition lists and
chaining to the existing 'tree' definition.

Replace the very top level of the 'object tree' with this list type and
get rid of struct uverbs_object_tree_def and DECLARE_UVERBS_OBJECT_TREE.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-22 11:57:32 -07:00
Yuval Shaia
3eeeb7a59a IB/core: Make function ib_fmr_pool_unmap return void
Since the function always returns 0 make it void.

Reported-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-21 16:13:02 -07:00
Parav Pandit
d52ef88a9f RDMA/core: Add GIDs while changing MAC addr only for registered ndev
Currently when MAC address is changed, regardless of the netdev reg_state,
GID entries are removed and added to reflect the new MAC address and new
default GID entries.

When a bonding device is used and the underlying PCI device is removed
several netdevice events are generated. Two events of the interest are
CHANGEADDR and UNREGISTER event on lower(slave) netdevice of the bond
netdevice.

Sometimes CHANGEADDR event is generated when netdev state is
UNREGISTERING (after UNREGISTER event is generated). In this scenario, GID
entries for default GIDs are added and never deleted because GID entries
are deleted only when netdev state is < UNREGISTERED.

This leads to non zero reference count on the netdevice. Due to this, PCI
device unbind operation is getting stuck.

To avoid it, when changing mac address, add GID entries only if netdev is
in REGISTERED state.

Fixes: 03db3a2d81 ("IB/core: Add RoCE GID table management")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-21 14:09:40 -07:00
Moni Shoua
b02394aa75 IB/mlx5: Improve ODP debugging messages
Add and modify debug messages to ODP related error flows.
In that context, return code EAGAIN is considered less severe and print
level for it is set debug instead of warn.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-11-12 22:26:02 +02:00
Linus Torvalds
da19a102ce First merge window pull request
This has been a smaller cycle with many of the commits being smallish code
 fixes and improvements across the drivers.
 
 - Driver updates for bnxt_re, cxgb4, hfi1, hns, mlx5, nes, qedr, and rxe
 
 - Memory window support in hns
 
 - mlx5 user API 'flow mutate/steering' allows accessing the full packet
   mangling and matching machinery from user space
 
 - Support inter-working with verbs API calls in the 'devx' mlx5 user API, and
   provide options to use devx with less privilege
 
 - Modernize the use of syfs and the device interface to use attribute groups
   and cdev properly for uverbs, and clean up some of the core code's device list
   management
 
 - More progress on net namespaces for RDMA devices
 
 - Consolidate driver BAR mmapping support into core code helpers and rework
   how RDMA holds poitners to mm_struct for get_user_pages cases
 
 - First pass to use 'dev_name' instead of ib_device->name
 
 - Device renaming for RDMA devices
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlvR7dUACgkQOG33FX4g
 mxojiw//a9GU5kq4IZ3LNAEio/3Ql/NHRF0uie5tSzJgipRJA1Ln9zW0Cm1S/ms1
 VCmaSJ3l3q3GC4i3tIlsZSIIkN5qtjv/FsT/i+TZwSJYx9BDpPbzWtG6Mp4PSDj0
 v3xzklFCN5HMOmEcjkNmyZw3VjHOt2Iw2mKjqvGbI9imCPLOYnw+WQaZLmMWMH6p
 GL0HDbAopN5Lv8ireWd8pOhPLVbSb12cWM1crx+yHOS3q8YNWjIXGiZr/QkOPtPr
 cymSXB8yuITJ7gnjbs/GxZHg6rxU0knC/Ck8hE7FqqYYHgytTklOXDE2ef1J2lFe
 1VmotD+nTsCir0mZWSdcRrszEk7tzaZT7n1oWggKvWySDB6qaH0II8vWumJchQnN
 pElIQn/WDgpekIqplamNqXJnKnDXZJpEVA01OHHDN4MNSc+Ad08hQy4FyFzpB6/G
 jv9TnDMfGC6ma9pr1ipOXyCgCa2pHYEUCaYxUqRA0O/4ATVl7/PplqT0rqtJ6hKg
 o/hmaVCawIFOUKD87/bo7Em2HBs3xNwE/c5ggbsQElLYeydrgPrZfrPfjkshv5K3
 eIKDb+HPyis0is1aiF7m/bz1hSIYZp0bQhuKCdzLRjZobwCm5WDPhtuuAWb7vYVw
 GSLCJWyet+bLyZxynNOt67gKm9je9lt8YTr5nilz49KeDytspK0=
 =pacJ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a smaller cycle with many of the commits being smallish
  code fixes and improvements across the drivers.

   - Driver updates for bnxt_re, cxgb4, hfi1, hns, mlx5, nes, qedr, and
     rxe

   - Memory window support in hns

   - mlx5 user API 'flow mutate/steering' allows accessing the full
     packet mangling and matching machinery from user space

   - Support inter-working with verbs API calls in the 'devx' mlx5 user
     API, and provide options to use devx with less privilege

   - Modernize the use of syfs and the device interface to use attribute
     groups and cdev properly for uverbs, and clean up some of the core
     code's device list management

   - More progress on net namespaces for RDMA devices

   - Consolidate driver BAR mmapping support into core code helpers and
     rework how RDMA holds poitners to mm_struct for get_user_pages
     cases

   - First pass to use 'dev_name' instead of ib_device->name

   - Device renaming for RDMA devices"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (242 commits)
  IB/mlx5: Add support for extended atomic operations
  RDMA/core: Fix comment for hw stats init for port == 0
  RDMA/core: Refactor ib_register_device() function
  RDMA/core: Fix unwinding flow in case of error to register device
  ib_srp: Remove WARN_ON in srp_terminate_io()
  IB/mlx5: Allow scatter to CQE without global signaled WRs
  IB/mlx5: Verify that driver supports user flags
  IB/mlx5: Support scatter to CQE for DC transport type
  RDMA/drivers: Use core provided API for registering device attributes
  RDMA/core: Allow existing drivers to set one sysfs group per device
  IB/rxe: Remove unnecessary enum values
  RDMA/umad: Use kernel API to allocate umad indexes
  RDMA/uverbs: Use kernel API to allocate uverbs indexes
  RDMA/core: Increase total number of RDMA ports across all devices
  IB/mlx4: Add port and TID to MAD debug print
  IB/mlx4: Enable debug print of SMPs
  RDMA/core: Rename ports_parent to ports_kobj
  RDMA/core: Do not expose unsupported counters
  IB/mlx4: Refer to the device kobject instead of ports_parent
  RDMA/nldev: Allow IB device rename through RDMA netlink
  ...
2018-10-26 07:38:19 -07:00
Linus Torvalds
bd6bf7c104 pci-v4.20-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlvPV7IUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyaUg//WnCaRIu2oKOp8c/bplZJDW5eT10d
 oYAN9qeyptU9RYrg4KBNbZL9UKGFTk3AoN5AUjrk8njxc/dY2ra/79esOvZyyYQy
 qLXBvrXKg3yZnlNlnyBneGSnUVwv/kl2hZS+kmYby2YOa8AH/mhU0FIFvsnfRK2I
 XvwABFm2ZYvXCqh3e5HXaHhOsR88NQ9In0AXVC7zHGqv1r/bMVn2YzPZHL/zzMrF
 mS79tdBTH+shSvchH9zvfgIs+UEKvvjEJsG2liwMkcQaV41i5dZjSKTdJ3EaD/Y2
 BreLxXRnRYGUkBqfcon16Yx+P6VCefDRLa+RhwYO3dxFF2N4ZpblbkIdBATwKLjL
 npiGc6R8yFjTmZU0/7olMyMCm7igIBmDvWPcsKEE8R4PezwoQv6YKHBMwEaflIbl
 Rv4IUqjJzmQPaA0KkRoAVgAKHxldaNqno/6G1FR2gwz+fr68p5WSYFlQ3axhvTjc
 bBMJpB/fbp9WmpGJieTt6iMOI6V1pnCVjibM5ZON59WCFfytHGGpbYW05gtZEod4
 d/3yRuU53JRSj3jQAQuF1B6qYhyxvv5YEtAQqIFeHaPZ67nL6agw09hE+TlXjWbE
 rTQRShflQ+ydnzIfKicFgy6/53D5hq7iH2l7HwJVXbXRQ104T5DB/XHUUTr+UWQn
 /Nkhov32/n6GjxQ=
 =58I4
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:

 - Fix ASPM link_state teardown on removal (Lukas Wunner)

 - Fix misleading _OSC ASPM message (Sinan Kaya)

 - Make _OSC optional for PCI (Sinan Kaya)

 - Don't initialize ASPM link state when ACPI_FADT_NO_ASPM is set
   (Patrick Talbert)

 - Remove x86 and arm64 node-local allocation for host bridge structures
   (Punit Agrawal)

 - Pay attention to device-specific _PXM node values (Jonathan Cameron)

 - Support new Immediate Readiness bit (Felipe Balbi)

 - Differentiate between pciehp surprise and safe removal (Lukas Wunner)

 - Remove unnecessary pciehp includes (Lukas Wunner)

 - Drop pciehp hotplug_slot_ops wrappers (Lukas Wunner)

 - Tolerate PCIe Slot Presence Detect being hardwired to zero to
   workaround broken hardware, e.g., the Wilocity switch/wireless device
   (Lukas Wunner)

 - Unify pciehp controller & slot structs (Lukas Wunner)

 - Constify hotplug_slot_ops (Lukas Wunner)

 - Drop hotplug_slot_info (Lukas Wunner)

 - Embed hotplug_slot struct into users instead of allocating it
   separately (Lukas Wunner)

 - Initialize PCIe port service drivers directly instead of relying on
   initcall ordering (Keith Busch)

 - Restore PCI config state after a slot reset (Keith Busch)

 - Save/restore DPC config state along with other PCI config state
   (Keith Busch)

 - Reference count devices during AER handling to avoid race issue with
   concurrent hot removal (Keith Busch)

 - If an Upstream Port reports ERR_FATAL, don't try to read the Port's
   config space because it is probably unreachable (Keith Busch)

 - During error handling, use slot-specific reset instead of secondary
   bus reset to avoid link up/down issues on hotplug ports (Keith Busch)

 - Restore previous AER/DPC handling that does not remove and
   re-enumerate devices on ERR_FATAL (Keith Busch)

 - Notify all drivers that may be affected by error recovery resets
   (Keith Busch)

 - Always generate error recovery uevents, even if a driver doesn't have
   error callbacks (Keith Busch)

 - Make PCIe link active reporting detection generic (Keith Busch)

 - Support D3cold in PCIe hierarchies during system sleep and runtime,
   including hotplug and Thunderbolt ports (Mika Westerberg)

 - Handle hpmemsize/hpiosize kernel parameters uniformly, whether slots
   are empty or occupied (Jon Derrick)

 - Remove duplicated include from pci/pcie/err.c and unused variable
   from cpqphp (YueHaibing)

 - Remove driver pci_cleanup_aer_uncorrect_error_status() calls (Oza
   Pawandeep)

 - Uninline PCI bus accessors for better ftracing (Keith Busch)

 - Remove unused AER Root Port .error_resume method (Keith Busch)

 - Use kfifo in AER instead of a local version (Keith Busch)

 - Use threaded IRQ in AER bottom half (Keith Busch)

 - Use managed resources in AER core (Keith Busch)

 - Reuse pcie_port_find_device() for AER injection (Keith Busch)

 - Abstract AER interrupt handling to disconnect error injection (Keith
   Busch)

 - Refactor AER injection callbacks to simplify future improvments
   (Keith Busch)

 - Remove unused Netronome NFP32xx Device IDs (Jakub Kicinski)

 - Use bitmap_zalloc() for dma_alias_mask (Andy Shevchenko)

 - Add switch fall-through annotations (Gustavo A. R. Silva)

 - Remove unused Switchtec quirk variable (Joshua Abraham)

 - Fix pci.c kernel-doc warning (Randy Dunlap)

 - Remove trivial PCI wrappers for DMA APIs (Christoph Hellwig)

 - Add Intel GPU device IDs to spurious interrupt quirk (Bin Meng)

 - Run Switchtec DMA aliasing quirk only on NTB endpoints to avoid
   useless dmesg errors (Logan Gunthorpe)

 - Update Switchtec NTB documentation (Wesley Yung)

 - Remove redundant "default n" from Kconfig (Bartlomiej Zolnierkiewicz)

 - Avoid panic when drivers enable MSI/MSI-X twice (Tonghao Zhang)

 - Add PCI support for peer-to-peer DMA (Logan Gunthorpe)

 - Add sysfs group for PCI peer-to-peer memory statistics (Logan
   Gunthorpe)

 - Add PCI peer-to-peer DMA scatterlist mapping interface (Logan
   Gunthorpe)

 - Add PCI configfs/sysfs helpers for use by peer-to-peer users (Logan
   Gunthorpe)

 - Add PCI peer-to-peer DMA driver writer's documentation (Logan
   Gunthorpe)

 - Add block layer flag to indicate driver support for PCI peer-to-peer
   DMA (Logan Gunthorpe)

 - Map Infiniband scatterlists for peer-to-peer DMA if they contain P2P
   memory (Logan Gunthorpe)

 - Register nvme-pci CMB buffer as PCI peer-to-peer memory (Logan
   Gunthorpe)

 - Add nvme-pci support for PCI peer-to-peer memory in requests (Logan
   Gunthorpe)

 - Use PCI peer-to-peer memory in nvme (Stephen Bates, Steve Wise,
   Christoph Hellwig, Logan Gunthorpe)

 - Cache VF config space size to optimize enumeration of many VFs
   (KarimAllah Ahmed)

 - Remove unnecessary <linux/pci-ats.h> include (Bjorn Helgaas)

 - Fix VMD AERSID quirk Device ID matching (Jon Derrick)

 - Fix Cadence PHY handling during probe (Alan Douglas)

 - Signal Cadence Endpoint interrupts via AXI region 0 instead of last
   region (Alan Douglas)

 - Write Cadence Endpoint MSI interrupts with 32 bits of data (Alan
   Douglas)

 - Remove redundant controller tests for "device_type == pci" (Rob
   Herring)

 - Document R-Car E3 (R8A77990) bindings (Tho Vu)

 - Add device tree support for R-Car r8a7744 (Biju Das)

 - Drop unused mvebu PCIe capability code (Thomas Petazzoni)

 - Add shared PCI bridge emulation code (Thomas Petazzoni)

 - Convert mvebu to use shared PCI bridge emulation (Thomas Petazzoni)

 - Add aardvark Root Port emulation (Thomas Petazzoni)

 - Support 100MHz/200MHz refclocks for i.MX6 (Lucas Stach)

 - Add initial power management for i.MX7 (Leonard Crestez)

 - Add PME_Turn_Off support for i.MX7 (Leonard Crestez)

 - Fix qcom runtime power management error handling (Bjorn Andersson)

 - Update TI dra7xx unaligned access errata workaround for host mode as
   well as endpoint mode (Vignesh R)

 - Fix kirin section mismatch warning (Nathan Chancellor)

 - Remove iproc PAXC slot check to allow VF support (Jitendra Bhivare)

 - Quirk Keystone K2G to limit MRRS to 256 (Kishon Vijay Abraham I)

 - Update Keystone to use MRRS quirk for host bridge instead of open
   coding (Kishon Vijay Abraham I)

 - Refactor Keystone link establishment (Kishon Vijay Abraham I)

 - Simplify and speed up Keystone link training (Kishon Vijay Abraham I)

 - Remove unused Keystone host_init argument (Kishon Vijay Abraham I)

 - Merge Keystone driver files into one (Kishon Vijay Abraham I)

 - Remove redundant Keystone platform_set_drvdata() (Kishon Vijay
   Abraham I)

 - Rename Keystone functions for uniformity (Kishon Vijay Abraham I)

 - Add Keystone device control module DT binding (Kishon Vijay Abraham
   I)

 - Use SYSCON API to get Keystone control module device IDs (Kishon
   Vijay Abraham I)

 - Clean up Keystone PHY handling (Kishon Vijay Abraham I)

 - Use runtime PM APIs to enable Keystone clock (Kishon Vijay Abraham I)

 - Clean up Keystone config space access checks (Kishon Vijay Abraham I)

 - Get Keystone outbound window count from DT (Kishon Vijay Abraham I)

 - Clean up Keystone outbound window configuration (Kishon Vijay Abraham
   I)

 - Clean up Keystone DBI setup (Kishon Vijay Abraham I)

 - Clean up Keystone ks_pcie_link_up() (Kishon Vijay Abraham I)

 - Fix Keystone IRQ status checking (Kishon Vijay Abraham I)

 - Add debug messages for all Keystone errors (Kishon Vijay Abraham I)

 - Clean up Keystone includes and macros (Kishon Vijay Abraham I)

 - Fix Mediatek unchecked return value from devm_pci_remap_iospace()
   (Gustavo A. R. Silva)

 - Fix Mediatek endpoint/port matching logic (Honghui Zhang)

 - Change Mediatek Root Port Class Code to PCI_CLASS_BRIDGE_PCI (Honghui
   Zhang)

 - Remove redundant Mediatek PM domain check (Honghui Zhang)

 - Convert Mediatek to pci_host_probe() (Honghui Zhang)

 - Fix Mediatek MSI enablement (Honghui Zhang)

 - Add Mediatek system PM support for MT2712 and MT7622 (Honghui Zhang)

 - Add Mediatek loadable module support (Honghui Zhang)

 - Detach VMD resources after stopping root bus to prevent orphan
   resources (Jon Derrick)

 - Convert pcitest build process to that used by other tools (iio, perf,
   etc) (Gustavo Pimentel)

* tag 'pci-v4.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
  PCI/AER: Refactor error injection fallbacks
  PCI/AER: Abstract AER interrupt handling
  PCI/AER: Reuse existing pcie_port_find_device() interface
  PCI/AER: Use managed resource allocations
  PCI: pcie: Remove redundant 'default n' from Kconfig
  PCI: aardvark: Implement emulated root PCI bridge config space
  PCI: mvebu: Convert to PCI emulated bridge config space
  PCI: mvebu: Drop unused PCI express capability code
  PCI: Introduce PCI bridge emulated config space common logic
  PCI: vmd: Detach resources after stopping root bus
  nvmet: Optionally use PCI P2P memory
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
  block: Add PCI P2P flag for request queue
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
  PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
  ...
2018-10-25 06:50:48 -07:00
David S. Miller
2e2d6f0342 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.

net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'.  Thanks to David
Ahern for the heads up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19 11:03:06 -07:00
Logan Gunthorpe
50b7d22079 IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
In order to use PCI P2P memory the pci_p2pmem_map_sg() function must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not.  At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-10-17 12:18:20 -05:00
Parav Pandit
76d865b87c RDMA/core: Fix comment for hw stats init for port == 0
When add_port() is done for port == 0, it indicates that ports hardware
counters initialization should be skipped. Reflect so in the comment.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17 11:43:07 -04:00
Parav Pandit
548cb4fbe8 RDMA/core: Refactor ib_register_device() function
ib_register_device() does several allocation and initialization
steps. Split it into smaller more readable functions for easy
review and maintenance.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17 11:43:07 -04:00
Parav Pandit
67fecaf8e9 RDMA/core: Fix unwinding flow in case of error to register device
If port pkey list initialization fails, free the port_immutable memory
during cleanup path. Currently it is missed out.

If cache setup fails, free the pkey list during cleanup path.

Fixes: d291f1a65 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17 11:43:07 -04:00
Leon Romanovsky
551d315e34 RDMA/umad: Use kernel API to allocate umad indexes
Replace custom code to allocate indexes to generic kernel API.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 14:36:21 -04:00
Leon Romanovsky
90f6e41cc0 RDMA/uverbs: Use kernel API to allocate uverbs indexes
Replace custom code to allocate indexes to generic kernel API.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 14:36:21 -04:00
Leon Romanovsky
7d65cbf0b0 RDMA/core: Increase total number of RDMA ports across all devices
IDA adds overhead to store IDs bitmap with maximal value of IDA
can be upto 2099202 (IDA_MAX = 0x80000000U / IDA_BITMAP_BITS - 1).

However, there is no need to add such enormous number of devices
and it is enough for now to limit it to be 8192.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 14:36:20 -04:00
Parav Pandit
1ae4cfa039 RDMA/core: Rename ports_parent to ports_kobj
Normally kobj objects have kobj suffix to reflect it.
Rename ports_parent to ports_kobj.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 14:09:45 -04:00
Parav Pandit
0f6ef65d1c RDMA/core: Do not expose unsupported counters
If the provider driver (such as rdma_rxe) doesn't support pma counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.

Fixes: 35c4cbb178 ("IB/core: Create get_perf_mad function in sysfs.c")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 14:09:44 -04:00
Leon Romanovsky
05d940d3a3 RDMA/nldev: Allow IB device rename through RDMA netlink
Provide an option to rename IB device name through RDMA netlink and
limit it to users with ADMIN capability only.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 13:37:16 -04:00
Leon Romanovsky
d21943dd19 RDMA/core: Implement IB device rename function
Generic implementation of IB device rename function.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 13:37:16 -04:00
Leon Romanovsky
dbace111e5 RDMA/core: Annotate timeout as unsigned long
The ucma users supply timeout in u32 format, it means that any number
with most significant bit set will be converted to negative value
by various rdma_*, cma_* and sa_query functions, which treat timeout
as int.

In the lowest level, the timeout is converted back to be unsigned long.
Remove this ambiguous conversion by updating all function signatures to
receive unsigned long.

Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 13:34:01 -04:00
Leon Romanovsky
9549c2bd09 RDMA/core: Align multiple functions to kernel coding style
This patch changes the small number of functions to be aligned to kernel
coding style. It is needed to minimize the diffstat of the following
patch. It doesn't change any functionality.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 13:34:01 -04:00
Leon Romanovsky
d6f9125207 RDMA/cma: Remove unused timeout_ms parameter from cma_resolve_iw_route()
cma_resolve_iw_route() doesn't use timeout_ms parameter, so let's remove it.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 13:34:01 -04:00
Gustavo A. R. Silva
a3671a4f97 RDMA/ucma: Fix Spectre v1 vulnerability
hdr.cmd can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/infiniband/core/ucma.c:1686 ucma_write() warn: potential
spectre issue 'ucma_cmd_table' [r] (local cap)

Fix this by sanitizing hdr.cmd before using it to index
ucm_cmd_table.

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 12:47:40 -04:00
Gustavo A. R. Silva
0295e39595 IB/ucm: Fix Spectre v1 vulnerability
hdr.cmd can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/infiniband/core/ucm.c:1127 ib_ucm_write() warn: potential
spectre issue 'ucm_cmd_table' [r] (local cap)

Fix this by sanitizing hdr.cmd before using it to index
ucm_cmd_table.

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-16 11:32:40 -04:00
Leon Romanovsky
e54b6a3bcd RDMA/cm: Respect returned status of cm_init_av_by_path
Add missing check for failure of cm_init_av_by_path

Fixes: e1444b5a16 ("IB/cm: Fix automatic path migration support")
Reported-by: Slava Shwartsman <slavash@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-16 08:29:24 -06:00
Leon Romanovsky
fe9bc16449 RDMA/restrack: Protect from reentry to resource return path
Nullify the resource task struct pointer to ensure that subsequent calls
won't try to release task_struct again.

------------[ cut here ]------------
ODEBUG: free active (active state 1) object type: rcu_head hint:
(null)
WARNING: CPU: 0 PID: 6048 at lib/debugobjects.c:329
debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 6048 Comm: syz-executor022 Not tainted
4.19.0-rc7-next-20181008+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x244/0x3ab lib/dump_stack.c:113
  panic+0x238/0x4e7 kernel/panic.c:184
  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
  report_bug+0x254/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
  do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:969
RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Code: 41 88 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 92 00 00 00 48 8b 14
dd
60 02 41 88 4c 89 fe 48 c7 c7 00 f8 40 88 e8 36 2f b4 fd <0f> 0b 83 05
a9
f4 5e 06 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f
RSP: 0018:ffff8801d8c3eda8 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8164d235 RDI: 0000000000000005
RBP: ffff8801d8c3ede8 R08: ffff8801d70aa280 R09: ffffed003b5c3eda
R10: ffffed003b5c3eda R11: ffff8801dae1f6d7 R12: 0000000000000001
R13: ffffffff8939a760 R14: 0000000000000000 R15: ffffffff8840fca0
  __debug_check_no_obj_freed lib/debugobjects.c:786 [inline]
  debug_check_no_obj_freed+0x3ae/0x58d lib/debugobjects.c:818
  kmem_cache_free+0x202/0x290 mm/slab.c:3759
  free_task_struct kernel/fork.c:163 [inline]
  free_task+0x16e/0x1f0 kernel/fork.c:457
  __put_task_struct+0x2e6/0x620 kernel/fork.c:730
  put_task_struct include/linux/sched/task.h:96 [inline]
  finish_task_switch+0x66c/0x900 kernel/sched/core.c:2715
  context_switch kernel/sched/core.c:2834 [inline]
  __schedule+0x8d7/0x21d0 kernel/sched/core.c:3480
  schedule+0xfe/0x460 kernel/sched/core.c:3524
  freezable_schedule include/linux/freezer.h:172 [inline]
  futex_wait_queue_me+0x3f9/0x840 kernel/futex.c:2530
  futex_wait+0x45c/0xa50 kernel/futex.c:2645
  do_futex+0x31a/0x26d0 kernel/futex.c:3528
  __do_sys_futex kernel/futex.c:3589 [inline]
  __se_sys_futex kernel/futex.c:3557 [inline]
  __x64_sys_futex+0x472/0x6a0 kernel/futex.c:3557
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446549
Code: e8 2c b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 2b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f3a998f5da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00000000006dbc38 RCX: 0000000000446549
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00000000006dbc38
RBP: 00000000006dbc30 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc3c
R13: 2f646e6162696e69 R14: 666e692f7665642f R15: 00000000006dbd2c
Kernel Offset: disabled

Reported-by: syzbot+71aff6ea121ffefc280f@syzkaller.appspotmail.com
Fixes: ed7a01fd3f ("RDMA/restrack: Release task struct which was hold by CM_ID object")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-16 08:24:36 -06:00
Jason Gunthorpe
59bfc59a68 Merge branch 'for-rc' into rdma.git for-next
From git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

This is required to resolve dependencies of the next series of RDMA
patches.

The code motion conflicts in drivers/infiniband/core/cache.c were
resolved.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-16 00:01:02 -06:00
Denis Drozdov
5d6b0cb336 RDMA/netdev: Fix netlink support in IPoIB
IPoIB netlink support was broken by the below commit since integrating
the rdma_netdev support relies on an allocation flow for netdevs that
was controlled by the ipoib driver while netdev's rtnl_newlink
implementation assumes that the netdev will be allocated by netlink.
Such situation leads to crash in __ipoib_device_add, once trying to
reuse netlink device.

This patch fixes the kernel oops for both mlx4 and mlx5
devices triggered by the following command:

Fixes: cd565b4b51 ("IB/IPoIB: Support acceleration options callbacks")
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 17:58:12 -07:00
Denis Drozdov
f6a8a19bb1 RDMA/netdev: Hoist alloc_netdev_mqs out of the driver
netdev has several interfaces that expect to call alloc_netdev_mqs from
the core code, with the driver only providing the arguments.  This is
incompatible with the rdma_netdev interface that returns the netdev
directly.

Thus re-organize the API used by ipoib so that the verbs core code calls
alloc_netdev_mqs for the driver. This is done by allowing the drivers to
provide the allocation parameters via a 'get_params' callback and then
initializing an allocated netdev as a second step.

Fixes: cd565b4b51 ("IB/IPoIB: Support acceleration options callbacks")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 17:58:11 -07:00
Leon Romanovsky
ed7a01fd3f RDMA/restrack: Release task struct which was hold by CM_ID object
Tracking CM_ID resource is performed in two stages: creation of cm_id
and connecting it to the cma_dev. It is needed because rdma-cm protocol
exports two separate user-visible calls rdma_create_id and rdma_accept.

At the time of CM_ID creation, the real owner of that object is unknown
yet and we need to grab task_struct. This task_struct is released or
reassigned in attach phase later on. but call to rdma_destroy_id left
this task_struct unreleased.

Such separation is unique to CM_ID and other restrack objects initialize
in one shot. It means that it is safe to use "res->valid" check to catch
unfinished CM_ID flow and release task_struct for that object.

Fixes: 00313983cd ("RDMA/nldev: provide detailed CM_ID information")
Reported-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-05 16:07:39 -06:00
Leon Romanovsky
2165fc2640 RDMA/restrack: Consolidate task name updates in one place
Unify task update and kernel name set in one place.

Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-05 16:07:39 -06:00
Leon Romanovsky
363ad35577 RDMA/restrack: Un-inline set task implementation
Prepare rdma_restrack_set_task() call to accommodate more
code by moving its implementation from *.h to *.c.

Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-05 16:07:39 -06:00
Parav Pandit
fe33507ec3 RDMA/core: Check error status of rdma_find_ndev_for_src_ip_rcu
rdma_find_ndev_for_src_ip_rcu() returns either valid netdev pointer or
ERR_PTR().  Instead of checking for NULL, check for error.

Fixes: caf1e3ae9f ("RDMA/core Introduce and use rdma_find_ndev_for_src_ip_rcu")
Reported-by: syzbot+20c32fa6ff84a2d28c36@syzkaller.appspotmail.com
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 20:47:41 -06:00
Leon Romanovsky
38716732f1 RDMA/netlink: Simplify netlink listener existence check
All users of rdma_nl_chk_listeners() are interested to get boolean answer
if netlink socket has listeners, so update all places to boolean function.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:06:07 -06:00
Kamal Heib
d31131bba5 RDMA: Remove unused parameter from ib_modify_qp_is_ok()
The ll parameter is not used in ib_modify_qp_is_ok(), so remove it.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:05:46 -06:00
Jason Gunthorpe
e73798f20e RDMA/uverbs: Fix RCU annotation for radix slot deference
The uapi radix tree is a write-once data structure protected by kref.
Once we get to the ioctl() fop it is not possible for anything else
to be writing to it, so the access should use rcu_dereference_protected.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:01:40 -06:00
Parav Pandit
41ab1cb7d1 RDMA/cma: Introduce and use cma_ib_acquire_dev()
When RDMA CM connect request arrives for IB transport, it already contains
device, port, netdevice (optional).

Instead of traversing all the cma devices, use the cma device already
found by the cma_find_listener() for which a listener id is provided.

iWarp devices doesn't need to derive RoCE GIDs, therefore drop RoCE
specific checks from cma_acquire_dev() and rename it to
cma_iw_acquire_dev().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:13 -06:00
Parav Pandit
ff11c6cd52 RDMA/cma: Introduce and use cma_acquire_dev_by_src_ip()
Light weight version of cma_acquire_dev() just for binding with rdma
device based on source IP(v4/v6) address.

This simplifies cma_acquire_dev() to avoid listen_id specific checks and
also for subsequent simplification for IB vs iWarp.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:13 -06:00
Parav Pandit
78fb282b15 RDMA/cma: Allow accepting requests for multi port rdma device
When IP failover is used between multiple ports of a given rdma device,
allow accepting CM requests from either of the ports.  This is applicable
for IPv4 and IPv6 non link local addressing scheme.

IPv6 link local addresses are bound. IP failover requests for listen
cm_ids bound to specific netdev interfaces cannot be supported.
(Similar to traditional sockets).

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:13 -06:00
Parav Pandit
3994586f4d RDMA/core: Acquire and release mmap_sem on page range
Currently mmap_sem is read locked while pinning the memory.  In a
multi-threaded application of a process, holding mmap_sem lock creates
contention with other threads who might be either registering memory,
creating QPs or simply doing mmap() as such operations also require to
hold the mmap_sem write lock.

All such operation cannot make forward progress until one memory pin
operation is completed.  It becomes more worse if the memory is unpinned
and/or memory registration is large (in GB range).

Therefore, instead of holding mmap_sem for too long (for whole region
pinning), acquire and release the lock for every few pages.  For example
on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes
memory chunk.

This allows other competing threads to make progress who might wish to
hold mmap_sem for shorter duration.

When memory registration latency is measured using [1] for memory sizes
ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many
runs no difference is seen other than run-to-run variance.

In other targeted tests of users with large memory, desired improvements
are seen due to reduced contention of mmap_sem.

[1] https://github.com/paravmellanox/rtool

$ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A

It registers pinned memory from 4K to 48GB size with 500 iterations for
each memory size.

$ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4

4 competing threads pin memory, each of 12GB size with 500 iterations.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-27 12:40:20 -06:00
Alex Estrin
c8b53d0c5e IB/sa: simplify return code logic for ib_nl_send_msg()
rdma_nl_multicast() returns either negative error code
or zero if succeeded. Remove unnecessary ret code checks
and reassignments.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-26 16:35:48 -06:00
Jason Gunthorpe
896de0090a RDMA/core: Use dev_name instead of ibdev->name
These return the same thing but dev_name is a more conventional use of the
kernel API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
2018-09-26 13:51:48 -06:00
Jason Gunthorpe
43c7c851b9 RDMA/core: Use dev_err/dbg/etc instead of pr_* + ibdev->name
Any messages related to a device should be printed with the dev_*
formatters. This provides greater consistency for the user.

The core does not set pr_fmt so this has no significant change.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
2018-09-26 13:51:48 -06:00
Jason Gunthorpe
e349f858d2 RDMA: Fully setup the device name in ib_register_device
The current code has two copies of the device name, ibdev->dev and
dev_name(&ibdev->dev), and they are setup at different times, which is
very confusing.

Set them both up at the same time and make dev_name() the lead name, which
is the proper use of the driver core APIs. To make it very clear that the
name is not valid until registration pass it in to the
ib_register_device() call rather than messing with ibdev->name directly.

Also the reorganization now checks that dev_name is unique even if it does
not contain a %.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
2018-09-26 13:51:36 -06:00
Doug Ledford
c6ce580716 RDMA/umem: Fix potential addition overflow
Given a large enough memory allocation, it is possible to wrap the
pinned_vm counter.  Check for addition overflow to prevent such
eventualities.

Fixes: 40ddacf2dd ("RDMA/umem: Don't hold mmap_sem for too long")
Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 15:19:06 -06:00
Doug Ledford
3312d1c6bd RDMA/umem: Minor optimizations
Noticed while reviewing commit d4b4dd1b97 ("RDMA/umem: Do not use
current->tgid to track the mm_struct") patch.  Why would we take a lock,
adjust a protected variable, drop the lock, and *then* check the input
into our protected variable adjustment?  Then we have to take the lock
again on our error unwind.  Let's just check the input early and skip
taking the locks needlessly if the input isn't valid.

It was also noticed that we set mm = current->mm, we then never modify
mm, but we still go back and reference current->mm a number of times
needlessly.  Be consistent in using the stored reference in mm.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 15:19:06 -06:00
Parav Pandit
5c5702e259 RDMA/core: Set right entry state before releasing reference
Currently add_modify_gid() for IB link layer has followong issue
in cache update path.

When GID update event occurs, core releases reference to the GID
table without updating its state and/or entry pointer.

CPU-0                              CPU-1
------                             -----
ib_cache_update()                    IPoIB ULP
   add_modify_gid()                   [..]
      put_gid_entry()
      refcnt = 0, but
      state = valid,
      entry is valid.
      (work item is not yet executed).
                                   ipoib_create_ah()
                                     rdma_create_ah()
                                        rdma_get_gid_attr() <--
                                   	Tries to acquire gid_attr
                                        which has refcnt = 0.
                                   	This is incorrect.

GID entry state and entry pointer is provides the accurate GID enty
state. Such fields must be updated with rwlock to protect against
readers and, such fields must be in sane state before refcount can drop
to zero. Otherwise above race condition can happen leading to
use-after-free situation.

Following backtrace has been observed when cache update for an IB port
is triggered while IPoIB ULP is creating an AH.

Therefore, when updating GID entry, first mark a valid entry as invalid
through state and set the barrier so that no callers can acquired
the GID entry, followed by release reference to it.

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 4 PID: 29106 at lib/refcount.c:153 refcount_inc_checked+0x30/0x50
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:refcount_inc_checked+0x30/0x50
RSP: 0018:ffff8802ad36f600 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000008 RDI: ffffffff86710100
RBP: ffff8802d6e60a30 R08: ffffed005d67bf8b R09: ffffed005d67bf8b
R10: 0000000000000001 R11: ffffed005d67bf8a R12: ffff88027620cee8
R13: ffff8802d6e60988 R14: ffff8802d6e60a78 R15: 0000000000000202
FS: 0000000000000000(0000) GS:ffff8802eb200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3ab35e5c88 CR3: 00000002ce84a000 CR4: 00000000000006e0
IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
Call Trace:
rdma_get_gid_attr+0x220/0x310 [ib_core]
? lock_acquire+0x145/0x3a0
rdma_fill_sgid_attr+0x32c/0x470 [ib_core]
rdma_create_ah+0x89/0x160 [ib_core]
? rdma_fill_sgid_attr+0x470/0x470 [ib_core]
? ipoib_create_ah+0x52/0x260 [ib_ipoib]
ipoib_create_ah+0xf5/0x260 [ib_ipoib]
ipoib_mcast_join_complete+0xbbe/0x2540 [ib_ipoib]

Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 15:01:09 -06:00
Mark Bloch
a9360abd3d IB/uverbs: Free uapi on destroy
Make sure we free struct uverbs_api once we clean the radix tree. It was
allocated by uverbs_alloc_api().

Fixes: 9ed3e5f447 ("IB/uverbs: Build the specs into a radix tree at runtime")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 14:47:33 -06:00
Jason Gunthorpe
2a3ccfdbeb RDMA/uverbs: Get rid of ucontext->tgid
Nothing uses this now, just delete it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
56ac9dd917 RDMA/umem: Avoid synchronize_srcu in the ODP MR destruction path
synchronize_rcu is slow enough that it should be avoided on the syscall
path when user space is destroying MRs. After all the rework we can now
trivially do this by having call_srcu kfree the per_mm.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
be7a57b41a RDMA/umem: Handle a half-complete start/end sequence
mmu_notifier_unregister() can race between a invalidate_start/end and
cause the invalidate_end to be skipped. This causes an imbalance in the
locking, which lockdep complains about.

This is not actually a bug, as we immediately kfree the memory holding the
lock, but it simple enough to fix.

Mark when the notifier is being destroyed and abort the start callback.
This can be done under the lock we already obtained, and can re-purpose
the invalidate_range test we already have.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
ca748c39ea RDMA/umem: Get rid of per_mm->notifier_count
This is intrinsically racy and the scheme is simply unnecessary. New MR
registration can wait for any on going invalidation to fully complete.

      CPU0                              CPU1
                                  if (atomic_read())
 if (atomic_dec_and_test() &&
     !list_empty())
  { /* not taken */ }
                                       list_add()

Putting the new UMEM into some kind of purgatory until another invalidate
rolls through..

Instead hold the read side of the umem_rwsem across the pair'd start/end
and get rid of the racy 'deferred add' approach.

Since all umem's in the rbt are always ready to go, also get rid of the
mn_counters_active stuff.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
f27a0d50a4 RDMA/umem: Use umem->owning_mm inside ODP
Since ODP had a single struct mmu_notifier located in the ucontext it
could only handle a single MM at a time, and this prevented it from using
the new owning_mm system.

With the prior rework it is now simple to let ODP track multiple MMs per
ucontext, finish the job so that the per_mm is allocated on a mm by mm
basis, and freed when the last umem is dropped from the ucontext.

As a side effect the new saner locking removes the lockdep splat about
nesting the umem_rwsem between mmu_notifier_unregister and
ib_umem_odp_release.

It also makes ODP work with multiple processes, across, fork, etc.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
c9990ab39b RDMA/umem: Move all the ODP related stuff out of ucontext and into per_mm
This is the first step to make ODP use the owning_mm that is now part of
struct ib_umem.

Each ODP umem is linked to a single per_mm structure, which in turn, is
linked to a single mm, via the embedded mmu_notifier. This first patch
introduces the structure and reworks eveything to use it.

This also needs to introduce tgid into the ib_ucontext_per_mm, as
get_user_pages_remote() requires the originating task for statistics
tracking.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Jason Gunthorpe
597ecc5a09 RDMA/umem: Get rid of struct ib_umem.odp_data
This no longer has any use, we can use container_of to get to the
umem_odp, and a simple flag to indicate if this is an odp MR. Remove the
few remaining references to it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Jason Gunthorpe
41b4deeaa1 RDMA/umem: Make ib_umem_odp into a sub structure of ib_umem
These two structures are linked together, use the container_of pattern
instead of a double allocation to make the code simpler and easier to
follow.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Jason Gunthorpe
b5231b019d RDMA/umem: Use ib_umem_odp in all function signatures connected to ODP
All of these functions already require the ODP version of the umem struct,
make this very clear by having the signature require it. This paves the
way to using the container_of() pattern to link umem_odp and umem
together.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Majd Dibbiny
4eeed36869 RDMA/uverbs: Fix validity check for modify QP
Uverbs shouldn't enforce QP state in the command unless the user set the QP
state bit in the attribute mask.

In addition, only copy qp attr fields which have the corresponding bit set
in the attribute mask over to the internal attr structure.

Fixes: 88de869bbe ("RDMA/uverbs: Ensure validity of current QP state value")
Fixes: bc38a6abdd ("[PATCH] IB uverbs: core implementation")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-20 16:47:30 -06:00
Jason Gunthorpe
d4b4dd1b97 RDMA/umem: Do not use current->tgid to track the mm_struct
This is just wrong, the process that calls into the reg_mr is the process
associated with the umem, and that does not have to be the same process
that created the context.

When this code was first written mmgrab() didn't exist, however these days
we can just directly hold the mm_struct pointer in the umem and have no
ambiguity when it comes to releasing the umem as to which mm it was
associated with.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-20 16:19:30 -04:00
Jason Gunthorpe
ce92db1ca8 RDMA/ucontext: Get rid of the old disassociate flow
The disassociate_ucontext function in every driver is now empty, so we
don't need this ugly and wrong code that was messing with tgids.

rdma_user_mmap_io does this same work in a better way.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-20 16:19:30 -04:00
Jason Gunthorpe
5f9794dc94 RDMA/ucontext: Add a core API for mmaping driver IO memory
To support disassociation and PCI hot unplug, we have to track all the
VMAs that refer to the device IO memory. When disassociation occurs the
VMAs have to be revised to point to the zero page, not the IO memory, to
allow the physical HW to be unplugged.

The three drivers supporting this implemented three different versions
of this algorithm, all leaving something to be desired. This new common
implementation has a few differences from the driver versions:

- Track all VMAs, including splitting/truncating/etc. Tie the lifetime of
  the private data allocation to the lifetime of the vma. This avoids any
  tricks with setting vm_ops which Linus didn't like. (see link)
- Support multiple mms, and support properly tracking mmaps triggered by
  processes other than the one first opening the uverbs fd. This makes
  fork behavior of disassociation enabled drivers the same as fork support
  in normal drivers.
- Don't use crazy get_task stuff.
- Simplify the approach for to racing between vm_ops close and
  disassociation, fixing the related bugs most of the driver
  implementations had. Since we are in core code the tracking list can be
  placed in struct ib_uverbs_ufile, which has a lifetime strictly longer
  than any VMAs created by mmap on the uverbs FD.

Link: https://www.spinics.net/lists/stable/msg248747.html
Link: https://lkml.kernel.org/r/CA+55aFxJTV_g46AQPoPXen-UPiqR1HGMZictt7VpC-SMFbm3Cw@mail.gmail.com
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-20 16:19:30 -04:00
Jason Gunthorpe
0099103926 RDMA/uverbs: Fix error unwind in ib_uverbs_add_one
The error path has several mistakes

- cdev_del should not be called if cdev_device_add fails
- We must call put_device on all the goto exit paths as that is what frees
  the uapi, SRCU and the struct itself.

While we are here consolidate all the uvdev_dev init that cannot fail at
the top.

Fixes: c5c4d92e70 ("RDMA/uverbs: Use cdev_device_add() instead of cdev_add()")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
2018-09-19 10:49:22 -06:00
YueHaibing
0965cc953a RDMA/core: Properly return the error code of rdma_set_src_addr_rcu
rdma_set_src_addr_rcu should check copy_src_l2_addr fails, rather than
always return 0. Also copy_src_l2_addr should return 'ret' as its return
value when rdma_translate_ip fails.

Fixes: c31d4b2ddf ("RDMA/core: Protect against changing dst->dev during destination resolve")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-19 10:12:44 -06:00
Jason Gunthorpe
6ebce44746 RDMA/uverbs: Remove is_closed from ib_uverbs_file
This does nothing but indicate if the uverbs_file is in the device's list,
use list_del_init instead.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-09-19 10:07:05 -06:00
Cong Wang
5fe23f262e ucma: fix a use-after-free in ucma_resolve_ip()
There is a race condition between ucma_close() and ucma_resolve_ip():

CPU0				CPU1
ucma_resolve_ip():		ucma_close():

ctx = ucma_get_ctx(file, cmd.id);

        list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
                mutex_lock(&mut);
                idr_remove(&ctx_idr, ctx->id);
                mutex_unlock(&mut);
		...
                mutex_lock(&mut);
                if (!ctx->closing) {
                        mutex_unlock(&mut);
                        rdma_destroy_id(ctx->cm_id);
		...
                ucma_free_ctx(ctx);

ret = rdma_resolve_addr();
ucma_put_ctx(ctx);

Before idr_remove(), ucma_get_ctx() could still find the ctx
and after rdma_destroy_id(), rdma_resolve_addr() may still
access id_priv pointer. Also, ucma_put_ctx() may use ctx after
ucma_free_ctx() too.

ucma_close() should call ucma_put_ctx() too which tests the
refcnt and waits for the last one releasing it. The similar
pattern is already used by ucma_destroy_id().

Reported-and-tested-by: syzbot+da2591e115d57a9cbb8b@syzkaller.appspotmail.com
Reported-by: syzbot+cfe3c1e8ef634ba8964b@syzkaller.appspotmail.com
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-13 13:04:13 -04:00
Parav Pandit
0e9d2c19bf RDMA/core: Consider net ns of gid attribute for RoCE
When resolving destination address or route, when net namespace is
unavailable, refer to the net namespace of the netdevice of the SGID
attribute. This is typically the case for requests arriving from the
network for RoCE ports.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:17 -06:00
Parav Pandit
d6b1764a8c RDMA/core: Introduce rdma_read_gid_attr_ndev_rcu() to check GID attribute
Introduce an API rdma_read_gid_attr_ndev_rcu() to return GID attribute
netdevice which is in UP state for accessing netdevice's fields such as
net namespace and ifindex.

This is useful for users who intent to access netdevice fields under rcu
lock.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:17 -06:00
Parav Pandit
6aaecd3856 RDMA/core: Simplify roce_resolve_route_from_path()
Currently RoCE route resolve functionality is split between two
functions. (a) roce_resolve_route_from_path() and its helper function
rdma_resolve_ip_route().

Due to this multiple sockaddr src structures are created in both functions
with rdma_dev_addr is an interface between the two for checks.

Since there is only one user of rdma_resolve_ip_route() as RoCE, combine
the functionality of both functions to roce_resolve_route_from_path() and
further reduce the scope of rdma_dev_addr to core/addr.c

This also allow to extend addr_resolve() in subsequent patch to consider
netdev properties of GID in safer way under rcu lock.

Additionally src and dst addresses were always provided, so skip the src
addr NULL pointer check as they are present on the stack now.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:17 -06:00
Parav Pandit
c31d4b2ddf RDMA/core: Protect against changing dst->dev during destination resolve
During resolving address process, during route lookup and while performing
src address translation in case of loopback mode, hold the rcu lock so
that if netdevice is moving to different net namespace, or being
unregistered, it can be synchronized with net/core/dev.c, ie

change_net_namespace()
->dev_close_many()
  ->rt6_uncached_list_flush_dev() who would change dst->dev

to loopback device of the given net namespace.

Therefore, hold the rcu lock and sync with synchronize_net() of
change_net_namespace() to ensure that netdevice cannot get freed while
dst->dev is being used.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:16 -06:00
Parav Pandit
307edde8ef RDMA/core: Refer to network type instead of device type
Set and refer to rdma_dev_addr network type instead of dst->ndev to reduce
dependency on accessing dst netdevice.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
783793b554 RDMA/core: Use common code flow for IPv4/6 for addr resolve
Use common code flow for resolving neighbour and for finding source
addresses.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
77addc5244 RDMA/core: Rename rdma_copy_addr to rdma_copy_src_l2_addr
Now that rdma_copy_addr() only copies the source addresses and all callers
are interested in copying only source addresses, simplify it to drop the
destination address argument.

Given that it only copies source layer2 addresses, rename it to
rdma_copy_src_l2_addr for better code readability.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
a362ea1d9e RDMA/core: Introduce and use rdma_set_src_addr() between IPv4 and IPv6
rdma_translate_ip() is done while resolving address for the loopback
addresses. The current flow is convoluted with resolve neighbor being
optional.

This patch simplifies the code in following ways.

(a) Use common code between IPv4 and IPv6 for address translation,
    loopback checks and acquiring netdevice.
(b) During neigh resolve in addr_resolve_neigh(), only copy destination
    address.
(c) Always resolve the source address before the destination address,
    because it doesn't depend on resolving neigh being requested or not.

This helps to reduce 3 calls of rdma_copy_addr and rdma_translate_ip to
one and makes it easier to follow the code flow.

Now that ib_nl_fetch_ha() doesn't depend on dst, drop dst argument from
ib_nl_fetch_ha().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
89c5691cdd RDMA/core: Let protocol specific function typecast sockaddr structure
Current code typecasts destination address using extra variable but uses
source address as is.

Even though the compiler optimizes such code well, just let each protocol
specific function typecast for src and dest both and have symmetric code.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
f89b7dfa33 RDMA/core: Avoid unnecessary sa_family overwrite
addr4_resolve() and addr6_resolve() are called by checking the value of
sa_family.

Both above functions overwrite the value after typecasting, this is not
necessary.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
caf1e3ae9f RDMA/core Introduce and use rdma_find_ndev_for_src_ip_rcu
This fixes two issues:
1. When address family is other than IPv4 or v6, rdma_translate_ip()
   returns success which is incorrect.
2. When address familty is AF_INET6, and if the source address is not
   found, it returns success, which is also incorrect.

Therefore, introduce and use rdma_find_ndev_for_src_ip_rcu() helper
function which returns correct success or error status and is also useful
for future code refactor in addr_resolve().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Steve Wise
67e3816842 RDMA/uverbs: Atomically flush and mark closed the comp event queue
Currently a uverbs completion event queue is flushed of events in
ib_uverbs_comp_event_close() with the queue spinlock held and then
released.  Yet setting ev_queue->is_closed is not set until later in
uverbs_hot_unplug_completion_event_file().

In between the time ib_uverbs_comp_event_close() releases the lock and
uverbs_hot_unplug_completion_event_file() acquires the lock, a completion
event can arrive and be inserted into the event queue by
ib_uverbs_comp_handler().

This can cause a "double add" list_add warning or crash depending on the
kernel configuration, or a memory leak because the event is never dequeued
since the queue is already closed down.

So add setting ev_queue->is_closed = 1 to ib_uverbs_comp_event_close().

Cc: stable@vger.kernel.org
Fixes: 1e7710f3f6 ("IB/core: Change completion channel to use the reworked objects schema")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:43:15 -06:00
Mark Bloch
fa76d24ee0 RDMA/mlx5: Add flow actions support to raw create flow
Support attaching flow actions to a flow rule via raw create flow.
For now only NIC RX path is supported. This change requires to export
flow resources management functions so we can maintain proper bookkeeping
of flow actions.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-11 09:28:07 -06:00
Mark Bloch
86e1d464a8 RDMA/uverbs: Move flow resources initialization
Use ib_set_flow() when initializing flow related resources.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-11 09:28:06 -06:00
Guy Levi
70cd20aed0 IB/uverbs: Add IDRs array attribute type to ioctl() interface
Methods sometimes need to get a flexible set of IDRs and not a strict set
as can be achieved today by the conventional IDR attribute. Add a new
IDRS_ARRAY attribute to the generic uverbs ioctl layer.

IDRS_ARRAY points to array of idrs of the same object type and same access
rights, only write and read are supported.

Signed-off-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>``
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-11 09:28:06 -06:00
Parav Pandit
273993509f RDMA/core: Assign device ifindex before publishing the device
Even though device->ifindex is assigned before adding the device in the
list which is read by netlink flow, it is better to assign rdma device
index before publishing the device in the system to users and clients.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 14:34:38 -06:00
Parav Pandit
c715a39541 RDMA/core: Follow correct unregister order between sysfs and cgroup
During register_device() init sequence is,
(a) register with rdma cgroup followed by
(b) register with sysfs

Therefore, unregister_device() sequence should follow the reverse order.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 14:32:21 -06:00
Leon Romanovsky
50704e039a RDMA/umem: Restore lockdep check while downgrading lock
Lockdep engine handles correctly downgrade of locks and it simply
incorrect to disable lockdep checks prior to calling mmu_notifier.

Remove lockdep_off and ensure locks correctness.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 14:26:51 -06:00
Parav Pandit
e1f540c3ed RDMA/core: Define client_data_lock as rwlock instead of spinlock
Even though device registration/unregistration and client
registration/unregistration is not a performance path, define the
client_data_lock as rwlock for code clarity.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:45:38 -06:00
Parav Pandit
2d65f49ff9 RDMA/core: Use simpler spin lock irq API from blocking context
add_client_context(), ib_unregister_device() and ib_unregister_client()
are designed to call from blocking context.  There is no need to save and
restore last interrupt state when called from such blocking context.  Even
though this is not a performance path, using the right spin lock API is
desired for code clarity.

To avoid checkpatch warning while removing flags, sizeof() is used.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:45:38 -06:00
Parav Pandit
4512acd0d3 RDMA/core: Remove context entries from list while unregistering device
While unregistering a device, remove the context elements from the list to
not have any stale entries. With that any errors/bugs can be checked when
device is freed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:45:38 -06:00
Parav Pandit
f7b65d9bf2 RDMA/core: Use simplified list_for_each
While traversing client_data_list in following conditions, linked list is
only read, no elements of the list are removed.  Therefore, use
list_for_each_entry(), instead of list_for_each_safe().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:45:38 -06:00
Parav Pandit
93688ddbe1 RDMA/core: No need to protect kfree with spin lock and semaphore
While unregistering a client, only context removal should be protected
with lock. There is no need to protect a freeing of such context which is
already removed from the list.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:45:38 -06:00
Parav Pandit
722c7b2bfe RDMA/{cma, core}: Avoid callback on rdma_addr_cancel()
Currently rdma_addr_cancel() is an async operation, which notifies that
cancel is done by executing the callback function given during
rdma_resolve_ip(). If resolve_ip request is already completed than
callback is not executed.

Instead, now rdma_resolve_addr() and rdma_addr_cancel() simplified in
following ways.
1. rdma_addr_cancel() now a synchronous method. If request was
pending, after it is cancelled, no callback is notified.
2. rdma_resolve_addr() and respective addr_handler() callback doesn't
need to hold reference to cm_id.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:35:16 -06:00
Parav Pandit
f9d08f1e19 RDMA/core: Rate limit MAD error messages
While registering a mad agent, a user space can trigger various errors
and flood the logs.

Therefore, decrease verbosity and rate limit such error messages.
While we are at it, use __func__ to print function name.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:35:16 -06:00
Parav Pandit
798bba01b4 RDMA/core: Fail early if unsupported QP is provided
When requested QP type is not supported for a {device, port}, return the
error right away before validating all parameters during mad agent
registration time.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:35:15 -06:00
Parav Pandit
954a8e3aea RDMA/cma: Protect cma dev list with lock
When AF_IB addresses are used during rdma_resolve_addr() a lock is not
held. A cma device can get removed while list traversal is in progress
which may lead to crash. ie

        CPU0                                     CPU1
        ====                                     ====
rdma_resolve_addr()
 cma_resolve_ib_dev()
  list_for_each()                         cma_remove_one()
    cur_dev->device                        mutex_lock(&lock)
                                            list_del();
                                           mutex_unlock(&lock);
                                           cma_process_remove();


Therefore, hold a lock while traversing the list which avoids such
situation.

Cc: <stable@vger.kernel.org> # 3.10
Fixes: f17df3b0de ("RDMA/cma: Add support for AF_IB to rdma_resolve_addr()")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-06 13:01:59 -06:00
Jason Gunthorpe
2c910cb75e Merge branch 'uverbs_dev_cleanups' into rdma.git for-next
For dependencies, branch based on rdma.git 'for-rc' of
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/

Pull 'uverbs_dev_cleanups' from Leon Romanovsky:

====================
Reuse the char device code interfaces to simplify ib_uverbs_device
creation and destruction. As part of this series, we are sending fix to
cleanup path, which was discovered during internal review,

The fix definitely can go to -rc, but it means that this series will be
dependent on rdma-rc.
====================

* branch 'uverbs_dev_cleanups':
  RDMA/uverbs: Use device.groups to initialize device attributes
  RDMA/uverbs: Use cdev_device_add() instead of cdev_add()
  RDMA/core: Depend on device_add() to add device attributes
  RDMA/uverbs: Fix error cleanup path of ib_uverbs_add_one()

Resolved conflict in ib_device_unregister_sysfs()

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 16:21:22 -06:00
Parav Pandit
b53b1c08a2 RDMA/uverbs: Use device.groups to initialize device attributes
Instead of explicitly adding device attribute files and handling such
error conditions, depend on device core layer to create device attributes
files based group pointer NULL terminated array.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 16:19:25 -06:00
Parav Pandit
c5c4d92e70 RDMA/uverbs: Use cdev_device_add() instead of cdev_add()
Instead of doing two step process to add char device and create underlying
device, use cdev_device_add() which does both.

Currently a kobject per uverbs_device is created to keep reference to its
holding ib_uverbs_device in addition to its underlying device 'dev'.

Instead just use uverbs_device->dev to keep a reference to.

With this change there is single reference tracker for ib_uverbs_device
structure.

This allows for subsequent patch to registers group attribute as well
using single API cdev_device_add().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 16:19:25 -06:00
Parav Pandit
adee9f3f3b RDMA/core: Depend on device_add() to add device attributes
Instead of adding/removing device attribute files, depend on device_add()
which considers adding these device files based on NULL terminated
attributes group array.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 16:19:18 -06:00
Parav Pandit
08e74be103 RDMA/uverbs: Fix error cleanup path of ib_uverbs_add_one()
If ib_uverbs_create_uapi() fails, dev_num should be freed from the bitmap.

Fixes: 7d96c9b176 ("IB/uverbs: Have the core code create the uverbs_root_spec")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 16:15:52 -06:00
Parav Pandit
627212c9d4 RDMA/core: Replace open-coded variant of get_device
Reuse existing get_device() API to do it symmetric to already used
put_device() in commit 924b8900a4 ("RDMA/core: Replace open-coded
variant of put_device")

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 16:04:52 -06:00
Leon Romanovsky
6ceb6331b3 RDMA/uverbs: Declare closing variable as boolean
The "closing" variable is used as boolean and set to "true" in one
place, update the declaration of that variable and their other
assignment to proper type.

Fixes: e951747a08 ("IB/uverbs: Rework the locking for cleaning up the ucontext")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 15:59:06 -06:00
Jack Morgenstein
f794809a72 IB/core: Add an unbound WQ type to the new CQ API
The upstream kernel commit cited below modified the workqueue in the
new CQ API to be bound to a specific CPU (instead of being unbound).
This caused ALL users of the new CQ API to use the same bound WQ.

Specifically, MAD handling was severely delayed when the CPU bound
to the WQ was busy handling (higher priority) interrupts.

This caused a delay in the MAD "heartbeat" response handling,
which resulted in ports being incorrectly classified as "down".

To fix this, add a new "unbound" WQ type to the new CQ API, so that users
have the option to choose either a bound WQ or an unbound WQ.

For MADs, choose the new "unbound" WQ.

Fixes: b7363e67b2 ("IB/device: Convert ib-comp-wq to be CPU-bound")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.m>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 15:38:09 -06:00
Mark Bloch
841eefc5cb RDMA/uverbs: Add generic function to fill in flow action object
Refactor the initialization of a flow action object to a common function.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-05 15:23:59 -06:00
Mark Bloch
0953fffec9 RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language
This makes it clear and safe to access constants passed in from user
space. We define a consistent ABI of u64 for all constants, and verify
that the data passed in can be represented by the type the user supplies.

The expectation is this will always be used with an enum declaring the
constant values, and the user will use the enum type as input to the
accessor.

To retrieve the attribute value we introduce two helper calls - one
standard which may fail if attribute is not valid and one where caller can
provide a default value which will be used in case the attribute is not
valid (useful when attribute is optional).

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-09-05 15:14:58 -06:00
Artemy Kovalyov
e4ff3d22c1 IB/core: Release object lock if destroy failed
The object lock was supposed to always be released during destroy, but
when the destruction retry series was integrated with the destroy series
it created a failure path that missed the unlock.

Keep with convention, if destroy fails the caller must undo all locking.

Fixes: 87ad80abc7 ("IB/uverbs: Consolidate uobject destruction")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-04 15:07:55 -06:00
Jann Horn
0d23ba6034 RDMA/ucma: check fd type in ucma_migrate_id()
The current code grabs the private_data of whatever file descriptor
userspace has supplied and implicitly casts it to a `struct ucma_file *`,
potentially causing a type confusion.

This is probably fine in practice because the pointer is only used for
comparisons, it is never actually dereferenced; and even in the
comparisons, it is unlikely that a file from another filesystem would have
a ->private_data pointer that happens to also be valid in this context.
But ->private_data is not always guaranteed to be a valid pointer to an
object owned by the file's filesystem; for example, some filesystems just
cram numbers in there.

Check the type of the supplied file descriptor to be safe, analogous to how
other places in the kernel do it.

Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-04 15:07:55 -06:00
Michal Hocko
93065ac753 mm, oom: distinguish blockable mode for mmu notifiers
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Jason Gunthorpe
0a3173a5f0 Merge branch 'linus/master' into rdma.git for-next
rdma.git merge resolution for the 4.19 merge window

Conflicts:
 drivers/infiniband/core/rdma_core.c
   - Use the rdma code and revise with the new spelling for
     atomic_fetch_add_unless
 drivers/nvme/host/rdma.c
   - Replace max_sge with max_send_sge in new blk code
 drivers/nvme/target/rdma.c
   - Use the blk code and revise to use NULL for ib_post_recv when
     appropriate
   - Replace max_sge with max_recv_sge in new blk code
 net/rds/ib_send.c
   - Use the net code and revise to use NULL for ib_post_recv when
     appropriate

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 14:21:29 -06:00
Jason Gunthorpe
89982f7cce Linux 4.18
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
 mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
 qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
 zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
 vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
 wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
 L47IdXo=
 =sJkt
 -----END PGP SIGNATURE-----

Merge tag 'v4.18' into rdma.git for-next

Resolve merge conflicts from the -rc cycle against the rdma.git tree:

Conflicts:
 drivers/infiniband/core/uverbs_cmd.c
  - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
  - Merge removal of file->ucontext in for-next with new code in -rc
 drivers/infiniband/core/uverbs_main.c
  - for-next removed code from ib_uverbs_write() that was modified
    in for-rc

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 13:12:00 -06:00
Parav Pandit
dd81b2c8a3 IB/core: Change filter function return type from int to bool
Filter functions returns either 0 or 1, therefore better change their
return type from int to bool to reflect the same.  Additionally some
filter functions have suffix of _filter some doesn't.  Make all filter
function consistent to have __filter suffix to improve code readability.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-15 13:33:20 -06:00
Parav Pandit
d12e2eed27 IB/core: Update GID entries for netdevice whose mac address changes
Update all GID table entries of the netdevice whose MAC address changed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-15 13:33:20 -06:00
Parav Pandit
464b79b45a IB/core: Add default GIDs of the bond master netdev
Currently following issues exist:
1. Default GIDs of the lower (slave) netdevice if the bond netdevice is
   added. Rather default GID should be of bond master netdevice.
2. Due to this, when failover event occurs FAILOVER event handler attempts
   to delete the GID of the upper device and tries to add the default GID
   of the lower device. This is incorrect behavior.

To have simple and correct code:
(a) Split default GIDs addition out of add_netdev_ips().  This allows
    easier removal in future if RoCE default GIDs are removed.
(b) Add default GIDs of the bond master device by using right filter and
    callback function.
(c) Remove unused function enum_netdev_default_gids().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-15 13:33:20 -06:00
Parav Pandit
a03d4d2775 IB/core: Consider adding default GIDs of bond device
Now that we correctly delete the default GIDs of lower devices during
CHANGEUPPER event, add default GIDs of the bonding master device.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-15 13:33:19 -06:00
Parav Pandit
408f1242d9 IB/core: Delete lower netdevice default GID entries in bonding scenario
When NETDEV_CHANGEUPPER event occurs, lower device is not yet established
as slave of the master, and when upper device is bond device, default GID
entries not deleted.

Due to this, when bond device is fully configured, default GID entries of
bond device cannot be added as default GID entries are occupied by the
lower netdevice. This is incorrect.

Default GID entries should really be of bond netdevice because in all RoCE
GIDs (default or IP), MAC address of the bond device will be used.  It is
confusing to have default GID of netdevice which is not really used for
any purpose.

Therefore, as first step, implement
(a) filter function which filters if a CHANGEUPPER event netdevice and
    associated upper device is master device or not.
(b) callback function which deletes the default GIDs of lower (event
    netdevice).

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-15 13:33:19 -06:00
Parav Pandit
b9f09866e0 IB/core: Avoid confusing del_netdev_default_ips
Currently bond_delete_netdev_default_gids() is called by two callers.
(a) del_netdev_default_ips_join()
(b) del_netdev_default_ips()

Both above functions changes the argument order while calling
bond_delete_netdev_default_gids().  This required silly
del_netdev_default_ips() wrapper.

Additionally, del_netdev_default_ips() deletes default GIDs not IP based
GIDs.  del_netdev_default_ips() having _ips suffix is confusing.

Therefore, get rid of confusing del_netdev_default_ips() and simplify
bond_delete_netdev_default_gids() to follow same argument order as its
caller.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-14 16:43:55 -06:00
Parav Pandit
666e7099a4 IB/core: Add comment for change upper netevent handling
Add comment for handling CHANGEUPPER netevent handling.
To improve code readability,
(a) move cmd definitions to its respective if-else branches,
(b) avoid single line structure definitions.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-14 16:43:55 -06:00
Jason Gunthorpe
486edfb103 IB/ucm: Fix compiling ucm.c
Even though this interface is marked CONFIG_BROKEN we still expect it to
compile, at least until we delete it completely.

Also mark INFINIBAND_USER_ACCESS_UCM with COMPILE_TEST so these situations
can be detected.

Fixes: e7ff98aefc ("RDMA/cma: Constify path record, ib_cm_event, listen_id pointers")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-13 20:04:37 -06:00
Linus Torvalds
de5d1b39ea Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking/atomics update from Thomas Gleixner:
 "The locking, atomics and memory model brains delivered:

   - A larger update to the atomics code which reworks the ordering
     barriers, consolidates the atomic primitives, provides the new
     atomic64_fetch_add_unless() primitive and cleans up the include
     hell.

   - Simplify cmpxchg() instrumentation and add instrumentation for
     xchg() and cmpxchg_double().

   - Updates to the memory model and documentation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  locking/atomics: Rework ordering barriers
  locking/atomics: Instrument cmpxchg_double*()
  locking/atomics: Instrument xchg()
  locking/atomics: Simplify cmpxchg() instrumentation
  locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
  tools/memory-model: Rename litmus tests to comply to norm7
  tools/memory-model/Documentation: Fix typo, smb->smp
  sched/Documentation: Update wake_up() & co. memory-barrier guarantees
  locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
  sched/core: Use smp_mb() in wake_woken_function()
  tools/memory-model: Add informal LKMM documentation to MAINTAINERS
  locking/atomics/Documentation: Describe atomic_set() as a write operation
  tools/memory-model: Make scripts executable
  tools/memory-model: Remove ACCESS_ONCE() from model
  tools/memory-model: Remove ACCESS_ONCE() from recipes
  locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
  MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
  tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
  tools/memory-model: Add litmus test for full multicopy atomicity
  locking/refcount: Always allow checked forms
  ...
2018-08-13 12:23:39 -07:00
Jason Gunthorpe
4ce719f846 IB/uverbs: Do not check for device disassociation during ioctl
Now that the ioctl path and uobjects are converted to use uverbs_api, it
is now safe to remove the disassociation protection from the common ioctl
code.

This completes the work to make destroy functions continue to work even
after device disassociation.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-13 09:17:19 -06:00
Jason Gunthorpe
51d0a2b4cf IB/uverbs: Remove struct uverbs_root_spec and all supporting code
Everything now uses the uverbs_uapi data structure.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-13 09:17:19 -06:00
Jason Gunthorpe
3a863577a7 IB/uverbs: Use uverbs_api to unmarshal ioctl commands
Convert the ioctl method syscall path to use the uverbs_api data
structures. The new uapi structure includes all the same information, just
in a different and more optimal way.

 - Use attr_bkey instead of 2 level radix trees for everything related to
   attributes. This includes the attribute storage, presence, and
   detection of missing mandatory attributes.
 - Avoid iterating over all attribute storage at finish, instead use
   find_first_bit with the attr_bkey to locate only those attrs that need
   cleanup.
 - Organize things to always run, and always rely on, cleanup. This
   avoids a bunch of tricky error unwind cases.
 - Locate the method using the radix tree, and locate the attributes
   using a very efficient incremental radix tree lookup
 - Use the precomputed destroy_bkey to handle uobject destruction
 - Use the precomputed allocation sizes and precomputed 'need_stack'
   to avoid maths in the fast path. This is optimal if userspace
   does not pass (many) unsupported attributes.

Overall this results in much better codegen for the attribute accessors,
everything is now stored in bitmaps or linear arrays indexed by attr_bkey.
The compiler can compute attr_bkey values at compile time for all method
attributes, meaning things like uverbs_attr_is_valid() now compile into
single instruction bit tests.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-13 09:17:16 -06:00
Jason Gunthorpe
b61815e241 IB/uverbs: Use uverbs_alloc for allocations
Several handlers need temporary allocations for the life of the method,
switch them to use the uverbs_alloc allocator.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-08-13 09:16:13 -06:00
Jason Gunthorpe
461bb2eee4 IB/uverbs: Add a simple allocator to uverbs_attr_bundle
This is similar in spirit to devm, it keeps track of any allocations
linked to this method call and ensures they are all freed when the method
exits. Further, if there is space in the internal/onstack buffer then the
allocator will hand out that memory and avoid an expensive call to
kalloc/kfree in the syscall path.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-08-13 09:16:08 -06:00
Jason Gunthorpe
6a1f444fef IB/uverbs: Remove the ib_uverbs_attr pointer from each attr
Memory in the bundle is valuable, do not waste it holding an 8 byte
pointer for the rare case of writing to a PTR_OUT. We can compute the
pointer by storing a small 1 byte array offset and the base address of the
uattr memory in the bundle private memory.

This also means we can access the kernel's copy of the ib_uverbs_attr, so
drop the copy of flags as well.

Since the uattr base should be private bundle information this also
de-inlines the already too big uverbs_copy_to inline and moves
create_udata into uverbs_ioctl.c so they can see the private struct
definition.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-08-10 16:06:24 -06:00
Jason Gunthorpe
4b3dd2bbf0 IB/uverbs: Provide implementation private memory for the uverbs_attr_bundle
This already existed as the anonymous 'ctx' structure, but this was not
really a useful form. Hoist this struct into bundle_priv and rework the
internal things to use it instead.

Move a bunch of the processing internal state into the priv and reduce the
excessive use of function arguments.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-08-10 16:06:24 -06:00
Jason Gunthorpe
6b0d08f4a2 IB/uverbs: Use uverbs_api to manage the object type inside the uobject
Currently the struct uverbs_obj_type stored in the ib_uobject is part of
the .rodata segment of the module that defines the object. This is a
problem if drivers define new uapi objects as we will be left with a
dangling pointer after device disassociation.

Switch the uverbs_obj_type for struct uverbs_api_object, which is
allocated memory that is part of the uverbs_api and is guaranteed to
always exist. Further this moves the 'type_class' into this memory which
means access to the IDR/FD function pointers is also guaranteed. Drivers
cannot define new types.

This makes it safe to continue to use all uobjects, including driver
defined ones, after disassociation.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-10 16:06:24 -06:00
Jason Gunthorpe
9ed3e5f447 IB/uverbs: Build the specs into a radix tree at runtime
This radix tree datastructure is intended to replace the 'hash' structure
used today for parsing ioctl methods during system calls. This first
commit introduces the structure and builds it from the existing .rodata
descriptions.

The so-called hash arrangement is actually a 5 level open coded radix tree.
This new version uses a 3 level radix tree built using the radix tree
library.

Overall this is much less code and much easier to build as the radix tree
API allows for dynamic modification during the building. There is a small
memory penalty to pay for this, but since the radix tree is allocated on
a per device basis, a few kb of RAM seems immaterial considering the
gained simplicity.

The radix tree is similar to the existing tree, but also has a 'attr_bkey'
concept, which is a small value'd index for each method attribute. This is
used to simplify and improve performance of everything in the next
patches.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
2018-08-10 16:06:24 -06:00
Jason Gunthorpe
7d96c9b176 IB/uverbs: Have the core code create the uverbs_root_spec
There is no reason for drivers to do this, the core code should take of
everything. The drivers will provide their information from rodata to
describe their modifications to the core's base uapi specification.

The core uses this to build up the runtime uapi for each device.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-08-10 16:06:24 -06:00
Jason Gunthorpe
922983c2a1 IB/uverbs: Fix reading of 32 bit flags
This is missing a zeroing of the high bits of flags, and is also not
correct for big endian machines. Properly zero extend the 32 bit flags
into the 64 bit stack variable.

Reported-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Fixes: bccd06223f ("IB/uverbs: Add UVERBS_ATTR_FLAGS_IN to the specs language")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
2018-08-09 15:46:07 -06:00
Parav Pandit
58796e67d5 IB/ucm: Initialize sgid request GID attribute pointer
sgid_attr is uninitialized on the stack, initialize it to NULL.

Fixes: 398391071f ("IB/cm: Replace members of sa_path_rec with 'struct sgid_attr *'")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-07 12:56:10 -06:00
Jason Gunthorpe
0f50d88a6e IB/uverbs: Allow all DESTROY commands to succeed after disassociate
The disassociate function was broken by design because it failed all
commands. This prevents userspace from calling destroy on a uobject after
it has detected a device fatal error and thus reclaiming the resources in
userspace is prevented.

This fix is now straightforward, when anything destroys a uobject that is
not the user the object remains on the IDR with a NULL context and object
pointer. All lookup locking modes other than DESTROY will fail. When the
user ultimately calls the destroy function it is simply dropped from the
IDR while any related information is returned.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
a9b66d6453 IB/uverbs: Do not block disassociate during write()
Now that all the callbacks are safe to run concurrently with
disassociation this test can be eliminated. The ufile core infrastructure
becomes entirely self contained and is not sensitive to disassociation.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
e83f0ecdc4 IB/uverbs: Do not pass struct ib_device to the ioctl methods
This does the same as the patch before, except for ioctl. The rules are
the same, but for the ioctl methods the core code handles setting up the
uobject.

- Retrieve the ib_dev from the uobject->context->device. This is
  safe under ioctl as the core has already done rdma_alloc_begin_uobject
  and so CREATE calls are entirely protected by the rwsem.
- Retrieve the ib_dev from uobject->object
- Call ib_uverbs_get_ucontext()

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
bbd51e881f IB/uverbs: Do not pass struct ib_device to the write based methods
This is a step to get rid of the global check for disassociation. In this
model, the ib_dev is not proven to be valid by the core code and cannot be
provided to the method. Instead, every method decides if it is able to
run after disassociation and obtains the ib_dev using one of three
different approaches:

- Call srcu_dereference on the udevice's ib_dev. As before, this means
  the method cannot be called after disassociation begins.
  (eg alloc ucontext)
- Retrieve the ib_dev from the ucontext, via ib_uverbs_get_ucontext()
- Retrieve the ib_dev from the uobject->object after checking
  under SRCU if disassociation has started (eg uobj_get)

Largely, the code is all ready for this, the main work is to provide a
ib_dev after calling uobj_alloc(). The few other places simply use
ib_uverbs_get_ucontext() to get the ib_dev.

This flexibility will let the next patches allow destroy to operate
after disassociation.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
cc2e14e680 IB/uverbs: Lower the test for ongoing disassociation
Commands that are reading/writing to objects can test for an ongoing
disassociation during their initial call to rdma_lookup_get_uobject.  This
directly prevents all of these commands from conflicting with an ongoing
disassociation.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
1e857e65d4 IB/uverbs: Allow uobject allocation to work concurrently with disassociate
After all the recent structural changes this is now straightforward, hold
the hw_destroy_rwsem across the entire uobject creation. We already take
this semaphore on the success path, so holding it a bit longer is not
going to change the performance.

After this change none of the create callbacks require the
disassociate_srcu lock to be correct.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
7452a3c745 IB/uverbs: Allow RDMA_REMOVE_DESTROY to work concurrently with disassociate
After all the recent structural changes this is now straightfoward, hoist
the hw_destroy_rwsem up out of rdma_destroy_explicit and wrap it around
the uobject write lock as well as the destroy.

This is necessary as obtaining a write lock concurrently with
uverbs_destroy_ufile_hw() will cause malfunction.

After this change none of the destroy callbacks require the
disassociate_srcu lock to be correct.

This requires introducing a new lookup mode, UVERBS_LOOKUP_DESTROY as the
IOCTL interface needs to hold an unlocked kref until all command
verification is completed.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
9867f5c669 IB/uverbs: Convert 'bool exclusive' into an enum
This is more readable, and future patches will need a 3rd lookup type.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
87ad80abc7 IB/uverbs: Consolidate uobject destruction
There are several flows that can destroy a uobject and each one is
minimized and sprinkled throughout the code base, making it difficult to
understand and very hard to modify the destroy path.

Consolidate all of these into uverbs_destroy_uobject() and call it in all
cases where a uobject has to be destroyed.

This makes one change to the lifecycle, during any abort (eg when
alloc_commit is not called) we always call out to alloc_abort, even if
remove_commit needs to be called to delete a HW object.

This also renames RDMA_REMOVE_DURING_CLEANUP to RDMA_REMOVE_ABORT to
clarify its actual usage and revises some of the comments to reflect what
the life cycle is for the type implementation.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
32ed5c00ac IB/uverbs: Make the write path destroy methods use the same flow as ioctl
The ridiculous dance with uobj_remove_commit() is not needed, the write
path can follow the same flow as ioctl - lock and destroy the HW object
then use the data left over in the uobject to form the response to
userspace.

Two helpers are introduced to make this flow straightforward for the
caller.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:48 -06:00
Jason Gunthorpe
aa72c9a5f9 IB/uverbs: Remove rdma_explicit_destroy() from the ioctl methods
The core code will destroy the HW object on behalf of the method, if the
method provides an implementation it must simply copy data from the stub
uobj into the response. Destroy methods cannot touch the HW object.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-01 14:55:37 -06:00
Parav Pandit
8546331651 RDMA/core: Prefix _ib to IB/RoCE specific functions
In rdma cm module, functions which are common between IB and iWarp
are named with cma_.
iWarp specific functions are prefixed with cma_iw.
IB specific functions are perfixed with cma_ib.

However some functions in request processing path didn't follow
cma_ib notion. Prefix them with _ib for better code clarity.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
79d684f026 RDMA/core: Simplify gid type check in cma_acquire_dev()
cma_add_one() initializes the default GID regardless of device type.
listen_id is bound to a device and an IP address, its GID type is
initialized by cma_acquire_dev().

Therefore a valid default GID type is always available, it is not needed
to check port type during cma_acquire_dev().

Initialize gid type of a cm id when the cm_id is created instead of
doing conditional checks during cma_acquire_dev() and trying to
initialize to 0 during _cma_attach_to_dev().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
7582df8267 RDMA/core: Avoid holding lock while initializing fields on stack
In various functions rdma_cm_event is zero initialized on stack using
memset() while holding lock which is not necessary.
Therefore, don't hold the lock while initializing on stack.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
ca3a8ace2b RDMA/core: Return bool instead of int
Return bool for following internal and inline functions as their
underlying APIs return bool too.

1. cma_zero_addr()
2. cma_loopback_addr()
3. cma_any_addr()
4. ib_addr_any()
5. ib_addr_loopback()

While we are touching cma_loopback_addr(), remove extra white spaces
in it.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
05e0b86c41 RDMA/cma: Get rid of 1 bit boolean
Arrange fields of cma_req_info structure for efficiency on
stack and get rid of one bit boolean field.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
e7ff98aefc RDMA/cma: Constify path record, ib_cm_event, listen_id pointers
Constify several pointers such as path_rec, ib_cm_event and listen_id
pointers in several functions.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
2df7dba855 RDMA/core: Constify dst_addr argument
Following APIs are not supposed to modify addr or dest_addr contents.
Therefore make those function argument const for better code
readability.

1. rdma_resolve_ip()
2. rdma_addr_size()
3. rdma_resolve_addr()

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
219d2e9dfd RDMA/cma: Simplify rdma_resolve_addr() error flow
Currently dst address is first set and later on cleared on either of the
3 error conditions are met.
However none of the APIs or checks are supposed to refer to the
destination address of the cm_id.
Therefore, set the destination address after necessary checks pass which
simplifies the error flow.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Parav Pandit
e11fef9f8d RDMA/cma: Initialize resource type in __rdma_create_id()
Currently rdma_cm_id's resource tracking fields such as owner task and
kern_name and other non resource tracking fields are initialized in
in single function __rdma_create_id().

Therefore, initialize rdma_cm_id's resource type also in same init
function.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:49:04 -06:00
Kamal Heib
0584c47bbc RDMA/core: Check for verbs callbacks before using them
Make sure the providers implement the verbs callbacks before calling
them, otherwise return -EOPNOTSUPP.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:31:09 -06:00
Kamal Heib
7150c3d554 RDMA/core: Remove {create,destroy}_ah from mandatory verbs
{create,destroy}_ah aren't mandatory verbs, because not all providers
are implementing them.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:31:09 -06:00
Jason Gunthorpe
bccd06223f IB/uverbs: Add UVERBS_ATTR_FLAGS_IN to the specs language
This clearly indicates that the input is a bitwise combination of values
in an enum, and identifies which enum contains the definition of the bits.

Special accessors are provided that handle the mandatory validation of the
allowed bits and enforce the correct type for bitwise flags.

If we had introduced this at the start then the kabi would have uniformly
used u64 data to pass flags, however today there is a mixture of u64 and
u32 flags. All places are converted to accept both sizes and the accessor
fixes it. This allows all existing flags to grow to u64 in future without
any hassle.

Finally all flags are, by definition, optional. If flags are not passed
the accessor does not fail, but provides a value of zero.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-30 20:23:29 -06:00
Bart Van Assche
d34ac5cd3a RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const
Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.

To make this possible, only one cast had to be introduce that casts away
constness, namely in rpcrdma_post_recvs(). The only way I can think of to
avoid that cast is to introduce an additional loop in that function or to
change the data type of bad_wr from struct ib_recv_wr ** into int
(an index that refers to an element in the work request list). However,
both approaches would require even more extensive changes than this
patch.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:09:34 -06:00
Parav Pandit
643d213a9a RDMA/cma: Do not ignore net namespace for unbound cm_id
Currently if the cm_id is not bound to any netdevice, than for such cm_id,
net namespace is ignored; which is incorrect.

Regardless of cm_id bound to a netdevice or not, net namespace must
match. When a cm_id is bound to a netdevice, in such case net namespace
and netdevice both must match.

Fixes: 4c21b5bcef ("IB/cma: Add net_dev and private data checks to RDMA CM")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-26 09:47:47 -06:00
Parav Pandit
d274e45ce1 RDMA/cma: Consider netdevice for RoCE ports
When netdevice is not found for a request, and if it for RoCE port,
currently it allows matching the listener as long as port number matches
by ignoring the netdevice.

Now that we always prefer to have netdevice associated with RoCE, when
netdevice is not found, don't consider RoCE ports.

In other words, a NULL netdevice with RoCE is not acceptable. Therefore,
remove this confusing RoCE port ignorance check.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-26 09:47:47 -06:00
Parav Pandit
cee104334c IB/core: Introduce and use sgid_attr in CM requests
For RoCE, when CM requests are received for RC and UD connections,
netdevice of the incoming request is unavailable. Because of that CM
requests are always forwarded to init_net namespace.

Now that we have the GID attribute available, introduce SGID attribute in
incoming CM requests and refer to the netdevice of it.  This is similar to
existing SGID attribute field in outgoing CM requests for RC and UD
transports.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-26 09:47:47 -06:00
Varsha Rao
076dd53be5 IB/core: Remove extra parentheses
Remove unnecessary parentheses to fix the clang warning of extraneous
parentheses.

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:36:29 -06:00
Jason Gunthorpe
22fa27fbc6 IB/uverbs: Fix locking around struct ib_uverbs_file ucontext
We have a parallel unlocked reader and writer with ib_uverbs_get_context()
vs everything else, and nothing guarantees this works properly.

Audit and fix all of the places that access ucontext to use one of the
following locking schemes:
- Call ib_uverbs_get_ucontext() under SRCU and check for failure
- Access the ucontext through an struct ib_uobject context member
  while holding a READ or WRITE lock on the uobject.
  This value cannot be NULL and has no race.
- Hold the ucontext_lock and check for ufile->ucontext !NULL

This also re-implements ib_uverbs_get_ucontext() in a way that is safe
against concurrent ib_uverbs_get_context() and disassociation.

As a side effect, every access to ucontext in the commands is via
ib_uverbs_get_context() with an error check, or via the uobject, so there
is no longer any need for the core code to check ucontext on every command
call. These checks are also removed.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:46 -06:00
Jason Gunthorpe
aba94548c9 IB/uverbs: Move the FD uobj type struct file allocation to alloc_commit
Allocating the struct file during alloc_begin creates this strange
asymmetry with IDR, where the FD has two krefs pointing at it during the
pre-commit phase. In particular this makes the abort process for FD very
strange and confusing.

For instance abort currently calls the type's destroy_object twice, and
the fops release once if abort is done. This is very counter intuitive. No
fops should be called until alloc_commit succeeds, and destroy_object
should only ever be called once.

Moving the struct file allocation to the alloc_commit is now simple, as we
already support failure of rdma_alloc_commit_uobject, with all the
required rollback pieces.

This creates an understandable symmetry with IDR and simplifies/fixes the
abort handling for FD types.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:22 -06:00
Jason Gunthorpe
2c96eb7d62 IB/uverbs: Always propagate errors from rdma_alloc_commit_uobject()
The ioctl framework already does this correctly, but the write path did
not. This is trivially fixed by simply using a standard pattern to return
uobj_alloc_commit() as the last statement in every function.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:22 -06:00
Jason Gunthorpe
e951747a08 IB/uverbs: Rework the locking for cleaning up the ucontext
The locking here has always been a bit crazy and spread out, upon some
careful analysis we can simplify things.

Create a single function uverbs_destroy_ufile_hw() that internally handles
all locking. This pulls together pieces of this process that were
sprinkled all over the places into one place, and covers them with one
lock.

This eliminates several duplicate/confusing locks and makes the control
flow in ib_uverbs_close() and ib_uverbs_free_hw_resources() extremely
simple.

Unfortunately we have to keep an extra mutex, ucontext_lock.  This lock is
logically part of the rwsem and provides the 'down write, fail if write
locked, wait if read locked' semantic we require.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:22 -06:00
Jason Gunthorpe
87064277c4 IB/uverbs: Revise and clarify the rwsem and uobjects_lock
Rename 'cleanup_rwsem' to 'hw_destroy_rwsem' which is held across any call
to the type destroy function (aka 'hw' destroy). The main purpose of this
lock is to prevent normal add and destroy from running concurrently with
uverbs_cleanup_ufile()

Since the uobjects list is always manipulated under the 'hw_destroy_rwsem'
we can eliminate the uobjects_lock in the cleanup function. This allows
converting that lock to a very simple spinlock with a narrow critical
section.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:22 -06:00
Jason Gunthorpe
e6d5d5ddd0 IB/uverbs: Clarify and revise uverbs_close_fd
The locking requirements here have changed slightly now that we can rely
on the ib_uverbs_file always existing and containing all the necessary
locking infrastructure.

That means we can get rid of the cleanup_mutex usage (this was protecting
the check on !uboj->context).

Otherwise, follow the same pattern that IDR uses for destroy, acquire
exclusive write access, then call destroy and the undo the 'lookup'.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:22 -06:00
Jason Gunthorpe
5671f79b42 IB/uverbs: Revise the placement of get/puts on uobject
This wasn't wrong, but the placement of two krefs didn't make any
sense. Follow some simple rules.

- A kref is held inside uobjects_list
- A kref is held inside the IDR
- A kref is held inside file->private
- A stack based kref is passed bettwen alloc_begin and
  alloc_abort/alloc_commit

Any place we destroy one of the above pointers, we stick a put,
or 'move' the kref into another pointer.

The key functions have sensible semantics:
- alloc_uobj fully initializes the common members in uobj, including
  the list
- Get rid of the uverbs_idr_remove_uobj helper since IDR remove
  does require put, but it depends on the situation. Later
  patches will re-consolidate this differently.
- alloc_abort always consumes the passed kref, done in the type
- alloc_commit always consumes the passed kref, done in the type
- rdma_remove_commit_uobject always pairs with a lookup_get

After it is all done the only control flow change is to:
- move a get from alloc_commit_fd_uobject to rdma_alloc_commit_uobject
- add a put to remove_commit_idr_uobject
- Consistenly use rdma_lookup_put in rdma_remove_commit_uobject at
  the right place

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:22 -06:00
Jason Gunthorpe
c561c28846 IB/uverbs: Clarify the kref'ing ordering for alloc_commit
The alloc_commit callback makes the uobj visible to other threads,
and it does so using a 'move' semantic of the uobj kref on the stack
into the public storage (eg the IDR, uobject list and file_private_data)

Once this is done another thread could start up and trigger deletion
of the kref. Fortunately cleanup_rwsem happens to prevent this from
being a bug, but that is a fantastically unclear side effect.

Re-organize things so that alloc_commit is that last thing to touch
the uobj, get rid of the sneaky implicit dependency on cleanup_rwsem,
and add a comment reminding that uobj is no longer kref'd after
alloc_commit.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:21 -06:00
Jason Gunthorpe
1250c3048c IB/uverbs: Handle IDR and FD types without truncation
Our ABI for write() uses a s32 for FDs and a u32 for IDRs, but internally
we ended up implicitly casting these ABI values into an 'int'. For ioctl()
we use a s64 for FDs and a u64 for IDRs, again casting to an int.

The various casts to int are all missing range checks which can cause
userspace values that should be considered invalid to be accepted.

Fix this by making the generic lookup routine accept a s64, which does not
truncate the write API's u32/s32 or the ioctl API's s64. Then push the
detailed range checking down to the actual type implementations to be
shared by both interfaces.

Finally, change the copy of the uobj->id to sign extend into a s64, so eg,
if we ever wish to return a negative value for a FD it is carried
properly.

This ensures that userspace values are never weirdly interpreted due to
the various trunctations and everything that is really out of range gets
an EINVAL.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25 14:21:21 -06:00
Jason Gunthorpe
3df593bfe6 IB/uverbs: Get rid of null_obj_type
If the method fails after calling rdma_explicit_destroy (eg if
copy_to_user faults) then it will trigger a kernel oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 800000000548d067 P4D 800000000548d067 PUD 54a0067 PMD 0
SMP PTI
CPU: 0 PID: 359 Comm: ibv_rc_pingpong Not tainted 4.18.0-rc1+ #28
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
RIP: 0010:          (null)
Code: Bad RIP value.
RSP: 0018:ffffc900001a3bf0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88000603bd00 RCX: 0000000000000003
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88000603bd00
RBP: 0000000000000001 R08: ffffc900001a3cf8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900001a3cf0
R13: 0000000000000000 R14: ffffc900001a3cf0 R15: 0000000000000000
FS:  00007fb00dda8700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000000548e004 CR4: 00000000003606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ? rdma_lookup_put_uobject+0x22/0x50 [ib_uverbs]
 ? uverbs_finalize_object+0x3b/0x60 [ib_uverbs]
 ? uverbs_finalize_attrs+0x128/0x140 [ib_uverbs]
 ? ib_uverbs_cmd_verbs+0x698/0x7c0 [ib_uverbs]
 ? find_held_lock+0x2d/0x90
 ? __might_fault+0x39/0x90
 ? ib_uverbs_ioctl+0x111/0x1f0 [ib_uverbs]
 ? do_vfs_ioctl+0xa0/0x6d0
 ? trace_hardirqs_on_caller+0xed/0x180
 ? _raw_spin_unlock_irq+0x24/0x40
 ? syscall_trace_enter+0x138/0x1d0
 ? ksys_ioctl+0x35/0x60
 ? __x64_sys_ioctl+0x11/0x20
 ? do_syscall_64+0x5b/0x1c0
 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is because the type was replaced with the null_type during explicit
destroy that cannot complete the destruction.

One of the side effects of replacing the type is to make the object
handle totally unreachable - so no other command could attempt to use
it, even though it remains on the uboject list.

We can get the same end result by just fully destroying the object inside
rdma_explicit_destroy and leaving the caller the residual kref for the
uobj with no attached HW object, and no presence in the ubojects list.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-25 14:21:21 -06:00
Bart Van Assche
1fec77bf8f RDMA/core: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 16:06:36 -06:00
Jack Morgenstein
addb8a6559 RDMA/uverbs: Expand primary and alt AV port checks
The commit cited below checked that the port numbers provided in the
primary and alt AVs are legal.

That is sufficient to prevent a kernel panic. However, it is not
sufficient for correct operation.

In Linux, AVs (both primary and alt) must be completely self-described.
We do not accept an AV from userspace without an embedded port number.
(This has been the case since kernel 3.14 commit dbf727de74
("IB/core: Use GID table in AH creation and dmac resolution")).

For the primary AV, this embedded port number must match the port number
specified with IB_QP_PORT.

We also expect the port number embedded in the alt AV to match the
alt_port_num value passed by the userspace driver in the modify_qp command
base structure.

Add these checks to modify_qp.

Cc: <stable@vger.kernel.org> # 4.16
Fixes: 5d4c05c3ee ("RDMA/uverbs: Sanitize user entered port numbers prior to access it")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 15:30:34 -06:00
Yishai Hadas
6cd080a674 IB: Support ib_flow creation in drivers
This patch considers the case that ib_flow is created by some device
driver with its specific parameters using the KABI infrastructure.

In that case both QP and ib_uflow_resources might not be applicable.
Downstream patches from this series use the above functionality.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 13:34:55 -06:00
Yishai Hadas
fd44e3853c IB/mlx5: Introduce flow steering matcher uapi object
Introduce flow steering matcher object and its create and destroy methods.

This matcher object holds some mlx5 specific driver properties that
matches the underlay device specification when an mlx5 flow steering group
is created.

It will be used in downstream patches to be part of mlx5 specific create
flow method.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 13:34:37 -06:00
Ingo Molnar
52b544bd38 Linux 4.18-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltLpVUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWisH/ikONMwV7OrSk36Y
 5rxzTFUoBk0Qffct88gtSNuRVCxaVb1ofCndvFJE6A6HfJkWpbBzH6eq90aakmJi
 f7uFcu4YmsQpeQaf9lpftWmY2vDf2fIadVTV0RnSMXks57wMax1cpBe7LJGpz13e
 f+g5XRVs1MdlZVtr6tG2SU3Y5AqVVVsYe/0DBPonEqeh9/JJbPFCuNkFOxxzAqPu
 VTnjyoOqG8qtZzjklNtR5rZn0Gv592tWX36eiWTQdThNmVFkGEAJwsHCQlY4OQYK
 61QN4UhOHiu8e1ZuGDNEDhNVRnKtaaYUPFeWL1wLRW73ul4P3ZkpvpS8QTMwcFJI
 JjzNOkI=
 =ckcO
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:27:43 +02:00
Jason Gunthorpe
c012691508 IB/cm: Remove cma_multicast->igmp_joined
This variable isn't read and written to with proper locking, so it is
racy. Instead of using an unlocked bool use presence in the mc->list

The caller could race rdma_join_multicast with rdma_leave_multicast which
would leak a mc join and cause a use after free of mc.

Instead, do not add the mc to the list until it has completed
initialization, all mcs on the list require leaving.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-13 12:18:55 -06:00
Leon Romanovsky
1215cb7c88 RDMA/umem: Refactor exit paths in ib_umem_get
Simplify exit paths in ib_umem_get to use the standard goto unwind
pattern.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-13 12:15:05 -06:00
Leon Romanovsky
40ddacf2dd RDMA/umem: Don't hold mmap_sem for too long
DMA mapping is time consuming operation and doesn't need to be performed
with mmap_sem semaphore is held.

The semaphore only needs to be held for accounting and get_user_pages
related activities.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-13 12:09:57 -06:00
Yishai Hadas
528922afd4 IB: Enable uverbs_destroy_def_handler to be used by drivers
Enable uverbs_destroy_def_handler to be used by drivers and replace
current code to use it.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-10 11:52:06 -06:00
Artemy Kovalyov
8942acea37 IB/uverbs: Pass IB_UVERBS_QPF_GRH_REQUIRED to user space
Userspace also needs to know if the port requires GRHs to properly form
the AVs it creates.

Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-10 11:13:04 -06:00
Artemy Kovalyov
b02289b3d6 RDMA: Validate grh_required when handling AVs
Extend the existing grh_required flag to check when AV's are handled that
a GRH is present.

Since we don't want to do query_port during the AV checks for performance
reasons move the flag into the immutable_data.

Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-10 11:13:04 -06:00
Jason Gunthorpe
958200ad8e RDMA/hfi1: Move grh_required into update_sm_ah
grh_required is intended to be a global setting where all AV's will
require a GRH, not just the sm_lid. Move the special logic to the creation
of the SM AH.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-10 11:13:04 -06:00
Jason Gunthorpe
2f944c0fbf RDMA: Fix storage of PortInfo CapabilityMask in the kernel
The internal flag IP_BASED_GIDS was added to a field that was being used
to hold the port Info CapabilityMask without considering the effects this
will have. Since most drivers just use the value from the HW MAD it means
IP_BASED_GIDS will also become set on any HW that sets the IBA flag
IsOtherLocalChangesNoticeSupported - which is not intended.

Fix this by keeping port_cap_flags only for the IBA CapabilityMask value
and store unrelated flags externally. Move the bit definitions for this to
ib_mad.h to make it clear what is happening.

To keep the uAPI unchanged define a new set of flags in the uapi header
that are only used by ib_uverbs_query_port_resp.port_cap_flags which match
the current flags supported in rdma-core, and the values exposed by the
current kernel.

Fixes: b4a26a2728 ("IB: Report using RoCE IP based gids in port caps")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-10 11:06:45 -06:00
Parav Pandit
07e7056aff IB/core: Simplify check for RoCE route resolve
roce_resolve_route_from_path() resolves the route based on the netdevice
of the GID attribute, therefore there is no point in checking again if
the route is resolved matches the same interface it arrived.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09 15:11:53 -06:00
Jason Gunthorpe
97202bbe22 IB/uverbs: Do not use uverbs_cmd_mask in the ioctl path
Instead we are now checking the function pointers directly. Get rid of
both cases in ioctl and drop the nonsense idea that destroy can fail.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 13:19:01 -06:00
Bart Van Assche
222c7b1fd4 RDMA/rw: Fix rdma_rw_ctx_signature_init() kernel-doc header
Fixes: 0e353e34e1 ("IB/core: add RW API support for signature MRs")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09 13:07:23 -06:00
Bart Van Assche
f8c2d2280c RDMA/core: Remove set-but-not-used variables
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09 12:11:22 -06:00
Bart Van Assche
28e39894ed RDMA/core: Remove ib_find_cached_gid() and ib_find_cached_gid_by_port()
Remove these two functions since all their callers have been removed.
See also commit ea8c2d8f60 ("RDMA/core: Remove unused ib cache
functions").

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09 12:11:22 -06:00
Håkon Bugge
87a37ce9e4 IB/cm: Remove unused and erroneous msg sequence encoding
In cm_form_tid(), a two bit message sequence number is OR'ed into bit
31-30 of the lower TID value.

After commit f06d265375 ("IB/cm: Randomize starting comm ID"), the
local_id is XOR'ed with a 32-bit random value. Hence, bit 31-30 in the
lower TID now has an arbitrarily value and it makes no sense to OR in
the message sequence number.

Adding to that, the evolution in use of IDR routines in cm_alloc_id()
has always had the possibility of returning a value with bit 30 set.

In addition, said bits are never checked.

Hence, remove the encoding and the corresponding enum.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09 11:39:28 -06:00
Jason Gunthorpe
76bc79ccce IB/uverbs: Replace ib_ucq_object uverbs_file with the one in ib_uobject
Now that ib_uobject has a ib_uverbs_file we don't need this extra one in
ib_ucq_object.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
d0259e82e7 IB/uverbs: Remove ib_uobject_file
The only purpose for this structure was to hold the ib_uobject_file
pointer, but now that is part of the standard ib_uobject the structure
no longer makes any sense, so get rid of it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
6f258884dd IB/uverbs: Tidy up remaining references to ucontext
Unnecessary clutter, to indirect through ucontext when the ufile would do.
Generally most of the code code should only be working with ufile, except
for a few places that touch the driver interface.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
2cc1e3b809 IB/uverbs: Replace file->ucontext with file in uverbs_cmd.c
The ucontext isn't needed any more, just pass the uverbs_file directly.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
6ef1c82821 IB/uverbs: Replace ib_ucontext with ib_uverbs_file in core function calls
The correct handle to refer to the idr/etc is ib_uverbs_file, revise all
the core APIs to use this instead. The user API are left as wrappers
that automatically convert a ucontext to a ufile for now.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
6a5e9c8841 IB/uverbs: Move non driver related elements from ib_ucontext to ib_ufile
The IDR is part of the ib_ufile so all the machinery to lock it, handle
closing and disassociation rightly belongs to the ufile not the ucontext.

This changes the lifetime of that data to match the lifetime of the file
descriptor which is always strictly longer than the lifetime of the
ucontext.

We need the entire locking machinery to continue to exist after ucontext
destruction to allow us to return the destroy data after a device has been
disassociated.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
c33e73af21 IB/uverbs: Add a uobj_perform_destroy helper
This consolidates a bunch of repeated code patterns into a helper.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-09 11:26:17 -06:00
Jason Gunthorpe
422e3d37ed RDMA/uverbs: Combine MIN_SZ_OR_ZERO with UVERBS_ATTR_STRUCT
After all the rework is done it is now possible to include single flags in
the type macros. Any user of UVERBS_ATTR_STRUCT needs to zero check data
past the end of the known struct to be correct, so make this mandatory,
and get rid of MIN_SZ_OR_ZERO as a user flag.

This changes UVERBS_ATTR_TYPE to refer to a struct of exact size with not
possibility of extension, convert the few users of UVERBS_ATTR_TYPE and
MIN_SZ_OR_ZERO to use UVERBS_ATTR_STRUCT.

The one user of UVERBS_ATTR_STRUCT without MIN_SZ_OR_ZERO is just
confused. There is some padding at the end of that struct, but userspace
always provides it with the padding. The construction doesn't test if the
padding is zero, so it is pointless. Just use UVERBS_ATTR_TYPE.

Finally, rename min_sz_or_zero to zero_trailing to better reflect what it
does and hopefully avoid such mis-uses in the future.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
540cd69209 RDMA/uverbs: Use UVERBS_ATTR_MIN_SIZE correctly and uniformly
This newer macro allows specifying a lower bound on the accepted size, and
has an 'unlimited' upper bound. Due to this it never checks for trailing
zeroing so it doesn't make any sense to combine it with MIN_SZ_OR_ZERO, so
drop MIN_SZ_OR_ZERO when they are used together

There were a couple of places that open coded this pattern, switch them to
use the clearer UVERBS_ATTR_MIN_SIZE for clarity.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
83bb444233 RDMA/uverbs: Remove UA_FLAGS
This bit of boilerplate isn't really necessary, we can use bitfields
instead of a flags enum and the macros can then individually initialize
them through the __VA_ARGS__ like everything else.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
9a119cd597 RDMA/uverbs: Get rid of the & in method specifications
Hide it inside the macros. The & is confusing and interferes with using
this as a generic DSL in later patches.

Since this also touches almost every line, also run the specs through
clang-format (with 'BinPackParameters: false') to make the maintenance
easier.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
6c61d2a55c RDMA/uverbs: Simplify UVERBS_OBJECT and _TREE family of macros
Instead of the large set of indirecting macros, define the few needed
macros to directly instantiate the struct uverbs_oject_tree_def and
associated objects list.

This is small amount of code duplication but the readability is far
better.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
595c7736d4 RDMA/uverbs: Simplify method definition macros
Instead of the large set of indirecting macros, define the few needed
macros to directly instantiate the struct uverbs_method_def and associated
attributes list.

This is small amount of code duplication but the readability is far
better.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
d108dac080 RDMA/uverbs: Simplify UVERBS_ATTR family of macros
Instead of using a complex cascade of macros, just directly provide the
initializer list each of the declarations is trying to create.

Now that the macros are simplified this also reworks the uverbs_attr_spec
to be friendly to older compilers by eliminating any unnamed
structures/unions inside, and removing the duplication of some fields. The
structure size remains at 16 bytes which was the original motivation for
some of this oddness.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
ad544cfe54 RDMA/uverbs: Split UVERBS_ATTR_FLOW_ACTION_ESP_HANDLE
Two methods are sharing the same attribute constant, but the attribute
definitions are not the same. This should not have been done, instead
split them into two attributes with the same number.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Jason Gunthorpe
87fc2a620a RDMA/uverbs: Store the specs_root in the struct ib_uverbs_device
The specs are required to operate the uverbs file, so they belong inside
the ib_uverbs_device, not inside the ib_device. The spec passed in the
ib_device is just a communication from the driver and should not be used
during runtime.

This also changes the lifetime of the spec memory to match the
ib_uverbs_device, however at this time the spec_root can still contain
driver pointers after disassociation, so it cannot be used if ib_dev is
NULL. This is preparation for another series.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-04 13:47:01 -06:00
Dan Carpenter
c2d7c8ff89 IB/core: type promotion bug in rdma_rw_init_one_mr()
"nents" is an unsigned int, so if ib_map_mr_sg() returns a negative
error code then it's type promoted to a high unsigned int which is
treated as success.

Fixes: a060b5629a ("IB/core: generic RDMA READ/WRITE API")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-04 12:05:28 -06:00
Leon Romanovsky
fe48aecb4d RDMA/uverbs: Don't fail in creation of multiple flows
The conversion from offsetof() calculations to sizeof()
wrongly behaved for missed exact size and in scenario with
more than one flow.

In such scenario we got "create flow failed, flow 10: 8 bytes
left from uverb cmd" error, which is wrong because the size of
kern_spec is exactly 8 bytes, and we were not supposed to fail.

Cc: <stable@vger.kernel.org> # 3.12
Fixes: 4fae7f1704 ("RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow")
Reported-by: Ran Rozenstein <ranro@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-03 14:08:00 -06:00
Yishai Hadas
1c77483e4c IB: Improve uverbs_cleanup_ucontext algorithm
Improve uverbs_cleanup_ucontext algorithm to work properly when the
topology graph of the objects cannot be determined at compile time.  This
is the case with objects created via the devx interface in mlx5.

Typically uverbs objects must be created in a strict topologically sorted
order, so that LIFO ordering will generally cause them to be freed
properly. There are only a few cases (eg memory windows) where objects can
point to things out of the strict LIFO order.

Instead of using an explicit ordering scheme where the HW destroy is not
allowed to fail, go over the list multiple times and allow the destroy
function to fail. If progress halts then a final, desperate, cleanup is
done before leaking the memory. This indicates a driver bug.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-29 14:35:46 -06:00
Jason Gunthorpe
92ebb6a0a1 IB/cm: Remove now useless rcu_lock in dst_fetch_ha
This lock used to be protecting a call to dst_get_neighbour_noref,
however the below commit changed it to dst_neigh_lookup which no longer
requires rcu.

Access to nud_state, neigh_event_send or rdma_copy_addr does not require
RCU, so delete the lock.

Fixes: 02b619555a ("infiniband: Convert dst_fetch_ha() over to dst_neigh_lookup().")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-29 11:45:50 -06:00
Jason Gunthorpe
7a5c938b9e IB/core: Check for rdma_protocol_ib only after validating port_num
port_num is untrusted data from the user, so it should be checked after
calling fill_sgid_attr, which validates it.

Fixes: 8d9ec9addd ("IB/core: Add a sgid_attr pointer to struct rdma_ah_attr")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-27 15:05:58 -06:00
Leon Romanovsky
d9c44040ed RDMA/uverbs: Remove redundant check
kern_spec->reserved is checked prior to calling
kern_spec_to_ib_spec_filter() which makes this second check redundant.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-26 14:37:26 -06:00
Leon Romanovsky
3a2e791c94 RDMA/umem: Don't check for a negative return value of dma_map_sg_attrs()
dma_map_sg_attrs() returns 0 on error and can't return a negative number
(ensured by BUG_ON), so don't check.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-26 14:37:26 -06:00
Leon Romanovsky
a5cc9831af RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
Number of specs is provided by user and in valid case can be equal to zero.
Such argument causes to call to kcalloc() with zero-length request and in
return the ZERO_SIZE_PTR is assigned. This pointer is different from NULL
and makes various if (..) checks to success.

Fixes: b6ba4a9aa5 ("IB/uverbs: Add support for flow counters")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-26 14:37:26 -06:00
Leon Romanovsky
4fae7f1704 RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow
The check of cmd.flow_attr.size should check into account the size of the
reserved field (2 bytes), otherwise user can provide a size which will
cause a slab-out-of-bounds warning below.

==================================================================
BUG: KASAN: slab-out-of-bounds in ib_uverbs_ex_create_flow+0x1740/0x1d00
Read of size 2 at addr ffff880068dff1a6 by task syz-executor775/269

CPU: 0 PID: 269 Comm: syz-executor775 Not tainted 4.18.0-rc1+ #245
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0xef/0x17e
 print_address_description+0x83/0x3b0
 kasan_report+0x18d/0x4d0
 ib_uverbs_ex_create_flow+0x1740/0x1d00
 ib_uverbs_write+0x923/0x1010
 __vfs_write+0x10d/0x720
 vfs_write+0x1b0/0x550
 ksys_write+0xc6/0x1a0
 do_syscall_64+0xa7/0x590
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x433899
Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d
89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 91 fd ff c3 66
2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc2724db58 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000020006880 RCX: 0000000000433899
RDX: 00000000000000e0 RSI: 0000000020002480 RDI: 0000000000000003
RBP: 00000000006d7018 R08: 00000000004002f8 R09: 00000000004002f8
R10: 00000000004002f8 R11: 0000000000000217 R12: 0000000000000000

R13: 000000000040cd20 R14: 000000000040cdb0 R15: 0000000000000006

Allocated by task 269:
 kasan_kmalloc+0xa0/0xd0
 __kmalloc+0x1a9/0x510
 ib_uverbs_ex_create_flow+0x26c/0x1d00
 ib_uverbs_write+0x923/0x1010
 __vfs_write+0x10d/0x720
 vfs_write+0x1b0/0x550
 ksys_write+0xc6/0x1a0
 do_syscall_64+0xa7/0x590
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 0:
 __kasan_slab_free+0x12e/0x180
 kfree+0x159/0x630
 detach_buf+0x559/0x7a0
 virtqueue_get_buf_ctx+0x3cc/0xab0
 virtblk_done+0x1eb/0x3d0
 vring_interrupt+0x16d/0x2b0
 __handle_irq_event_percpu+0x10a/0x980
 handle_irq_event_percpu+0x77/0x190
 handle_irq_event+0xc6/0x1a0
 handle_edge_irq+0x211/0xd80
 handle_irq+0x3d/0x60
 do_IRQ+0x9b/0x220

The buggy address belongs to the object at ffff880068dff180
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 38 bytes inside of
 64-byte region [ffff880068dff180, ffff880068dff1c0)
The buggy address belongs to the page:
page:ffffea0001a37fc0 count:1 mapcount:0 mapping:ffff88006c401780
index:0x0
flags: 0x4000000000000100(slab)
raw: 4000000000000100 ffffea0001a31100 0000001100000011 ffff88006c401780
raw: 0000000000000000 00000000802a002a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff880068dff080: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
 ffff880068dff100: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
>ffff880068dff180: 00 00 00 00 07 fc fc fc fc fc fc fc fb fb fb fb
                               ^
 ffff880068dff200: fb fb fb fb fc fc fc fc 00 00 00 00 00 00 fc fc
 ffff880068dff280: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc
==================================================================

Cc: <stable@vger.kernel.org> # 3.12
Fixes: f884827438 ("IB/core: clarify overflow/underflow checks on ib_create/destroy_flow")
Cc: syzkaller <syzkaller@googlegroups.com>
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-25 15:27:51 -06:00
Leon Romanovsky
940efcc888 RDMA/uverbs: Protect from attempts to create flows on unsupported QP
Flows can be created on UD and RAW_PACKET QP types. Attempts to provide
other QP types as an input causes to various unpredictable failures.

The reason is that in order to support all various types (e.g. XRC), we
are supposed to use real_qp handle and not qp handle and expect to
driver/FW to fail such (XRC) flows. The simpler and safer variant is to
ban all QP types except UD and RAW_PACKET, instead of relying on
driver/FW.

Cc: <stable@vger.kernel.org> # 3.11
Fixes: 436f2ad05a ("IB/core: Export ib_create/destroy_flow through uverbs")
Cc: syzkaller <syzkaller@googlegroups.com>
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-25 15:27:51 -06:00
Leon Romanovsky
1ccddc42da RDMA/verbs: Drop kernel variant of destroy_flow
Following the removal of ib_create_flow(), adjust the code to get rid of
ib_destroy_flow() too.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-25 15:22:01 -06:00
Leon Romanovsky
ca576fbbdc RDMA/verbs: Drop kernel variant of create_flow
There are no kernel users of this interface so lets drop it.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-25 15:22:01 -06:00
Jason Gunthorpe
e99028ad76 RDMA/uverbs: Check existence of create_flow callback
In the accepted series "Refactor ib_uverbs_write path", we presented the
roadmap to get rid of uverbs_cmd_mask and uverbs_ex_cmd_mask fields in
favor of simple check of function pointer. So let's put NULL check of
create_flow function callback despite the fact that uverbs_ex_cmd_mask
still exists.

Link: https://www.spinics.net/lists/linux-rdma/msg60753.html
Suggested-by: Michael J Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-25 15:21:56 -06:00
Jason Gunthorpe
ea8c2d8f60 RDMA/core: Remove unused ib cache functions
Now that all users have been converted to use the version of these APIs
that returns a gid_attr pointer we can delete the old entry points.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:57 -06:00
Parav Pandit
a8872d53e9 IB/cm: Use sgid_attr from the AV
Prior patches now ensure that the AV has a sgid_attr, if one would have
been required.  Instead of querying for one, take it directly from the AH.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:57 -06:00
Parav Pandit
398391071f IB/cm: Replace members of sa_path_rec with 'struct sgid_attr *'
While processing a path record entry in CM messages the associated GID
attribute is now also supplied.

Currently for RoCE a netdevice's net namespace pointer and ifindex are
stored in path record entry. Both of these fields of the netdev can change
anytime while processing CM messages. Additionally storing net namespace
without holding reference will lead to use-after-free crash. Therefore it
is removed. Netdevice information for RoCE is instead provided via
referenced gid attribute in ib_cm requests.

Such a design leads to a situation where the kernel can crash when the net
pointer becomes invalid. However today it is always initialized to
init_net, which cannot become invalid. In order to support processing
packets in any arbitrary namespace of the received packet, it is necessary
to avoid such conditions.

This patch removes the dependency on the net pointer and ifindex; instead
it will rely on SGID attribute which contains a pointer to netdev.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:57 -06:00
Parav Pandit
815d456ef2 IB/cm: Pass the sgid_attr through various events
Make the sgid_attr available along with path information to the event
consumer, this allows the consumer to keep using the same GID table entry
as the event is related to.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:57 -06:00
Parav Pandit
4ed13a5f2d IB/cm: Keep track of the sgid_attr that created the cm id
Hold reference to the the sgid_attr which is used in a cm_id until the
cm_id is destroyed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:56 -06:00
Parav Pandit
aa74f4878d IB: Make init_ah_attr_grh_fields set sgid_attr
Use the sgid and other information from the path record to figure out the
sgid_attrs.

Store the selected table entry in the sgid_attr for everything else to
use.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:56 -06:00
Parav Pandit
f685c19529 IB: Make ib_init_ah_from_mcmember set sgid_attr
This is really just a CM support function, normally a multicast address
does not have a specific SGID - but the RDMA CM usage model does restrict
things to the netdevice the CM id is bound to, at least for roce case.

Store the selected table entry in the sgid_attr for everything else to
use.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:56 -06:00
Parav Pandit
b740321765 IB: Make ib_init_ah_attr_from_wc set sgid_attr
The work completion is inspected to determine what dgid table entry was
used to receieve the packet, produces a sgid_attr that matches and sticks
it in the ah_attr.

All callers of this function are now required to release the ah_attr on
success.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-25 14:19:56 -06:00
Parav Pandit
59d4081332 IB/core: Free GID table entry during GID deletion
If we already hold the table->lock when doing the kref_put it means we are
in a context where it is safe to do the deletion synchronously, with no
need for the work queue.

This helps to eliminate issues when GID change is requested as part of MAC
address change or bonding event change where expectation is to replace the
GID almost immediately.

Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-22 09:02:59 -06:00
Parav Pandit
8814567892 RDMA/cma: Consider net namespace while leaving multicast group
When sending multicast leave request, consider the net ns in which this
cm_id is created.

Code was duplicated in cma_leave_mc_groups() and rdma_leave_multicast(),
which is now done using a helper function cma_leave_roce_mc_group().

Fixes: bee3c3c918 ("IB/cma: Join and leave multicast groups with IGMP")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-22 09:02:59 -06:00
Jason Gunthorpe
321d7863ac IB/uverbs: Delete type and id from uverbs_obj_attr
In this context the uobject is not allowed to be NULL, so type is the same
as uobject->type, and at least for IDR, id is the same as uobject->id.

FD objects should never handle the FD number outside the uAPI boundary
code.

Suggested-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-22 09:02:59 -06:00
Mark Rutland
bfc18e389c atomics/treewide: Rename __atomic_add_unless() => atomic_fetch_add_unless()
While __atomic_add_unless() was originally intended as a building-block
for atomic_add_unless(), it's now used in a number of places around the
kernel. It's the only common atomic operation named __atomic*(), rather
than atomic_*(), and for consistency it would be better named
atomic_fetch_add_unless().

This lack of consistency is slightly confusing, and gets in the way of
scripting atomics. Given that, let's clean things up and promote it to
an official part of the atomics API, in the form of
atomic_fetch_add_unless().

This patch converts definitions and invocations over to the new name,
including the instrumented version, using the following script:

  ----
  git grep -w __atomic_add_unless | while read line; do
  sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}";
  done
  git grep -w __arch_atomic_add_unless | while read line; do
  sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}";
  done
  ----

Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
be introduced by later patches.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 14:22:32 +02:00
Linus Torvalds
1abd8a8f39 4.18-rc
Regression and crashing bug fixes:
 
 - mlx4/5: Fixes for issues found from various checkers
 - A resource tracking and uverbs regression in the core code
 - qedr: NULL pointer regression found during testing
 - rxe: Various small bugs
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJbKr/pAAoJEDht9xV+IJsasIoP/2yyHUHjBp3vVNJ3A2qRnzAJ
 Yt4DHVo+lWfAhtEY+1rqRQx432aa+gv7e9TUA/Y9Llj0+C2nrOIsNniJvyjF7UrF
 djtAua66p5L+TxmeQPbQP+RsE8pUoczxtPWvpTP6dJ5pkp+/0IJl4P7aZNG+WlYT
 t/4pW1zBejhA9nXfHCFej4A3HM3/6oW3narmIldrNhW1EH7+5jeidyyLKueY6c1Q
 MJ8zfLQM/ZdP1hFwrzfZPMsFmGI4WD7P0F4jWVa+JvpeedV/jOTVVBLKrjHfF1JS
 7JMEeVlK/Mqsu4hCu/BJqHsh8kpFs4aTGfHUOyusZ1xsOx92X1QWCTtGEwi/ZKZh
 PvZMkbWU6Syd1IFwtMRHrKMxGQYrErwXf9V3xHxVn4bIFEAWTT8qn/T1w+tiUcJY
 gBtfqpLuIdzjZ4JtNGBRtfxOvhzqBkHdZO7sd1ARmuIf6Euzvas9AEz9qH893Oun
 rfeLOL70hoz2TrJIpnDApndo9LFEGUB+ypUpax9e99nVHVdbPh/PSdRze/2khoj3
 oJ8z8oh6KAimiW1sMkJ89fefDfUnkkOFOYrxH3nTYfkdrOHyiEtpLuE424pZwVKM
 uWqQ+yoXRuab4X58Gw2ezYq2/UIILn4hJEJ/VdTgJomb41nd0iZtKNlgw2uk8G8M
 WhOCed7yvYsp6hDi8pSq
 =Gjuy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "Here are eight fairly small fixes collected over the last two weeks.

  Regression and crashing bug fixes:

   - mlx4/5: Fixes for issues found from various checkers

   - A resource tracking and uverbs regression in the core code

   - qedr: NULL pointer regression found during testing

   - rxe: Various small bugs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/rxe: Fix missing completion for mem_reg work requests
  RDMA/core: Save kernel caller name when creating CQ using ib_create_cq()
  IB/uverbs: Fix ordering of ucontext check in ib_uverbs_write
  IB/mlx4: Fix an error handling path in 'mlx4_ib_rereg_user_mr()'
  RDMA/qedr: Fix NULL pointer dereference when running over iWARP without RDMA-CM
  IB/mlx5: Fix return value check in flow_counters_set_data()
  IB/mlx5: Fix memory leak in mlx5_ib_create_flow
  IB/rxe: avoid double kfree skb
2018-06-21 07:22:30 +09:00
Yishai Hadas
7dc08dcfc8 IB/core: Expose ib_ucontext from a given ib_uverbs_file
Drivers that use the IOCTL API may have the ib_uverbs_file and need a
way to get the related ib_ucontext from it, this is enabled by this
patch.

Downstream patches from this series will use it.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 10:53:02 -06:00
Matan Barak
19b9def258 IB/uverbs: Allow an empty namespace in ioctl() framework
The ioctl parser framework wrongly assumed that each namespace is
populated. This could lead to NULL dereferences. Fix the parser to
always check that a given namespace indeed exists.

Fixes: fac9658cab ("IB/core: Add new ioctl interface")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 10:53:02 -06:00
Matan Barak
2d9c1bd7e1 IB/uverbs: Add a macro to define a type with no kernel known size
Sometimes the uverbs uAPI  doesn't really care about the structure it gets
from user-space. All it wants to do is to allocate enough space and send
it to the hardware/provider driver. Adding a UVERBS_ATTR_MIN_SIZE that
could be used for this scenarios. We use USHRT_MAX as the kernel known
size to bypass any zero validations.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 10:53:02 -06:00
Matan Barak
8762d149e8 IB/uverbs: Add PTR_IN attributes that are allocated/copied automatically
Adding UVERBS_ATTR_SPEC_F_ALLOC_AND_COPY flag to PTR_IN attributes.
By using this flag, the parse automatically allocates and copies the
user-space data. This data is accessible by using uverbs_attr_get_len
and uverbs_attr_get_alloced_ptr inline accessor functions from the
handler.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 10:53:02 -06:00
Matan Barak
9442d8bf1d IB/uverbs: Refactor uverbs_finalize_objects
uverbs_finalize_objects is currently used only to commit or abort
objects. Since we want to add automatic allocation/free of PTR_IN
attributes, moving it to uverbs_ioctl.c and renamit it to
uverbs_finalize_attrs.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 10:53:02 -06:00
Matan Barak
1114b0a8a8 IB/uverbs: Export uverbs idr and fd types
As provider drivers could use UVERBS_ATTR_FD and UVERBS_ATTR_IDR macros
need to export them.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 10:53:02 -06:00
Steve Wise
33023fb85a IB/core: add max_send_sge and max_recv_sge attributes
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths.  For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16.  Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.

Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 13:17:28 -06:00
Bharat Potnuri
7350cdd025 RDMA/core: Save kernel caller name when creating CQ using ib_create_cq()
Few kernel applications like SCST-iSER create CQ using ib_create_cq(),
where accessing CQ structures using rdma restrack tool leads to below NULL
pointer dereference. This patch saves caller kernel module name similar to
ib_alloc_cq().

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8132ca70>] skip_spaces+0x30/0x30
PGD 738bac067 PUD 8533f0067 PMD 0
Oops: 0000 [#1] SMP
R10: ffff88017fc03300 R11: 0000000000000246 R12: 0000000000000000
R13: ffff88082fa5a668 R14: ffff88017475a000 R15: 0000000000000000
FS:  00002b32726582c0(0000) GS:ffff88087fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000008491a1000 CR4: 00000000003607e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffffc05af69c>] ? fill_res_name_pid+0x7c/0x90 [ib_core]
 [<ffffffffc05af79f>] fill_res_cq_entry+0xef/0x170 [ib_core]
 [<ffffffffc05af4c4>] res_get_common_dumpit+0x3c4/0x480 [ib_core]
 [<ffffffffc05af5d3>] nldev_res_get_cq_dumpit+0x13/0x20 [ib_core]
 [<ffffffff815bc1e7>] netlink_dump+0x117/0x2e0
 [<ffffffff815bcb8b>] __netlink_dump_start+0x1ab/0x230
 [<ffffffffc059fead>] ibnl_rcv_msg+0x11d/0x1f0 [ib_core]
 [<ffffffffc05af5c0>] ? nldev_res_get_mr_dumpit+0x20/0x20 [ib_core]
 [<ffffffffc059fd90>] ? rdma_nl_multicast+0x30/0x30 [ib_core]
 [<ffffffff815bea49>] netlink_rcv_skb+0xa9/0xc0
 [<ffffffffc05a0018>] ibnl_rcv+0x98/0xb0 [ib_core]
 [<ffffffff815be132>] netlink_unicast+0xf2/0x1b0
 [<ffffffff815be50f>] netlink_sendmsg+0x31f/0x6a0
 [<ffffffff8156b580>] sock_sendmsg+0xb0/0xf0
 [<ffffffff816ace9e>] ? _raw_spin_unlock_bh+0x1e/0x20
 [<ffffffff8156f998>] ? release_sock+0x118/0x170
 [<ffffffff8156b731>] SYSC_sendto+0x121/0x1c0
 [<ffffffff81568340>] ? sock_alloc_file+0xa0/0x140
 [<ffffffff81221265>] ? __fd_install+0x25/0x60
 [<ffffffff8156c2ce>] SyS_sendto+0xe/0x10
 [<ffffffff816b6c2a>] system_call_fastpath+0x16/0x1b
RIP  [<ffffffff8132ca70>] skip_spaces+0x30/0x30
RSP <ffff88072be97760>
CR2: 0000000000000000

Cc: <stable@vger.kernel.org>
Fixes: f66c8ba4c9 ("RDMA/core: Save kernel caller name when creating PD and CQ objects")
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:32:58 -06:00
willy@infradead.org
9a41e38a46 IB/mad: Use IDR for agent IDs
Allocate agent IDs from a global IDR instead of an atomic variable.
This eliminates the possibility of reusing an ID which is already in
use after 4 billion registrations.  We limit the assigned ID to be less
than 2^24 as the mlx4 driver uses the most significant byte of the agent
ID to store the slave number.  Users unlucky enough to see a collision
between agent numbers and slave numbers see messages like:

 mlx4_ib: egress mad has non-null tid msb:1 class:4 slave:0

and the MAD layer stops working.

We look up the agent under protection of the RCU lock, which means we
have to free the agent using kfree_rcu, and only increment the reference
counter if it is not 0.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Tested-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:22:54 -06:00
Jason Gunthorpe
1a1f460ff1 RDMA: Hold the sgid_attr inside the struct ib_ah/qp
If the AH has a GRH then hold a reference to the sgid_attr inside the
common struct.

If the QP is modified with an AV that includes a GRH then also hold a
reference to the sgid_attr inside the common struct.

This informs the cache that the sgid_index is in-use so long as the AH or
QP using it exists.

This also means that all drivers can access the sgid_attr directly from
the ah_attr instead of querying the cache during their UD post-send paths.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-18 11:11:27 -06:00
Jason Gunthorpe
d97099fe53 IB{cm, core}: Introduce and use ah_attr copy, move, replace APIs
Introduce AH attribute copy, move and replace APIs to be used by core and
provider drivers.

In CM code flow when ah attribute might be re-initialized twice while
processing incoming request, or initialized once while from path record
while sending out CM requests. Therefore use rdma_move_ah_attr API to
handle such scenarios instead of memcpy().

Provider drivers keeps a copy ah_attr during the lifetime of the ah.
Therefore, use rdma_replace_ah_attr() which conditionally release
reference to old ah_attr and holds reference to new attribute whose
referrence is released when the AH is freed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-18 11:11:26 -06:00
Parav Pandit
947c99ecfc IB/core: Tidy ib_resolve_eth_dmac
No reason to call rdma_ah_retrieve_grh, tidy whitespace, and add a
function comment block.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-18 11:11:26 -06:00
Jason Gunthorpe
8d9ec9addd IB/core: Add a sgid_attr pointer to struct rdma_ah_attr
The sgid_attr will ultimately replace the sgid_index in the ah_attr.
This will allow for all layers to have a consistent view of what
gid table entry was selected as processing runs through all stages of the
stack.

This commit introduces the pointer and ensures it is set before calling
any driver callback that includes a struct ah_attr callback, allowing
future patches to adjust both the drivers and the callers to use
sgid_attr instead of sgid_index.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-18 11:11:26 -06:00
Parav Pandit
fb51eecaa5 IB: Ensure that all rdma_ah_attr's are zero initialized
Since we are adding some new fields to this structure it is safest if all
users reliably initialize the struct to zero.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-18 11:11:26 -06:00
Matthew Wilcox
0c271c433c IB/mad: Agent registration is process context only
Document this (it's implicitly true due to sleeping operations already
in use in both registration and deregistration).  Use this fact to use
spin_lock_irq instead of spin_lock_irqsave.  This improves performance
slightly.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Leon Romanovsky
de7498147d RDMA/uverbs: Refactor flow_resources_alloc() function
Simplify the flow_resources_alloc() function call by reducing
number of goto statements.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Leon Romanovsky
dd8028f1e9 RDMA/nldev: Return port capability flag for IB only
Port capability flag represents IBTA PortInfo:CapabilityMask,
but was mistakenly mixed with non-relevant fields. Return that
information for IB only.

Link: https://patchwork.kernel.org/patch/10386245/
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
1dfce29457 IB: Replace ib_query_gid/ib_get_cached_gid with rdma_query_gid
If the gid_attr argument is NULL then the functions behave identically to
rdma_query_gid. ib_query_gid just calls ib_get_cached_gid, so everything
can be consolidated to one function.

Now that all callers either use rdma_query_gid() or ib_get_cached_gid(),
ib_query_gid() API is removed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Jason Gunthorpe
83f6f8d29d IB/core: Make rdma_find_gid_by_filter support all protocols
There is no reason to restrict this function to roce only these days,
allow the filter function to be called on any protocol.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Jason Gunthorpe
c3d71b69a7 IB/core: Provide rdma_ versions of the gid cache API
These versions are functionally similar but all return gid_attrs and
related information via reference instead of via copy.

The old API is preserved, implemented as wrappers around the new, until
all callers can be converted.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
77e786fcbe IB/core: Replace ib_query_gid with rdma_get_gid_attr
These call sites have a use of ib_query_gid with a simple lifetime for the
struct gid_attr pointer, with an easy conversion.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
bf399c2cad IB/core: Introduce GID attribute get, put and hold APIs
This patch introduces three APIs, rdma_get_gid_attr(),
rdma_put_gid_attr(), and rdma_hold_gid_attr() which expose the reference
counting for GID table entries to the entire stack. The kref counting is
based on the struct ib_gid_attr pointer

Later patches will convert more cache query function to return struct
ib_gid_attrs.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
f4df9a7c34 RDMA: Use GID from the ib_gid_attr during the add_gid() callback
Now that ib_gid_attr contains the GID, make use of that in the add_gid()
callback functions for the provider drivers to simplify the add_gid()
implementations.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
b150c3862d IB/core: Introduce GID entry reference counts
In order to be able to expose pointers to the ib_gid_attrs in the GID
table we need to make it so the value of the pointer cannot be
changed. Thus each GID table entry gets a unique piece of kref'd memory
that is written only during initialization and remains constant for its
lifetime.

This eventually will allow the struct ib_gid_attrs to be returned without
copy from many of query the APIs, but it also provides a way to track when
all users of a HW table index go away.

For roce we no longer allow an in-use HW table index to be re-used for a
new an different entry. When a GID table entry needs to be removed it is
hidden from the find API, but remains as a valid HW index and all
ib_gid_attr points remain valid. The HW index is not relased until all
users put the kref.

Later patches will broadly replace the use of the sgid_index integer with
the kref'd structure.

Ultimately this will prevent security problems where the OS changes the
properties of a HW GID table entry while an active user object is still
using the entry.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
1c36cf912a IB/core: Store default GID property per-table instead of per-entry
There are at max one or two default GIDs for RoCE. Instead of storing
a default GID property for all the GIDs, store default GID indices as
individual bit per table.

This allows a future simplification to get rid of the GID property field.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-17 22:32:04 -06:00
Parav Pandit
a1a4caeeba IB/core: Do not set the gid type when reserving default entries
When default GIDs are added, their gid type is set by
ib_cache_gid_set_default_gid().  There is no need to set the gid type of a
free GID entry during GID table initialization.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-17 22:32:04 -06:00
Kees Cook
fad953ce0b treewide: Use array_size() in vzalloc()
The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vzalloc(a * b)

with:
        vzalloc(array_size(a, b))

as well as handling cases of:

        vzalloc(a * b * c)

with:

        vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vzalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vzalloc(C1 * C2 * C3, ...)
|
  vzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vzalloc(C1 * C2, ...)
|
  vzalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Matthew Wilcox
7654cb1ba7 Convert infiniband uverbs to struct_size
The flows were hidden from the C compiler; expose them as a zero-length
array to allow struct_size to work.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Jason Gunthorpe
1eb9364ce8 IB/uverbs: Fix ordering of ucontext check in ib_uverbs_write
During disassociation the ucontext will become NULL, however due to how
the SRCU locking works the ucontext must only be examined after looking
at the ib_dev, which governs the RCU control flow.

With the wrong ordering userspace will see EINVAL instead of EIO for a
disassociated uverbs FD, which breaks rdma-core.

Cc: stable@vger.kernel.org
Fixes: 491d5c6a30 ("RDMA/uverbs: Move uncontext check before SRCU read lock")
Reported-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-12 14:39:32 -06:00
Linus Torvalds
a1cdde8c41 4.18 Merge window pull request
This has been a quiet cycle for RDMA, the big bulk is the usual smallish
 driver updates and bug fixes. About four new uAPI related things. Not as much
 Szykaller patches this time, the bugs it finds are getting harder to fix.
 
 - More work cleaning up the RDMA CM code
 - Usual driver bug fixes and cleanups for qedr, qib, hfi1, hns, i40iw, iw_cxgb4, mlx5, rxe
 - Driver specific resource tracking and reporting via netlink
 - Continued work for name space support from Parav
 - MPLS support for the verbs flow steering uAPI
 - A few tricky IPoIB fixes improving robustness
 - HFI1 driver support for the '16B' management packet format
 - Some auditing to not print kernel pointers via %llx or similar
 - Mark the entire 'UCM' user-space interface as BROKEN with the intent to remove it
   entirely. The user space side of this was long ago replaced with RDMA-CM and
   syzkaller is finding bugs in the residual UCM interface nobody wishes to fix because
   nobody uses it.
 - Purge more bogus BUG_ON's from Leon
 - 'flow counters' verbs uAPI
 - T10 fixups for iser/isert, these are Acked by Martin but going through the RDMA
   tree due to dependencies
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJbGEcPAAoJEDht9xV+IJsarBMQAIsAFOizycF0kQfDtvz1yHyV
 YjkT3NA71379DsDsCOezVKqZ6RtXdQncJoqqEG1FuNKiXh/rShR3rk9XmdBwUCTq
 mIY0ySiQggdeSIJclROiBuzLE3F/KIIkY3jwM80DzT9GUEbnVuvAMt4M56X48Xo8
 RpFc13/1tY09ZLBVjInlfmCpRWyNgNccDBDywB/5hF5KCFR/BG/vkp4W0yzksKiU
 7M/rZYyxQbtwSfe/ZXp7NrtwOpkpn7vmhED59YgKRZWhqnHF9KKmV+K1FN+BKdXJ
 V1KKJ2RQINg9bbLJ7H2JPdQ9EipvgAjUJKKBoD+XWnoVJahp6X2PjX351R/h4Lo5
 TH+0XwuCZ2EdjRxhnm3YE+rU10mDY9/UUi1xkJf9vf0r25h6Fgt6sMnN0QBpqkTh
 euRZnPyiFeo1b+hCXJfKqkQ6An+F3zes5zvVf59l0yfVNLVmHdlz0lzKLf/RPk+t
 U+YZKxfmHA+mwNhMXtKx7rKVDrko+uRHjaX2rPTEvZ0PXE7lMzFMdBWYgzP6sx/b
 4c55NiJMDAGTyLCxSc7ziGgdL9Lpo/pRZJtFOHqzkDg8jd7fb07ID7bMPbSa05y0
 BU5VpC8yEOYRpOEFbkJSPtHc0Q8cMCv/q1VcMuuhKXYnfSho3TWvtOSQIjUoU/q0
 8T6TXYi2yF+f+vZBTFlV
 =Mb8m
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a quiet cycle for RDMA, the big bulk is the usual
  smallish driver updates and bug fixes. About four new uAPI related
  things. Not as much Szykaller patches this time, the bugs it finds are
  getting harder to fix.

  Summary:

   - More work cleaning up the RDMA CM code

   - Usual driver bug fixes and cleanups for qedr, qib, hfi1, hns,
     i40iw, iw_cxgb4, mlx5, rxe

   - Driver specific resource tracking and reporting via netlink

   - Continued work for name space support from Parav

   - MPLS support for the verbs flow steering uAPI

   - A few tricky IPoIB fixes improving robustness

   - HFI1 driver support for the '16B' management packet format

   - Some auditing to not print kernel pointers via %llx or similar

   - Mark the entire 'UCM' user-space interface as BROKEN with the
     intent to remove it entirely. The user space side of this was long
     ago replaced with RDMA-CM and syzkaller is finding bugs in the
     residual UCM interface nobody wishes to fix because nobody uses it.

   - Purge more bogus BUG_ON's from Leon

   - 'flow counters' verbs uAPI

   - T10 fixups for iser/isert, these are Acked by Martin but going
     through the RDMA tree due to dependencies"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (138 commits)
  RDMA/mlx5: Update SPDX tags to show proper license
  RDMA/restrack: Change SPDX tag to properly reflect license
  IB/hfi1: Fix comment on default hdr entry size
  IB/hfi1: Rename exp_lock to exp_mutex
  IB/hfi1: Add bypass register defines and replace blind constants
  IB/hfi1: Remove unused variable
  IB/hfi1: Ensure VL index is within bounds
  IB/hfi1: Fix user context tail allocation for DMA_RTAIL
  IB/hns: Use zeroing memory allocator instead of allocator/memset
  infiniband: fix a possible use-after-free bug
  iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
  IB/isert: use T10-PI check mask definitions from core layer
  IB/iser: use T10-PI check mask definitions from core layer
  RDMA/core: introduce check masks for T10-PI offload
  IB/isert: fix T10-pi check mask setting
  IB/mlx5: Add counters read support
  IB/mlx5: Add flow counters read support
  IB/mlx5: Add flow counters binding support
  IB/mlx5: Add counters create and destroy support
  IB/uverbs: Add support for flow counters
  ...
2018-06-07 13:04:07 -07:00
Linus Torvalds
2857676045 - Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
 - Introduce overflow test module (Rasmus, Kees)
 - Introduce saturating size helper functions (Matthew, Kees)
 - Treewide use of struct_size() for allocators (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
 8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
 2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
 jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
 YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
 u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
 m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
 bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
 jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
 S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
 3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
 KCDVLYPxwQQqK1Mqig==
 =/3L8
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00
Kees Cook
acafe7e302 treewide: Use struct_size() for kmalloc()-family
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
//                      sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-06 11:15:43 -07:00
Leon Romanovsky
33edc3b2db RDMA/restrack: Change SPDX tag to properly reflect license
Resource tracking is supposed to be dual licensed: GPL-2.0 and
OpenIB, but the SPDX tag was not compliant to it. Update the tag to
properly reflect license.

Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-05 14:04:20 -06:00
Cong Wang
cb2595c139 infiniband: fix a possible use-after-free bug
ucma_process_join() will free the new allocated "mc" struct,
if there is any error after that, especially the copy_to_user().

But in parallel, ucma_leave_multicast() could find this "mc"
through idr_find() before ucma_process_join() frees it, since it
is already published.

So "mc" could be used in ucma_leave_multicast() after it is been
allocated and freed in ucma_process_join(), since we don't refcnt
it.

Fix this by separating "publish" from ID allocation, so that we
can get an ID first and publish it later after copy_to_user().

Fixes: c8f6a362bf ("RDMA/cma: Add multicast communication support")
Reported-by: Noam Rathaus <noamr@beyondsecurity.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-04 09:37:03 -06:00
Jason Gunthorpe
0f45e69d62 Verbs flow counters support
This series comes to allow user space applications to monitor real time
 traffic activity and events of the verbs objects it manages, e.g.:
 ibv_qp, ibv_wq, ibv_flow.
 
 This API enables generic counters creation and define mapping
 to association with a verbs object, current mlx5 driver using
 this API for flow counters.
 
 With this API, an application can monitor the entire life cycle of
 object activity, defined here as a static counters attachment.
 This API also allows dynamic counters monitoring of measurement points
 for a partial period in the verbs object life cycle.
 
 In addition it presents the implementation of the generic counters interface.
 
 This will be achieved by extending flow creation by adding a new flow count
 specification type which allows the user to associate a previously created
 flow counters using the generic verbs counters interface to the created flow,
 once associated the user could read statistics by using the read function of
 the generic counters interface.
 
 The API includes:
 1. create and destroyed API of a new counters objects
 2. read the counters values from HW
 
 Note:
 Attaching API to allow application to define the measurement points per objects
 is a user space only API and this data is passed to kernel when the counted
 object (e.g. flow) is created with the counters object.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYIAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCWxIiqQAKCRAp8NhrnBAZ
 sWJRAPYl06nEfQjRlW//ZE/pO2oKXbfEevg7nnbpe80ERlxLAQDA2LHAcU7ma/NC
 hS5yxIq1gLSA27N+5qAoFVK8vJ5ZCg==
 =EiAV
 -----END PGP SIGNATURE-----

Merge tag 'verbs_flow_counters' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git into for-next

Pull verbs counters series from Leon Romanovsky:

====================
Verbs flow counters support

This series comes to allow user space applications to monitor real time
traffic activity and events of the verbs objects it manages, e.g.: ibv_qp,
ibv_wq, ibv_flow.

The API enables generic counters creation and define mapping to
association with a verbs object, the current mlx5 driver is using this API
for flow counters.

With this API, an application can monitor the entire life cycle of object
activity, defined here as a static counters attachment.  This API also
allows dynamic counters monitoring of measurement points for a partial
period in the verbs object life cycle.

In addition it presents the implementation of the generic counters
interface.

This will be achieved by extending flow creation by adding a new flow
count specification type which allows the user to associate a previously
created flow counters using the generic verbs counters interface to the
created flow, once associated the user could read statistics by using the
read function of the generic counters interface.

The API includes:
1. create and destroyed API of a new counters objects
2. read the counters values from HW

Note:
Attaching API to allow application to define the measurement points per
objects is a user space only API and this data is passed to kernel when
the counted object (e.g. flow) is created with the counters object.
===================

* tag 'verbs_flow_counters':
  IB/mlx5: Add counters read support
  IB/mlx5: Add flow counters read support
  IB/mlx5: Add flow counters binding support
  IB/mlx5: Add counters create and destroy support
  IB/uverbs: Add support for flow counters
  IB/core: Add support for flow counters
  IB/core: Support passing uhw for create_flow
  IB/uverbs: Add read counters support
  IB/core: Introduce counters read verb
  IB/uverbs: Add create/destroy counters support
  IB/core: Introduce counters object and its create/destroy
  IB/uverbs: Add an ib_uobject getter to ioctl() infrastructure
  net/mlx5: Export flow counter related API
  net/mlx5: Use flow counter pointer as input to the query function
2018-06-04 08:48:11 -06:00
Raed Salem
b6ba4a9aa5 IB/uverbs: Add support for flow counters
The struct ib_uverbs_flow_spec_action_count associates a counters object
with the flow.

Post this association the flow counters can be read via the counters
object.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:56 +03:00
Matan Barak
59082a327d IB/core: Support passing uhw for create_flow
This is required when user-space drivers need to pass extra information
regarding how to handle this flow steering specification.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:55 +03:00
Raed Salem
ebb6796bd3 IB/uverbs: Add read counters support
This patch exposes the read counters verb to user space applications.  By
that verb the user can read the hardware counters which are associated
with the counters object.

The application needs to provide a sufficient memory to hold the
statistics.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:55 +03:00
Raed Salem
d9a5a6441e IB/uverbs: Add create/destroy counters support
User space application which uses counters functionality, is expected to
allocate/release the counters resources by calling create/destroy verbs
and in turn get a unique handle that can be used to attach the counters to
its counted type.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:54 +03:00
Matan Barak
3efa38125b IB/uverbs: Add an ib_uobject getter to ioctl() infrastructure
Previously, the user had to dig inside the attribute to get the uobject.
Add a helper function that correctly extract it (and do the required
checks) for him/her.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:53 +03:00
Leon Romanovsky
2468b82d69 RDMA/mad: Convert BUG_ONs to error flows
Let's perform checks in-place instead of BUG_ONs.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-06-01 11:16:24 -04:00
Leon Romanovsky
dee92c4bf5 RDMA/mad: Delete inaccessible BUG_ON
There is no need to check existence of mad_queue, because we already did
pointer dereference before call to dequeue_mad().

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-06-01 11:16:23 -04:00
Leon Romanovsky
671a6cc2ba RDMA/cma: Ignore unknown event
There is no need to bring down the whole machine, just because unknown
event was received. It is better to ignore it silently.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-06-01 11:16:23 -04:00
Leon Romanovsky
2f5059a7af RDMA/cm: Abort loop in case of CM dequeue
In case CM work list is empty, the work pointer will be NULL,
so instead of kernel crash it is better to abort processing
of works.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-06-01 11:16:23 -04:00
Wei Hu(Xavier)
a0976f418d RDMA/uverbs: Hoist the common process of disassociate_ucontext into ib core
This patch hoisted the common process of disassociate_ucontext
callback function into ib core code, and these code are common
to ervery ib_device driver.

Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-30 20:45:03 -04:00
Jason Gunthorpe
5ef8c0c180 RDMA/core: Remove indirection through ib_cache_setup()
This once might have made sense when cache.c was in a different module
from device.c, but  today it just obfuscation. Get rid of the wrappers
and call roge_gid_mgmt_init()/cleanup() directly.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
2018-05-29 15:19:31 -06:00
Parav Pandit
a840c93ca7 IB/core: Fix error code for invalid GID entry
When a GID entry is invalid EAGAIN is returned. This is an incorrect error
code, there is nothing that will make this GID entry valid again in
bounded time.

Some user space tools fail incorrectly if EAGAIN is returned here, and
this represents a small ABI change from earlier kernels.

The first patch in the Fixes list makes entries that were valid before
to become invalid, allowing this code to trigger, while the second patch
in the Fixes list introduced the wrong EAGAIN.

Therefore revert the return result to EINVAL which matches the historical
expectations of the ibv_query_gid_type() API of the libibverbs user space
library.

Cc: <stable@vger.kernel.org>
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Fixes: 03db3a2d81 ("IB/core: Add RoCE GID table management")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-29 12:08:37 -06:00
Jason Gunthorpe
0394808d9e Merge branch 'mr_fix' into git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma for-next
Update mlx4 to support user MR creation against read-only memory, previously
it required the memory to be writable.

Based on rdma for-rc due to dependencies.

* mr_fix: (2 commits)
  IB/mlx4: Mark user MR as writable if actual virtual memory is writable
  IB/core: Make testing MR flags for writability a static inline function
2018-05-28 11:44:35 -06:00
Jack Morgenstein
08bb558ac1 IB/core: Make testing MR flags for writability a static inline function
Make the MR writability flags check, which is performed in umem.c,
a static inline function in file ib_verbs.h

This allows the function to be used by low-level infiniband drivers.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-05-28 11:41:39 -06:00
Parav Pandit
724631a9c6 IB/core: Introduce and use rdma_gid_table()
There are several places a gid table is accessed.
Have a helper tiny function rdma_gid_table() to avoid code
duplication at such places.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-24 09:39:25 -06:00
Parav Pandit
25e62655c7 IB/core: Reduce the places that use zgid
Instead of open coding memcmp() to check whether a given GID is zero or
not, use a helper function to do so, and replace instances of
memcpy(z,&zgid) with memset.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-24 09:39:25 -06:00
Leon Romanovsky
7a8690ed6f RDMA/ucm: Mark UCM interface as BROKEN
In commit 357d23c811a7 ("Remove the obsolete libibcm library")
in rdma-core [1], we removed obsolete library which used the
/dev/infiniband/ucmX interface.

Following multiple syzkaller reports about non-sanitized
user input in the UCMA module, the short audit reveals the same
issues in UCM module too.

It is better to disable this interface in the kernel,
before syzkaller team invests time and energy to harden
this unused interface.

[1] https://github.com/linux-rdma/rdma-core/pull/279

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-24 09:39:25 -06:00
Parav Pandit
9906224f60 IB/core: Remove duplicate declaration of gid_cache_wq
Remove duplicate declaration of gid_cache_wq.

Fixes: d41861942 ("IB/core: Add generic function to extract IB speed from netdev")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-24 09:39:25 -06:00
Steve Wise
fbdb0a9181 RDMA/CMA: add rdma_iw_cm_id() and rdma_res_to_id() helpers
Add a helper function for iwarp drivers to be able to map an
rdma_cm_id to an iw_cm_id.  This is useful for dumping driver specific
NLDEV/RESTRACK connection state.

Add a helper to return the rdma_cm_id pointer from the rdma_restack
pointer.  This is needed for rdma drivers to map a res entry back to
the public rdma_cm_id struct.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-22 14:32:30 -04:00
Ariel Levkovich
b04f0f036a IB/uverbs: Introduce a MPLS steering match filter
Add a new MPLS steering match filter that can match against
a single MPLS tag field.

Since the MPLS header can reside in different locations in the packet's
protocol stack as well as be encapsulated with a tunnel protocol, it
is required to know the exact location of the header in the protocol
stack.

Therefore, when including the MPLS protocol spec in the specs list,
it is mandatory to provide the list in an ordered manner, so
that it represents the actual header order in a matching packet.

Drivers that process the spec list and apply the matching rule
should treat the position of the MPLS spec in the spec list as the
actual location of the MPLS label in the packet's protocol stack.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:55 -06:00
Ariel Levkovich
d90e5e5038 IB/uverbs: Introduce a GRE steering match filter
Adding a new GRE steering match filter that can match against
key and protocol fields.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:54 -06:00
Parav Pandit
e822ff213f IB/cm: Store and restore ah_attr during CM message processing
During CM request processing flow, ah_attr is initialized twice.
First based on wc. Secondly based on primary path record.
ah_attr initialization from path record can fail, which leads to ah_attr
zeroed out.

Therefore, always initialize ah_attr on stack during reinitialization
phase. If ah_attr init is successful, use the new ah_attry by
overwriting the old one. If the ah_attr init fails, continue to use the
last ah_attr.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 14:11:46 -06:00
Parav Pandit
0e225dcb76 IB/cm: Store and restore ah_attr during LAP msg processing
During CM LAP processing, ah_attr is reinitialized on receiving LAP
request. First likely during CM request processing.

ah_attr might get zero out if LAP processing fails.
Therefore, attempt to create new ah_attr for the LAP message.
If the initialization fails, continue with older ah_attr.
If the initialization passes, consider the new ah_attr by overwriting
the older one.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 14:11:46 -06:00
Parav Pandit
a5c57d3272 IB/cm: Avoid AV ah_attr overwriting during LAP message handling
AH attribute of the cm_id can be overwritten if LAP message is received
on CM request which is in progress. This bug got introduced to avoid
sleeping when spin lock is held as part of commit in Fixes tag.

Therefore validate the cm_id state first and continue to perform AV
ah_attr initialization.

Given that Aleternative path related messages are not supported for
RoCE, init_av_from_response/path is such messages are ok to be called
from blocking context.

Fixes: 33f93e1ebc ("IB/cm: Fix sleeping while spin lock is held")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 14:11:46 -06:00
Lidong Chen
8e907ed488 IB/umem: Use the correct mm during ib_umem_release
User-space may invoke ibv_reg_mr and ibv_dereg_mr in different threads.

If ibv_dereg_mr is called after the thread which invoked ibv_reg_mr has
exited, get_pid_task will return NULL and ib_umem_release will not
decrease mm->pinned_vm.

Instead of using threads to locate the mm, use the overall tgid from the
ib_ucontext struct instead. This matches the behavior of ODP and
disassociate in handling the mm of the process that called ibv_reg_mr.

Cc: <stable@vger.kernel.org>
Fixes: 87773dd56d ("IB: ib_umem_release() should decrement mm->pinned_vm from ib_umem_get")
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-15 17:09:10 -06:00
Yuval Shaia
aec05afe64 IB/core: Remove redundant return
"return" statement at the end of void function is redundant, removing
it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-15 16:22:02 -06:00
Steve Wise
e6125a254d RDMA/NLDEV: remove mr iova attribute
Remove mr iova attribute because we don't want to pass up kernel pointers.

Fixes: fccec5b89a ("RDMA/nldev: provide detailed MR information")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-15 16:17:38 -06:00
Doug Ledford
f5e27a203f Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
Several items of conflict have arisen between the RDMA stack's for-rc
branch and upcoming for-next work:

9fd4350ba8 ("IB/rxe: avoid double kfree_skb") directly conflicts with
2e47350789 ("IB/rxe: optimize the function duplicate_request")

Patches already submitted by Intel for the hfi1 driver will fail to
apply cleanly without this merge

Other people on the mailing list have notified that their upcoming
patches also fail to apply cleanly without this merge

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 15:48:48 -04:00
Parav Pandit
be0e8f34b6 IB/core: Reuse gid_table_release_one() in table allocation failure
_gid_table_setup_one() only performs GID table cache memory allocation,
marks entries as invalid (free) and marks the reserved entries.
At this point GID table is empty and no entries are added.

On dual port device if _gid_table_setup_one() fails to allocate the gid
table for 2nd port, there is no need to perform cleanup_gid_table_port()
to delete GID entries, as GID table is empty.
Therefore make use of existing gid_table_release_one() routine which
frees the GID table memory and avoid code duplication.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 12:08:21 -04:00
Parav Pandit
25a1cd3fe5 IB/core: Make gid_table_reserve_default() return void
gid_table_reserve_default() always returns zero. Make it return void and
simplify error checking.

rdma_port is already calculated, use that while calling
gid_table_reserve_default() instead of recalculating it.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 12:08:21 -04:00
Steve Wise
73937e8a03 RDMA/nldev: helper functions to add driver attributes
These help rdma drivers to fill out the driver entries.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-03 15:51:27 -04:00
Steve Wise
da5c850782 RDMA/nldev: add driver-specific resource tracking
Each driver can register a "fill entry" function with the restrack core.
This function will be called when filling out a resource, allowing the
driver to add driver-specific details.  The details consist of a
nltable of nested attributes, that are in the form of <key, [print-type],
value> tuples.  Both key and value attributes are mandatory.  The key
nlattr must be a string, and the value nlattr can be one of the driver
attributes that are generic, but typed, allowing the attributes to be
validated.  Currently the driver nlattr types include string, s32,
u32, s64, and u64.  The print-type nlattr allows a driver to specify
an alternative display format for user tools displaying the attribute.
For example, a u32 attribute will default to "%u", but a print-type
attribute can be included for it to be displayed in hex.  This allows
the user tool to print the number in the format desired by the driver
driver.

More attrs can be defined as they become needed by drivers.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-03 15:51:27 -04:00
Steve Wise
25a0ad8515 RDMA/nldev: Add explicit pad attribute
Add a specific RDMA_NLDEV_ATTR_PAD attribute to be used for 64b
attribute padding.  To preserve the ABI, make this attribute equal to
RDMA_NLDEV_ATTR_UNSPEC, which has a value of 0, because that has been
used up until now as the pad attribute.

Change all the previous use of 0 as the pad with this
new enum.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-03 15:51:27 -04:00
Parav Pandit
9aa169213d RDMA/cma: Do not query GID during QP state transition to RTR
When commit [1] was added, SGID was queried to derive the SMAC address.
Then, later on during a refactor [2], SMAC was no longer needed. However,
the now useless GID query remained.  Then during additional code changes
later on, the GID query was being done in such a way that it caused iWARP
queries to start breaking.  Remove the useless GID query and resolve the
iWARP breakage at the same time.

This is discussed in [3].

[1] commit dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
[2] commit 5c266b2304 ("IB/cm: Remove the usage of smac and vid of qp_attr and cm_av")
[3] https://www.spinics.net/lists/linux-rdma/msg63951.html

Suggested-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-03 15:45:18 -04:00
Parav Pandit
2f6e513657 IB/core: Use CONFIG_SECURITY_INFINIBAND to compile out security code
Make security.c depends on CONFIG_SECURITY_INFINIBAND.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-01 11:16:36 -04:00
Håkon Bugge
db82476f37 IB/core: Make ib_mad_client_id atomic
Currently, the kernel protects access to the agent ID allocator on a per
port basis using a spinlock, so it is impossible for two apps/threads on
the same port to get the same TID, but it is entirely possible for two
threads on different ports to end up with the same TID.

As this can be confusing (regardless of it being legal according to the
IB Spec 1.3, C13-18.1.1, in section 13.4.6.4 - TransactionID usage),
and as the rdma-core user space API for /dev/umad devices implies unique
TIDs even across ports, make the TID an atomic type so that no two
allocations, regardless of port number, will be the same.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-30 13:07:28 -04:00
Ariel Levkovich
54e7e48b13 IB/uverbs: Fix kernel crash during MR deregistration flow
This patch fixes a crash that happens due to access to an
uninitialized DM pointer within the MR object.

The change makes sure the DM pointer in the MR object is set to
NULL during a non-DM MR creation to prevent a false indication
that this MR is related to a DM in the dereg flow.

Fixes: be934cca9e ("IB/uverbs: Add device memory registration ioctl support")
Reported-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 14:22:24 -04:00
Ariel Levkovich
5ccbf63f87 IB/uverbs: Prevent reregistration of DM_MR to regular MR
This patch adds a check in the ib_uverbs_rereg_mr flow to make
sure there's no attempt to rereg a device memory MR to regular MR.
In such case the command will fail with -EINVAL status.

fixes: be934cca9e ("IB/uverbs: Add device memory registration ioctl support")
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 14:22:24 -04:00
Colin Ian King
f96416cea7 RDMA/iwpm: fix memory leak on map_info
In the cases where iwpm_hash_bucket is NULL and where function
get_mapinfo_hash_bucket returns NULL then the map_info is never added
to hash_bucket_head and hence there is a leak of map_info. Fix this
by nullifying hash_bucket_head and if that is null we know that
that map_info was not added to hash_bucket_head and hence map_info
should be free'd.

Detected by CoverityScan, CID#1222481 ("Resource Leak")

Fixes: 30dc5e63d6 ("RDMA/core: Add support for iWARP Port Mapper user space service")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 14:20:47 -04:00
Parav Pandit
2918c1a900 RDMA/cma: Fix use after destroy access to net namespace for IPoIB
There are few issues with validation of netdevice and listen id lookup
for IB (IPoIB) while processing incoming CM request as below.

1. While performing lookup of bind_list in cma_ps_find(), net namespace
of the netdevice can get deleted in cma_exit_net(), resulting in use
after free access of idr and/or net namespace structures.
This lookup occurs from the workqueue context (and not userspace
context where net namespace is always valid).

           CPU0                              CPU1
           ====                              ====

 bind_list = cma_ps_find();
                                     move netdevice to new namespace
                                     delete net namespace
                                        cma_exit_net()
                                           idr_destroy(idr);

 [..]
 cma_find_listener(bind_list, ..);

2. While netdevice is validated for IP address in given net namespace,
netdevice's net namespace and/or ifindex can change in
cma_get_net_dev() and cma_match_net_dev().

Above issues are overcome by using rcu lock along with netdevice
UP/DOWN state as described below.
When a net namespace is getting deleted, netdevice is closed and
shutdown before moving it back to init_net namespace.
change_net_namespace() synchronizes with any existing use of netdevice
before changing the netdev properties such as net or ifindex.
Once netdevice IFF_UP flags is cleared, such fields are not guaranteed
to be valid.
Therefore, rcu lock along with netdevice state check ensures that,
while route lookup and cm_id lookup is in progress, netdevice of
interest won't migrate to any other net namespace.
This ensures that associated net namespace of netdevice won't get
deleted while rcu lock is held for netdevice which is in IFF_UP state.

Fixes: fa20105e09 ("IB/cma: Add support for network namespaces")
Fixes: 4be74b42a6 ("IB/cma: Separate port allocation to network namespaces")
Fixes: f887f2ac87 ("IB/cma: Validate routing of incoming requests")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 13:57:26 -04:00
Matan Barak
f604db645a IB/uverbs: Fix validating mandatory attributes
Previously, if a method contained mandatory attributes in a namespace
that wasn't given by the user, these attributes weren't validated.
Fixing this by iterating over all specification namespaces.

Fixes: fac9658cab ("IB/core: Add new ioctl interface")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 13:53:41 -04:00
Parav Pandit
dc5640f294 IB/core: Fix deleting default GIDs when changing mac adddress
Before [1], When MAC address of the netdevice is changed, default GID is
supposed to get deleted and added back which affects the node and/or port
GUID in below sequence.

netdevice_event()
-> NETDEV_CHANGEADDR
   default_del_cmd()
      del_netdev_default_ips()
          bond_delete_netdev_default_gids()
              ib_cache_gid_set_default_gid()
                  ib_cache_gid_del()
   add_cmd()
   [..]

However, ib_cache_gid_del() was not getting invoked in non bonding
scenarios because event_ndev and rdma_ndev are same.
Therefore, fix such condition to ignore checking upper device when event
ndev and rdma_dev are same; similar to bond_set_netdev_default_gids().

Which this fix ib_cache_gid_del() is invoked correctly; however
ib_cache_gid_del() doesn't find the default GID for deletion because
find_gid() was given default_gid = false with
GID_ATTR_FIND_MASK_DEFAULT set.
But it was getting overwritten by ib_cache_gid_set_default_gid() later
on as part of add_cmd().
Therefore, mac address change used to work for default GID.

With refactor series [1], this incorrect behavior is detected.

Therefore,
when deleting default GID, set default_gid and set MASK flag.
when deleting IP based GID, clear default_gid and set MASK flag.

[1] https://patchwork.kernel.org/patch/10319151/

Fixes: 238fdf48f2 ("IB/core: Add RoCE table bonding support")
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-23 17:28:18 -04:00
Parav Pandit
22c01ee4b8 IB/core: Fix to avoid deleting IPv6 look alike default GIDs
When IPv6 link local address is removed, if it matches with the default
GID, default GID(s)s gets removed which may not be a desired behavior.
This behavior is introduced by refactor work in Fixes tag.

When IPv6 link address is removed, removing its equivalent RoCEv2 GID
which exactly matches with default RoCEv2 GID, is right thing to do.
However achieving it correctly requires lot more changes, likely in
roce_gid_mgmt.c and core/cache.c. This should be done as independent
patch.

Therefore, this patch preserves behavior of not deleteing default GIDs.
This is done by providing explicit hint to consider default GID property
using mask and default_gid; similar to add_gid().

Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-23 17:28:18 -04:00
Parav Pandit
a66ed149b0 IB/core: Don't allow default GID addition at non reseved slots
Default GIDs are marked reserved at the start of the GID table at index
0 and 1 by gid_table_reserve_default().  Currently when default GID is
requested, it can still allocates an empty slot which was not marked as
RESERVED for default GID, which is incorrect.

At least in current code flow of roce_gid_mgmt.c, in theory we can
still request to allocate more than one/two default GIDs depending
on how upper devices are setup.

Therefore, it is better for cache layer to only allow our reserved slots
to be used by default GID allocation requests.

Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-23 17:26:04 -04:00
Roland Dreier
09abfe7b5b RDMA/ucma: Allow resolving address w/o specifying source address
The RDMA CM will select a source device and address by consulting
the routing table if no source address is passed into
rdma_resolve_address().  Userspace will ask for this by passing an
all-zero source address in the RESOLVE_IP command.  Unfortunately
the new check for non-zero address size rejects this with EINVAL,
which breaks valid userspace applications.

Fix this by explicitly allowing a zero address family for the source.

Fixes: 2975d5de64 ("RDMA/ucma: Check AF family prior resolving address")
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-23 11:04:05 -04:00
Jason Gunthorpe
8b77586bd8 RDMA/ucma: Check for a cm_id->device in all user calls that need it
This is done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than  half of the call sites.

The 11 remaining call sites to ucma_get_ctx() were manually audited.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-19 22:01:11 -04:00
Geert Uytterhoeven
e33514f2e9 IB/uverbs: Add missing braces in anonymous union initializers
With gcc-4.1.2:

    drivers/infiniband/core/uverbs_std_types_flow_action.c:366: error: unknown field ‘ptr’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:367: error: unknown field ‘type’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: missing braces around initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>.<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘min_len’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘len’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:369: error: unknown field ‘flags’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:376: error: unknown field ‘ptr’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:377: error: unknown field ‘type’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: missing braces around initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>.<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:379: error: unknown field ‘len’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:383: error: unknown field ‘ptr’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:384: error: unknown field ‘type’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘min_len’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘len’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
    drivers/infiniband/core/uverbs_std_types_flow_action.c:386: error: unknown field ‘flags’ specified in initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: excess elements in union initializer
    drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)

Add the missing braces to fix this.

Fixes: 2eb9beaee5 ("IB/uverbs: Add flow_action create and destroy verbs")
Fixes: 7d12f8d5a1 ("IB/uverbs: Add modify ESP flow_action")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-17 20:14:15 -06:00
Jason Gunthorpe
ee6548d1d9 RDMA/rdma_cm: Delete rdma_addr_client
The only thing it does is block module unload while work is posted from
rdma_resolve_ip().

However, this is not the right place to do this. The users of
rdma_resolve_ip() must ensure their own module does not unload until
rdma_resolve_ip() calls the callback, or until rdma_addr_cancel() is
called.

Similarly callers to rdma_addr_find_l2_eth_by_grh() must ensure their
module does not unload while they are calling code.

The only two users are already safe, so there is no need for this.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-17 19:42:50 -06:00
Jason Gunthorpe
44e75052bc RDMA/rdma_cm: Make rdma_addr_cancel into a fence
Currently rdma_addr_cancel does not prevent the callback from being used,
this is surprising and hard to reason about. There does not appear to be a
bug here as the only user of this API does refcount properly, fixing it
only to increase clarity.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-17 19:42:50 -06:00
Jason Gunthorpe
e19c0d2378 RDMA/rdma_cm: Remove process_req and timer sorting
Now that the work queue is used directly to launch and track the work
there is no need for the second processing function to do 'all list
entries'. Just schedule all entries onto the main work queue directly.

We can also drop all of the useless list sorting now, as the workqueue
sorts by expiration time automatically.

This change requires switching lock to a spinlock as netdev notifiers
are called in an atomic context, this is now easy since the lock does
not need to be held across the lookup, that is already single
threaded due to the work queue.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-17 19:42:50 -06:00
Shamir Rabinovitch
ef95a90ae6 RDMA/ucma: ucma_context reference leak in error path
Validating input parameters should be done before getting the cm_id
otherwise it can leak a cm_id reference.

Fixes: 6a21dfc0d0 ("RDMA/ucma: Limit possible option size")
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-16 09:49:24 -06:00
Linus Torvalds
19fd08b85b Merge candidates for 4.17 merge window
- Fix RDMA uapi headers to actually compile in userspace and be more
   complete
 
 - Three shared with netdev pull requests from Mellanox:
 
    * 7 patches, mostly to net with 1 IB related one at the back). This
      series addresses an IRQ performance issue (patch 1), cleanups related to
      the fix for the IRQ performance problem (patches 2-6), and then extends
      the fragmented completion queue support that already exists in the net
      side of the driver to the ib side of the driver (patch 7).
 
    * Mostly IB, with 5 patches to net that are needed to support the remaining
      10 patches to the IB subsystem. This series extends the current
      'representor' framework when the mlx5 driver is in switchdev mode from
      being a netdev only construct to being a netdev/IB dev construct. The IB
      dev is limited to raw Eth queue pairs only, but by having an IB dev of
      this type attached to the representor for a switchdev port, it enables
      DPDK to work on the switchdev device.
 
    * All net related, but needed as infrastructure for the rdma driver
 
 - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
 
 - SRP performance updates
 
 - IB uverbs write path cleanup patch series from Leon
 
 - Add RDMA_CM support to ib_srpt. This is disabled by default.  Users need to
   set the port for ib_srpt to listen on in configfs in order for it to be
   enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
 
 - TSO and Scatter FCS support in mlx4
 
 - Refactor of modify_qp routine to resolve problems seen while working on new
   code that is forthcoming
 
 - More refactoring and updates of RDMA CM for containers support from Parav
 
 - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory'
   user API features
 
 - Infrastructure updates for the new IOCTL interface, based on increased usage
 
 - ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit
   kernel as was originally intended. See the commit messages for
   extensive details
 
 - Syzkaller bugs and code cleanups motivated by them
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJax5Z0AAoJEDht9xV+IJsacCwQAJBIgmLCvVp5fBu2kJcXMMVI
 y3l2YNzAUJvDDKv1r5yTC9ugBXEkDtgzi/W/C2/5es2yUG/QeT/zzQ3YPrtsnN68
 5FkiXQ35Tt7+PBHMr0cacGRmF4M3Td3MeW0X5aJaBKhqlNKwA+aF18pjGWBmpVYx
 URYCwLb5BZBKVh4+1Leebsk4i0/7jSauAqE5M+9notuAUfBCoY1/Eve3DipEIBBp
 EyrEnMDIdujYRsg4KHlxFKKJ1EFGItknLQbNL1+SEa0Oe0SnEl5Bd53Yxfz7ekNP
 oOWQe5csTcs3Yr4Ob0TC+69CzI71zKbz6qPDILTwXmsPFZJ9ipJs4S8D6F7ra8tb
 D5aT1EdRzh/vAORPC9T3DQ3VsHdvhwpUMG7knnKrVT9X/g7E+gSji1BqaQaTr/xs
 i40GepHT7lM/TWEuee/6LRpqdhuOhud7vfaRFwn2JGRX9suqTcvwhkBkPUDGV5XX
 5RkHcWOb/7KvmpG7S1gaRGK5kO208LgmAZi7REaJFoZB74FqSneMR6NHIH07ha41
 Zou7rnxV68CT2bgu27m+72EsprgmBkVDeEzXgKxVI/+PZ1oadUFpgcZ3pRLOPWVx
 rEqjHu65rlA/YPog4iXQaMfSwt/oRD3cVJS/n8EdJKXi4Qt2RDDGdyOmt74w4prM
 QuLEdvJIFmwrND1KDoqn
 =Ku8g
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Doug and I are at a conference next week so if another PR is sent I
  expect it to only be bug fixes. Parav noted yesterday that there are
  some fringe case behavior changes in his work that he would like to
  fix, and I see that Intel has a number of rc looking patches for HFI1
  they posted yesterday.

  Parav is again the biggest contributor by patch count with his ongoing
  work to enable container support in the RDMA stack, followed by Leon
  doing syzkaller inspired cleanups, though most of the actual fixing
  went to RC.

  There is one uncomfortable series here fixing the user ABI to actually
  work as intended in 32 bit mode. There are lots of notes in the commit
  messages, but the basic summary is we don't think there is an actual
  32 bit kernel user of drivers/infiniband for several good reasons.

  However we are seeing people want to use a 32 bit user space with 64
  bit kernel, which didn't completely work today. So in fixing it we
  required a 32 bit rxe user to upgrade their userspace. rxe users are
  still already quite rare and we think a 32 bit one is non-existing.

   - Fix RDMA uapi headers to actually compile in userspace and be more
     complete

   - Three shared with netdev pull requests from Mellanox:

      * 7 patches, mostly to net with 1 IB related one at the back).
        This series addresses an IRQ performance issue (patch 1),
        cleanups related to the fix for the IRQ performance problem
        (patches 2-6), and then extends the fragmented completion queue
        support that already exists in the net side of the driver to the
        ib side of the driver (patch 7).

      * Mostly IB, with 5 patches to net that are needed to support the
        remaining 10 patches to the IB subsystem. This series extends
        the current 'representor' framework when the mlx5 driver is in
        switchdev mode from being a netdev only construct to being a
        netdev/IB dev construct. The IB dev is limited to raw Eth queue
        pairs only, but by having an IB dev of this type attached to the
        representor for a switchdev port, it enables DPDK to work on the
        switchdev device.

      * All net related, but needed as infrastructure for the rdma
        driver

   - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers

   - SRP performance updates

   - IB uverbs write path cleanup patch series from Leon

   - Add RDMA_CM support to ib_srpt. This is disabled by default. Users
     need to set the port for ib_srpt to listen on in configfs in order
     for it to be enabled
     (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)

   - TSO and Scatter FCS support in mlx4

   - Refactor of modify_qp routine to resolve problems seen while
     working on new code that is forthcoming

   - More refactoring and updates of RDMA CM for containers support from
     Parav

   - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device
     memory' user API features

   - Infrastructure updates for the new IOCTL interface, based on
     increased usage

   - ABI compatibility bug fixes to fully support 32 bit userspace on 64
     bit kernel as was originally intended. See the commit messages for
     extensive details

   - Syzkaller bugs and code cleanups motivated by them"

* tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits)
  IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
  IB/mlx5: Device memory mr registration support
  net/mlx5: Mkey creation command adjustments
  IB/mlx5: Device memory support in mlx5_ib
  net/mlx5: Query device memory capabilities
  IB/uverbs: Add device memory registration ioctl support
  IB/uverbs: Add alloc/free dm uverbs ioctl support
  IB/uverbs: Add device memory capabilities reporting
  IB/uverbs: Expose device memory capabilities to user
  RDMA/qedr: Fix wmb usage in qedr
  IB/rxe: Removed GID add/del dummy routines
  RDMA/qedr: Zero stack memory before copying to user space
  IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR
  IB/mlx5: Add information for querying IPsec capabilities
  IB/mlx5: Add IPsec support for egress and ingress
  {net,IB}/mlx5: Add ipsec helper
  IB/mlx5: Add modify_flow_action_esp verb
  IB/mlx5: Add implementation for create and destroy action_xfrm
  IB/uverbs: Introduce ESP steering match filter
  IB/uverbs: Add modify ESP flow_action
  ...
2018-04-06 17:35:43 -07:00
Ariel Levkovich
be934cca9e IB/uverbs: Add device memory registration ioctl support
Adding new ioctl method for the MR object - REG_DM_MR.

This command can be used by users to register an allocated
device memory buffer as an MR and receive lkey and rkey
to be used within work requests.

It is added as a new method under the MR object and using a new
ib_device callback - reg_dm_mr.
The command creates a standard ib_mr object which represents the
registered memory.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05 11:16:39 -06:00
Ariel Levkovich
bee76d7ab5 IB/uverbs: Add alloc/free dm uverbs ioctl support
This change adds uverbs support for allocation/freeing
of device memory commands.

A new uverbs object is defined of type idr to represent
and track the new resource type allocation per context.

The API requires provider driver to implement 2 new ib_device
callbacks - one for allocation and one for deallocation which
return and accept (respectively) the ib_dm object which represents
the allocated memory on the device.

The support is added via the ioctl command infrastructure
only.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05 11:16:39 -06:00