mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	
			
				
					
						
					
					c6c901b7d9
				
			
			
		
	
	
		
			1260 Commits
		
	
	
	| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|  Peter Zijlstra | 6e33cad0af | sched: Trivial core scheduling cookie management In order to not have to use pid_struct, create a new, smaller, structure to manage task cookies for core scheduling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.org | ||
|  Xiaofeng Cao | a8ca6b1388 | kernel/fork.c: fix typos change 'ancestoral' to 'ancestral' change 'reuseable' to 'reusable' delete 'do' grammatically Link: https://lkml.kernel.org/r/20210317082031.11692-1-caoxiaofeng@yulong.com Signed-off-by: Xiaofeng Cao <caoxiaofeng@yulong.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Rolf Eike Beer | a689539938 | kernel/fork.c: simplify copy_mm() All this can happen without a single goto. Link: https://lkml.kernel.org/r/2072685.XptgVkyDqn@devpool47 Signed-off-by: Rolf Eike Beer <eb@emlix.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Alexey Gladkov | c1ada3dc72 | ucounts: Set ucount_max to the largest positive value the type can hold The ns->ucount_max[] is signed long which is less than the rlimit size. We have to protect ucount_max[] from overflow and only use the largest value that we can hold. On 32bit using "long" instead of "unsigned long" to hold the counts has the downside that RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK are limited to 2GiB instead of 4GiB. I don't think anyone cares but it should be mentioned in case someone does. The RLIMIT_NPROC and RLIMIT_SIGPENDING used atomic_t so their maximum hasn't changed. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/1825a5dfa18bc5a570e79feb05e2bd07fd57e7e3.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Alexey Gladkov | d7c9e99aee | Reimplement RLIMIT_MEMLOCK on top of ucounts The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Changelog v11: * Fix issue found by lkp robot. v8: * Fix issues found by lkp-tests project. v7: * Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred. v6: * Fix bug in hugetlb_file_setup() detected by trinity. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Alexey Gladkov | d646969055 | Reimplement RLIMIT_SIGPENDING on top of ucounts The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Changelog v11: * Revert most of changes to fix performance issues. v10: * Fix memory leak on get_ucounts failure. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/df9d7764dddd50f28616b7840de74ec0f81711a8.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Alexey Gladkov | 6e52a9f053 | Reimplement RLIMIT_MSGQUEUE on top of ucounts The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Alexey Gladkov | 21d1c5e386 | Reimplement RLIMIT_NPROC on top of ucounts The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. We cannot use existing inc_ucounts / dec_ucounts because they do not allow us to exceed the maximum for the counter. Some rlimits can be overlimited by root or if the user has the appropriate capability. Changelog v11: * Change inc_rlimit_ucounts() which now returns top value of ucounts. * Drop inc_rlimit_ucounts_and_test() because the return code of inc_rlimit_ucounts() can be checked. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Alexey Gladkov | 905ae01c4a | Add a reference to ucounts for each cred For RLIMIT_NPROC and some other rlimits the user_struct that holds the global limit is kept alive for the lifetime of a process by keeping it in struct cred. Adding a pointer to ucounts in the struct cred will allow to track RLIMIT_NPROC not only for user in the system, but for user in the user_namespace. Updating ucounts may require memory allocation which may fail. So, we cannot change cred.ucounts in the commit_creds() because this function cannot fail and it should always return 0. For this reason, we modify cred.ucounts before calling the commit_creds(). Changelog v6: * Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This error was caused by the fact that cred_alloc_blank() left the ucounts pointer empty. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Muchun Song | 27faca83a7 | mm: memcontrol: fix kernel stack account For simplification commit  | ||
|  Linus Torvalds | 9d31d23389 | Networking changes for 5.13. Core:
 
  - bpf:
 	- allow bpf programs calling kernel functions (initially to
 	  reuse TCP congestion control implementations)
 	- enable task local storage for tracing programs - remove the
 	  need to store per-task state in hash maps, and allow tracing
 	  programs access to task local storage previously added for
 	  BPF_LSM
 	- add bpf_for_each_map_elem() helper, allowing programs to
 	  walk all map elements in a more robust and easier to verify
 	  fashion
 	- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
 	  redirection
 	- lpm: add support for batched ops in LPM trie
 	- add BTF_KIND_FLOAT support - mostly to allow use of BTF
 	  on s390 which has floats in its headers files
 	- improve BPF syscall documentation and extend the use of kdoc
 	  parsing scripts we already employ for bpf-helpers
 	- libbpf, bpftool: support static linking of BPF ELF files
 	- improve support for encapsulation of L2 packets
 
  - xdp: restructure redirect actions to avoid a runtime lookup,
 	improving performance by 4-8% in microbenchmarks
 
  - xsk: build skb by page (aka generic zerocopy xmit) - improve
 	performance of software AF_XDP path by 33% for devices
 	which don't need headers in the linear skb part (e.g. virtio)
 
  - nexthop: resilient next-hop groups - improve path stability
 	on next-hops group changes (incl. offload for mlxsw)
 
  - ipv6: segment routing: add support for IPv4 decapsulation
 
  - icmp: add support for RFC 8335 extended PROBE messages
 
  - inet: use bigger hash table for IP ID generation
 
  - tcp: deal better with delayed TX completions - make sure we don't
 	give up on fast TCP retransmissions only because driver is
 	slow in reporting that it completed transmitting the original
 
  - tcp: reorder tcp_congestion_ops for better cache locality
 
  - mptcp:
 	- add sockopt support for common TCP options
 	- add support for common TCP msg flags
 	- include multiple address ids in RM_ADDR
 	- add reset option support for resetting one subflow
 
  - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
 	co-existence with UDP tunnel GRO, allowing the first to take
 	place correctly	even for encapsulated UDP traffic
 
  - micro-optimize dev_gro_receive() and flow dissection, avoid
 	retpoline overhead on VLAN and TEB GRO
 
  - use less memory for sysctls, add a new sysctl type, to allow using
 	u8 instead of "int" and "long" and shrink networking sysctls
 
  - veth: allow GRO without XDP - this allows aggregating UDP
 	packets before handing them off to routing, bridge, OvS, etc.
 
  - allow specifing ifindex when device is moved to another namespace
 
  - netfilter:
 	- nft_socket: add support for cgroupsv2
 	- nftables: add catch-all set element - special element used
 	  to define a default action in case normal lookup missed
 	- use net_generic infra in many modules to avoid allocating
 	  per-ns memory unnecessarily
 
  - xps: improve the xps handling to avoid potential out-of-bound
 	accesses and use-after-free when XPS change race with other
 	re-configuration under traffic
 
  - add a config knob to turn off per-cpu netdev refcnt to catch
 	underflows in testing
 
 Device APIs:
 
  - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    -independent APIs
 
  - ethtool:
 	- add interface for reading IEEE MIB stats (incl. mlx5 and
 	  bnxt support)
 	- allow network drivers to dump arbitrary SFP EEPROM data,
 	  current offset+length API was a poor fit for modern SFP
 	  which define EEPROM in terms of pages (incl. mlx5 support)
 
  - act_police, flow_offload: add support for packet-per-second
 	policing (incl. offload for nfp)
 
  - psample: add additional metadata attributes like transit delay
 	for packets sampled from switch HW (and corresponding egress
 	and policy-based sampling in the mlxsw driver)
 
  - dsa: improve support for sandwiched LAGs with bridge and DSA
 
  - netfilter:
 	- flowtable: use direct xmit in topologies with IP
 	  forwarding, bridging, vlans etc.
 	- nftables: counter hardware offload support
 
  - Bluetooth:
 	- improvements for firmware download w/ Intel devices
 	- add support for reading AOSP vendor capabilities
 	- add support for virtio transport driver
 
  - mac80211:
 	- allow concurrent monitor iface and ethernet rx decap
 	- set priority and queue mapping for injected frames
 
  - phy: add support for Clause-45 PHY Loopback
 
  - pci/iov: add sysfs MSI-X vector assignment interface
 	to distribute MSI-X resources to VFs (incl. mlx5 support)
 
 New hardware/drivers:
 
  - dsa: mv88e6xxx: add support for Marvell mv88e6393x -
 	11-port Ethernet switch with 8x 1-Gigabit Ethernet
 	and 3x 10-Gigabit interfaces.
 
  - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
 	and BCM63xx switches
 
  - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
 
  - ath11k: support for QCN9074 a 802.11ax device
 
  - Bluetooth: Broadcom BCM4330 and BMC4334
 
  - phy: Marvell 88X2222 transceiver support
 
  - mdio: add BCM6368 MDIO mux bus controller
 
  - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
 
  - mana: driver for Microsoft Azure Network Adapter (MANA)
 
  - Actions Semi Owl Ethernet MAC
 
  - can: driver for ETAS ES58X CAN/USB interfaces
 
 Pure driver changes:
 
  - add XDP support to: enetc, igc, stmmac
  - add AF_XDP support to: stmmac
 
  - virtio:
 	- page_to_skb() use build_skb when there's sufficient tailroom
 	  (21% improvement for 1000B UDP frames)
 	- support XDP even without dedicated Tx queues - share the Tx
 	  queues with the stack when necessary
 
  - mlx5:
 	- flow rules: add support for mirroring with conntrack,
 	  matching on ICMP, GTP, flex filters and more
 	- support packet sampling with flow offloads
 	- persist uplink representor netdev across eswitch mode
 	  changes
 	- allow coexistence of CQE compression and HW time-stamping
 	- add ethtool extended link error state reporting
 
  - ice, iavf: support flow filters, UDP Segmentation Offload
 
  - dpaa2-switch:
 	- move the driver out of staging
 	- add spanning tree (STP) support
 	- add rx copybreak support
 	- add tc flower hardware offload on ingress traffic
 
  - ionic:
 	- implement Rx page reuse
 	- support HW PTP time-stamping
 
  - octeon: support TC hardware offloads - flower matching on ingress
 	and egress ratelimitting.
 
  - stmmac:
 	- add RX frame steering based on VLAN priority in tc flower
 	- support frame preemption (FPE)
 	- intel: add cross time-stamping freq difference adjustment
 
  - ocelot:
 	- support forwarding of MRP frames in HW
 	- support multiple bridges
 	- support PTP Sync one-step timestamping
 
  - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
 	learning, flooding etc.
 
  - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
 	SC7280 SoCs)
 
  - mt7601u: enable TDLS support
 
  - mt76:
 	- add support for 802.3 rx frames (mt7915/mt7615)
 	- mt7915 flash pre-calibration support
 	- mt7921/mt7663 runtime power management fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
 Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
 Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
 q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
 I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
 zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
 rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
 =vcbA
 -----END PGP SIGNATURE-----
Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core:
   - bpf:
        - allow bpf programs calling kernel functions (initially to
          reuse TCP congestion control implementations)
        - enable task local storage for tracing programs - remove the
          need to store per-task state in hash maps, and allow tracing
          programs access to task local storage previously added for
          BPF_LSM
        - add bpf_for_each_map_elem() helper, allowing programs to walk
          all map elements in a more robust and easier to verify fashion
        - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
          redirection
        - lpm: add support for batched ops in LPM trie
        - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
          s390 which has floats in its headers files
        - improve BPF syscall documentation and extend the use of kdoc
          parsing scripts we already employ for bpf-helpers
        - libbpf, bpftool: support static linking of BPF ELF files
        - improve support for encapsulation of L2 packets
   - xdp: restructure redirect actions to avoid a runtime lookup,
     improving performance by 4-8% in microbenchmarks
   - xsk: build skb by page (aka generic zerocopy xmit) - improve
     performance of software AF_XDP path by 33% for devices which don't
     need headers in the linear skb part (e.g. virtio)
   - nexthop: resilient next-hop groups - improve path stability on
     next-hops group changes (incl. offload for mlxsw)
   - ipv6: segment routing: add support for IPv4 decapsulation
   - icmp: add support for RFC 8335 extended PROBE messages
   - inet: use bigger hash table for IP ID generation
   - tcp: deal better with delayed TX completions - make sure we don't
     give up on fast TCP retransmissions only because driver is slow in
     reporting that it completed transmitting the original
   - tcp: reorder tcp_congestion_ops for better cache locality
   - mptcp:
        - add sockopt support for common TCP options
        - add support for common TCP msg flags
        - include multiple address ids in RM_ADDR
        - add reset option support for resetting one subflow
   - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
     co-existence with UDP tunnel GRO, allowing the first to take place
     correctly even for encapsulated UDP traffic
   - micro-optimize dev_gro_receive() and flow dissection, avoid
     retpoline overhead on VLAN and TEB GRO
   - use less memory for sysctls, add a new sysctl type, to allow using
     u8 instead of "int" and "long" and shrink networking sysctls
   - veth: allow GRO without XDP - this allows aggregating UDP packets
     before handing them off to routing, bridge, OvS, etc.
   - allow specifing ifindex when device is moved to another namespace
   - netfilter:
        - nft_socket: add support for cgroupsv2
        - nftables: add catch-all set element - special element used to
          define a default action in case normal lookup missed
        - use net_generic infra in many modules to avoid allocating
          per-ns memory unnecessarily
   - xps: improve the xps handling to avoid potential out-of-bound
     accesses and use-after-free when XPS change race with other
     re-configuration under traffic
   - add a config knob to turn off per-cpu netdev refcnt to catch
     underflows in testing
  Device APIs:
   - add WWAN subsystem to organize the WWAN interfaces better and
     hopefully start driving towards more unified and vendor-
     independent APIs
   - ethtool:
        - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
          support)
        - allow network drivers to dump arbitrary SFP EEPROM data,
          current offset+length API was a poor fit for modern SFP which
          define EEPROM in terms of pages (incl. mlx5 support)
   - act_police, flow_offload: add support for packet-per-second
     policing (incl. offload for nfp)
   - psample: add additional metadata attributes like transit delay for
     packets sampled from switch HW (and corresponding egress and
     policy-based sampling in the mlxsw driver)
   - dsa: improve support for sandwiched LAGs with bridge and DSA
   - netfilter:
        - flowtable: use direct xmit in topologies with IP forwarding,
          bridging, vlans etc.
        - nftables: counter hardware offload support
   - Bluetooth:
        - improvements for firmware download w/ Intel devices
        - add support for reading AOSP vendor capabilities
        - add support for virtio transport driver
   - mac80211:
        - allow concurrent monitor iface and ethernet rx decap
        - set priority and queue mapping for injected frames
   - phy: add support for Clause-45 PHY Loopback
   - pci/iov: add sysfs MSI-X vector assignment interface to distribute
     MSI-X resources to VFs (incl. mlx5 support)
  New hardware/drivers:
   - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
     Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
     interfaces.
   - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
     BCM63xx switches
   - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
   - ath11k: support for QCN9074 a 802.11ax device
   - Bluetooth: Broadcom BCM4330 and BMC4334
   - phy: Marvell 88X2222 transceiver support
   - mdio: add BCM6368 MDIO mux bus controller
   - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
   - mana: driver for Microsoft Azure Network Adapter (MANA)
   - Actions Semi Owl Ethernet MAC
   - can: driver for ETAS ES58X CAN/USB interfaces
  Pure driver changes:
   - add XDP support to: enetc, igc, stmmac
   - add AF_XDP support to: stmmac
   - virtio:
        - page_to_skb() use build_skb when there's sufficient tailroom
          (21% improvement for 1000B UDP frames)
        - support XDP even without dedicated Tx queues - share the Tx
          queues with the stack when necessary
   - mlx5:
        - flow rules: add support for mirroring with conntrack, matching
          on ICMP, GTP, flex filters and more
        - support packet sampling with flow offloads
        - persist uplink representor netdev across eswitch mode changes
        - allow coexistence of CQE compression and HW time-stamping
        - add ethtool extended link error state reporting
   - ice, iavf: support flow filters, UDP Segmentation Offload
   - dpaa2-switch:
        - move the driver out of staging
        - add spanning tree (STP) support
        - add rx copybreak support
        - add tc flower hardware offload on ingress traffic
   - ionic:
        - implement Rx page reuse
        - support HW PTP time-stamping
   - octeon: support TC hardware offloads - flower matching on ingress
     and egress ratelimitting.
   - stmmac:
        - add RX frame steering based on VLAN priority in tc flower
        - support frame preemption (FPE)
        - intel: add cross time-stamping freq difference adjustment
   - ocelot:
        - support forwarding of MRP frames in HW
        - support multiple bridges
        - support PTP Sync one-step timestamping
   - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
     learning, flooding etc.
   - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
     SC7280 SoCs)
   - mt7601u: enable TDLS support
   - mt76:
        - add support for 802.3 rx frames (mt7915/mt7615)
        - mt7915 flash pre-calibration support
        - mt7921/mt7663 runtime power management fixes"
* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
  net: selftest: fix build issue if INET is disabled
  net: netrom: nr_in: Remove redundant assignment to ns
  net: tun: Remove redundant assignment to ret
  net: phy: marvell: add downshift support for M88E1240
  net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
  net/sched: act_ct: Remove redundant ct get and check
  icmp: standardize naming of RFC 8335 PROBE constants
  bpf, selftests: Update array map tests for per-cpu batched ops
  bpf: Add batched ops support for percpu array
  bpf: Implement formatted output helpers with bstr_printf
  seq_file: Add a seq_bprintf function
  sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
  net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
  net: fix a concurrency bug in l2tp_tunnel_register()
  net/smc: Remove redundant assignment to rc
  mpls: Remove redundant assignment to err
  llc2: Remove redundant assignment to rc
  net/tls: Remove redundant initialization of record
  rds: Remove redundant assignment to nr_sig
  dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
  ... | ||
|  Linus Torvalds | 625434dafd | for-5.13/io_uring-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0 4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC 3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS 2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1 +gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck 6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi +CB1NokPng== =ZJ8L -----END PGP SIGNATURE----- Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Support for multi-shot mode for POLL requests - More efficient reference counting. This is shamelessly stolen from the mm side. Even though referencing is mostly single/dual user, the 128 count was retained to keep the code the same. Maybe this should/could be made generic at some point. - Removal of the need to have a manager thread for each ring. The manager threads only job was checking and creating new io-threads as needed, instead we handle this from the queue path. - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this thread is "just" a regular application thread, so no need to restrict use of it anymore. - Cleanup of how internal async poll data lifetime is managed. - Fix for syzbot reported crash on SQPOLL cancelation. - Make buffer registration more like file registrations, which includes flexibility in avoiding full set unregistration and re-registration. - Fix for io-wq affinity setting. - Be a bit more defensive in task->pf_io_worker setup. - Various SQPOLL fixes. - Cleanup of SQPOLL creds handling. - Improvements to in-flight request tracking. - File registration cleanups. - Tons of cleanups and little fixes * tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits) io_uring: maintain drain logic for multishot poll requests io_uring: Check current->io_uring in io_uring_cancel_sqpoll io_uring: fix NULL reg-buffer io_uring: simplify SQPOLL cancellations io_uring: fix work_exit sqpoll cancellations io_uring: Fix uninitialized variable up.resv io_uring: fix invalid error check after malloc io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker kernel: always initialize task->pf_io_worker to NULL io_uring: update sq_thread_idle after ctx deleted io_uring: add full-fledged dynamic buffers support io_uring: implement fixed buffers registration similar to fixed files io_uring: prepare fixed rw for dynanic buffers io_uring: keep table of pointers to ubufs io_uring: add generic rsrc update with tags io_uring: add IORING_REGISTER_RSRC io_uring: enumerate dynamic resources io_uring: add generic path for rsrc update io_uring: preparation for rsrc tagging io_uring: decouple CQE filling from requests ... | ||
|  Linus Torvalds | 16b3d0cf5b | Scheduler updates for this cycle are: - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces
    to a unified debugfs interface.
 
  - Signals: Allow caching one sigqueue object per task, to improve performance & latencies.
 
  - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups.
 
  - Improve energy-aware scheduling
 
  - Improve the PELT metrics for certain workloads
 
  - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous
    regressions
 
  - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This
    is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature,
    or the use of the resched_latency_warn_ms=xx boot parameter.
 
  - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining
    balance_push() vs. hotplug holes/races
 
  - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well
 
  - Fix/improve various load-balancing corner cases vs. capacity margins
 
  - Fix sched topology on systems with NUMA diameter of 3 or above
 
  - Fix PF_KTHREAD vs to_kthread() race
 
  - Minor rseq optimizations
 
  - Misc cleanups, optimizations, fixes and smaller updates
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJInsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i5XxAArh0b+fwXlkVGzTUly7HQjhU7lFbChnmF
 h6ToyNLi6pXoZ14VC/WoRIME+RzK3gmw9cEFaSLVPxbkbekTcyWS78kqmcg1/j2v
 kO/20QhXobiIxVskYfoMmqSavZ5mKhMWBqtFXkCuYfxwGylas0VVdh3AZLJ7N21G
 WEoFh99pVULwWnPHxM2ZQ87Ex9BkGKbsBTswxWpprCfXLqD0N2hHlABpwJP78zRf
 VniWFOcC7lslILCFawb7CqGgAwbgV85nDRS4QCuCKisrkFywvjJrEeu/W+h1NfhF
 d6ves/osNdEAM1DSALoxwEA42An8l8xh8NyJnl8JZV00LW0DM108O5/7pf5Zcryc
 RHV3RxA7skgezBh5uThvo60QzNK+kVMatI4qpQEHxLE52CaDl/fBu1Cgb/VUxnIl
 AEBfyiFbk+skHpuMFKtl30Tx3M+yJKMTzFPd4kYjHYGEDwtAcXcB3dJQW48A79i3
 H3IWcDcXpk5Rjo2UZmaXdt/qlj7mP6U0xdOUq8ZK6JOC4uY9skszVGsfuNN9QQ5u
 2E2YKKVrGFoQydl4C8R6A7axL2VzIJszHFZNipd8E3YOyW7PWRAkr02tOOkBTj8N
 dLMcNM7aPJWqEYiEIjEzGQN20pweJ1dRA29LDuOswKh+7W2bWTQFh6F2Q8Haansc
 RVg5PDzl+Mc=
 =E7mz
 -----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and
   debugfs interfaces to a unified debugfs interface.
 - Signals: Allow caching one sigqueue object per task, to improve
   performance & latencies.
 - Improve newidle_balance() irq-off latencies on systems with a large
   number of CPU cgroups.
 - Improve energy-aware scheduling
 - Improve the PELT metrics for certain workloads
 - Reintroduce select_idle_smt() to improve load-balancing locality -
   but without the previous regressions
 - Add 'scheduler latency debugging': warn after long periods of pending
   need_resched. This is an opt-in feature that requires the enabling of
   the LATENCY_WARN scheduler feature, or the use of the
   resched_latency_warn_ms=xx boot parameter.
 - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix
   remaining balance_push() vs. hotplug holes/races
 - PSI fixes, plus allow /proc/pressure/ files to be written by
   CAP_SYS_RESOURCE tasks as well
 - Fix/improve various load-balancing corner cases vs. capacity margins
 - Fix sched topology on systems with NUMA diameter of 3 or above
 - Fix PF_KTHREAD vs to_kthread() race
 - Minor rseq optimizations
 - Misc cleanups, optimizations, fixes and smaller updates
* tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  cpumask/hotplug: Fix cpu_dying() state tracking
  kthread: Fix PF_KTHREAD vs to_kthread() race
  sched/debug: Fix cgroup_path[] serialization
  sched,psi: Handle potential task count underflow bugs more gracefully
  sched: Warn on long periods of pending need_resched
  sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
  sched/debug: Rename the sched_debug parameter to sched_verbose
  sched,fair: Alternative sched_slice()
  sched: Move /proc/sched_debug to debugfs
  sched,debug: Convert sysctl sched_domains to debugfs
  debugfs: Implement debugfs_create_str()
  sched,preempt: Move preempt_dynamic to debug.c
  sched: Move SCHED_DEBUG sysctl to debugfs
  sched: Don't make LATENCYTOP select SCHED_DEBUG
  sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
  sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
  sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
  cpumask: Introduce DYING mask
  cpumask: Make cpu_{online,possible,present,active}() inline
  rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
  ... | ||
|  Linus Torvalds | 42dec9a936 | Perf events changes in this cycle were: - Improve Intel uncore PMU support:
 
      - Parse uncore 'discovery tables' - a new hardware capability enumeration method
        introduced on the latest Intel platforms. This table is in a well-defined PCI
        namespace location and is read via MMIO. It is organized in an rbtree.
 
        These uncore tables will allow the discovery of standard counter blocks, but
        fancier counters still need to be enumerated explicitly.
 
      - Add Alder Lake support
 
      - Improve IIO stacks to PMON mapping support on Skylake servers
 
  - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
    and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
    cores.
 
    The CPU-side feature set is entirely symmetrical - but on the PMU side there's
    core type dependent PMU functionality.
 
  - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
    fixing the AUX allocation watermark logic.
 
  - Improve ring buffer allocation on NUMA systems
 
  - Put 'struct perf_event' into their separate kmem_cache pool
 
  - Add support for synchronous signals for select perf events. The immediate motivation
    is to support low-overhead sampling-based race detection for user-space code. The
    feature consists of the following main changes:
 
     - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
       inheritance of events to CLONE_THREAD.
 
     - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
 
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
       ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
 
    The siginfo support is adequate for breakpoints right now - but the new field can be used
    to introduce support for other types of metadata passed over siginfo as well.
 
  - Misc fixes, cleanups and smaller updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
 QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
 mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
 NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
 5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
 UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
 USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
 I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
 BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
 7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
 nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
 goUMOL/D0lg=
 =NhQ7
 -----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event updates from Ingo Molnar:
 - Improve Intel uncore PMU support:
     - Parse uncore 'discovery tables' - a new hardware capability
       enumeration method introduced on the latest Intel platforms. This
       table is in a well-defined PCI namespace location and is read via
       MMIO. It is organized in an rbtree.
       These uncore tables will allow the discovery of standard counter
       blocks, but fancier counters still need to be enumerated
       explicitly.
     - Add Alder Lake support
     - Improve IIO stacks to PMON mapping support on Skylake servers
 - Add Intel Alder Lake PMU support - which requires the introduction of
   'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
   and Gracemont ('small' - Atom derived) cores.
   The CPU-side feature set is entirely symmetrical - but on the PMU
   side there's core type dependent PMU functionality.
 - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
   profiling, by fixing the AUX allocation watermark logic.
 - Improve ring buffer allocation on NUMA systems
 - Put 'struct perf_event' into their separate kmem_cache pool
 - Add support for synchronous signals for select perf events. The
   immediate motivation is to support low-overhead sampling-based race
   detection for user-space code. The feature consists of the following
   main changes:
     - Add thread-only event inheritance via
       perf_event_attr::inherit_thread, which limits inheritance of
       events to CLONE_THREAD.
     - Add the ability for events to not leak through exec(), via
       perf_event_attr::remove_on_exec.
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
       extend siginfo with an u64 ::si_perf, and add the breakpoint
       information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
   The siginfo support is adequate for breakpoints right now - but the
   new field can be used to introduce support for other types of
   metadata passed over siginfo as well.
 - Misc fixes, cleanups and smaller updates.
* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  signal, perf: Add missing TRAP_PERF case in siginfo_layout()
  signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
  perf/x86: Allow for 8<num_fixed_counters<16
  perf/x86/rapl: Add support for Intel Alder Lake
  perf/x86/cstate: Add Alder Lake CPU support
  perf/x86/msr: Add Alder Lake CPU support
  perf/x86/intel/uncore: Add Alder Lake support
  perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
  perf/x86/intel: Add Alder Lake Hybrid support
  perf/x86: Support filter_match callback
  perf/x86/intel: Add attr_update for Hybrid PMUs
  perf/x86: Add structures for the attributes of Hybrid PMUs
  perf/x86: Register hybrid PMUs
  perf/x86: Factor out x86_pmu_show_pmu_cap
  perf/x86: Remove temporary pmu assignment in event_init
  perf/x86/intel: Factor out intel_pmu_check_extra_regs
  perf/x86/intel: Factor out intel_pmu_check_event_constraints
  perf/x86/intel: Factor out intel_pmu_check_num_counters
  perf/x86: Hybrid PMU support for extra_regs
  perf/x86: Hybrid PMU support for event constraints
  ... | ||
|  Stefan Metzmacher | ff24430330 | kernel: always initialize task->pf_io_worker to NULL Otherwise io_wq_worker_{running,sleeping}() may dereference an
invalid pointer (in future). Currently all users of create_io_thread()
are fine and get task->pf_io_worker = NULL implicitly from the
wq_manager, which got it either from the userspace thread
of the sq_thread, which explicitly reset it to NULL.
I think it's safer to always reset it in order to avoid future
problems.
Fixes:  | ||
|  Ingo Molnar | d0d252b8ca | Linux 5.12-rc8 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik gm0bMSw= =qHI0 -----END PGP SIGNATURE----- Merge tag 'v5.12-rc8' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Marco Elver | 2b26f0aa00 | perf: Support only inheriting events if cloned with CLONE_THREAD Adds bit perf_event_attr::inherit_thread, to restricting inheriting events only if the child was cloned with CLONE_THREAD. This option supports the case where an event is supposed to be process-wide only (including subthreads), but should not propagate beyond the current process's shared environment. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/YBvj6eJR%2FDY2TsEB@hirez.programming.kicks-ass.net/ | ||
|  Thomas Gleixner | 4bad58ebc8 | signal: Allow tasks to cache one sigqueue struct The idea for this originates from the real time tree to make signal delivery for realtime applications more efficient. In quite some of these application scenarios a control tasks signals workers to start their computations. There is usually only one signal per worker on flight. This works nicely as long as the kmem cache allocations do not hit the slow path and cause latencies. To cure this an optimistic caching was introduced (limited to RT tasks) which allows a task to cache a single sigqueue in a pointer in task_struct instead of handing it back to the kmem cache after consuming a signal. When the next signal is sent to the task then the cached sigqueue is used instead of allocating a new one. This solved the problem for this set of application scenarios nicely. The task cache is not preallocated so the first signal sent to a task goes always to the cache allocator. The cached sigqueue stays around until the task exits and is freed when task::sighand is dropped. After posting this solution for mainline the discussion came up whether this would be useful in general and should not be limited to realtime tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org One concern leading to the original limitation was to avoid a large amount of pointlessly cached sigqueues in alive tasks. The other concern was vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for. The accounting problem is real, but on the other hand slightly academic. After gathering some statistics it turned out that after boot of a regular distro install there are less than 10 sigqueues cached in ~1500 tasks. In case of a 'mass fork and fire signal to child' scenario the extra 80 bytes of memory per task are well in the noise of the overall memory consumption of the fork bomb. If this should be limited then this would need an extra counter in struct user, more atomic instructions and a seperate rlimit. Yet another tunable which is mostly unused. The caching is actually used. After boot and a full kernel compile on a 64CPU machine with make -j128 the number of 'allocations' looks like this: From slab: 23996 From task cache: 52223 I.e. it reduces the number of slab cache operations by ~68%. A typical pattern there is: <...>-58490 __sigqueue_alloc: for 58488 from slab ffff8881132df460 <...>-58488 __sigqueue_free: cache ffff8881132df460 <...>-58488 __sigqueue_alloc: for 1149 from cache ffff8881103dc550 bash-1149 exit_task_sighand: free ffff8881132df460 bash-1149 __sigqueue_free: cache ffff8881103dc550 The interesting sequence is that the exiting task 58488 grabs the sigqueue from bash's task cache to signal exit and bash sticks it back into it's own cache. Lather, rinse and repeat. The caching is probably not noticable for the general use case, but the benefit for latency sensitive applications is clear. While kmem caches are usually just serving from the fast path the slab merging (default) can depending on the usage pattern of the merged slabs cause occasional slow path allocations. The time spared per cached entry is a few micro seconds per signal which is not relevant for e.g. a kernel build, but for signal heavy workloads it's measurable. As there is no real downside of this caching mechanism making it unconditionally available is preferred over more conditional code or new magic tunables. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de | ||
|  Jens Axboe | 66ae0d1e2d | kernel: allow fork with TIF_NOTIFY_SIGNAL pending fork() fails if signal_pending() is true, but there are two conditions that can lead to that: 1) An actual signal is pending. We want fork to fail for that one, like we always have. 2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work. We don't need to make it fail for that case. Allow fork() to proceed if just task_work is pending, by changing the signal_pending() check to task_sigpending(). Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Jakub Kicinski | 8859a44ea0 | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts: MAINTAINERS - keep Chandrasekar drivers/net/ethernet/mellanox/mlx5/core/en_main.c - simple fix + trust the code re-added to param.c in -next is fine include/linux/bpf.h - trivial include/linux/ethtool.h - trivial, fix kdoc while at it include/linux/skmsg.h - move to relevant place in tcp.c, comment re-wrapped net/core/skmsg.c - add the sk = sk // sk = NULL around calls net/tipc/crypto.c - trivial Signed-off-by: Jakub Kicinski <kuba@kernel.org> | ||
|  Linus Torvalds | b44d1ddcf8 | io_uring-5.12-2021-03-27 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBf1KAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjVSD/0f1HdekXnIE6aSRQ7YEV8ux2t5wUeDyP8U
 cdcZ8fBW9PvKZLdODSI4sw8UYV5OYEBcfImFe3nRVHR+RIVQo72UTYvuHqeUYNct
 w3drgF2GEMIxJFZR6zf9LDrQVduPqXvbEJLui6TN+eX/5E99ZlUWMLwkX1k+vDju
 QfaGZjz2736GTn1MPc7jdyZKoK7eCi5xtNFPash5wGck7aYl5TGXnG/8bRYsv2Tw
 eCYKbvv4x0s8OFcYVQMooDfbIMCyyfTwt6YatFHQEtM/RM+M66gndvv3jfkeJQju
 hz0I8qOJ8X5lf0VucncWs5J8b9Whr5YZV+k9461xalBbV9ed2vzIIikP8DpCxtYz
 yKbsdDm0+3hwfuZOz+d7ooEXKsphJ1PnSsEeuNZXtKDXVtphksUbbq4H2NLINcsQ
 m6dwaRPSEA0EymngGY2e+8+CU0euiE4mqoMpw4D9m9Irs+BAaWYGk9xCWr0BGem0
 auZOMqvV2xktdBlGx1BJCLts1sHHxy8IM3u0852R/1AfcKOkXwNVPt62I8e9ceIA
 wc731aWHwJfS25m430xFDPJKJpUZoZgste4qwVym70CmRziuamgYyIfrfRg1ZjsD
 ZBa9Z4hPiT4e0eDqlYjcMpl9FORgYQXVXy5ofd/eZg5xkU8X+i6TVZkaQNkZyqV/
 4ogBZYUolg==
 =mwLC
 -----END PGP SIGNATURE-----
Merge tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 - Use thread info versions of flag testing, as discussed last week.
 - The series enabling PF_IO_WORKER to just take signals, instead of
   needing to special case that they do not in a bunch of places. Ends
   up being pretty trivial to do, and then we can revert all the special
   casing we're currently doing.
 - Kill dead pointer assignment
 - Fix hashed part of async work queue trace
 - Fix sign extension issue for IORING_OP_PROVIDE_BUFFERS
 - Fix a link completion ordering regression in this merge window
 - Cancellation fixes
* tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block:
  io_uring: remove unsued assignment to pointer io
  io_uring: don't cancel extra on files match
  io_uring: don't cancel-track common timeouts
  io_uring: do post-completion chore on t-out cancel
  io_uring: fix timeout cancel return code
  Revert "signal: don't allow STOP on PF_IO_WORKER threads"
  Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"
  Revert "kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals"
  Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"
  kernel: stop masking signals in create_io_thread()
  io_uring: handle signals for IO threads like a normal thread
  kernel: don't call do_exit() for PF_IO_WORKER threads
  io_uring: maintain CQE order of a failed link
  io-wq: fix race around pending work on teardown
  io_uring: do ctx sqd ejection in a clear context
  io_uring: fix provide_buffers sign extension
  io_uring: don't skip file_end_write() on reissue
  io_uring: correct io_queue_async_work() traces
  io_uring: don't use {test,clear}_tsk_thread_flag() for current | ||
|  Jens Axboe | b16b3855d8 | kernel: stop masking signals in create_io_thread() This is racy - move the blocking into when the task is created and we're marking it as PF_IO_WORKER anyway. The IO threads are now prepared to handle signals like SIGSTOP as well, so clear that from the mask to allow proper stopping of IO threads. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  David S. Miller | efd13b71a3 | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: David S. Miller <davem@davemloft.net> | ||
|  Linus Torvalds | 0ada2dad8b | io_uring-5.12-2021-03-19 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBVI8cQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuFOD/494N0khk5EpLnoq0+/uyRpnqnTjL3n+iWc
 fviiodL2/eirKWML/WbNUaKOWMs76iBwRqvTFnmCuyVexM9iPq3BXHocNYESYFni
 0EfuL+jzs/LjQLVJgCxyYUyafDtCGZ5ct/3ilfGWSY13ngfYdUVT1p+u9NK94T63
 4SrT6KKqEnpStpA1kjCw+doL17Tx2jrcrnX8gztIm0IarTnJGusiNZboy1IBMcqf
 Lw7CEePn4b9/0wKJa8sDYIFtI8Rvj2Jk86c4DDpGgoPU6I9fGPnp3oMGrxlwectT
 uTguzTlKAvbSu6v+2jqHCcXpkOG3aQJJM+YaNZmWOKwkLdyzLLIDT7SPlNHlacDF
 yBj+Ou3FbKvVUrYldUHlQoLZIAgp7AQO1JBilijNNibXsH0M4Gaw3aGPFmhEFfeJ
 /y+DXEfi2TGC6Yo+Ogub9Rh3gd2kgATu9Qbbnxi5TmYFc6WASBHP3OQEMVpVkD6F
 IZxZDvIKMj3DoYX3Can0vlqiWhmL5o7gyaRTkmxc4A21CR+AHstupDNTHbR23IsY
 dVxWmfrU25VFcIUAUOUgzPayDRn5KevexXjpkC8MVPQUqe/8FgI18eigDWTwlkcG
 0AZUraswv8uT5b0oLj9cawtAU9Dlit7niI6r9I3dtoUAD3JY4+yDp7oZp2TTOV2z
 +rgS+5zjug==
 =aPxz
 -----END PGP SIGNATURE-----
Merge tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "Quieter week this time, which was both expected and desired. About
  half of the below is fixes for this release, the other half are just
  fixes in general. In detail:
   - Fix the freezing of IO threads, by making the freezer not send them
     fake signals. Make them freezable by default.
   - Like we did for personalities, move the buffer IDR to xarray. Kills
     some code and avoids a use-after-free on teardown.
   - SQPOLL cleanups and fixes (Pavel)
   - Fix linked timeout race (Pavel)
   - Fix potential completion post use-after-free (Pavel)
   - Cleanup and move internal structures outside of general kernel view
     (Stefan)
   - Use MSG_SIGNAL for send/recv from io_uring (Stefan)"
* tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block:
  io_uring: don't leak creds on SQO attach error
  io_uring: use typesafe pointers in io_uring_task
  io_uring: remove structures from include/linux/io_uring.h
  io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls
  io_uring: fix sqpoll cancellation via task_work
  io_uring: add generic callback_head helpers
  io_uring: fix concurrent parking
  io_uring: halt SQO submission on ctx exit
  io_uring: replace sqd rw_semaphore with mutex
  io_uring: fix complete_post use ctx after free
  io_uring: fix ->flags races by linked timeouts
  io_uring: convert io_buffer_idr to XArray
  io_uring: allow IO worker threads to be frozen
  kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing | ||
|  Linus Torvalds | 50eb842fe5 | Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "28 patches. Subsystems affected by this series: mm (memblock, pagealloc, hugetlb, highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and ia64" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits) zram: fix broken page writeback zram: fix return value on writeback_store mm/memcg: set memcg when splitting page mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls mm/userfaultfd: fix memory corruption due to writeprotect kasan: fix KASAN_STACK dependency for HW_TAGS kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC mm/madvise: replace ptrace attach requirement for process_madvise include/linux/sched/mm.h: use rcu_dereference in in_vfork() kfence: fix reports if constant function prefixes exist kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations kfence: fix printk format for ptrdiff_t linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* MAINTAINERS: exclude uapi directories in API/ABI section binfmt_misc: fix possible deadlock in bm_register_write mm/highmem.c: fix zero_user_segments() with start > end hugetlb: do early cow when page pinned on src mm mm: use is_cow_mapping() across tree where proper ... | ||
|  Fenghua Yu | 82e69a121b | mm/fork: clear PASID for new mm When a new mm is created, its PASID should be cleared, i.e. the PASID is initialized to its init state 0 on both ARM and X86. This patch was part of the series introducing mm->pasid, but got lost along the way [1]. It still makes sense to have it, because each address space has a different PASID. And the IOMMU code in iommu_sva_alloc_pasid() expects the pasid field of a new mm struct to be cleared. [1] https://lore.kernel.org/linux-iommu/YDgh53AcQHT+T3L0@otcwcpicx3.sc.intel.com/ Link: https://lkml.kernel.org/r/20210302103837.2562625-1-jean-philippe@linaro.org Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: Jacob Pan <jacob.jun.pan@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Jens Axboe | 16efa4fce3 | io_uring: allow IO worker threads to be frozen With the freezer using the proper signaling to notify us of when it's time to freeze a thread, we can re-enable normal freezer usage for the IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call try_to_freeze() appropriately, and remove the default setting of PF_NOFREEZE from create_io_thread(). Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Jens Axboe | e22bc9b481 | kernel: make IO threads unfreezable by default The io-wq threads were already marked as no-freeze, but the manager was not. On resume, we perpetually have signal_pending() being true, and hence the manager will loop and spin 100% of the time. Just mark the tasks created by create_io_thread() as PF_NOFREEZE by default, and remove any knowledge of it in io-wq and io_uring. Reported-by: Kevin Locke <kevin@kevinlocke.name> Tested-by: Kevin Locke <kevin@kevinlocke.name> Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  David S. Miller | c1acda9807 | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-09 The following pull-request contains BPF updates for your *net-next* tree. We've added 90 non-merge commits during the last 17 day(s) which contain a total of 114 files changed, 5158 insertions(+), 1288 deletions(-). The main changes are: 1) Faster bpf_redirect_map(), from Björn. 2) skmsg cleanup, from Cong. 3) Support for floating point types in BTF, from Ilya. 4) Documentation for sys_bpf commands, from Joe. 5) Support for sk_lookup in bpf_prog_test_run, form Lorenz. 6) Enable task local storage for tracing programs, from Song. 7) bpf_for_each_map_elem() helper, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> | ||
|  Jens Axboe | cc440e8738 | kernel: provide create_io_thread() helper Provide a generic helper for setting up an io_uring worker. Returns a task_struct so that the caller can do whatever setup is needed, then call wake_up_new_task() to kick it into gear. Add a kernel_clone_args member, io_thread, which tells copy_process() to mark the task with PF_IO_WORKER. Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Song Liu | a10787e6d5 | bpf: Enable task local storage for tracing programs To access per-task data, BPF programs usually creates a hash table with
pid as the key. This is not ideal because:
 1. The user need to estimate the proper size of the hash table, which may
    be inaccurate;
 2. Big hash tables are slow;
 3. To clean up the data properly during task terminations, the user need
    to write extra logic.
Task local storage overcomes these issues and offers a better option for
these per-task data. Task local storage is only available to BPF_LSM. Now
enable it for tracing programs.
Unlike LSM programs, tracing programs can be called in IRQ contexts.
Helpers that access task local storage are updated to use
raw_spin_lock_irqsave() instead of raw_spin_lock_bh().
Tracing programs can attach to functions on the task free path, e.g.
exit_creds(). To avoid allocating task local storage after
bpf_task_storage_free(). bpf_task_storage_get() is updated to not allocate
new storage when the task is not refcounted (task->usage == 0).
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210225234319.336131-2-songliubraving@fb.com | ||
|  Alexander Guril | 96e1e9846c | Kernel: fork.c: Fix coding style: Do not use {} around single-line statements Fixed two coding style issues in kernel/fork.c
Do not use {} around single-line statements.
Cc: linux-kernel@vger.kernel.org
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Alexander Guril <alexander.guril02@gmail.com>
Link: https://lore.kernel.org/r/20201226114021.2589-1-alexander.guril02@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Andrey Konovalov | cebd0eb29a | kasan: rename (un)poison_shadow to (un)poison_range This is a preparatory commit for the upcoming addition of a new hardware tag-based (MTE-based) KASAN mode. The new mode won't be using shadow memory. Rename external annotation kasan_unpoison_shadow() to kasan_unpoison_range(), and introduce internal functions (un)poison_range() (without kasan_ prefix). Co-developed-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/fccdcaa13dc6b2211bf363d6c6d499279a54fe3a.1606161801.git.andreyknvl@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | d01e7f10da | Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exec-update-lock update from Eric Biederman: "The key point of this is to transform exec_update_mutex into a rw_semaphore so readers can be separated from writers. This makes it easier to understand what the holders of the lock are doing, and makes it harder to contend or deadlock on the lock. The real deadlock fix wound up in perf_event_open" * 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: exec: Transform exec_update_mutex into a rw_semaphore | ||
|  Linus Torvalds | faf145d6f3 | Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "This set of changes ultimately fixes the interaction of posix file lock and exec. Fundamentally most of the change is just moving where unshare_files is called during exec, and tweaking the users of files_struct so that the count of files_struct is not unnecessarily played with. Along the way fcheck and related helpers were renamed to more accurately reflect what they do. There were also many other small changes that fell out, as this is the first time in a long time much of this code has been touched. Benchmarks haven't turned up any practical issues but Al Viro has observed a possibility for a lot of pounding on task_lock. So I have some changes in progress to convert put_files_struct to always rcu free files_struct. That wasn't ready for the merge window so that will have to wait until next time" * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) exec: Move io_uring_task_cancel after the point of no return coredump: Document coredump code exclusively used by cell spufs file: Remove get_files_struct file: Rename __close_fd_get_file close_fd_get_file file: Replace ksys_close with close_fd file: Rename __close_fd to close_fd and remove the files parameter file: Merge __alloc_fd into alloc_fd file: In f_dupfd read RLIMIT_NOFILE once. file: Merge __fd_install into fd_install proc/fd: In fdinfo seq_show don't use get_files_struct bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu file: Implement task_lookup_next_fd_rcu kcmp: In get_file_raw_ptr use task_lookup_fd_rcu proc/fd: In tid_fd_mode use task_lookup_fd_rcu file: Implement task_lookup_fd_rcu file: Rename fcheck lookup_fd_rcu file: Replace fcheck_files with files_lookup_fd_rcu file: Factor files_lookup_fd_locked out of fcheck_files file: Rename __fcheck_files to files_lookup_fd_raw ... | ||
|  Linus Torvalds | d635a69dd4 | Networking updates for 5.11 Core:
 
  - support "prefer busy polling" NAPI operation mode, where we defer softirq
    for some time expecting applications to periodically busy poll
 
  - AF_XDP: improve efficiency by more batching and hindering
            the adjacency cache prefetcher
 
  - af_packet: make packet_fanout.arr size configurable up to 64K
 
  - tcp: optimize TCP zero copy receive in presence of partial or unaligned
         reads making zero copy a performance win for much smaller messages
 
  - XDP: add bulk APIs for returning / freeing frames
 
  - sched: support fragmenting IP packets as they come out of conntrack
 
  - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
 
 BPF:
 
  - BPF switch from crude rlimit-based to memcg-based memory accounting
 
  - BPF type format information for kernel modules and related tracing
    enhancements
 
  - BPF implement task local storage for BPF LSM
 
  - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
 
 Protocols:
 
  - mptcp: improve multiple xmit streams support, memory accounting and
           many smaller improvements
 
  - TLS: support CHACHA20-POLY1305 cipher
 
  - seg6: add support for SRv6 End.DT4/DT6 behavior
 
  - sctp: Implement RFC 6951: UDP Encapsulation of SCTP
 
  - ppp_generic: add ability to bridge channels directly
 
  - bridge: Connectivity Fault Management (CFM) support as is defined in
            IEEE 802.1Q section 12.14.
 
 Drivers:
 
  - mlx5: make use of the new auxiliary bus to organize the driver internals
 
  - mlx5: more accurate port TX timestamping support
 
  - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
      the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging
 
  - rtw88: major bluetooth co-existance improvements
 
  - iwlwifi: support new 6 GHz frequency band
 
  - ath11k: Fast Initial Link Setup (FILS)
 
  - mt7915: dual band concurrent (DBDC) support
 
  - net: ipa: add basic support for IPA v4.5
 
 Refactor:
 
  - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
 
  - phy: add support for shared interrupts; get rid of multiple driver
         APIs and have the drivers write a full IRQ handler, slight growth
 	of driver code should be compensated by the simpler API which
 	also allows shared IRQs
 
  - add common code for handling netdev per-cpu counters
 
  - move TX packet re-allocation from Ethernet switch tag drivers to
    a central place
 
  - improve efficiency and rename nla_strlcpy
 
  - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot
 
 Old code removal:
 
  - wan: delete the DLCI / SDLA drivers
 
  - wimax: move to staging
 
  - wifi: remove old WDS wifi bridging support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
 IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
 qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
 N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
 s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
 hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
 C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
 LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
 +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
 Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
 =GXs1
 -----END PGP SIGNATURE-----
Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core:
   - support "prefer busy polling" NAPI operation mode, where we defer
     softirq for some time expecting applications to periodically busy
     poll
   - AF_XDP: improve efficiency by more batching and hindering the
     adjacency cache prefetcher
   - af_packet: make packet_fanout.arr size configurable up to 64K
   - tcp: optimize TCP zero copy receive in presence of partial or
     unaligned reads making zero copy a performance win for much smaller
     messages
   - XDP: add bulk APIs for returning / freeing frames
   - sched: support fragmenting IP packets as they come out of conntrack
   - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
  BPF:
   - BPF switch from crude rlimit-based to memcg-based memory accounting
   - BPF type format information for kernel modules and related tracing
     enhancements
   - BPF implement task local storage for BPF LSM
   - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
     bpf_sk_storage
  Protocols:
   - mptcp: improve multiple xmit streams support, memory accounting and
     many smaller improvements
   - TLS: support CHACHA20-POLY1305 cipher
   - seg6: add support for SRv6 End.DT4/DT6 behavior
   - sctp: Implement RFC 6951: UDP Encapsulation of SCTP
   - ppp_generic: add ability to bridge channels directly
   - bridge: Connectivity Fault Management (CFM) support as is defined
     in IEEE 802.1Q section 12.14.
  Drivers:
   - mlx5: make use of the new auxiliary bus to organize the driver
     internals
   - mlx5: more accurate port TX timestamping support
   - mlxsw:
      - improve the efficiency of offloaded next hop updates by using
        the new nexthop object API
      - support blackhole nexthops
      - support IEEE 802.1ad (Q-in-Q) bridging
   - rtw88: major bluetooth co-existance improvements
   - iwlwifi: support new 6 GHz frequency band
   - ath11k: Fast Initial Link Setup (FILS)
   - mt7915: dual band concurrent (DBDC) support
   - net: ipa: add basic support for IPA v4.5
  Refactor:
   - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
     Siewior
   - phy: add support for shared interrupts; get rid of multiple driver
     APIs and have the drivers write a full IRQ handler, slight growth
     of driver code should be compensated by the simpler API which also
     allows shared IRQs
   - add common code for handling netdev per-cpu counters
   - move TX packet re-allocation from Ethernet switch tag drivers to a
     central place
   - improve efficiency and rename nla_strlcpy
   - number of W=1 warning cleanups as we now catch those in a patchwork
     build bot
  Old code removal:
   - wan: delete the DLCI / SDLA drivers
   - wimax: move to staging
   - wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
  net: hns3: fix expression that is currently always true
  net: fix proc_fs init handling in af_packet and tls
  nfc: pn533: convert comma to semicolon
  af_vsock: Assign the vsock transport considering the vsock address flags
  af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
  vsock_addr: Check for supported flag values
  vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
  vm_sockets: Add flags field in the vsock address data structure
  net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
  tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
  net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
  nfc: s3fwrn5: Release the nfc firmware
  net: vxget: clean up sparse warnings
  mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
  mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
  mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
  mlxsw: reg: Add Router LPM Cache Enable Register
  mlxsw: reg: Add Router LPM Cache ML Delete Register
  mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
  mlxsw: reg: Add XM Router M Table Register
  ... | ||
|  Linus Torvalds | ac73e3dc8a | Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton:
 - a few random little subsystems
 - almost all of the MM patches which are staged ahead of linux-next
   material. I'll trickle to post-linux-next work in as the dependents
   get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
  mm: cleanup kstrto*() usage
  mm: fix fall-through warnings for Clang
  mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
  mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
  mm:backing-dev: use sysfs_emit in macro defining functions
  mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
  mm: use sysfs_emit for struct kobject * uses
  mm: fix kernel-doc markups
  zram: break the strict dependency from lzo
  zram: add stat to gather incompressible pages since zram set up
  zram: support page writeback
  mm/process_vm_access: remove redundant initialization of iov_r
  mm/zsmalloc.c: rework the list_add code in insert_zspage()
  mm/zswap: move to use crypto_acomp API for hardware acceleration
  mm/zswap: fix passing zero to 'PTR_ERR' warning
  mm/zswap: make struct kernel_param_ops definitions const
  userfaultfd/selftests: hint the test runner on required privilege
  userfaultfd/selftests: fix retval check for userfaultfd_open()
  userfaultfd/selftests: always dump something in modes
  userfaultfd: selftests: make __{s,u}64 format specifiers portable
  ... | ||
|  Muchun Song | da3ceeff92 | mm: memcg/slab: rename *_lruvec_slab_state to *_lruvec_kmem_state The *_lruvec_slab_state is also suitable for pages allocated from buddy, not just for the slab objects. But the function name seems to tell us that only slab object is applicable. So we can rename the keyword of slab to kmem. Link: https://lkml.kernel.org/r/20201117085249.24319-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Jason Gunthorpe | 57efa1fe59 | mm/gup: prevent gup_fast from racing with COW during fork Since commit | ||
|  Linus Torvalds | edd7ab7684 | The new preemtible kmap_local() implementation: - Consolidate all kmap_atomic() internals into a generic implementation
     which builds the base for the kmap_local() API and make the
     kmap_atomic() interface wrappers which handle the disabling/enabling of
     preemption and pagefaults.
 
   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.
 
   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a mapping
     is established. It has to disable migration instead to guarantee that
     the virtual address of the mapped slot is the same accross preemption.
 
   - Provide better debug facilities: guard pages and enforced utilization
     of the mapping mechanics on 64bit systems when the architecture allows
     it.
 
   - Provide the new kmap_local() API which can now be used to cleanup the
     kmap_atomic() usage sites all over the place. Most of the usage sites
     do not require the implicit disabling of preemption and pagefaults so
     the penalty on 64bit and 32bit non-highmem systems is removed and quite
     some of the code can be simplified. A wholesale conversion is not
     possible because some usage depends on the implicit side effects and
     some need to be cleaned up because they work around these side effects.
 
     The migrate disable side effect is only effective on highmem systems
     and when enforced debugging is enabled. On 64bit and 32bit non-highmem
     systems the overhead is completely avoided.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
 0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
 4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
 p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
 RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
 +x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
 ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
 Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
 XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
 FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
 HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
 +jlfoJhMbtx5Gg==
 =n71I
 -----END PGP SIGNATURE-----
Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
 "The new preemtible kmap_local() implementation:
   - Consolidate all kmap_atomic() internals into a generic
     implementation which builds the base for the kmap_local() API and
     make the kmap_atomic() interface wrappers which handle the
     disabling/enabling of preemption and pagefaults.
   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.
   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a
     mapping is established. It has to disable migration instead to
     guarantee that the virtual address of the mapped slot is the same
     across preemption.
   - Provide better debug facilities: guard pages and enforced
     utilization of the mapping mechanics on 64bit systems when the
     architecture allows it.
   - Provide the new kmap_local() API which can now be used to cleanup
     the kmap_atomic() usage sites all over the place. Most of the usage
     sites do not require the implicit disabling of preemption and
     pagefaults so the penalty on 64bit and 32bit non-highmem systems is
     removed and quite some of the code can be simplified. A wholesale
     conversion is not possible because some usage depends on the
     implicit side effects and some need to be cleaned up because they
     work around these side effects.
     The migrate disable side effect is only effective on highmem
     systems and when enforced debugging is enabled. On 64bit and 32bit
     non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  ARM: highmem: Fix cache_is_vivt() reference
  x86/crashdump/32: Simplify copy_oldmem_page()
  io-mapping: Provide iomap_local variant
  mm/highmem: Provide kmap_local*
  sched: highmem: Store local kmaps in task struct
  x86: Support kmap_local() forced debugging
  mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
  mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
  microblaze/mm/highmem: Add dropped #ifdef back
  xtensa/mm/highmem: Make generic kmap_atomic() work correctly
  mm/highmem: Take kmap_high_get() properly into account
  highmem: High implementation details and document API
  Documentation/io-mapping: Remove outdated blurb
  io-mapping: Cleanup atomic iomap
  mm/highmem: Remove the old kmap_atomic cruft
  highmem: Get rid of kmap_types.h
  xtensa/mm/highmem: Switch to generic kmap atomic
  sparc/mm/highmem: Switch to generic kmap atomic
  powerpc/mm/highmem: Switch to generic kmap atomic
  nds32/mm/highmem: Switch to generic kmap atomic
  ... | ||
|  Linus Torvalds | 76d4acf22b | perf/kprobes updates: - Make kretprobes lockless to avoid the rp->lock performance and potential
    lock ordering issues.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XvuYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoV3XEADA3yp4ApabrdSMK+JpTM053mM3NCCk
 VLZEdh5+ydvPfgTWZcgLDfL4P4MVySDKf40pSVgZOA73uDWhdO4jcMoJgl9Du4Nq
 qfvz6Atj0a8XEgAFNh1IWGGAHydIwKOQZJyjFT5Kh94QNOErF2PJGAMnoMYpdJsj
 E7kgDM+vmWJk0GE+OYTzsAYQ99XhLfUAO9f8WoRirxyNgga6bu0arRYWZSX3Sg/h
 oDUHeizyrrURUBgxJBewCxvCsy4TTfefwZFUBLK5gm3zRJLKDT2O8wiy+KzlRQqA
 kYV3fSx8fYETlSOJWJC8S01MLpxslGdenIdRgNc63C021DtwMGM83FCl0DLnPMeg
 iX5u+0Qg77rnJ8zh0cgSxyP6EgZzrUW8+DjZagge3PAnTXwYRv95pOJahJifDVmF
 mo2RJ2Me+XbqeB4BYoLivvWpXdsWOvtXl3BTA6ZLV+K823lMPYcZO/cXHIUYHhtu
 ExrZ+aw3opt43KT5sNQmPll7d1UsMD4/761L7gysIYK0RthunmlWpAnnfLTbRdPe
 ELKIHcuSCGkGfRs07/oPbbOpMorhel+3alW0B6Vzar0/0nw3fPX/yPIkCh7s941o
 G0UIPquvBGk3u0bZKZZ7QJPjT0ktdQpQs69+J2ARXWvApAGKnkOlPsNSI9TbPE3D
 ZIguKqSyzqJwuA==
 =PDBa
 -----END PGP SIGNATURE-----
Merge tag 'perf-kprobes-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf/kprobes updates from Thomas Gleixner:
 "Make kretprobes lockless to avoid the rp->lock performance and
  potential lock ordering issues"
* tag 'perf-kprobes-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/atomics: Regenerate the atomics-check SHA1's
  kprobes: Replace rp->free_instance with freelist
  freelist: Implement lockless freelist
  asm-generic/atomic: Add try_cmpxchg() fallbacks
  kprobes: Remove kretprobe hash
  llist: Add nonatomic __llist_add() and __llist_dell_all() | ||
|  Linus Torvalds | 1ac0884d54 | A set of updates for entry/exit handling: - More generalization of entry/exit functionality
 
  - The consolidation work to reclaim TIF flags on x86 and also for non-x86
    specific TIF flags which are solely relevant for syscall related work
    and have been moved into their own storage space. The x86 specific part
    had to be merged in to avoid a major conflict.
 
  - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
    delivery mode of task work and results in an impressive performance
    improvement for io_uring. The non-x86 consolidation of this is going to
    come seperate via Jens.
 
  - The selective syscall redirection facility which provides a clean and
    efficient way to support the non-Linux syscalls of WINE by catching them
    at syscall entry and redirecting them to the user space emulation. This
    can be utilized for other purposes as well and has been designed
    carefully to avoid overhead for the regular fastpath. This includes the
    core changes and the x86 support code.
 
  - Simplification of the context tracking entry/exit handling for the users
    of the generic entry code which guarantee the proper ordering and
    protection.
 
  - Preparatory changes to make the generic entry code accomodate S390
    specific requirements which are mostly related to their syscall restart
    mechanism.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XoPoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoe0tD/4jSKHIogVM9kVpiYfwjDGS1NluaBXn
 71ZoASbX9GZebyGandMyF2QP1iJ24ZO0RztBwHEVH6fyomKB2iFNedssCpO9yfWV
 3eFRpOvMpbszY2W2bd0QG3GrqaTttjVfB4ahkGLzqeSbchdob6hZpNDYtBZnujA6
 GSnrrurfJkCGoQny+yJQYdQJXQU+BIX90B2a2Q+jW123Luy/iHXC1f/krZSA1m14
 fC9xYLSUjPphTzh2ZOW+C3DgdjOL5PfAm/6F+DArt4GtLgrEGD7R74aLSFhvetky
 dn5QtG+yAsz1i0cc5Wu/JBcT9tOkY92rPYSyLI9bYQUSQ/bMyuprz6oYKj3dubsu
 ZSsKPdkNFPIniL4fLdCMWZcIXX5xgnrxKjdgXZXW3gtrcxSns8w8uED3Sh7dgE08
 pgIeq67E5g/OB8kJXH1VxdewmeQb9cOmnzzHwNO7TrrGbBKjDTYHNdYOKf1dUTTK
 ZX1UjLfGwxTkMYAbQD1k0JGZ2OLRshzSaH5BW/ZKa3bvJW6yYOq+/YT8B8hbJ8U3
 vThlO75/55IJxS5r5Y3vZd/IHdsYbPuETD+TA8tNYtPqNZasW8nnk4TYctWqzDuO
 /Ka1wvWYid3c6ySznQn4zSyRjr968AfHeZ9YTUMhWufy5waXVmdBMG41u3IKfsVt
 osyzNc4EK19/Mg==
 =hsjV
 -----END PGP SIGNATURE-----
Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry/exit updates from Thomas Gleixner:
 "A set of updates for entry/exit handling:
   - More generalization of entry/exit functionality
   - The consolidation work to reclaim TIF flags on x86 and also for
     non-x86 specific TIF flags which are solely relevant for syscall
     related work and have been moved into their own storage space. The
     x86 specific part had to be merged in to avoid a major conflict.
   - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
     delivery mode of task work and results in an impressive performance
     improvement for io_uring. The non-x86 consolidation of this is
     going to come seperate via Jens.
   - The selective syscall redirection facility which provides a clean
     and efficient way to support the non-Linux syscalls of WINE by
     catching them at syscall entry and redirecting them to the user
     space emulation. This can be utilized for other purposes as well
     and has been designed carefully to avoid overhead for the regular
     fastpath. This includes the core changes and the x86 support code.
   - Simplification of the context tracking entry/exit handling for the
     users of the generic entry code which guarantee the proper ordering
     and protection.
   - Preparatory changes to make the generic entry code accomodate S390
     specific requirements which are mostly related to their syscall
     restart mechanism"
* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  entry: Add syscall_exit_to_user_mode_work()
  entry: Add exit_to_user_mode() wrapper
  entry_Add_enter_from_user_mode_wrapper
  entry: Rename exit_to_user_mode()
  entry: Rename enter_from_user_mode()
  docs: Document Syscall User Dispatch
  selftests: Add benchmark for syscall user dispatch
  selftests: Add kselftest for syscall user dispatch
  entry: Support Syscall User Dispatch on common syscall entry
  kernel: Implement selective syscall userspace redirection
  signal: Expose SYS_USER_DISPATCH si_code type
  x86: vdso: Expose sigreturn address on vdso to the kernel
  MAINTAINERS: Add entry for common entry code
  entry: Fix boot for !CONFIG_GENERIC_ENTRY
  x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
  sched: Detect call to schedule from critical entry code
  context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
  x86: Reclaim unused x86 TI flags
  ... | ||
|  Eric W. Biederman | f7cfd871ae | exec: Transform exec_update_mutex into a rw_semaphore Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex.  The problematic lock ordering found by lockdep
was:
   perf_event_open  (exec_update_mutex -> ovl_i_mutex)
   chown            (ovl_i_mutex       -> sb_writes)
   sendfile         (sb_writes         -> p->lock)
     by reading from a proc file and writing to overlayfs
   proc_pid_syscall (p->lock           -> exec_update_mutex)
While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same.  They are all readers.  The only writer is
exec.
There is no reason for readers to block on each other.  So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.
Cc: Jann Horn <jannh@google.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Fixes:  | ||
|  Eric W. Biederman | 1f702603e7 | exec: Simplify unshare_files Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Roman Gushchin | bcfe06bf26 | mm: memcontrol: Use helpers to read page's memcg data Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup *page_memcg(struct page *page); struct mem_cgroup *page_memcg_rcu(struct page *page); struct mem_cgroup *page_memcg_check(struct page *page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com | ||
|  Gabriel Krisman Bertazi | 1446e1df9e | kernel: Implement selective syscall userspace redirection Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine. The proposed interface looks like this: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector]) The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn. selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel. The feature is meant to be set per-thread and it is disabled on fork/clone/execv. Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher. For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com | ||
|  Thomas Gleixner | 5fbda3ecd1 | sched: highmem: Store local kmaps in task struct Instead of storing the map per CPU provide and use per task storage. That prepares for local kmaps which are preemptible. The context switch code is preparatory and not yet in use because kmap_atomic() runs with preemption disabled. Will be made usable in the next step. The context switch logic is safe even when an interrupt happens after clearing or before restoring the kmaps. The kmap index in task struct is not modified so any nesting kmap in an interrupt will use unused indices and on return the counter is the same as before. Also add an assert into the return to user space code. Going back to user space with an active kmap local is a nono. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20201118204007.372935758@linutronix.de | ||
|  Gabriel Krisman Bertazi | 64eb35f701 | ptrace: Migrate TIF_SYSCALL_EMU to use SYSCALL_WORK flag On architectures using the generic syscall entry code the architecture independent syscall work is moved to flags in thread_info::syscall_work. This removes architecture dependencies and frees up TIF bits. Define SYSCALL_WORK_SYSCALL_EMU, use it in the generic entry code and convert the code which uses the TIF specific helper functions to use the new *_syscall_work() helpers which either resolve to the new mode for users of the generic entry code or to the TIF based functions for the other architectures. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20201116174206.2639648-8-krisman@collabora.com | ||
|  Gabriel Krisman Bertazi | 64c19ba29b | ptrace: Migrate to use SYSCALL_TRACE flag On architectures using the generic syscall entry code the architecture independent syscall work is moved to flags in thread_info::syscall_work. This removes architecture dependencies and frees up TIF bits. Define SYSCALL_WORK_SYSCALL_TRACE, use it in the generic entry code and convert the code which uses the TIF specific helper functions to use the new *_syscall_work() helpers which either resolve to the new mode for users of the generic entry code or to the TIF based functions for the other architectures. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20201116174206.2639648-7-krisman@collabora.com | ||
|  Gabriel Krisman Bertazi | 23d67a5485 | seccomp: Migrate to use SYSCALL_WORK flag On architectures using the generic syscall entry code the architecture independent syscall work is moved to flags in thread_info::syscall_work. This removes architecture dependencies and frees up TIF bits. Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert the code which uses the TIF specific helper functions to use the new *_syscall_work() helpers which either resolve to the new mode for users of the generic entry code or to the TIF based functions for the other architectures. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com | ||
|  Eddy Wu | b4e00444ca | fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent current->group_leader->exit_signal may change during copy_process() if current->real_parent exits. Move the assignment inside tasklist_lock to avoid the race. Signed-off-by: Eddy Wu <eddy_wu@trendmicro.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Ingo Molnar | 666fab4a3e | Merge branch 'linus' into perf/kprobes Conflicts:
	include/asm-generic/atomic-instrumented.h
	kernel/kprobes.c
Use the upstream atomic-instrumented.h checksum, and pick
the kprobes version of kernel/kprobes.c, which effectively
reverts this upstream workaround:
   | ||
|  Randy Dunlap | 7b7b8a2c95 | kernel/: fix repeated words in comments Fix multiple occurrences of duplicated words in kernel/. Fix one typo/spello on the same line as a duplicate word. Change one instance of "the the" to "that the". Otherwise just drop one of the repeated words. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/98202fa6-8919-ef63-9efe-c0fad5ca7af1@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 73eb7f9a4f | mm: use helper function put_write_access() In commit  | ||
|  Linus Torvalds | 612e7a4c16 | kernel-clone-v5.9 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXz5bNAAKCRCRxhvAZXjc opfjAP9R/J72yxdd2CLGNZ96hyiRX1NgFDOVUhscOvujYJf8ZwD+OoLmKMvAyFW6 hnMhT1n9Q+aq194hyzChOLQaBTejBQ8= =4WCX -----END PGP SIGNATURE----- Merge tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull kernel_clone() updates from Christian Brauner: "During the v5.9 merge window we reworked the process creation codepaths across multiple architectures. After this work we were only left with the _do_fork() helper based on the struct kernel_clone_args calling convention. As was pointed out _do_fork() isn't valid kernelese especially for a helper that isn't just static. This series removes the _do_fork() helper and introduces the new kernel_clone() helper. The process creation cleanup didn't change the name to something more reasonable mainly because _do_fork() was used in quite a few places. So sending this as a separate series seemed the better strategy. I originally intended to send this early in the v5.9 development cycle after the merge window had closed but given that this was touching quite a few places I decided to defer this until the v5.10 merge window" * tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: sched: remove _do_fork() tracing: switch to kernel_clone() kgdbts: switch to kernel_clone() kprobes: switch to kernel_clone() x86: switch to kernel_clone() sparc: switch to kernel_clone() nios2: switch to kernel_clone() m68k: switch to kernel_clone() ia64: switch to kernel_clone() h8300: switch to kernel_clone() fork: introduce kernel_clone() | ||
|  Suren Baghdasaryan | 67197a4f28 | mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.
Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.
Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.
The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.
With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.
[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
Fixes:  | ||
|  Peter Xu | c78f463649 | mm: remove src/dst mm parameter in copy_page_range() Both of the mm pointers are not needed after commit  | ||
|  Miaohe Lin | cf508b5845 | mm: use helper function mapping_allow_writable() Commit  | ||
|  Peter Zijlstra | d741bf41d7 | kprobes: Remove kretprobe hash The kretprobe hash is mostly superfluous, replace it with a per-task variable. This gets rid of the task hash and it's related locking. Note that this may change the kprobes module-exported API for kretprobe handlers. If any out-of-tree kretprobe user uses ri->rp, use get_kretprobe(ri) instead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/159870620431.1229682.16325792502413731312.stgit@devnote2 | ||
|  Jens Axboe | 0f2122045b | io_uring: don't rely on weak ->files references Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Peter Xu | 7a4830c380 | mm/fork: Pass new vma pointer into copy_page_range() This prepares for the future work to trigger early cow on pinned pages during fork(). No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 008cfe4418 | mm: Introduce mm_struct.has_pinned (Commit message majorly collected from Jason Gunthorpe) Reduce the chance of false positive from page_maybe_dma_pinned() by keeping track if the mm_struct has ever been used with pin_user_pages(). This allows cases that might drive up the page ref_count to avoid any penalty from handling dma_pinned pages. Future work is planned, to provide a more sophisticated solution, likely to turn it into a real counter. For now, make it atomic_t but use it as a boolean for simplicity. Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Tobias Klauser | b0daa2c73f | fork: adjust sysctl_max_threads definition to match prototype Commit | ||
|  Christian Brauner | cad6967ac1 | fork: introduce kernel_clone() The old _do_fork() helper doesn't follow naming conventions of in-kernel
helpers for syscalls. The process creation cleanup in [1] didn't change the
name to something more reasonable mainly because _do_fork() was used in quite a
few places. So sending this as a separate series seemed the better strategy.
This commit does two things:
1. renames _do_fork() to kernel_clone() but keeps _do_fork() as a simple static
   inline wrapper around kernel_clone().
2. Changes the return type from long to pid_t. This aligns kernel_thread() and
   kernel_clone(). Also, the return value from kernel_clone that is surfaced in
   fork(), vfork(), clone(), and clone3() is taken from pid_vrn() which returns
   a pid_t too.
Follow-up patches will switch each caller of _do_fork() and each place where it
is referenced over to kernel_clone(). After all these changes are done, we can
remove _do_fork() completely and will only be left with kernel_clone().
[1]:  | ||
|  Linus Torvalds | 97d052ea3f | A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various
     situations caused by the lockdep additions to seqcount to validate that
     the write side critical sections are non-preemptible.
 
   - The seqcount associated lock debug addons which were blocked by the
     above fallout.
 
     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict per
     CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
     validate that the lock is held.
 
     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored and
     write_seqcount_begin() has a lockdep assertion to validate that the
     lock is held.
 
     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API is
     unchanged and determines the type at compile time with the help of
     _Generic which is possible now that the minimal GCC version has been
     moved up.
 
     Adding this lockdep coverage unearthed a handful of seqcount bugs which
     have been addressed already independent of this.
 
     While generaly useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if the
     writers are serialized by an associated lock, which leads to the well
     known reader preempts writer livelock. RT prevents this by storing the
     associated lock pointer independent of lockdep in the seqcount and
     changing the reader side to block on the lock when a reader detects
     that a writer is in the write side critical section.
 
  - Conversion of seqcount usage sites to associated types and initializers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
 QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
 KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
 eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
 shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
 n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
 F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
 DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
 VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
 AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
 f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
 81ppn2KlvzUK8g==
 =7Gj+
 -----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
 "A set of locking fixes and updates:
   - Untangle the header spaghetti which causes build failures in
     various situations caused by the lockdep additions to seqcount to
     validate that the write side critical sections are non-preemptible.
   - The seqcount associated lock debug addons which were blocked by the
     above fallout.
     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict
     per CPU seqcounts. As the lock is not part of the seqcount, lockdep
     cannot validate that the lock is held.
     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored
     and write_seqcount_begin() has a lockdep assertion to validate that
     the lock is held.
     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API
     is unchanged and determines the type at compile time with the help
     of _Generic which is possible now that the minimal GCC version has
     been moved up.
     Adding this lockdep coverage unearthed a handful of seqcount bugs
     which have been addressed already independent of this.
     While generally useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if
     the writers are serialized by an associated lock, which leads to
     the well known reader preempts writer livelock. RT prevents this by
     storing the associated lock pointer independent of lockdep in the
     seqcount and changing the reader side to block on the lock when a
     reader detects that a writer is in the write side critical section.
   - Conversion of seqcount usage sites to associated types and
     initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  locking/seqlock, headers: Untangle the spaghetti monster
  locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
  x86/headers: Remove APIC headers from <asm/smp.h>
  seqcount: More consistent seqprop names
  seqcount: Compress SEQCNT_LOCKNAME_ZERO()
  seqlock: Fold seqcount_LOCKNAME_init() definition
  seqlock: Fold seqcount_LOCKNAME_t definition
  seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
  hrtimer: Use sequence counter with associated raw spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  vfs: Use sequence counter with associated spinlock
  timekeeping: Use sequence counter with associated raw spinlock
  xfrm: policy: Use sequence counters with associated lock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  sched: tasks: Use sequence counter with associated spinlock
  ... | ||
|  Andrey Konovalov | 8dcc1d3466 | kasan: don't tag stacks allocated with pagealloc Patch series "kasan: support stack instrumentation for tag-based mode", v2. This patch (of 5): Prepare Software Tag-Based KASAN for stack tagging support. With Tag-Based KASAN when kernel stacks are allocated via pagealloc (which happens when CONFIG_VMAP_STACK is not enabled), they get tagged. KASAN instrumentation doesn't expect the sp register to be tagged, and this leads to false-positive reports. Fix by resetting the tag of kernel stack pointers after allocation. Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Walter Wu <walter-zh.wu@mediatek.com> Cc: Elena Petrova <lenaptr@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Link: http://lkml.kernel.org/r/cover.1596199677.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/cover.1596544734.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/12d8c678869268dd0884b01271ab592f30792abf.1596544734.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/01c678b877755bcf29009176592402cdf6f2cb15.1596199677.git.andreyknvl@google.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=203497 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Shakeel Butt | 991e767385 | mm: memcontrol: account kernel stack per node Currently the kernel stack is being accounted per-zone. There is no need to do that. In addition due to being per-zone, memcg has to keep a separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of node_stat_item. In addition localize the kernel stack stats updates to account_kernel_stack(). Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 4f30a60aa7 | close-range-v5.9 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygcpgAKCRCRxhvAZXjc
 ogPeAQDv1ncqtNroFAC4pJ4tQhH7JSjW0OltiMk/AocY/J2SdQD9GJ15luYJ0/om
 697q/Z68sndRynhdoZlMuf3oYuBlHQw=
 =3ZhE
 -----END PGP SIGNATURE-----
Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range() implementation from Christian Brauner:
 "This adds the close_range() syscall. It allows to efficiently close a
  range of file descriptors up to all file descriptors of a calling
  task.
  This is coordinated with the FreeBSD folks which have copied our
  version of this syscall and in the meantime have already merged it in
  April 2019:
    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836
  The syscall originally came up in a discussion around the new mount
  API and making new file descriptor types cloexec by default. During
  this discussion, Al suggested the close_range() syscall.
  First, it helps to close all file descriptors of an exec()ing task.
  This can be done safely via (quoting Al's example from [1] verbatim):
        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);
  The code snippet above is one way of working around the problem that
  file descriptors are not cloexec by default. This is aggravated by the
  fact that we can't just switch them over without massively regressing
  userspace. For a whole class of programs having an in-kernel method of
  closing all file descriptors is very helpful (e.g. demons, service
  managers, programming language standard libraries, container managers
  etc.).
  Second, it allows userspace to avoid implementing closing all file
  descriptors by parsing through /proc/<pid>/fd/* and calling close() on
  each file descriptor and other hacks. From looking at various
  large(ish) userspace code bases this or similar patterns are very
  common in service managers, container runtimes, and programming
  language runtimes/standard libraries such as Python or Rust.
  In addition, the syscall will also work for tasks that do not have
  procfs mounted and on kernels that do not have procfs support compiled
  in. In such situations the only way to make sure that all file
  descriptors are closed is to call close() on each file descriptor up
  to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.
  Based on Linus' suggestion close_range() also comes with a new flag
  CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
  right before exec. This would usually be expressed in the sequence:
        unshare(CLONE_FILES);
        close_range(3, ~0U);
  as pointed out by Linus it might be desirable to have this be a part
  of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
  gets especially handy when we're closing all file descriptors above a
  certain threshold.
  Test-suite as always included"
* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add CLOSE_RANGE_UNSHARE tests
  close_range: add CLOSE_RANGE_UNSHARE
  tests: add close_range() tests
  arch: wire-up close_range()
  open: add close_range() | ||
|  Linus Torvalds | 9ba27414f2 | fork-v5.9 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXyge/QAKCRCRxhvAZXjc
 oildAQCCWpnTeXm6hrIE3VZ36X5npFtbaEthdBVAUJM7mo0FYwEA8+Wbnubg6jCw
 mztkXCnTfU7tApUdhKtQzcpEws45/Qk=
 =REE/
 -----END PGP SIGNATURE-----
Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fork cleanups from Christian Brauner:
 "This is cleanup series from when we reworked a chunk of the process
  creation paths in the kernel and switched to struct
  {kernel_}clone_args.
  High-level this does two main things:
   - Remove the double export of both do_fork() and _do_fork() where
     do_fork() used the incosistent legacy clone calling convention.
     Now we only export _do_fork() which is based on struct
     kernel_clone_args.
   - Remove the copy_thread_tls()/copy_thread() split making the
     architecture specific HAVE_COYP_THREAD_TLS config option obsolete.
  This switches all remaining architectures to select
  HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
  convention. The current split makes the process creation codepaths
  more convoluted than they need to be. Each architecture has their own
  copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
  has a copy_thread_tls() function.
  The split is not needed anymore nowadays, all architectures support
  CLONE_SETTLS but quite a few of them never bothered to select
  HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
  and use the old calling convention. Removing this split cleans up the
  process creation codepaths and paves the way for implementing clone3()
  on such architectures since it requires the copy_thread_tls() calling
  convention.
  After having made each architectures support copy_thread_tls() this
  series simply renames that function back to copy_thread(). It also
  switches all architectures that call do_fork() directly over to
  _do_fork() and the struct kernel_clone_args calling convention. This
  is a corollary of switching the architectures that did not yet support
  it over to copy_thread_tls() since do_fork() is conditional on not
  supporting copy_thread_tls() (Mostly because it lacks a separate
  argument for tls which is trivial to fix but there's no need for this
  function to exist.).
  The do_fork() removal is in itself already useful as it allows to to
  remove the export of both do_fork() and _do_fork() we currently have
  in favor of only _do_fork(). This has already been discussed back when
  we added clone3(). The legacy clone() calling convention is - as is
  probably well-known - somewhat odd:
    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3
  that is aggravated by the fact that some architectures such as sparc
  follow the CLONE_BACKWARDSx calling convention but don't really select
  the corresponding config option since they call do_fork() directly.
  So do_fork() enforces a somewhat arbitrary calling convention in the
  first place that doesn't really help the individual architectures that
  deviate from it. They can thus simply be switched to _do_fork()
  enforcing a single calling convention. (I really hope that any new
  architectures will __not__ try to implement their own calling
  conventions...)
  Most architectures already have made a similar switch (m68k comes to
  mind).
  Overall this removes more code than it adds even with a good portion
  of added comments. It simplifies a chunk of arch specific assembly
  either by moving the code into C or by simply rewriting the assembly.
  Architectures that have been touched in non-trivial ways have all been
  actually boot and stress tested: sparc and ia64 have been tested with
  Debian 9 images. They are the two architectures which have been
  touched the most. All non-trivial changes to architectures have seen
  acks from the relevant maintainers. nios2 with a custom built
  buildroot image. h8300 I couldn't get something bootable to test on
  but the changes have been fairly automatic and I'm sure we'll hear
  people yell if I broke something there.
  All other architectures that have been touched in trivial ways have
  been compile tested for each single patch of the series via git rebase
  -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
  even though they have just been trivially touched (removal of the
  HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
  basically "core architectures" and since it is trivial to get your
  hands on a useable image"
* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: rename copy_thread_tls() back to copy_thread()
  arch: remove HAVE_COPY_THREAD_TLS
  unicore: switch to copy_thread_tls()
  sh: switch to copy_thread_tls()
  nds32: switch to copy_thread_tls()
  microblaze: switch to copy_thread_tls()
  hexagon: switch to copy_thread_tls()
  c6x: switch to copy_thread_tls()
  alpha: switch to copy_thread_tls()
  fork: remove do_fork()
  h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  sparc: unconditionally enable HAVE_COPY_THREAD_TLS
  sparc: share process creation helpers between sparc and sparc64
  sparc64: enable HAVE_COPY_THREAD_TLS
  fork: fold legacy_clone_args_valid() into _do_fork() | ||
|  Linus Torvalds | 3950e97543 | Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "During the development of v5.7 I ran into bugs and quality of implementation issues related to exec that could not be easily fixed because of the way exec is implemented. So I have been diggin into exec and cleaning up what I can. This cycle I have been looking at different ideas and different implementations to see what is possible to improve exec, and cleaning the way exec interfaces with in kernel users. Only cleaning up the interfaces of exec with rest of the kernel has managed to stabalize and make it through review in time for v5.9-rc1 resulting in 2 sets of changes this cycle. - Implement kernel_execve - Make the user mode driver code a better citizen With kernel_execve the code size got a little larger as the copying of parameters from userspace and copying of parameters from userspace is now separate. The good news is kernel threads no longer need to play games with set_fs to use exec. Which when combined with the rest of Christophs set_fs changes should security bugs with set_fs much more difficult" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) exec: Implement kernel_execve exec: Factor bprm_stack_limits out of prepare_arg_pages exec: Factor bprm_execve out of do_execve_common exec: Move bprm_mm_init into alloc_bprm exec: Move initialization of bprm->filename into alloc_bprm exec: Factor out alloc_bprm exec: Remove unnecessary spaces from binfmts.h umd: Stop using split_argv umd: Remove exit_umh bpfilter: Take advantage of the facilities of struct pid exit: Factor thread_group_exited out of pidfd_poll umd: Track user space drivers with struct pid bpfilter: Move bpfilter_umh back into init data exec: Remove do_execve_file umh: Stop calling do_execve_file umd: Transform fork_usermode_blob into fork_usermode_driver umd: Rename umd_info.cmdline umd_info.driver_name umd: For clarity rename umh_info umd_info umh: Separate the user mode driver and the user mode helper support umh: Remove call_usermodehelper_setup_file. ... | ||
|  Linus Torvalds | 9ecc6ea491 | seccomp updates for v5.9-rc1 - Improved selftest coverage, timeouts, and reporting
 - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
 - Refactor __scm_install_fd() into __receive_fd() and fix buggy callers
 - Introduce "addfd" command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oZcQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJomDD/4x3j7eXREcXDsHOmlgEaHWGx4l
 JldHFQhV5GjmD7gOkPcoZSG7NfG7F6VpwAJg7ZoR3qUkem7K8DFucxqgo1RldCot
 nigleeLX6JeMS0Z+iwjAVZd+5t4xG4J/7GGDHIIMiG5qvwJ0Yf64o1bkjaB2Q/Bv
 tluBg0WF32kFMG/ZwyY/V2QDbbue97CFPflybOh1o2nWbVzmUlFEEum3UUvZsxc8
 smMsattJyuAV7kcEKzKrs8b010NdFZqwdbub5Np9W3XEXGBYMdIPoNsOQGmB9wby
 j2ui0lzboXRG997jM7TCd1l/XZAv8aAwvPplw3FJRybzkOGs9NDyLMoz87yJpR1T
 xp511vnMyMbyKIGdungkt7cIyzaictHwaYzznsmuNdCPEjTaIQJr1ctsa4GEgtqf
 pnkktZ9YbMCcHU0CtZ8GlOVqA9wE+FUm0/u0zgikzJQsB+HcNItiARTTTHRyco7p
 VJCqK8o4Zx4ELV7QNkSH4nhFkVgRopvrvBiPAGro/qwGOofBg8W8wM8O1+V/MDmp
 zSU22v4SncT1Xb7dtmdJqDEeHfDikhaCAb4Je2hsGQWzbdAqwHGlpa7vpk9x3Q5r
 L+XyP+Z+rPHlXYyypJwUvvOQhXOmP0zYxcEHxByqIBfXiwy+3dN4tDDfatWbccwl
 uTlTDM8kmQn6QzSztA==
 =yb55
 -----END PGP SIGNATURE-----
Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
 "There are a bunch of clean ups and selftest improvements along with
  two major updates to the SECCOMP_RET_USER_NOTIF filter return:
  EPOLLHUP support to more easily detect the death of a monitored
  process, and being able to inject fds when intercepting syscalls that
  expect an fd-opening side-effect (needed by both container folks and
  Chrome). The latter continued the refactoring of __scm_install_fd()
  started by Christoph, and in the process found and fixed a handful of
  bugs in various callers.
   - Improved selftest coverage, timeouts, and reporting
   - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
   - Refactor __scm_install_fd() into __receive_fd() and fix buggy
     callers
   - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
     Dhillon)"
* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
  selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
  seccomp: Introduce addfd ioctl to seccomp user notifier
  fs: Expand __receive_fd() to accept existing fd
  pidfd: Replace open-coded receive_fd()
  fs: Add receive_fd() wrapper for __receive_fd()
  fs: Move __scm_install_fd() to __receive_fd()
  net/scm: Regularize compat handling of scm_detach_fds()
  pidfd: Add missing sock updates for pidfd_getfd()
  net/compat: Add missing sock updates for SCM_RIGHTS
  selftests/seccomp: Check ENOSYS under tracing
  selftests/seccomp: Refactor to use fixture variants
  selftests/harness: Clean up kern-doc for fixtures
  seccomp: Use -1 marker for end of mode 1 syscall list
  seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
  selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
  selftests/seccomp: Make kcmp() less required
  seccomp: Use pr_fmt
  selftests/seccomp: Improve calibration loop
  selftests/seccomp: use 90s as timeout
  selftests/seccomp: Expand benchmark to per-filter measurements
  ... | ||
|  Linus Torvalds | e4cbce4d13 | The main changes in this cycle were: - Improve uclamp performance by using a static key for the fast path
 
  - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)
 
  - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the values
    become larger. This is now replaced with more precise arithmetics,
    using the new mul_u64_u64_div_u64() helper in math64.h.
 
  - Improve the deadline scheduler, such as making it capacity aware
 
  - Improve frequency-invariant scheduling
 
  - Misc cleanups in energy/power aware scheduling
 
  - Add sched_update_nr_running tracepoint to track changes to nr_running
 
  - Documentation additions and updates
 
  - Misc cleanups and smaller fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oJDURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ixLg//bqWzFlfWirvngTgDxDnplwUTyKXmMCcq
 R1IYhlyK2O5FxvhbRmdmW11W3yzyTPvgCs6Q/70negGaPNe2w1OxfxiK9NMKz5eu
 M1LoXas7pL5g7Pr/ZxxHk/8VqJLV4t9MkodiiInmV6lTaznT3sU6a/kpYQjJyFnG
 Tuu9jd6JhdRKmePDJnNmUBoGQ7JiOQDcX4HtkcQ3OA+An3624tmJzbW1yts+uj7J
 ZWo2EY60RfbA9MxQXGPOaR/nAjngWs4Q6tddAh10mftsPq1gR2iFUKju1d31MQt/
 RHLdiqJf+AyUC4popKG7a+7ilCKMBwPociSreTJNPyEUQ1X4AM3vUVk4yjUoiDph
 k2WdsCF8/JRdhXg0NnrpPUqOaAbQj53EeXnitEb92E7WyTZgLOvAtpV//xZo6utp
 2QHerfrQ9SoGQjz/ho78za5vQtV1x25yDhd+X4XV4QEhIy85G9/2JCpC/Kc/TXLf
 OO7A4X69XztKTEJhP60g8ldCPUe4N2vbh1vKY6oAD8AFQVVNZ6n7375/Qa//b0/k
 ++hcYkPc2EK97/aBFdvzDgqb7aUo7Mtn2ibke16sQU4szulaoRuAHQG4jdGKMwbD
 dk2VBoxyxeYFXWHsNneSe87+ha3sd0dSN0ul1EB/SlFrVELMvy634YXnMYGW8ima
 PzyPB0ezpuA=
 =PbO7
 -----END PGP SIGNATURE-----
Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 - Improve uclamp performance by using a static key for the fast path
 - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
   better power efficiency of RT tasks on battery powered devices.
   (The default is to maximize performance & reduce RT latencies.)
 - Improve utime and stime tracking accuracy, which had a fixed boundary
   of error, which created larger and larger relative errors as the
   values become larger. This is now replaced with more precise
   arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.
 - Improve the deadline scheduler, such as making it capacity aware
 - Improve frequency-invariant scheduling
 - Misc cleanups in energy/power aware scheduling
 - Add sched_update_nr_running tracepoint to track changes to nr_running
 - Documentation additions and updates
 - Misc cleanups and smaller fixes
* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
  sched/doc: Document capacity aware scheduling
  sched: Document arch_scale_*_capacity()
  arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
  Documentation/sysctl: Document uclamp sysctl knobs
  sched/uclamp: Add a new sysctl to control RT default boost value
  sched/uclamp: Fix a deadlock when enabling uclamp static key
  sched: Remove duplicated tick_nohz_full_enabled() check
  sched: Fix a typo in a comment
  sched/uclamp: Remove unnecessary mutex_init()
  arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
  sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
  arch_topology, sched/core: Cleanup thermal pressure definition
  trace/events/sched.h: fix duplicated word
  linux/sched/mm.h: drop duplicated words in comments
  smp: Fix a potential usage of stale nr_cpus
  sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
  sched: nohz: stop passing around unused "ticks" parameter.
  sched: Better document ttwu()
  sched: Add a tracepoint to track rq->nr_running
  ... | ||
|  Ingo Molnar | 63722bbca6 | Merge branch 'kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core Pull v5.9 KCSAN bits from Paul E. McKenney. Perhaps the most important change is that GCC 11 now has all fixes in place to support KCSAN, so GCC support can be enabled again. Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Marco Elver | 0584df9c12 | lockdep: Refactor IRQ trace events fields into struct Refactor the IRQ trace events fields, used for printing information about the IRQ trace events, into a separate struct 'irqtrace_events'. This improves readability by separating the information only used in reporting, as well as enables (simplified) storing/restoring of irqtrace_events snapshots. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Ahmed S. Darwish | b75058614f | sched: tasks: Use sequence counter with associated spinlock A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de | ||
|  Qais Yousef | 13685c4a08 | sched/uclamp: Add a new sysctl to control RT default boost value RT tasks by default run at the highest capacity/performance level. When uclamp is selected this default behavior is retained by enforcing the requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum value. This is also referred to as 'the default boost value of RT tasks'. See commit | ||
|  Christian Brauner | 3a15fb6ed9 | seccomp: release filter after task is fully dead The seccomp filter used to be released in free_task() which is called asynchronously via call_rcu() and assorted mechanisms. Since we need to inform tasks waiting on the seccomp notifier when a filter goes empty we will notify them as soon as a task has been marked fully dead in release_task(). To not split seccomp cleanup into two parts, move filter release out of free_task() and into release_task() after we've unhashed struct task from struct pid, exited signals, and unlinked it from the threadgroups' thread list. We'll put the empty filter notification infrastructure into it in a follow up patch. This also renames put_seccomp_filter() to seccomp_filter_release() which is a more descriptive name of what we're doing here especially once we've added the empty filter notification mechanism in there. We're also NULL-ing the task's filter tree entrypoint which seems cleaner than leaving a dangling pointer in there. Note that this shouldn't need any memory barriers since we're calling this when the task is in release_task() which means it's EXIT_DEAD. So it can't modify its seccomp filters anymore. You can also see this from the point where we're calling seccomp_filter_release(). It's after __exit_signal() and at this point, tsk->sighand will already have been NULLed which is required for thread-sync and filter installation alike. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org> | ||
|  Peter Zijlstra | a21ee6055c | lockdep: Change hardirq{s_enabled,_context} to per-cpu variables Currently all IRQ-tracking state is in task_struct, this means that task_struct needs to be defined before we use it. Especially for lockdep_assert_irq*() this can lead to header-hell. Move the hardirq state into per-cpu variables to avoid the task_struct dependency. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200623083721.512673481@infradead.org | ||
|  Eric W. Biederman | 38fd525a4c | exit: Factor thread_group_exited out of pidfd_poll Create an independent helper thread_group_exited which returns true when all threads have passed exit_notify in do_exit. AKA all of the threads are at least zombies and might be dead or completely gone. Create this helper by taking the logic out of pidfd_poll where it is already tested, and adding a READ_ONCE on the read of task->exit_state. I will be changing the user mode driver code to use this same logic to know when a user mode driver needs to be restarted. Place the new helper thread_group_exited in kernel/exit.c and EXPORT it so it can be used by modules. Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Christian Brauner | 714acdbd1c | arch: rename copy_thread_tls() back to copy_thread() Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls() back simply copy_thread(). It's a simpler name, and doesn't imply that only tls is copied here. This finishes an outstanding chunk of internal process creation work since we've added clone3(). Cc: linux-arch@vger.kernel.org Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>A Acked-by: Stafford Horne <shorne@gmail.com> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>A Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Christian Brauner | 140c8180eb | arch: remove HAVE_COPY_THREAD_TLS All architectures support copy_thread_tls() now, so remove the legacy copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone uses the same process creation calling convention based on copy_thread_tls() and struct kernel_clone_args. This will make it easier to maintain the core process creation code under kernel/, simplifies the callpaths and makes the identical for all architectures. Cc: linux-arch@vger.kernel.org Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Christian Brauner | ff2a91127b | fork: remove do_fork() Now that all architectures have been switched to use _do_fork() and the new struct kernel_clone_args calling convention we can remove the legacy do_fork() helper completely. The calling convention used to be brittle and do_fork() didn't buy us anything. The only calling convention accepted should be based on struct kernel_clone_args going forward. It's cleaner and uniform. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Qian Cai | cda099b37d | fork: Annotate a data race in vm_area_dup() struct vm_area_struct could be accessed concurrently as noticed by KCSAN, write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35: vma_interval_tree_insert+0x101/0x150: rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58 (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23 __vma_link_file+0x6e/0xe0 __vma_link_file at mm/mmap.c:629 vma_link+0xa2/0x120 mmap_region+0x753/0xb90 do_mmap+0x45c/0x710 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x1d1/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122: vm_area_dup+0x6a/0xe0 vm_area_dup at kernel/fork.c:362 __split_vma+0x72/0x2a0 __split_vma at mm/mmap.c:2661 split_vma+0x5a/0x80 mprotect_fixup+0x368/0x3f0 do_mprotect_pkey+0x263/0x420 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe vm_area_dup() blindly copies all fields of original VMA to the new one. This includes coping vm_area_struct::shared.rb which is normally protected by i_mmap_lock. But this is fine because the read value will be overwritten on the following __vma_link_file() under proper protection. Thus, mark it as an intentional data race and insert a few assertions for the fields that should not be modified concurrently. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> | ||
|  Weilong Chen | c17d1a3a8e | fork: annotate data race in copy_process() KCSAN reported data race reading and writing nr_threads and max_threads. The data race is intentional and benign. This is obvious from the comment above it and based on general consensus when discussing this issue. So there's no need for any heavy atomic or *_ONCE() machinery here. In accordance with the newly introduced data_race() annotation consensus, mark the offending line with data_race(). Here it's actually useful not just to silence KCSAN but to also clearly communicate that the race is intentional. This is especially helpful since nr_threads is otherwise protected by tasklist_lock. BUG: KCSAN: data-race in copy_process / copy_process write to 0xffffffff86205cf8 of 4 bytes by task 14779 on cpu 1: copy_process+0x2eba/0x3c40 kernel/fork.c:2273 _do_fork+0xfe/0x7a0 kernel/fork.c:2421 __do_sys_clone kernel/fork.c:2576 [inline] __se_sys_clone kernel/fork.c:2557 [inline] __x64_sys_clone+0x130/0x170 kernel/fork.c:2557 do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffffffff86205cf8 of 4 bytes by task 6944 on cpu 0: copy_process+0x94d/0x3c40 kernel/fork.c:1954 _do_fork+0xfe/0x7a0 kernel/fork.c:2421 __do_sys_clone kernel/fork.c:2576 [inline] __se_sys_clone kernel/fork.c:2557 [inline] __x64_sys_clone+0x130/0x170 kernel/fork.c:2557 do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: https://groups.google.com/forum/#!msg/syzkaller-upstream-mo deration/thvp7AHs5Ew/aPdYLXfYBQAJ Reported-by: syzbot+52fced2d288f8ecd2b20@syzkaller.appspotmail.com Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Weilong Chen <chenweilong@huawei.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Qian Cai <cai@lca.pw> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Marco Elver <elver@google.com> [christian.brauner@ubuntu.com: rewrite commit message] Link: https://lore.kernel.org/r/20200623041240.154294-1-chenweilong@huawei.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Christian Brauner | 3af8588c77 | fork: fold legacy_clone_args_valid() into _do_fork() This separate helper only existed to guarantee the mutual exclusivity of CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD abuses the parent_tid field to return the pidfd. But we can actually handle this uniformely thus removing the helper. For legacy clone we can detect that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID because they will share the same memory which is invalid and for clone3() setting the separate pidfd and parent_tid fields to the same memory is bogus as well. So fold that helper directly into _do_fork() by detecting this case. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: linux-m68k@lists.linux-m68k.org Cc: x86@kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Christian Brauner | 60997c3d45 | close_range: add CLOSE_RANGE_UNSHARE One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.
The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Michel Lespinasse | aaa2cc56c1 | mmap locking API: convert nested write lock sites Add API for nested write locks and convert the few call sites doing that. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-7-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michel Lespinasse | d8ed45c5dc | mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | e31cf2f4ca | mm: don't include asm/pgtable.h if linux/mm.h is already included Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 9ff7258575 | Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc updates from Eric Biederman: "This has four sets of changes: - modernize proc to support multiple private instances - ensure we see the exit of each process tid exactly - remove has_group_leader_pid - use pids not tasks in posix-cpu-timers lookup Alexey updated proc so each mount of proc uses a new superblock. This allows people to actually use mount options with proc with no fear of messing up another mount of proc. Given the kernel's internal mounts of proc for things like uml this was a real problem, and resulted in Android's hidepid mount options being ignored and introducing security issues. The rest of the changes are small cleanups and fixes that came out of my work to allow this change to proc. In essence it is swapping the pids in de_thread during exec which removes a special case the code had to handle. Then updating the code to stop handling that special case" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: proc_pid_ns takes super_block as an argument remove the no longer needed pid_alive() check in __task_pid_nr_ns() posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type posix-cpu-timers: Extend rcu_read_lock removing task_struct references signal: Remove has_group_leader_pid exec: Remove BUG_ON(has_group_leader_pid) posix-cpu-timer: Unify the now redundant code in lookup_task posix-cpu-timer: Tidy up group_leader logic in lookup_task proc: Ensure we see the exit of each process tid exactly once rculist: Add hlists_swap_heads_rcu proc: Use PIDTYPE_TGID in next_tgid Use proc_pid_ns() to get pid_namespace from the proc superblock proc: use named enums for better readability proc: use human-readable values for hidepid docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior proc: add option to mount only a pids subset proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option proc: allow to mount many instances of proc in one pid namespace proc: rename struct proc_fs_info to proc_fs_opts | ||
|  Linus Torvalds | 533b220f7b | arm64 updates for 5.8 - Branch Target Identification (BTI)
 	* Support for ARMv8.5-BTI in both user- and kernel-space. This
 	  allows branch targets to limit the types of branch from which
 	  they can be called and additionally prevents branching to
 	  arbitrary code, although kernel support requires a very recent
 	  toolchain.
 
 	* Function annotation via SYM_FUNC_START() so that assembly
 	  functions are wrapped with the relevant "landing pad"
 	  instructions.
 
 	* BPF and vDSO updates to use the new instructions.
 
 	* Addition of a new HWCAP and exposure of BTI capability to
 	  userspace via ID register emulation, along with ELF loader
 	  support for the BTI feature in .note.gnu.property.
 
 	* Non-critical fixes to CFI unwind annotations in the sigreturn
 	  trampoline.
 
 - Shadow Call Stack (SCS)
 	* Support for Clang's Shadow Call Stack feature, which reserves
 	  platform register x18 to point at a separate stack for each
 	  task that holds only return addresses. This protects function
 	  return control flow from buffer overruns on the main stack.
 
 	* Save/restore of x18 across problematic boundaries (user-mode,
 	  hypervisor, EFI, suspend, etc).
 
 	* Core support for SCS, should other architectures want to use it
 	  too.
 
 	* SCS overflow checking on context-switch as part of the existing
 	  stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
 
 - CPU feature detection
 	* Removed numerous "SANITY CHECK" errors when running on a system
 	  with mismatched AArch32 support at EL1. This is primarily a
 	  concern for KVM, which disabled support for 32-bit guests on
 	  such a system.
 
 	* Addition of new ID registers and fields as the architecture has
 	  been extended.
 
 - Perf and PMU drivers
 	* Minor fixes and cleanups to system PMU drivers.
 
 - Hardware errata
 	* Unify KVM workarounds for VHE and nVHE configurations.
 
 	* Sort vendor errata entries in Kconfig.
 
 - Secure Monitor Call Calling Convention (SMCCC)
 	* Update to the latest specification from Arm (v1.2).
 
 	* Allow PSCI code to query the SMCCC version.
 
 - Software Delegated Exception Interface (SDEI)
 	* Unexport a bunch of unused symbols.
 
 	* Minor fixes to handling of firmware data.
 
 - Pointer authentication
 	* Add support for dumping the kernel PAC mask in vmcoreinfo so
 	  that the stack can be unwound by tools such as kdump.
 
 	* Simplification of key initialisation during CPU bringup.
 
 - BPF backend
 	* Improve immediate generation for logical and add/sub
 	  instructions.
 
 - vDSO
 	- Minor fixes to the linker flags for consistency with other
 	  architectures and support for LLVM's unwinder.
 
 	- Clean up logic to initialise and map the vDSO into userspace.
 
 - ACPI
 	- Work around for an ambiguity in the IORT specification relating
 	  to the "num_ids" field.
 
 	- Support _DMA method for all named components rather than only
 	  PCIe root complexes.
 
 	- Minor other IORT-related fixes.
 
 - Miscellaneous
 	* Initialise debug traps early for KGDB and fix KDB cacheflushing
 	  deadlock.
 
 	* Minor tweaks to early boot state (documentation update, set
 	  TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
 
 	* Refactoring and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
 yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
 jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
 JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
 HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
 znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
 =w3qi
 -----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "A sizeable pile of arm64 updates for 5.8.
  Summary below, but the big two features are support for Branch Target
  Identification and Clang's Shadow Call stack. The latter is currently
  arm64-only, but the high-level parts are all in core code so it could
  easily be adopted by other architectures pending toolchain support
  Branch Target Identification (BTI):
   - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
     branch targets to limit the types of branch from which they can be
     called and additionally prevents branching to arbitrary code,
     although kernel support requires a very recent toolchain.
   - Function annotation via SYM_FUNC_START() so that assembly functions
     are wrapped with the relevant "landing pad" instructions.
   - BPF and vDSO updates to use the new instructions.
   - Addition of a new HWCAP and exposure of BTI capability to userspace
     via ID register emulation, along with ELF loader support for the
     BTI feature in .note.gnu.property.
   - Non-critical fixes to CFI unwind annotations in the sigreturn
     trampoline.
  Shadow Call Stack (SCS):
   - Support for Clang's Shadow Call Stack feature, which reserves
     platform register x18 to point at a separate stack for each task
     that holds only return addresses. This protects function return
     control flow from buffer overruns on the main stack.
   - Save/restore of x18 across problematic boundaries (user-mode,
     hypervisor, EFI, suspend, etc).
   - Core support for SCS, should other architectures want to use it
     too.
   - SCS overflow checking on context-switch as part of the existing
     stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
  CPU feature detection:
   - Removed numerous "SANITY CHECK" errors when running on a system
     with mismatched AArch32 support at EL1. This is primarily a concern
     for KVM, which disabled support for 32-bit guests on such a system.
   - Addition of new ID registers and fields as the architecture has
     been extended.
  Perf and PMU drivers:
   - Minor fixes and cleanups to system PMU drivers.
  Hardware errata:
   - Unify KVM workarounds for VHE and nVHE configurations.
   - Sort vendor errata entries in Kconfig.
  Secure Monitor Call Calling Convention (SMCCC):
   - Update to the latest specification from Arm (v1.2).
   - Allow PSCI code to query the SMCCC version.
  Software Delegated Exception Interface (SDEI):
   - Unexport a bunch of unused symbols.
   - Minor fixes to handling of firmware data.
  Pointer authentication:
   - Add support for dumping the kernel PAC mask in vmcoreinfo so that
     the stack can be unwound by tools such as kdump.
   - Simplification of key initialisation during CPU bringup.
  BPF backend:
   - Improve immediate generation for logical and add/sub instructions.
  vDSO:
   - Minor fixes to the linker flags for consistency with other
     architectures and support for LLVM's unwinder.
   - Clean up logic to initialise and map the vDSO into userspace.
  ACPI:
   - Work around for an ambiguity in the IORT specification relating to
     the "num_ids" field.
   - Support _DMA method for all named components rather than only PCIe
     root complexes.
   - Minor other IORT-related fixes.
  Miscellaneous:
   - Initialise debug traps early for KGDB and fix KDB cacheflushing
     deadlock.
   - Minor tweaks to early boot state (documentation update, set
     TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
   - Refactoring and cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
  KVM: arm64: Check advertised Stage-2 page size capability
  arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
  ACPI/IORT: Remove the unused __get_pci_rid()
  arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
  arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
  arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
  arm64/cpufeature: Introduce ID_MMFR5 CPU register
  arm64/cpufeature: Introduce ID_DFR1 CPU register
  arm64/cpufeature: Introduce ID_PFR2 CPU register
  arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
  arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
  arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
  arm64: mm: Add asid_gen_match() helper
  firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
  arm64: vdso: Fix CFI directives in sigreturn trampoline
  arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
  ... | ||
|  Linus Torvalds | 2227e5b21a | The RCU updates for this cycle were: - RCU-tasks update, including addition of RCU Tasks Trace for
    BPF use and TASKS_RUDE_RCU
  - kfree_rcu() updates.
  - Remove scheduler locking restriction
  - RCU CPU stall warning updates.
  - Torture-test updates.
  - Miscellaneous fixes and other updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7U/r0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hSNxAAirKhPGBoLI9DW1qde4OFhZg+BlIpS+LD
 IE/0eGB8hGwhb1793RGbzIJfSnRQpSOPxWbWc6DJZ4Zpi5/ZbVkiPKsuXpM1xGxs
 kuBCTOhWy1/p3iCZ1JH/JCrCAdWGZkIzEoaV7ipnHtV/+UrRbCWH5PB7R0fYvcbI
 q5bUcWJyEp/bYMxQn8DhAih6SLPHx+F9qaGAqqloLSHstTYG2HkBhBGKnqcd/Jex
 twkLK53poCkeP/c08V1dyagU2IRWj2jGB1NjYh/Ocm+Sn/vru15CVGspjVjqO5FF
 oq07lad357ddMsZmKoM2F5DhXbOh95A+EqF9VDvIzCvfGMUgqYI1oxWF4eycsGhg
 /aYJgYuN23YeEe2DkDzJB67GvBOwl4WgdoFaxKRzOiCSfrhkM8KqM4G9Fz1JIepG
 abRJCF85iGcLslU9DkrShQiDsd/CRPzu/jz6ybK0I2II2pICo6QRf76T7TdOvKnK
 yXwC6OdL7/dwOht20uT6XfnDXMCWI4MutiUrb8/C1DbaihwEaI2denr3YYL+IwrB
 B38CdP6sfKZ5UFxKh0xb+sOzWrw0KA+ThSAXeJhz3tKdxdyB6nkaw3J9lFg8oi20
 XGeAujjtjMZG5cxt2H+wO9kZY0RRau/nTqNtmmRrCobd5yJjHHPHH8trEd0twZ9A
 X5Wjh11lv3E=
 =Yisx
 -----END PGP SIGNATURE-----
Merge tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The RCU updates for this cycle were:
   - RCU-tasks update, including addition of RCU Tasks Trace for BPF use
     and TASKS_RUDE_RCU
   - kfree_rcu() updates.
   - Remove scheduler locking restriction
   - RCU CPU stall warning updates.
   - Torture-test updates.
   - Miscellaneous fixes and other updates"
* tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  rcu: Allow for smp_call_function() running callbacks from idle
  rcu: Provide rcu_irq_exit_check_preempt()
  rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
  rcu: Provide __rcu_is_watching()
  rcu: Provide rcu_irq_exit_preempt()
  rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
  rcu/tree: Mark the idle relevant functions noinstr
  x86: Replace ist_enter() with nmi_enter()
  x86/mce: Send #MC singal from task work
  x86/entry: Get rid of ist_begin/end_non_atomic()
  sched,rcu,tracing: Avoid tracing before in_nmi() is correct
  sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
  lockdep: Always inline lockdep_{off,on}()
  hardirq/nmi: Allow nested nmi_enter()
  arm64: Prepare arch_nmi_enter() for recursion
  printk: Disallow instrumenting print_nmi_enter()
  printk: Prepare for nested printk_nmi_enter()
  rcutorture: Convert ULONG_CMP_LT() to time_before()
  torture: Add a --kasan argument
  torture: Save a few lines by using config_override_param initially
  ... | ||
|  Alexey Gladkov | 9d78edeaec | proc: proc_pid_ns takes super_block as an argument syzbot found that touch /proc/testfile causes NULL pointer dereference at tomoyo_get_local_path() because inode of the dentry is NULL. Before | ||
|  Sami Tolvanen | d08b9f0ca6 | scs: Add support for Clang's Shadow Call Stack (SCS) This change adds generic support for Clang's Shadow Call Stack, which uses a shadow stack to protect return addresses from being overwritten by an attacker. Details are available here: https://clang.llvm.org/docs/ShadowCallStack.html Note that security guarantees in the kernel differ from the ones documented for user space. The kernel must store addresses of shadow stacks in memory, which means an attacker capable reading and writing arbitrary memory may be able to locate them and hijack control flow by modifying the stacks. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [will: Numerous cosmetic changes] Signed-off-by: Will Deacon <will@kernel.org> | ||
|  Christian Brauner | 3f2c788a13 | fork: prevent accidental access to clone3 features Jan reported an issue where an interaction between sign-extending clone's
flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes
clone() to consistently fail with EBADF.
The whole story is a little longer. The legacy clone() syscall is odd in a
bunch of ways and here two things interact. First, legacy clone's flag
argument is word-size dependent, i.e. it's an unsigned long whereas most
system calls with flag arguments use int or unsigned int. Second, legacy
clone() ignores unknown and deprecated flags. The two of them taken
together means that users on 64bit systems can pass garbage for the upper
32bit of the clone() syscall since forever and things would just work fine.
Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed
and on v5.7-rc1 where this will fail with EBADF:
int main(int argc, char *argv[])
{
        pid_t pid;
        /* Note that legacy clone() has different argument ordering on
         * different architectures so this won't work everywhere.
         *
         * Only set the upper 32 bits.
         */
        pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
                      NULL, NULL, NULL, NULL);
        if (pid < 0)
                exit(EXIT_FAILURE);
        if (pid == 0)
                exit(EXIT_SUCCESS);
        if (wait(NULL) != pid)
                exit(EXIT_FAILURE);
        exit(EXIT_SUCCESS);
}
Since legacy clone() couldn't be extended this was not a problem so far and
nobody really noticed or cared since nothing in the kernel ever bothered to
look at the upper 32 bits.
But once we introduced clone3() and expanded the flag argument in struct
clone_args to 64 bit we opened this can of worms. With the first flag-based
extension to clone3() making use of the upper 32 bits of the flag argument
we've effectively made it possible for the legacy clone() syscall to reach
clone3() only flags. The sign extension scenario is just the odd
corner-case that we needed to figure this out.
The reason we just realized this now and not already when we introduced
CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup
file descriptor has been given. So the sign extension (or the user
accidently passing garbage for the upper 32 bits) caused the
CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it
didn't find a valid cgroup file descriptor.
Let's fix this by always capping the upper 32 bits for all codepaths that
are not aware of clone3() features. This ensures that we can't reach
clone3() only features by accident via legacy clone as with the sign
extension case and also that legacy clone() works exactly like before, i.e.
ignoring any unknown flags.  This solution risks no regressions and is also
pretty clean.
Fixes:  | ||
|  Paul E. McKenney | 276c410448 | rcu-tasks: Split ->trc_reader_need_end This commit splits ->trc_reader_need_end by using the rcu_special union. This change permits readers to check to see if a memory barrier is required without any added overhead in the common case where no such barrier is required. This commit also adds the read-side checking. Later commits will add the machinery to properly set the new ->trc_reader_special.b.need_mb field. This commit also makes rcu_read_unlock_trace_special() tolerate nested read-side critical sections within interrupt and NMI handlers. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> | ||
|  Paul E. McKenney | d5f177d35c | rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks Because RCU does not watch exception early-entry/late-exit, idle-loop, or CPU-hotplug execution, protection of tracing and BPF operations is needlessly complicated. This commit therefore adds a variant of Tasks RCU that: o Has explicit read-side markers to allow finite grace periods in the face of in-kernel loops for PREEMPT=n builds. These markers are rcu_read_lock_trace() and rcu_read_unlock_trace(). o Protects code in the idle loop, exception entry/exit, and CPU-hotplug code paths. In this respect, RCU-tasks trace is similar to SRCU, but with lighter-weight readers. o Avoids expensive read-side instruction, having overhead similar to that of Preemptible RCU. There are of course downsides: o The grace-period code can send IPIs to CPUs, even when those CPUs are in the idle loop or in nohz_full userspace. This is mitigated by later commits. o It is necessary to scan the full tasklist, much as for Tasks RCU. o There is a single callback queue guarded by a single lock, again, much as for Tasks RCU. However, those early use cases that request multiple grace periods in quick succession are expected to do so from a single task, which makes the single lock almost irrelevant. If needed, multiple callback queues can be provided using any number of schemes. Perhaps most important, this variant of RCU does not affect the vanilla flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace readers can operate from idle, offline, and exception entry/exit in no way enables rcu_preempt and rcu_sched readers to do so. The memory ordering was outlined here: https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/ This effort benefited greatly from off-list discussions of BPF requirements with Alexei Starovoitov and Andrii Nakryiko. At least some of the on-list discussions are captured in the Link: tags below. In addition, KCSAN was quite helpful in finding some early bugs. Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/ Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/ Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/ Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andriin@fb.com> [ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ] [ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ] [ paulmck: Fix locking issue reported by rcutorture. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> | ||
|  Eugene Syromiatnikov | a966dcfe15 | clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks CLONE_ARGS_SIZE_VER* macros are defined explicitly and not via the offsets of the relevant struct clone_args fields, which makes it rather error-prone, so it probably makes sense to add some compile-time checks for them (including the one that breaks on struct clone_args extension as a reminder to add a relevant size macro and a similar check). Function copy_clone_args_from_user seems to be a good place for such checks. Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200412202658.GA31499@asgard.redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Eugene Syromiatnikov | 62173872ca | clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set Passing CLONE_INTO_CGROUP with an under-sized structure (that doesn't properly contain cgroup field) seems like garbage input, especially considering the fact that fd 0 is a valid descriptor. Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200412203123.GA5869@asgard.redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Eugene Syromiatnikov | e82a118f57 | clone3: fix cgroup argument sanity check Checking that cgroup field value of struct clone_args is less than 0
is useless, as it is defined as unsigned 64-bit integer.  Moreover,
it doesn't catch the situations where its higher bits are lost during
the assignment to the cgroup field of the cgroup field of the internal
struct kernel_clone_args (where it is declared as signed 32-bit
integer), so it is still possible to pass garbage there.  A check
against INT_MAX solves both these issues.
Fixes:  | ||
|  Li Xinhai | e39a4b332d | mm: set vm_next and vm_prev to NULL in vm_area_dup() Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the new duplicated vma. Currently, only in fork path there are misuse for handling anon_vma. No other bugs been revealed with this patch applied. Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Li Xinhai | 93949bb21b | mm: don't prepare anon_vma if vma has VM_WIPEONFORK Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path".
This patchset fixes the misuse of parenet anon_vma, which mainly caused by
child vma's vm_next and vm_prev are left same as its parent after
duplicate vma.  Finally, code reached parent vma's neighbor by referring
pointer of child vma and executed wrong logic.
The first two patches fix relevant issues, and the third patch sets
vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse
in future.
Effects of the first bug is that causes rmap code to check both parent and
child's page table, although a page couldn't be mapped by both parent and
child, because child vma has WIPEONFORK so all pages mapped by child are
'new' and not relevant to parent.
Effects of the second bug is that the relationship of anon_vma of parent
and child are totallyconvoluted.  It would cause 'son', 'grandson', ...,
etc, to share 'parent' anon_vma, which disobey the design rule of reusing
anon_vma (the rule to be followed is that reusing should among vma of same
process, and vma should not gone through fork).
So, both issues should cause unnecessary rmap walking and have unexpected
complexity.
These two issues would not be directly visible, I used debugging code to
check the anon_vma pointers of parent and child when inspecting the
suspicious implementation of issue #2, then find the problem.
This patch (of 3):
In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and
parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
->vm_prev as its parent vma.  That allows anon_vma used by parent been
mistakenly shared by child (find_mergeable_anon_vma() will do this reuse
work).
Besides this issue, call anon_vma_prepare() should be avoided because we
don't copy page for this vma.  Preparing anon_vma will be handled during
fault.
Fixes:  | ||
|  Linus Torvalds | d883600523 | Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Christian extended clone3 so that processes can be spawned into cgroups directly. This is not only neat in terms of semantics but also avoids grabbing the global cgroup_threadgroup_rwsem for migration. - Daniel added !root xattr support to cgroupfs. Userland already uses xattrs on cgroupfs for bookkeeping. This will allow delegated cgroups to support such usages. - Prateek tried to make cpuset hotplug handling synchronous but that led to possible deadlock scenarios. Reverted. - Other minor changes including release_agent_path handling cleanup. * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: Document the cpuset_v2_mode mount option Revert "cpuset: Make cpuset hotplug synchronous" cgroupfs: Support user xattrs kernfs: Add option to enable user xattrs kernfs: Add removed_size out param for simple_xattr_set kernfs: kvmalloc xattr value instead of kmalloc cgroup: Restructure release_agent_path handling selftests/cgroup: add tests for cloning into cgroups clone3: allow spawning processes into cgroups cgroup: add cgroup_may_write() helper cgroup: refactor fork helpers cgroup: add cgroup_get_from_file() helper cgroup: unify attach permission checking cpuset: Make cpuset hotplug synchronous cgroup.c: Use built-in RCU list checking kselftest/cgroup: add cgroup destruction test cgroup: Clean up css_set task traversal | ||
|  Linus Torvalds | 6cad420cc6 | Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton:
 "A large amount of MM, plenty more to come.
  Subsystems affected by this patch series:
   - tools
   - kthread
   - kbuild
   - scripts
   - ocfs2
   - vfs
   - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
         sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
         hugetlbfs, hugetlb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
  include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
  mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
  selftests/vm: fix map_hugetlb length used for testing read and write
  mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
  mm/hugetlb.c: clean code by removing unnecessary initialization
  hugetlb_cgroup: add hugetlb_cgroup reservation docs
  hugetlb_cgroup: add hugetlb_cgroup reservation tests
  hugetlb: support file_region coalescing again
  hugetlb_cgroup: support noreserve mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add reservation accounting for private mappings
  mm/hugetlb_cgroup: fix hugetlb_cgroup migration
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add hugetlb_cgroup reservation counter
  hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  mm/memblock.c: remove redundant assignment to variable max_addr
  mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
  mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
  ... | ||
|  Linus Torvalds | d987ca1c6b | Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exec/proc updates from Eric Biederman: "This contains two significant pieces of work: the work to sort out proc_flush_task, and the work to solve a deadlock between strace and exec. Fixing proc_flush_task so that it no longer requires a persistent mount makes improvements to proc possible. The removal of the persistent mount solves an old regression that that caused the hidepid mount option to only work on remount not on mount. The regression was found and reported by the Android folks. This further allows Alexey Gladkov's work making proc mount options specific to an individual mount of proc to move forward. The work on exec starts solving a long standing issue with exec that it takes mutexes of blocking userspace applications, which makes exec extremely deadlock prone. For the moment this adds a second mutex with a narrower scope that handles all of the easy cases. Which makes the tricky cases easy to spot. With a little luck the code to solve those deadlocks will be ready by next merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits) signal: Extend exec_id to 64bits pidfd: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve kernel: doc: remove outdated comment cred.c mm: docs: Fix a comment in process_vm_rw_core selftests/ptrace: add test cases for dead-locks exec: Fix a deadlock in strace exec: Add exec_update_mutex to replace cred_guard_mutex exec: Move exec_mmap right after de_thread in flush_old_exec exec: Move cleanup of posix timers on exec out of de_thread exec: Factor unshare_sighand out of de_thread and call it separately exec: Only compute current once in flush_old_exec pid: Improve the comment about waiting in zap_pid_ns_processes proc: Remove the now unnecessary internal mount of proc uml: Create a private mount of proc for mconsole uml: Don't consult current to find the proc_mnt in mconsole_proc proc: Use a list of inodes to flush from proc ... | ||
|  Roman Gushchin | f4b00eab50 | mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Roman Gushchin | 8380ce4790 | mm: fork: fix kernel_stack memcg stats for various stack implementations Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().
In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg .  In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.
It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).
In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state().  It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .
Note: This is a special version of the patch created for stable
backports.  It contains code from the following two patches:
  - mm: memcg/slab: introduce mem_cgroup_from_obj()
  - mm: fork: fix kernel_stack memcg stats for various stack implementations
[guro@fb.com: introduce mem_cgroup_from_obj()]
  Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes:  | ||
|  Bernd Edlinger | 3e74fabd39 | exec: Fix a deadlock in strace This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running. I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access. The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex. This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/ Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Eric W. Biederman | eea9673250 | exec: Add exec_update_mutex to replace cred_guard_mutex The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Madhuparna Bhowmik | 0c282b068e | fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer() Use RCU_INIT_POINTER() instead of rcu_access_pointer() in copy_sighand(). Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [christian.brauner@ubuntu.com: edit commit message] Link: https://lore.kernel.org/r/20200127175821.10833-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Christian Brauner | ef2c41cf38 | clone3: allow spawning processes into cgroups This adds support for creating a process in a different cgroup than its parent. Callers can limit and account processes and threads right from the moment they are spawned: - A service manager can directly spawn new services into dedicated cgroups. - A process can be directly created in a frozen cgroup and will be frozen as well. - The initial accounting jitter experienced by process supervisors and daemons is eliminated with this. - Threaded applications or even thread implementations can choose to create a specific cgroup layout where each thread is spawned directly into a dedicated cgroup. This feature is limited to the unified hierarchy. Callers need to pass a directory file descriptor for the target cgroup. The caller can choose to pass an O_PATH file descriptor. All usual migration restrictions apply, i.e. there can be no processes in inner nodes. In general, creating a process directly in a target cgroup adheres to all migration restrictions. One of the biggest advantages of this feature is that CLONE_INTO_GROUP does not need to grab the write side of the cgroup cgroup_threadgroup_rwsem. This global lock makes moving tasks/threads around super expensive. With clone3() this lock is avoided. Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org> | ||
|  Christian Brauner | 5a5cf5cb30 | cgroup: refactor fork helpers This refactors the fork helpers so they can be easily modified in the next patches. The patch just moves the cgroup threadgroup rwsem grab and release into the helpers. They don't need to be directly exposed in fork.c. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org> | ||
|  Linus Torvalds | 39bed42de2 | hmm related patches for 5.6 This small series revises the names in mmu_notifier to make the code clearer and more readable. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl4wf2EACgkQOG33FX4g mxqrdw//XIexbXQqP4dUKFCFeI7Um6ZqYE6iVCQi6JEetpKxCR8BSrJsq6EP60Mg cVCKolISuudzOccz/liotg9SrwRlcO3mzucd8LJZG0v2FZMzQr0EKjst0RC4/xvK U2RxGvwLQ+XVR/3/l6hXyWyw7u28+F1RsfQMMX3kqR3qlcQachQ3k7oUINDIq2XH JkQcBV+XK0doXEp6VCCVKwuwEN7O5xSm8lAIHDNFZEEPre0iKxwatgWxdXFIWQek tRywwB7bRzFROBlDcoOQ0GDTqScr3bghz6vWU4GGv3avYkystKwy44ha6BzO2xQc ZNIo8AN9UFFhcmF531wklsXTCbxbxJAJAwdyIuQnKq5glw64EFnrjo2sxuL6s56h C1GHADtxDccv+nr2sKP/rFFeq9K3VqHDtjEdBOhReuB0Vp1YfVr17A4R8yAn8A+1 vm3IusoOq+g8qMYxRHEb+76/S//joaxAlFQkU5Gjn/0xsykP99YQSQFBjXmkzWlS IiHLf0HJiCCL8SHe4Wnyhyl1DUIIl38HQULqbFWZ8hK4ELhTd2KEuDxzT8q+v+v7 2M9nBVdRaw1kskGiFv+F7mb6c990CTEZO9B5fHpAjPRxeVkLYc06QfJY+hXbbu4c 6yzIvERRRlAviCmgb7G+3pLyBCKdvlIlCVsVOdxHXSRsl904BnA= =hhT0 -----END PGP SIGNATURE----- Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull mmu_notifier updates from Jason Gunthorpe: "This small series revises the names in mmu_notifier to make the code clearer and more readable" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions | ||
|  Linus Torvalds | e279160f49 | The timekeeping and timers departement provides: - Time namespace support:
 
     If a container migrates from one host to another then it expects that
     clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime these
     clocks can differ significantly on two hosts, in the worst case time
     goes backwards which is a violation of the POSIX requirements.
 
     The time namespace addresses this problem. It allows to set offsets for
     clock MONOTONIC and BOOTTIME once after creation and before tasks are
     associated with the namespace. These offsets are taken into account by
     timers and timekeeping including the VDSO.
 
     Offsets for wall clock based clocks (REALTIME/TAI) are not provided by
     this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric potential
     use cases which were discussed at Plumbers '18.
 
     The overhead for tasks in the root namespace (host time offsets = 0) is
     in the noise and great effort was made to ensure that especially in the
     VDSO. If time namespace is disabled in the kernel configuration the
     code is compiled out.
 
     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature
     and kept on for more than a year addressing review comments, finding
     better solutions. A pleasant experience.
 
   - Overhaul of the alarmtimer device dependency handling to ensure that
     the init/suspend/resume ordering is correct.
 
   - A new clocksource/event driver for Microchip PIT64
 
   - Suspend/resume support for the Hyper-V clocksource
 
   - The usual pile of fixes, updates and improvements mostly in the
     driver code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbTcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXT2D/96iJ3G9Snn2khEQP3XS2rYmtDGw7NO
 m1n96falwWeGe6zreU80R2Jge5nLxQtNhRoMPLLee1GpHwRC6lvqEqgdZ4LMBrD2
 JqV7Gzg8Urmdh+hpDsyTCpeEWEzoMKxiFOX8PxwctqUhM4szEe5iQg2YQsg85Jw2
 vG6M93N2xwDILh4rhEMbKjo+5ZmYn7c1RQvpGOSmpKOj940W/N7H2HBsFhdaJ1Kw
 FW5pFv1211PaU5RV2YNb2dMeeMTT1N3e2VN4Dkadoxp47pb+725gNHEBEjmV9poG
 Lp4IhzGAPnj8zVD88icQZSTaK3gUHMClxprJ0Pf84WEtiH7SeGu8BPYyu77+oNDe
 yzcctDJNyCWXkzmaP/fe/HLc0TStbvNAJ5Tagp4BC75gzebeb4/n8RtRT0fKeDYL
 pxpDPKDAPU7p1JSjxiWAtshqjBycWNY3Z49bA7/VhKBhnv8BDyBPGlYd7/4xrbGr
 RK7DQNXJwaJaiNJ7p5PiaFxGzNyB0B9sThD/slSlEInIKb4h9YzWr0TV+NB62VnB
 sDcN+tpLbRPz5/5cHGGfxR0+zKWpfyai8pzbmmaXEaKssjRYwyvcac5EZdgbWpbK
 k7CqAjoWLA2P+tGeePNJOf5JYK6Vmdyh4clmuwM0zOiRJ9NlWUyMf3z7QYILs4RO
 UAI+6opYlZEPAw==
 =x3qT
 -----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timekeeping and timers departement provides:
   - Time namespace support:
     If a container migrates from one host to another then it expects
     that clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime
     these clocks can differ significantly on two hosts, in the worst
     case time goes backwards which is a violation of the POSIX
     requirements.
     The time namespace addresses this problem. It allows to set offsets
     for clock MONOTONIC and BOOTTIME once after creation and before
     tasks are associated with the namespace. These offsets are taken
     into account by timers and timekeeping including the VDSO.
     Offsets for wall clock based clocks (REALTIME/TAI) are not provided
     by this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric
     potential use cases which were discussed at Plumbers '18.
     The overhead for tasks in the root namespace (ie where host time
     offsets = 0) is in the noise and great effort was made to ensure
     that especially in the VDSO. If time namespace is disabled in the
     kernel configuration the code is compiled out.
     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
     feature and kept on for more than a year addressing review
     comments, finding better solutions. A pleasant experience.
   - Overhaul of the alarmtimer device dependency handling to ensure
     that the init/suspend/resume ordering is correct.
   - A new clocksource/event driver for Microchip PIT64
   - Suspend/resume support for the Hyper-V clocksource
   - The usual pile of fixes, updates and improvements mostly in the
     driver code"
* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
  alarmtimer: Use wakeup source from alarmtimer platform device
  alarmtimer: Make alarmtimer platform device child of RTC device
  alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
  hrtimer: Add missing sparse annotation for __run_timer()
  lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
  MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
  clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
  clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
  clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
  clocksource/drivers/exynos_mct: Rename Exynos to lowercase
  clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
  clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
  clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
  clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
  clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
  clocksource/drivers/bcm2835_timer: Fix memory leak of timer
  clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
  clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
  clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
  ... | ||
|  Jason Gunthorpe | 984cfe4e25 | mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions The name mmu_notifier_mm implies that the thing is a mm_struct pointer, and is difficult to abbreviate. The struct is actually holding the interval tree and hlist containing the notifiers subscribed to a mm. Use 'subscriptions' as the variable name for this struct instead of the really terrible and misleading 'mmn_mm'. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> | ||
|  Andrei Vagin | 769071ac9f | ns: Introduce Time Namespace Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
      System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
      Clock that cannot be set and represents monotonic time since
      some unspecified starting point.
CLOCK_BOOTTIME
      Identical to CLOCK_MONOTONIC, except it also includes any time
      that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
  	changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com | ||
|  Amanieu d'Antras | dd499f7a7e | clone3: ensure copy_thread_tls is implemented copy_thread implementations handle CLONE_SETTLS by reading the TLS value from the registers containing the syscall arguments for clone. This doesn't work with clone3 since the TLS value is passed in clone_args instead. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: <stable@vger.kernel.org> # 5.3.x Link: https://lore.kernel.org/r/20200102172413.654385-8-amanieu@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Linus Torvalds | 043cf46825 | Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Ingo Molnar:
 "The main changes in the timer code in this cycle were:
   - Clockevent updates:
      - timer-of framework cleanups. (Geert Uytterhoeven)
      - Use timer-of for the renesas-ostm and the device name to prevent
        name collision in case of multiple timers. (Geert Uytterhoeven)
      - Check if there is an error after calling of_clk_get in asm9260
        (Chuhong Yuan)
   - ABI fix: Zero out high order bits of nanoseconds on compat
     syscalls. This got broken a year ago, with apparently no side
     effects so far.
     Since the kernel would use random data otherwise I don't think we'd
     have other options but to fix the bug, even if there was a side
     effect to applications (Dmitry Safonov)
   - Optimize ns_to_timespec64() on 32-bit systems: move away from
     div_s64_rem() which can be slow, to div_u64_rem() which is faster
     (Arnd Bergmann)
   - Annotate KCSAN-reported false positive data races in
     hrtimer_is_queued() users by moving timer->state handling over to
     the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
     (Eric Dumazet)
   - Misc cleanups and small fixes"
[ I undid the "ABI fix" and updated the comments instead. The reason
  there were apparently no side effects is that the fix was a no-op.
  The updated comment is to say _why_ it was a no-op.    - Linus ]
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Zero the upper 32-bits in __kernel_timespec on 32-bit
  time: Rename tsk->real_start_time to ->start_boottime
  hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
  time: Fix spelling mistake in comment
  time: Optimize ns_to_timespec64()
  hrtimer: Annotate lockless access to timer->state
  clocksource/drivers/asm9260: Add a check for of_clk_get
  clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
  clocksource/drivers/renesas-ostm: Convert to timer_of
  clocksource/drivers/timer-of: Use unique device name instead of timer
  clocksource/drivers/timer-of: Convert last full_name to %pOF | ||
|  Daniel Axtens | eafb149ed7 | fork: support VMAP_STACK with KASAN_VMALLOC Supporting VMAP_STACK with KASAN_VMALLOC is straightforward: - clear the shadow region of vmapped stacks when swapping them in - tweak Kconfig to allow VMAP_STACK to be turned on with KASAN Link: http://lkml.kernel.org/r/20191031093909.9228-4-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | aa32f11691 | hmm related patches for 5.5 This is another round of bug fixing and cleanup. This time the focus is on
 the driver pattern to use mmu notifiers to monitor a VA range. This code
 is lifted out of many drivers and hmm_mirror directly into the
 mmu_notifier core and written using the best ideas from all the driver
 implementations.
 
 This removes many bugs from the drivers and has a very pleasing
 diffstat. More drivers can still be converted, but that is for another
 cycle.
 
 - A shared branch with RDMA reworking the RDMA ODP implementation
 
 - New mmu_interval_notifier API. This is focused on the use case of
   monitoring a VA and simplifies the process for drivers
 
 - A common seq-count locking scheme built into the mmu_interval_notifier
   API usable by drivers that call get_user_pages() or hmm_range_fault()
   with the VA range
 
 - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen GntDev
   drivers to the new API. This deletes a lot of wonky driver code.
 
 - Two improvements for hmm_range_fault(), from testing done by Ralph
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl3cCjQACgkQOG33FX4g
 mxpp8xAAiR9iOdT28m/tx1GF31XludrMhRZVIiz0vmCIxIiAkWekWEfAEVm9PDnh
 wdrxTJohSs+B65AK3sfToOM3AIuNCuFVWmbbHI5qmOO76vaSvcZa905Z++pNsawO
 Bn8mgRCprYoFHcxWLvTvnA5U0g1S2BSSOwBSZI43CbEnVvHjYAR6MnvRqfGMk+NF
 bf8fTk/x+fl0DCemhynlBLuJkogzoE2Hgl0yPY5bFna4PktOxdpa1yPaQsiqZ7e6
 2s2NtM3pbMBJk0W42q5BU+aPhiqfxFFszasPSLBduXrD2xDsG76HJdHj5VydKmfL
 nelG4BvqJozXTEZWvTEePYhCqaZ41eJZ7Asw8BXtmacVqE5mDlTXo/Zdgbz7yEOR
 mI5MVyjD5rauZJldUOWXbwrPoWVFRvboauehiSgqvxvT9HvlFp9GKObSuu4gubBQ
 mzxs4t48tPhA7bswLmw0/pETSogFuVDfaB7hsyY0gi8EwxMFMpw2qFypm1PEEF+C
 BuUxCSShzvNKrraNe5PWaNNFd3AzIwAOWJHE+poH4bCoXQVr5nA+rq2gnHkdY5vq
 /xrBCyxkf0U05YoFGYembPVCInMehzp9Xjy8V+SueSvCg2/TYwGDCgGfsbe9dNOP
 Bc40JpS7BDn5w9nyLUJmOx7jfruNV6kx1QslA7NDDrB/rzOlsEc=
 =Hj8a
 -----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
 "This is another round of bug fixing and cleanup. This time the focus
  is on the driver pattern to use mmu notifiers to monitor a VA range.
  This code is lifted out of many drivers and hmm_mirror directly into
  the mmu_notifier core and written using the best ideas from all the
  driver implementations.
  This removes many bugs from the drivers and has a very pleasing
  diffstat. More drivers can still be converted, but that is for another
  cycle.
   - A shared branch with RDMA reworking the RDMA ODP implementation
   - New mmu_interval_notifier API. This is focused on the use case of
     monitoring a VA and simplifies the process for drivers
   - A common seq-count locking scheme built into the
     mmu_interval_notifier API usable by drivers that call
     get_user_pages() or hmm_range_fault() with the VA range
   - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
     GntDev drivers to the new API. This deletes a lot of wonky driver
     code.
   - Two improvements for hmm_range_fault(), from testing done by Ralph"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
  mm/hmm: make full use of walk_page_range()
  xen/gntdev: use mmu_interval_notifier_insert
  mm/hmm: remove hmm_mirror and related
  drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
  drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
  drm/amdgpu: Call find_vma under mmap_sem
  nouveau: use mmu_interval_notifier instead of hmm_mirror
  nouveau: use mmu_notifier directly for invalidate_range_start
  drm/radeon: use mmu_interval_notifier_insert
  RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
  RDMA/odp: Use mmu_interval_notifier_insert()
  mm/hmm: define the pre-processor related parts of hmm.h even if disabled
  mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
  mm/mmu_notifier: add an interval tree notifier
  mm/mmu_notifier: define the header pre-processor parts even if disabled
  mm/hmm: allow snapshot of the special zero page | ||
|  Linus Torvalds | 168829ad09 | Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:
   - A comprehensive rewrite of the robust/PI futex code's exit handling
     to fix various exit races. (Thomas Gleixner et al)
   - Rework the generic REFCOUNT_FULL implementation using
     atomic_fetch_* operations so that the performance impact of the
     cmpxchg() loops is mitigated for common refcount operations.
     With these performance improvements the generic implementation of
     refcount_t should be good enough for everybody - and this got
     confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
     REFCOUNT_FULL entirely, leaving the generic implementation enabled
     unconditionally. (Will Deacon)
   - Other misc changes, fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  lkdtm: Remove references to CONFIG_REFCOUNT_FULL
  locking/refcount: Remove unused 'refcount_error_report()' function
  locking/refcount: Consolidate implementations of refcount_t
  locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
  locking/refcount: Move saturation warnings out of line
  locking/refcount: Improve performance of generic REFCOUNT_FULL code
  locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
  locking/refcount: Remove unused refcount_*_checked() variants
  locking/refcount: Ensure integer operands are treated as signed
  locking/refcount: Define constants for saturation and max refcount values
  futex: Prevent exit livelock
  futex: Provide distinct return value when owner is exiting
  futex: Add mutex around futex exit
  futex: Provide state handling for exec() as well
  futex: Sanitize exit state handling
  futex: Mark the begin of futex exit explicitly
  futex: Set task::futex_state to DEAD right after handling futex exit
  futex: Split futex_mm_release() for exit/exec
  exit/exec: Seperate mm_release()
  futex: Replace PF_EXITPIDONE with a state
  ... | ||
|  Linus Torvalds | 0acefef584 | threads-v5.5 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXdfjBwAKCRCRxhvAZXjc onCBAP47WZ/ie7yjoDWhOI1QB7II3NGSzToakxpgJaWoB+NjTwEA7PGrSYVEbPrf pUhiEaEJ29t+cWUxX3+yDO+k7SA6BAY= =Ra58 -----END PGP SIGNATURE----- Merge tag 'threads-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: - A pidfd's fdinfo file currently contains the field "Pid:\t<pid>" where <pid> is the pid of the process in the pid namespace of the procfs instance the fdinfo file for the pidfd was opened in. The fdinfo file has now gained a new "NSpid:\t<ns-pid1>[\t<ns-pid2>[...]]" field which lists the pids of the process in all child pid namespaces provided the pid namespace of the procfs instance it is looked up under has an ancestoral relationship with the pid namespace of the process. If it does not 0 will be shown and no further pid namespaces will be listed. Tests included. (Christian Kellner) - If the process the pidfd references has already exited, print -1 for the Pid and NSpid fields in the pidfd's fdinfo file. Tests included. (me) - Add CLONE_CLEAR_SIGHAND. This lets callers clear all signal handler that are not SIG_DFL or SIG_IGN at process creation time. This originated as a feature request from glibc to improve performance and elimate races in their posix_spawn() implementation. Tests included. (me) - Add support for choosing a specific pid for a process with clone3(). This is the feature which was part of the thread update for v5.4 but after a discussion at LPC in Lisbon we decided to delay it for one more cycle in order to make the interface more generic. This has now done. It is now possible to choose a specific pid in a whole pid namespaces (sub)hierarchy instead of just one pid namespace. In order to choose a specific pid the caller must have CAP_SYS_ADMIN in all owning user namespaces of the target pid namespaces. Tests included. (Adrian Reber) - Test improvements and extensions. (Andrei Vagin, me) * tag 'threads-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests/clone3: skip if clone3() is ENOSYS selftests/clone3: check that all pids are released on error paths selftests/clone3: report a correct number of fails selftests/clone3: flush stdout and stderr before clone3() and _exit() selftests: add tests for clone3() with *set_tid fork: extend clone3() to support setting a PID selftests: add tests for clone3() tests: test CLONE_CLEAR_SIGHAND clone3: add CLONE_CLEAR_SIGHAND pid: use pid_has_task() in pidfd_open() exit: use pid_has_task() in do_wait() pid: use pid_has_task() in __change_pid() test: verify fdinfo for pidfd of reaped process pidfd: check pid has attached task in fdinfo pidfd: add tests for NSpid info in fdinfo pidfd: add NSpid entries to fdinfo | ||
|  Ingo Molnar | 83bae01182 | Merge branch 'timers/urgent' into timers/core, to pick up fix Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Jason Gunthorpe | 107e899874 | mm/hmm: define the pre-processor related parts of hmm.h even if disabled Only the function calls are stubbed out with static inlines that always fail. This is the standard way to write a header for an optional component and makes it easier for drivers that only optionally need HMM_MIRROR. Link: https://lore.kernel.org/r/20191112202231.3856-5-jgg@ziepe.ca Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Tested-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> | ||
|  Luc Van Oostenryck | 9e77716a75 | fork: fix pidfd_poll()'s return type pidfd_poll() is defined as returning 'unsigned int' but the
.poll method is declared as returning '__poll_t', a bitwise type.
Fix this by using the proper return type and using the EPOLL
constants instead of the POLL ones, as required for __poll_t.
Fixes:  | ||
|  Thomas Gleixner | 150d71584b | futex: Split futex_mm_release() for exit/exec To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de | ||
|  Thomas Gleixner | 4610ba7ad8 | exit/exec: Seperate mm_release() mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de | ||
|  Thomas Gleixner | ba31c1a485 | futex: Move futex exit handling into futex code The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code. Split it out into a function, move it into futex code and make the various futex exit functions static. Preparatory only and no functional change. Folded build fix from Borislav. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de | ||
|  Adrian Reber | 49cb2fc42c | fork: extend clone3() to support setting a PID The main motivation to add set_tid to clone3() is CRIU.
To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.
Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).
This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.
The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.
To create a process with the following PIDs:
      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1
For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:
  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;
If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:
  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;
The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.
The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.
set_tid_size cannot be larger then the current PID namespace level.
Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Peter Zijlstra | cf25e24db6 | time: Rename tsk->real_start_time to ->start_boottime Since it stores CLOCK_BOOTTIME, not, as the name suggests, CLOCK_REALTIME, let's rename ->real_start_time to ->start_bootime. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Christian Brauner | fa729c4df5 | clone3: validate stack arguments Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.
Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:
 #define __STACK_SIZE (8 * 1024 * 1024)
 pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
 {
         pid_t ret;
         void *stack;
         stack = malloc(__STACK_SIZE);
         if (!stack)
                 return -ENOMEM;
 #ifdef __ia64__
         ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #elif defined(__parisc__) /* stack grows up */
         ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
 #else
         ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #endif
         return ret;
 }
or even crazier variants such as [3].
With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.
/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().
[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]:  | ||
|  Christian Brauner | b612e5df45 | clone3: add CLONE_CLEAR_SIGHAND Reset all signal handlers of the child not set to SIG_IGN to SIG_DFL. Mutually exclusive with CLONE_SIGHAND to not disturb other thread's signal handler. In the spirit of closer cooperation between glibc developers and kernel developers (cf. [2]) this patchset came out of a discussion on the glibc mailing list for improving posix_spawn() (cf. [1], [3], [4]). Kernel support for this feature has been explicitly requested by glibc and I see no reason not to help them with this. The child helper process on Linux posix_spawn must ensure that no signal handlers are enabled, so the signal disposition must be either SIG_DFL or SIG_IGN. However, it requires a sigprocmask to obtain the current signal mask and at least _NSIG sigaction calls to reset the signal handlers for each posix_spawn call or complex state tracking that might lead to data corruption in glibc. Adding this flags lets glibc avoid these problems. [1]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00149.html [3]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00158.html [4]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00160.html [2]: https://lwn.net/Articles/799331/ '[...] by asking for better cooperation with the C-library projects in general. They should be copied on patches containing ABI changes, for example. I noted that there are often times where C-library developers wish the kernel community had done things differently; how could those be avoided in the future? Members of the audience suggested that more glibc developers should perhaps join the linux-api list. The other suggestion was to "copy Florian on everything".' Cc: Florian Weimer <fweimer@redhat.com> Cc: libc-alpha@sourceware.org Cc: linux-api@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20191014104538.3096-1-christian.brauner@ubuntu.com | ||
|  Christian Brauner | 3d6d8da48d | pidfd: check pid has attached task in fdinfo Currently, when a task is dead we still print the pid it used to use in the fdinfo files of its pidfds. This doesn't make much sense since the pid may have already been reused. So verify that the task is still alive by introducing the pid_has_task() helper which will be used by other callers in follow-up patches. If the task is not alive anymore, we will print -1. This allows us to differentiate between a task not being present in a given pid namespace - in which case we already print 0 - and a task having been reaped. Note that this uses PIDTYPE_PID for the check. Technically, we could've checked PIDTYPE_TGID since pidfds currently only refer to thread-group leaders but if they won't anymore in the future then this check becomes problematic without it being immediately obvious to non-experts imho. If a thread is created via clone(CLONE_THREAD) than struct pid has a single non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even though the thread-group leader might still be very much alive. So checking PIDTYPE_PID is fine and is easier to maintain should we ever allow pidfds to refer to threads. Cc: Jann Horn <jannh@google.com> Cc: Christian Kellner <christian@kellner.me> Cc: linux-api@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20191017101832.5985-1-christian.brauner@ubuntu.com | ||
|  Christian Kellner | 15d42eb26b | pidfd: add NSpid entries to fdinfo Currently, the fdinfo file contains the Pid field which shows the pid a given pidfd refers to in the pid namespace of the procfs instance. If pid namespaces are configured, also show an NSpid field for easy retrieval of the pid in all descendant pid namespaces. If the pid namespace of the process is not a descendant of the pid namespace of the procfs instance 0 will be shown as its first NSpid entry and no other entries will be shown. Add a block comment to pidfd_show_fdinfo with a detailed explanation of Pid and NSpid fields. Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christian Kellner <christian@kellner.me> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20191014162034.2185-1-ckellner@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Michal Hocko | b0f53dbc4b | kernel/sysctl.c: do not override max_threads provided by userspace Partially revert | ||
|  Linus Torvalds | e524d16e7e | copy-struct-from-user-v5.4-rc2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXZZIgQAKCRCRxhvAZXjc orNOAP98B2nmoxvq8d5Z6PhoyTBC5NIUuJ5h2YMwcX/hAaj5uQEA58NTKtPmOPDR 2ffUFFerGZ2+brlHgACa0ZKdH27TjAA= =QryD -----END PGP SIGNATURE----- Merge tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull copy_struct_from_user() helper from Christian Brauner: "This contains the copy_struct_from_user() helper which got split out from the openat2() patchset. It is a generic interface designed to copy a struct from userspace. The helper will be especially useful for structs versioned by size of which we have quite a few. This allows for backwards compatibility, i.e. an extended struct can be passed to an older kernel, or a legacy struct can be passed to a newer kernel. For the first case (extended struct, older kernel) the new fields in an extended struct can be set to zero and the struct safely passed to an older kernel. The most obvious benefit is that this helper lets us get rid of duplicate code present in at least sched_setattr(), perf_event_open(), and clone3(). More importantly it will also help to ensure that users implementing versioning-by-size end up with the same core semantics. This point is especially crucial since we have at least one case where versioning-by-size is used but with slighly different semantics: sched_setattr(), perf_event_open(), and clone3() all do do similar checks to copy_struct_from_user() while rt_sigprocmask(2) always rejects differently-sized struct arguments. With this pull request we also switch over sched_setattr(), perf_event_open(), and clone3() to use the new helper" * tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: usercopy: Add parentheses around assignment in test_copy_struct_from_user perf_event_open: switch to copy_struct_from_user() sched_setattr: switch to copy_struct_from_user() clone3: switch to copy_struct_from_user() lib: introduce copy_struct_from_user() helper | ||
|  Christian Brauner | 501bd0166e | fork: add kernel-doc for clone3 Add kernel-doc for the clone3() syscall. Link: https://lore.kernel.org/r/20191001114701.24661-2-christian.brauner@ubuntu.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Aleksa Sarai | f14c234b4b | clone3: switch to copy_struct_from_user() Switch clone3() syscall from it's own copying struct clone_args from userspace to the new dedicated copy_struct_from_user() helper. The change is very straightforward, and helps unify the syscall interface for struct-from-userspace syscalls. Additionally, explicitly define CLONE_ARGS_SIZE_VER0 to match the other users of the struct-extension pattern. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> [christian.brauner@ubuntu.com: improve commit message] Link: https://lore.kernel.org/r/20191001011055.19283-3-cyphar@cyphar.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> | ||
|  Linus Torvalds | 9c5efe9ae7 | Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Apply a number of membarrier related fixes and cleanups, which fixes a use-after-free race in the membarrier code - Introduce proper RCU protection for tasks on the runqueue - to get rid of the subtle task_rcu_dereference() interface that was easy to get wrong - Misc fixes, but also an EAS speedup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Avoid redundant EAS calculation sched/core: Remove double update_max_interval() call on CPU startup sched/core: Fix preempt_schedule() interrupt return comment sched/fair: Fix -Wunused-but-set-variable warnings sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() sched/membarrier: Return -ENOMEM to userspace on memory allocation failure sched/membarrier: Skip IPIs when mm->mm_users == 1 selftests, sched/membarrier: Add multi-threaded test sched/membarrier: Fix p->mm->membarrier_state racy load sched/membarrier: Call sync_core only before usermode for same mm sched/membarrier: Remove redundant check sched/membarrier: Fix private expedited registration check tasks, sched/core: RCUify the assignment of rq->curr tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue tasks: Add a count of task RCU users sched/core: Convert vcpu_is_preempted() from macro to an inline function sched/fair: Remove unused cfs_rq_clock_task() function | ||
|  Sai Praneeth Prakhya | 8495f7e673 | fork: improve error message for corrupted page tables When a user process exits, the kernel cleans up the mm_struct of the user process and during cleanup, check_mm() checks the page tables of the user process for corruption (E.g: unexpected page flags set/cleared). For corrupted page tables, the error message printed by check_mm() isn't very clear as it prints the loop index instead of page table type (E.g: Resident file mapping pages vs Resident shared memory pages). The loop index in check_mm() is used to index rss_stat[] which represents individual memory type stats. Hence, instead of printing index, print memory type, thereby improving error message. Without patch: -------------- [ 204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467) [ 204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2 [ 204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5 [ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480 With patch: ----------- [ 69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467) [ 69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2 [ 69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5 [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480 Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so that it matches the other print statement. Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.com Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Eric W. Biederman | 0ff7b2cfba | tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue In the ordinary case today the RCU grace period for a task_struct is triggered when another process wait's for it's zombine and causes the kernel to call release_task(). As the waiting task has to receive a signal and then act upon it before this happens, typically this will occur after the original task as been removed from the runqueue. Unfortunaty in some cases such as self reaping tasks it can be shown that release_task() will be called starting the grace period for task_struct long before the task leaves the runqueue. Therefore use put_task_struct_rcu_user() in finish_task_switch() to guarantee that the there is a RCU lifetime after the task leaves the runqueue. Besides the change in the start of the RCU grace period for the task_struct this change may cause perf_event_delayed_put and trace_sched_process_free. The function perf_event_delayed_put boils down to just a WARN_ON for cases that I assume never show happen. So I don't see any problem with delaying it. The function trace_sched_process_free is a trace point and thus visible to user space. Occassionally userspace has the strangest dependencies so this has a miniscule chance of causing a regression. This change only changes the timing of when the tracepoint is called. The change in timing arguably gives userspace a more accurate picture of what is going on. So I don't expect there to be a regression. In the case where a task self reaps we are pretty much guaranteed that the RCU grace period is delayed. So we should get quite a bit of coverage in of this worst case for the change in a normal threaded workload. So I expect any issues to turn up quickly or not at all. I have lightly tested this change and everything appears to work fine. Inspired-by: Linus Torvalds <torvalds@linux-foundation.org> Inspired-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Eric W. Biederman | 3fbd7ee285 | tasks: Add a count of task RCU users Add a count of the number of RCU users (currently 1) of the task struct so that we can later add the scheduler case and get rid of the very subtle task_rcu_dereference(), and just use rcu_dereference(). As suggested by Oleg have the count overlap rcu_head so that no additional space in task_struct is required. Inspired-by: Linus Torvalds <torvalds@linux-foundation.org> Inspired-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Linus Torvalds | 84da111de0 | hmm related patches for 5.4 This is more cleanup and consolidation of the hmm APIs and the very
 strongly related mmu_notifier interfaces. Many places across the tree
 using these interfaces are touched in the process. Beyond that a cleanup
 to the page walker API and a few memremap related changes round out the
 series:
 
 - General improvement of hmm_range_fault() and related APIs, more
   documentation, bug fixes from testing, API simplification &
   consolidation, and unused API removal
 
 - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
   make them internal kconfig selects
 
 - Hoist a lot of code related to mmu notifier attachment out of drivers by
   using a refcount get/put attachment idiom and remove the convoluted
   mmu_notifier_unregister_no_release() and related APIs.
 
 - General API improvement for the migrate_vma API and revision of its only
   user in nouveau
 
 - Annotate mmu_notifiers with lockdep and sleeping region debugging
 
 Two series unrelated to HMM or mmu_notifiers came along due to
 dependencies:
 
 - Allow pagemap's memremap_pages family of APIs to work without providing
   a struct device
 
 - Make walk_page_range() and related use a constant structure for function
   pointers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
 mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
 Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
 6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
 tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
 nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
 wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
 yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
 EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
 Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
 kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
 olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
 =FRGg
 -----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
 "This is more cleanup and consolidation of the hmm APIs and the very
  strongly related mmu_notifier interfaces. Many places across the tree
  using these interfaces are touched in the process. Beyond that a
  cleanup to the page walker API and a few memremap related changes
  round out the series:
   - General improvement of hmm_range_fault() and related APIs, more
     documentation, bug fixes from testing, API simplification &
     consolidation, and unused API removal
   - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
     and make them internal kconfig selects
   - Hoist a lot of code related to mmu notifier attachment out of
     drivers by using a refcount get/put attachment idiom and remove the
     convoluted mmu_notifier_unregister_no_release() and related APIs.
   - General API improvement for the migrate_vma API and revision of its
     only user in nouveau
   - Annotate mmu_notifiers with lockdep and sleeping region debugging
  Two series unrelated to HMM or mmu_notifiers came along due to
  dependencies:
   - Allow pagemap's memremap_pages family of APIs to work without
     providing a struct device
   - Make walk_page_range() and related use a constant structure for
     function pointers"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
  libnvdimm: Enable unit test infrastructure compile checks
  mm, notifier: Catch sleeping/blocking for !blockable
  kernel.h: Add non_block_start/end()
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()
  csky: add missing brackets in a macro for tlb.h
  pagewalk: use lockdep_assert_held for locking validation
  pagewalk: separate function pointers from iterator data
  mm: split out a new pagewalk.h header from mm.h
  mm/mmu_notifiers: annotate with might_sleep()
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
  mm/hmm: hmm_range_fault() infinite loop
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
  mm/mmu_notifiers: remove unregister_no_release
  RDMA/odp: remove ib_ucontext from ib_umem
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  ... | ||
|  Linus Torvalds | 7f2444d38f | Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:
   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.
     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.
   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.
   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function
   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.
   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.
   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.
   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.
   - The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ... | ||
|  Linus Torvalds | 76f0f227cf | ia64 for v5.4 - big change here is removal of support for SGI Altix -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJdf64MAAoJEKurIx+X31iBB20P/07o93sBT92SiA2/ety9sLqV BGJmEdw7gyb9WVbUip6s71FIEKZw4foCGkqDiX+lr5Fw2A9tiK7LmFgTLi4LLwg+ syhYZ1y5/mwBI4FLlJudKjQdFZjr/n7DNlz4H67woE2kK+FyRsOKEaFUhuR8+0rC mKJBKtIGnoIOPG06PT1k5qfdpzlreCFoWdIhjO55LfDgZnnDiMaX5h0vcBQ9xgCp xGV0n/f7+qn4pzB4hGvNV209Sdgv2V4t77bHNvyXlJrM5Hqzafo5MzFgEJv+fRqJ 2RnkWVhwctfbid/2ggf2aAsYnMK3GigEaOCsYW2oWJESVUQhxIi3ndF/Jt9fraZv ZouD7G/s64P5lUQuCT9JnKGzJrSgxvkd37049AZ4pFVc2MzLC6o6dyyP8pu5ARe8 T0shFik3+gsml2US/vSUzxvrg1saRQjl9E/AJ0RTZ8oyP4FNnFmkJf38qj3a0L0k ILFYscM5q7WPggoDA/m6F96tLGhdK/sKjDzrADjEh2dIvn4woqoEJSDn+rXuP+Gm UOj1v8mILZCqvOAmc9IkGCkPUlbrmNV/1FYh5+GWudtillEaD82vjSqm+jnVbfXD REvHlR/kxCSj1gg/+nk+NFdZCkW3xETOcTZohhDkR7du2mHjTwBMZ2YRPrqoX4c8 VZA57Mrqm5Uk5601qYRl =L5e+ -----END PGP SIGNATURE----- Merge tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull ia64 updates from Tony Luck: "The big change here is removal of support for SGI Altix" * tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: (33 commits) genirq: remove the is_affinity_mask_valid hook ia64: remove CONFIG_SWIOTLB ifdefs ia64: remove support for machvecs ia64: move the screen_info setup to common code ia64: move the ROOT_DEV setup to common code ia64: rework iommu probing ia64: remove the unused sn_coherency_id symbol ia64: remove the SGI UV simulator support ia64: remove the zx1 swiotlb machvec ia64: remove CONFIG_ACPI ifdefs ia64: remove CONFIG_PCI ifdefs ia64: remove the hpsim platform ia64: remove now unused machvec indirections ia64: remove support for the SGI SN2 platform drivers: remove the SGI SN2 IOC4 base support drivers: remove the SGI SN2 IOC3 base support qla2xxx: remove SGI SN2 support qla1280: remove SGI SN2 support misc/sgi-xp: remove SGI SN2 support char/mspec: remove SGI SN2 support ... | ||
|  Linus Torvalds | c17112a5c4 | core-process-v5.4 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXXe8mQAKCRCRxhvAZXjc
 ou7oAQCszihkNfpjORSSSOqenMDrxxDW++A7TIOLuq7UyZQl8QD+LM1wvT/xypfJ
 ORD9XX8+Wrv07AQn85fZBEFXGrnengk=
 =o+VL
 -----END PGP SIGNATURE-----
Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd/waitid updates from Christian Brauner:
 "This contains two features and various tests.
  First, it adds support for waiting on process through pidfds by adding
  the P_PIDFD type to the waitid() syscall. This completes the basic
  functionality of the pidfd api (cf. [1]). In the meantime we also have
  a new adition to the userspace projects that make use of the pidfd
  api. The qt project was nice enough to send a mail pointing out that
  they have a pr up to switch to the pidfd api (cf. [2]).
  Second, this tag contains an extension to the waitid() syscall to make
  it possible to wait on the current process group in a race free manner
  (even though the actual problem is very unlikely) by specifing 0
  together with the P_PGID type. This extension traces back to a
  discussion on the glibc development mailing list.
  There are also a range of tests for the features above. Additionally,
  the test-suite which detected the pidfd-polling race we fixed in [3]
  is included in this tag"
[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit  | ||
|  Eugene Syromiatnikov | a0eb9abd8a | fork: block invalid exit signals with clone3() Previously, higher 32 bits of exit_signal fields were lost when copied
to the kernel args structure (that uses int as a type for the respective
field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
it has to be checked for sanity before use; for the legacy syscalls,
applying CSIGNAL mask guarantees that it is at least non-negative;
however, there's no such thing is done in clone3() code path, and that
can break at least thread_group_leader.
This commit adds a check to copy_clone_args_from_user() to verify that
the exit signal is limited by CSIGNAL as with legacy clone() and that
the signal is valid. With this we don't get the legacy clone behavior
were an invalid signal could be handed down and would only be detected
and ignored in do_notify_parent(). Users of clone3() will now get a
proper error when they pass an invalid exit signal. Note, that this is
not user-visible behavior since no kernel with clone3() has been
released yet.
The following program will cause a splat on a non-fixed clone3() version
and will fail correctly on a fixed version:
 #define _GNU_SOURCE
 #include <linux/sched.h>
 #include <linux/types.h>
 #include <sched.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/syscall.h>
 #include <sys/wait.h>
 #include <unistd.h>
 int main(int argc, char *argv[])
 {
        pid_t pid = -1;
        struct clone_args args = {0};
        args.exit_signal = -1;
        pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
        if (pid < 0)
                exit(EXIT_FAILURE);
        if (pid == 0)
                exit(EXIT_SUCCESS);
        wait(NULL);
        exit(EXIT_SUCCESS);
 }
Fixes:  | ||
|  Thomas Gleixner | 244d49e306 | posix-cpu-timers: Move state tracking to struct posix_cputimers Put it where it belongs and clean up the ifdeffery in fork completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de | ||
|  Thomas Gleixner | 3a245c0f11 | posix-cpu-timers: Move expiry cache into struct posix_cputimers The expiry cache belongs into the posix_cputimers container where the other cpu timers information is. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de | ||
|  Thomas Gleixner | 2b69942f90 | posix-cpu-timers: Create a container struct Per task/process data of posix CPU timers is all over the place which makes the code hard to follow and requires ifdeffery. Create a container to hold all this information in one place, so data is consolidated and the ifdeffery can be confined to the posix timer header file and removed from places like fork. As a first step, move the cpu_timers list head array into the new struct and clean up the initializers and simplify fork. The remaining #ifdef in fork will be removed later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de | ||
|  Jason Gunthorpe | daa138a58c | Merge branch 'odp_fixes' into hmm.git From rdma.git Jason Gunthorpe says: ==================== This is a collection of general cleanups for ODP to clarify some of the flows around umem creation and use of the interval tree. ==================== The branch is based on v5.3-rc5 due to dependencies, and is being taken into hmm.git due to dependencies in the next patches. * odp_fixes: RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address RDMA/core: Make invalidate_range a device operation RDMA/odp: Use kvcalloc for the dma_list and page_list RDMA/odp: Check for overflow when computing the umem_odp end RDMA/odp: Provide ib_umem_odp_release() to undo the allocs RDMA/odp: Split creating a umem_odp from ib_umem_get RDMA/odp: Make the three ways to create a umem_odp clear RMDA/odp: Consolidate umem_odp initialization RDMA/odp: Make it clearer when a umem is an implicit ODP umem RDMA/odp: Iterate over the whole rbtree directly RDMA/odp: Use the common interval tree library instead of generic RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> | ||
|  Jason Gunthorpe | c7d8b7824f | hmm: use mmu_notifier_get/put for 'struct hmm' This is a significant simplification, it eliminates all the remaining 'hmm' stuff in mm_struct, eliminates krefing along the critical notifier paths, and takes away all the ugly locking and abuse of page_table_lock. mmu_notifier_get() provides the single struct hmm per struct mm which eliminates mm->hmm. It also directly guarantees that no mmu_notifier op callback is callable while concurrent free is possible, this eliminates all the krefs inside the mmu_notifier callbacks. The remaining krefs in the range code were overly cautious, drivers are already not permitted to free the mirror while a range exists. Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Tested-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> | ||
|  Christoph Hellwig | 4189ff2348 | kernel: only define task_struct_whitelist conditionally If CONFIG_ARCH_TASK_STRUCT_ALLOCATOR is set task_struct_whitelist is never called, and thus generates a compiler warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/20190812065524.19959-5-hch@lst.de Signed-off-by: Tony Luck <tony.luck@intel.com> | ||
|  Christian Brauner | 3695eae5fe | pidfd: add P_PIDFD to waitid() This adds the P_PIDFD type to waitid(). One of the last remaining bits for the pidfd api is to make it possible to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace that want to use the pidfd api to exclusively manage processes can do so now. One of the things this will unblock in the future is the ability to make it possible to retrieve the exit status via waitid(P_PIDFD) for non-parent processes if handed a _suitable_ pidfd that has this feature set. This is similar to what you can do on FreeBSD with kqueue(). It might even end up being possible to wait on a process as a non-parent if an appropriate property is enabled on the pidfd. With P_PIDFD no scoping of the process identified by the pidfd is possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0), waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics equivalent to wait4(pid), waitid(P_PID). Users that need scoping should rely on pid-based wait*() syscalls for now. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io | ||
|  Jann Horn | 16d51a590a | sched/fair: Don't free p->numa_faults with concurrent readers When going through execve(), zero out the NUMA fault statistics instead of
freeing them.
During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.
Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes:  | ||
|  Linus Torvalds | 3c69914b4c | for-linus-20190715 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSxgkQAKCRCRxhvAZXjc
 opJ3AP9TWIWhEC0dzbNmzh0STj/Vyl5KYEpdMbi7HNeAmIEAfQD6A3HI4bVJbN08
 jH44U7DLzHJyHefKlB8jHEKEVYJWqgo=
 =74Bg
 -----END PGP SIGNATURE-----
Merge tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd and clone3 fixes from Christian Brauner:
 "This contains a bugfix for CLONE_PIDFD when used with the legacy clone
  syscall, two fixes to ensure that syscall numbering and clone3
  entrypoint implementations will stay consistent, and an update for the
  maintainers file:
   - The addition of clone3 broke CLONE_PIDFD for legacy clone on all
     architectures that use do_fork() directly instead of calling the
     clone syscall itself. (Fwiw, cleaning do_fork() up is on my todo.)
     The reason this happened was that during conversion of _do_fork()
     to use struct kernel_clone_args we missed that do_fork() is called
     directly by various architectures. This is fixed by making sure
     that the pidfd argument in struct kernel_clone_args is correctly
     initialized with the parent_tidptr argument passed down from
     do_fork(). Additionally, do_fork() missed a check to make
     CLONE_PIDFD and CLONE_PARENT_SETTID mutually exclusive just a
     clone() does. This is now fixed too.
   - When clone3() was introduced we skipped architectures that require
     special handling for fork-like syscalls. Their syscall tables did
     not contain any mention of clone3().
     To make sure that Arnd's work to make syscall numbers on all
     architectures identical (minus alpha) was not for naught we are
     placing a comment in all syscall tables that do not yet implement
     clone3(). The comment makes it clear that 435 is reserved for
     clone3 and should not be used.
   - Also, this contains a patch to make the clone3() syscall definition
     in asm-generic/unist.h conditional on __ARCH_WANT_SYS_CLONE3. This
     lets us catch new architectures that implicitly make use of clone3
     without setting __ARCH_WANT_SYS_CLONE3 which is a good indicator
     that they did not check whether it needs special treatment or not.
   - Finally, this contains a patch to add me as maintainer for pidfd
     stuff so people can start blaming me (more)"
* tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  MAINTAINERS: add new entry for pidfd api
  unistd: protect clone3 via __ARCH_WANT_SYS_CLONE3
  arch: mark syscall number 435 reserved for clone3
  clone: fix CLONE_PIDFD support | ||
|  Linus Torvalds | fec88ab0af | HMM patches for 5.3 Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:
   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.
   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.
   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct
   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly
   - Remove DEVICE_PUBLIC
   - Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ... | ||
|  Dmitry V. Levin | 028b6e8a89 | clone: fix CLONE_PIDFD support The introduction of clone3 syscall accidentally broke CLONE_PIDFD
support in traditional clone syscall on compat x86 and those
architectures that use do_fork to implement clone syscall.
This bug was found by strace test suite.
Link: https://strace.io/logs/strace/2019-07-12
Fixes:  | ||
|  Linus Torvalds | 8f6ccf6159 | clone3-v5.3 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhhgAKCRCRxhvAZXjc
 or7kAP9VzDcQaK/WoDd2ezh2C7Wh5hNy9z/qJVCa6Tb+N+g1UgEAxbhFUg55uGOA
 JNf7fGar5JF5hBMIXR+NqOi1/sb4swg=
 =ELWo
 -----END PGP SIGNATURE-----
Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull clone3 system call from Christian Brauner:
 "This adds the clone3 syscall which is an extensible successor to clone
  after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
  window for clone(). It cleanly supports all of the flags from clone()
  and thus all legacy workloads.
  There are few user visible differences between clone3 and clone.
  First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
  this flag. Second, the CSIGNAL flag is deprecated and will cause
  EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
  argument in struct clone_args thus freeing up even more flags. And
  third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
  clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
  argument.
  The clone3 uapi is designed to be easy to handle on 32- and 64 bit:
    /* uapi */
    struct clone_args {
            __aligned_u64 flags;
            __aligned_u64 pidfd;
            __aligned_u64 child_tid;
            __aligned_u64 parent_tid;
            __aligned_u64 exit_signal;
            __aligned_u64 stack;
            __aligned_u64 stack_size;
            __aligned_u64 tls;
    };
  and a separate kernel struct is used that uses proper kernel typing:
    /* kernel internal */
    struct kernel_clone_args {
            u64 flags;
            int __user *pidfd;
            int __user *child_tid;
            int __user *parent_tid;
            int exit_signal;
            unsigned long stack;
            unsigned long stack_size;
            unsigned long tls;
    };
  The system call comes with a size argument which enables the kernel to
  detect what version of clone_args userspace is passing in. clone3
  validates that any additional bytes a given kernel does not know about
  are set to zero and that the size never exceeds a page.
  A nice feature is that this patchset allowed us to cleanup and
  simplify various core kernel codepaths in kernel/fork.c by making the
  internal _do_fork() function take struct kernel_clone_args even for
  legacy clone().
  This patch also unblocks the time namespace patchset which wants to
  introduce a new CLONE_TIMENS flag.
  Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
  xtensa. These were the architectures that did not require special
  massaging.
  Other architectures treat fork-like system calls individually and
  after some back and forth neither Arnd nor I felt confident that we
  dared to add clone3 unconditionally to all architectures. We agreed to
  leave this up to individual architecture maintainers. This is why
  there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
  which any architecture can set once it has implemented support for
  clone3. The patch also adds a cond_syscall(clone3) for architectures
  such as nios2 or h8300 that generate their syscall table by simply
  including asm-generic/unistd.h. The hope is to get rid of
  __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"
* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: handle arches who do not yet define clone3
  arch: wire-up clone3() syscall
  fork: add clone3 | ||
|  Linus Torvalds | 5450e8a316 | pidfd-updates-v5.3 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhUgAKCRCRxhvAZXjc
 okkiAQC3Hlg/O2JoIb4PqgEvBkpHSdVxyuWagn0ksjACW9ANKQEAl5OadMhvOq16
 UHGhKlpE/M8HflknIffoEGlIAWHrdwU=
 =7kP5
 -----END PGP SIGNATURE-----
Merge tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
 "This adds two main features.
   - First, it adds polling support for pidfds. This allows process
     managers to know when a (non-parent) process dies in a race-free
     way.
     The notification mechanism used follows the same logic that is
     currently used when the parent of a task is notified of a child's
     death. With this patchset it is possible to put pidfds in an
     {e}poll loop and get reliable notifications for process (i.e.
     thread-group) exit.
   - The second feature compliments the first one by making it possible
     to retrieve pollable pidfds for processes that were not created
     using CLONE_PIDFD.
     A lot of processes get created with traditional PID-based calls
     such as fork() or clone() (without CLONE_PIDFD). For these
     processes a caller can currently not create a pollable pidfd. This
     is a problem for Android's low memory killer (LMK) and service
     managers such as systemd.
  Both patchsets are accompanied by selftests.
  It's perhaps worth noting that the work done so far and the work done
  in this branch for pidfd_open() and polling support do already see
  some adoption:
   - Android is in the process of backporting this work to all their LTS
     kernels [1]
   - Service managers make use of pidfd_send_signal but will need to
     wait until we enable waiting on pidfds for full adoption.
   - And projects I maintain make use of both pidfd_send_signal and
     CLONE_PIDFD [2] and will use polling support and pidfd_open() too"
[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22
[2]  | ||
|  Linus Torvalds | dad1c12ed8 | Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Remove the unused per rq load array and all its infrastructure, by Dietmar Eggemann. - Add utilization clamping support by Patrick Bellasi. This is a refinement of the energy aware scheduling framework with support for boosting of interactive and capping of background workloads: to make sure critical GUI threads get maximum frequency ASAP, and to make sure background processing doesn't unnecessarily move to cpufreq governor to higher frequencies and less energy efficient CPU modes. - Add the bare minimum of tracepoints required for LISA EAS regression testing, by Qais Yousef - which allows automated testing of various power management features, including energy aware scheduling. - Restructure the former tsk_nr_cpus_allowed() facility that the -rt kernel used to modify the scheduler's CPU affinity logic such as migrate_disable() - introduce the task->cpus_ptr value instead of taking the address of &task->cpus_allowed directly - by Sebastian Andrzej Siewior. - Misc optimizations, fixes, cleanups and small enhancements - see the Git log for details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) sched/uclamp: Add uclamp support to energy_compute() sched/uclamp: Add uclamp_util_with() sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks sched/uclamp: Set default clamps for RT tasks sched/uclamp: Reset uclamp values on RESET_ON_FORK sched/uclamp: Extend sched_setattr() to support utilization clamping sched/core: Allow sched_setattr() to use the current policy sched/uclamp: Add system default clamps sched/uclamp: Enforce last task's UCLAMP_MAX sched/uclamp: Add bucket local max tracking sched/uclamp: Add CPU's clamp buckets refcounting sched/fair: Rename weighted_cpuload() to cpu_runnable_load() sched/debug: Export the newly added tracepoints sched/debug: Add sched_overutilized tracepoint sched/debug: Add new tracepoint to track PELT at se level sched/debug: Add new tracepoints to track PELT at rq level sched/debug: Add a new sched_trace_*() helper functions sched/autogroup: Make autogroup_path() always available sched/wait: Deduplicate code with do-while sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity() ... | ||
|  Linus Torvalds | e192832869 | Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle are: - rwsem scalability improvements, phase #2, by Waiman Long, which are rather impressive: "On a 2-socket 40-core 80-thread Skylake system with 40 reader and writer locking threads, the min/mean/max locking operations done in a 5-second testing window before the patchset were: 40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810 40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255 After the patchset, they became: 40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741 40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098" There's a lot of changes to the locking implementation that makes it similar to qrwlock, including owner handoff for more fair locking. Another microbenchmark shows how across the spectrum the improvements are: "With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 2-socket Skylake system with equal numbers of readers and writers (mixed) before and after this patchset were: # of Threads Before Patch After Patch ------------ ------------ ----------- 2 2,618 4,193 4 1,202 3,726 8 802 3,622 16 729 3,359 32 319 2,826 64 102 2,744" The changes are extensive and the patch-set has been through several iterations addressing various locking workloads. There might be more regressions, but unless they are pathological I believe we want to use this new implementation as the baseline going forward. - jump-label optimizations by Daniel Bristot de Oliveira: the primary motivation was to remove IPI disturbance of isolated RT-workload CPUs, which resulted in the implementation of batched jump-label updates. Beyond the improvement of the real-time characteristics kernel, in one test this patchset improved static key update overhead from 57 msecs to just 1.4 msecs - which is a nice speedup as well. - atomic64_t cross-arch type cleanups by Mark Rutland: over the last ~10 years of atomic64_t existence the various types used by the APIs only had to be self-consistent within each architecture - which means they became wildly inconsistent across architectures. Mark puts and end to this by reworking all the atomic64 implementations to use 's64' as the base type for atomic64_t, and to ensure that this type is consistently used for parameters and return values in the API, avoiding further problems in this area. - A large set of small improvements to lockdep by Yuyang Du: type cleanups, output cleanups, function return type and othr cleanups all around the place. - A set of percpu ops cleanups and fixes by Peter Zijlstra. - Misc other changes - please see the Git log for more details" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) locking/lockdep: increase size of counters for lockdep statistics locking/atomics: Use sed(1) instead of non-standard head(1) option locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING x86/jump_label: Make tp_vec_nr static x86/percpu: Optimize raw_cpu_xchg() x86/percpu, sched/fair: Avoid local_clock() x86/percpu, x86/irq: Relax {set,get}_irq_regs() x86/percpu: Relax smp_processor_id() x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() locking/rwsem: Guard against making count negative locking/rwsem: Adaptive disabling of reader optimistic spinning locking/rwsem: Enable time-based spinning on reader-owned rwsem locking/rwsem: Make rwsem->owner an atomic_long_t locking/rwsem: Enable readers spinning on writer locking/rwsem: Clarify usage of owner's nonspinaable bit locking/rwsem: Wake up almost all readers in wait queue locking/rwsem: More optimal RT task handling of null owner locking/rwsem: Always release wait_lock before waking up tasks locking/rwsem: Implement lock handoff to prevent lock starvation locking/rwsem: Make rwsem_spin_on_owner() return owner state ... | ||
|  Linus Torvalds | 927ba67a63 | Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner:
 "The timer and timekeeping departement delivers:
  Core:
   - The consolidation of the VDSO code into a generic library including
     the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
     route through the relevant maintainer trees and should end up in
     5.4.
     This gets rid of the unnecessary different copies of the same code
     and brings all architectures on the same level of VDSO
     functionality.
   - Make the NTP user space interface more robust by restricting the
     TAI offset to prevent undefined behaviour. Includes a selftest.
   - Validate user input in the compat settimeofday() syscall to catch
     invalid values which would be turned into valid values by a
     multiplication overflow
   - Consolidate the time accessors
   - Small fixes, improvements and cleanups all over the place
  Drivers:
   - Support for the NXP system counter, TI davinci timer
   - Move the Microsoft HyperV clocksource/events code into the
     drivers/clocksource directory so it can be shared between x86 and
     ARM64.
   - Overhaul of the Tegra driver
   - Delay timer support for IXP4xx
   - Small fixes, improvements and cleanups as usual"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  time: Validate user input in compat_settimeofday()
  timer: Document TIMER_PINNED
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
  hrtimer: Use a bullet for the returns bullet list
  arm64: vdso: Fix compilation with clang older than 8
  arm64: compat: Fix __arch_get_hw_counter() implementation
  arm64: Fix __arch_get_hw_counter() implementation
  lib/vdso: Make delta calculation work correctly
  MAINTAINERS: Add entry for the generic VDSO library
  arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
  arm64: vdso: Remove unnecessary asm-offsets.c definitions
  vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
  clocksource/drivers/davinci: Add support for clocksource
  clocksource/drivers/davinci: Add support for clockevents
  clocksource/drivers/tegra: Set up maximum-ticks limit properly
  clocksource/drivers/tegra: Cycles can't be 0
  clocksource/drivers/tegra: Restore base address before cleanup
  clocksource/drivers/tegra: Add verbose definition for 1MHz constant
  ... | ||
|  Jason Gunthorpe | 9ec3f4cb35 | Linux 5.2-rc7 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0YK7ceHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWfcH/36ep8GZHY9H1ARV RJJoGoMnwENoq2o4eKhH3iZgUIGPq2uonazequhePwnIsrOdFGT7AeMHWSW7W0o4 wNlNFdUrTe0bvU00m+YtDwNIqgNCnFEoUbqn9H+VhAAWpSydKvhh2mlebTFO50KN hb9+jh59Q8tbxrQdCuNF6yJATdf4hcj1V/ZZMGgF34kx+dFY4wOooSfu/eaIxXIl fBDKN9K4Mmw8HWJvebV+ocOMZ7Zqknt1lbjx69OxpJmgxhb2Ks7heqSZanLTBPBB oZxOlEdNPSyOjBQUlsDC2S8VJ7g5gINZk1JcFjByzE7cIPOQ2UXE72R++wwANngm SR054NQ= =WlA8 -----END PGP SIGNATURE----- Merge tag 'v5.2-rc7' into rdma.git hmm Required for dependencies in the next patches. | ||
|  Christian Brauner | 28dd29c06d | fork: return proper negative error code Make sure to return a proper negative error code from copy_process()
when anon_inode_getfile() fails with CLONE_PIDFD.
Otherwise _do_fork() will not detect an error and get_task_pid() will
operator on a nonsensical pointer:
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74
08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
FS:  00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
  __do_sys_clone kernel/fork.c:2454 [inline]
  __se_sys_clone kernel/fork.c:2448 [inline]
  __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
  do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com
Fixes:  | ||
|  Andrea Arcangeli | 1bf4580e00 | fork,memcg: alloc_thread_stack_node needs to set tsk->stack Commit | ||
|  Joel Fernandes (Google) | b53b0b9d9a | pidfd: add polling support This patch adds polling support to pidfd. Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Jonathan Kowalski <bl0pbl33p@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: kernel-team@android.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Co-developed-by: Daniel Colascione <dancol@google.com> Signed-off-by: Daniel Colascione <dancol@google.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Christian Brauner <christian@brauner.io> | ||
|  Al Viro | 6fd2fe494b | copy_process(): don't use ksys_close() on cleanups anon_inode_getfd() should be used *ONLY* in situations when we are
guaranteed to be past the last failure point (including copying the
descriptor number to userland, at that).  And ksys_close() should
not be used for cleanups at all.
anon_inode_getfile() is there for all nontrivial cases like that.
Just use that...
Fixes:  | ||
|  Dmitry V. Levin | 9014143bab | fork: don't check parent_tidptr with CLONE_PIDFD Give userspace a cheap and reliable way to tell whether CLONE_PIDFD is supported by the kernel or not. The easiest way is to pass an invalid file descriptor value in parent_tidptr, perform the syscall and verify that parent_tidptr has been changed to a valid file descriptor value. CLONE_PIDFD uses parent_tidptr to return pidfds. CLONE_PARENT_SETTID will use parent_tidptr to return the tid of the parent. The two flags cannot be used together. Old kernels that only support CLONE_PARENT_SETTID will not verify the value pointed to by parent_tidptr. This behavior is unchanged even with the introduction of CLONE_PIDFD. However, if CLONE_PIDFD is specified the kernel will currently check the value pointed to by parent_tidptr before placing the pidfd in the memory pointed to. EINVAL will be returned if the value in parent_tidptr is not 0. If CLONE_PIDFD is supported and fd 0 is closed, then the returned pidfd can and likely will be 0 and parent_tidptr will be unchanged. This means userspace must either check CLONE_PIDFD support beforehand or check that fd 0 is not closed when invoking CLONE_PIDFD. The check for pidfd == 0 was introduced during the v5.2 merge window by commit | ||
|  Jason A. Donenfeld | 9285ec4c8b | timekeeping: Use proper clock specifier names in functions This makes boot uniformly boottime and tai uniformly clocktai, to address the remaining oversights. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com | ||
|  Christian Brauner | d68dbb0c9a | arch: handle arches who do not yet define clone3 This cleanly handles arches who do not yet define clone3.
clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
assumption that this would cleanly handle all architectures. It does
not.
Architectures such as nios2 or h8300 simply take the asm-generic syscall
definitions and generate their syscall table from it. Since they don't
define __ARCH_WANT_SYS_CLONE the build would fail complaining about
sys_clone3 missing. The reason this doesn't happen for legacy clone is
that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
be done for architectural reasons.
The build failures for nios2 and h8300 were caught int -next luckily.
The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
add. Additionally, we need a cond_syscall(clone3) for architectures such
as nios2 or h8300 that generate their syscall table in the way I
explained above.
Fixes:  | ||
|  Jason Gunthorpe | c8a53b2db0 | mm/hmm: Hold a mmgrab from hmm to mm So long as a struct hmm pointer exists, so should the struct mm it is linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it once the hmm refcount goes to zero. Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL mm->hmm delete the hmm_hmm_destroy(). Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Philip Yang <Philip.Yang@amd.com> | ||
|  Christian Brauner | 7f192e3cd3 | fork: add clone3 This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 child_tid;
        __aligned_u64 parent_tid;
        __aligned_u64 exit_signal;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *child_tid;
        int __user *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
  It is superseeded by a dedicated "exit_signal" argument in struct
  clone_args freeing up space for additional flags.
  This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]:  | ||
|  Yuyang Du | e196e479a3 | locking/lockdep: Use lockdep_init_task for task initiation consistently Despite that there is a lockdep_init_task() which does nothing, lockdep initiates tasks by assigning lockdep fields and does so inconsistently. Fix this by using lockdep_init_task(). Signed-off-by: Yuyang Du <duyuyang@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bvanassche@acm.org Cc: frederic@kernel.org Cc: ming.lei@redhat.com Cc: will.deacon@arm.com Link: https://lkml.kernel.org/r/20190506081939.74287-8-duyuyang@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Sebastian Andrzej Siewior | 3bd3706251 | sched/core: Provide a pointer to the valid CPU mask In commit:
   | ||
|  Kefeng Wang | 8856ae4df3 | kernel/fork.c: make max_threads symbol static Fix build warning, kernel/fork.c:125:5: warning: symbol 'max_threads' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20190516015118.140561-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Thomas Gleixner | 457c899653 | treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | ||
|  Lin Feng | e02c9b0d65 | kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing The name clear_all_latency_tracing is misleading, in fact which only clear per task's latency_record[], and we do have another function named clear_global_latency_tracing which clear the global latency_record[] buffer. Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com Signed-off-by: Lin Feng <linf@wangsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Andrea Arcangeli | c3f3ce049f | userfaultfd: use RCU to free the task struct when fork fails The task structure is freed while get_mem_cgroup_from_mm() holds
rcu_read_lock() and dereferences mm->owner.
  get_mem_cgroup_from_mm()                failing fork()
  ----                                    ---
  task = mm->owner
                                          mm->owner = NULL;
                                          free(task)
  if (task) *task; /* use after free */
The fix consists in freeing the task with RCU also in the fork failure
case, exactly like it always happens for the regular exit(2) path.  That
is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
(left side above) effective to avoid a use after free when dereferencing
the task structure.
An alternate possible fix would be to defer the delivery of the
userfaultfd contexts to the monitor until after fork() is guaranteed to
succeed.  Such a change would require more changes because it would
create a strict ordering dependency where the uffd methods would need to
be called beyond the last potentially failing branch in order to be
safe.  This solution as opposed only adds the dependency to common code
to set mm->owner to NULL and to free the task struct that was pointed by
mm->owner with RCU, if fork ends up failing.  The userfaultfd methods
can still be called anywhere during the fork runtime and the monitor
will keep discarding orphaned "mm" coming from failed forks in userland.
This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
time.
[aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
  Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
Fixes:  | ||
|  Christian Brauner | c3b7112df8 | fork: do not release lock that wasn't taken Avoid calling cgroup_threadgroup_change_end() without having called
cgroup_threadgroup_change_begin() first.
During process creation we need to check whether the cgroup we are in
allows us to fork. To perform this check the cgroup needs to guard itself
against threadgroup changes and takes a lock.
Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
to call cgroup_threadgroup_change_end() because said lock had already been
taken.
However, this is not the case anymore with the addition of CLONE_PIDFD. We
are now allocating a pidfd before we check whether the cgroup we're in can
fork and thus prior to taking the lock. So when copy_process() fails at the
right step it would release a lock we haven't taken.
This bug is not even very subtle to be honest. It's just not very clear
from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is
taken.
Here's the relevant splat:
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(depth <= 0)
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release
kernel/locking/lockdep.c:4052 [inline]
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052
lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x65c kernel/panic.c:214
  __warn.cold+0x20/0x45 kernel/panic.c:566
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972
RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline]
RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87
48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85
70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
RSP: 0018:ffff888094117b48 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b
RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d
R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8
R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8
  percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
  cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
  copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
  copy_process kernel/fork.c:1772 [inline]
  _do_fork+0x25d/0xfd0 kernel/fork.c:2338
  __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
  __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
  __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
  do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
  do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..
Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com
Fixes:  | ||
|  Linus Torvalds | abde77eb5c | Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "This includes Roman's cgroup2 freezer implementation. It's a separate machanism from cgroup1 freezer. Instead of blocking user tasks in arbitrary uninterruptible sleeps, the new implementation extends jobctl stop - frozen tasks are trapped in jobctl stop until thawed and can be killed and ptraced. Lots of thanks to Oleg for sheperding the effort. Other than that, there are a few trivial changes" * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: never call do_group_exit() with task->frozen bit set kernel: cgroup: fix misuse of %x cgroup: get rid of cgroup_freezer_frozen_exit() cgroup: prevent spurious transition into non-frozen state cgroup: Remove unused cgrp variable cgroup: document cgroup v2 freezer interface cgroup: add tracing points for cgroup v2 freezer cgroup: make TRACE_CGROUP_PATH irq-safe kselftests: cgroup: add freezer controller self-tests kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy() cgroup: cgroup v2 freezer cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock cgroup: implement __cgroup_task_count() helper cgroup: rename freezer.c into legacy_freezer.c cgroup: remove extra cgroup_migrate_finish() call | ||
|  Linus Torvalds | eac7078a0f | pidfd patches for v5.2-rc1 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAlzReuoACgkQjrBW1T7s sS1uvBAA16pgnhRNxNTrp3LYft6lUWmF4n0baOTVtQNLhPjpwaOxHIrCBugkQCJB QcQ9IQSOvIkaEW0XAQoPBaeLviiKhHOFw1Fv89OtW6xUidSfSV15lcI9f1F2pCm2 4yCL/8XvL6M0NhxiwftJAkWOXeDNLfjFnLwyLxBfgg3EeyqMgUB8raeosEID0ORR gm2/g8DYS2r+KNqM/F4xvMSgabfi2bGk+8BtAaVnftJfstpRNrqKwWnSK3Wspj1l 5gkb8gSsiY6ns3V6RgNHrFlhevFg8V+VjcJt7FR+aUEjOkcoiXas/PhvamMzdsn/ FM1F/A0pM8FSybIUClhnnnxNPc+p8ZN/71YQAPs+Mnh3xvbtKea2lkhC+Xv4OpK3 edutSZWFaiIery82Rk00H3vqiSF1+kRIXSpZSS4mElk4FsVljkyH+nSP7rbmE2MR EQe+kKnZl8QzWrVbnODC+EVvvVpA2bXDvENJmvKqus+t2G0OdV7Iku3F5E3KjF8k S5RRV1zuBF3ugqnjmYrVmJtpEA8mxClmqvg6okru+qW6ngO5oOgVpPLjWn1CXcdj wcuQ6Pe1QwAHS54e9WSWgCHVssLvm9nCdCqypdNaoyGWmbTWntwlrY7Y0JUQnAbB 6/G/DQQiCWY9y8bMZlTEydhIpgcsdROuPYv+oHF5+eQQthsWwHc= =LH11 -----END PGP SIGNATURE----- Merge tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "This patchset makes it possible to retrieve pidfds at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. After a thorough review from Oleg CLONE_PIDFD returns pidfds in the parent_tidptr argument. This means we can give back the associated pid and the pidfd at the same time. Access to process metadata information thus becomes rather trivial. As has been agreed, CLONE_PIDFD creates file descriptors based on anonymous inodes similar to the new mount api. They are made unconditional by this patchset as they are now needed by core kernel code (vfs, pidfd) even more than they already were before (timerfd, signalfd, io_uring, epoll etc.). The core patchset is rather small. The bulky looking changelist is caused by David's very simple changes to Kconfig to make anon inodes unconditional. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". To remove worries about missing metadata access this patchset comes with a sample/test program that illustrates how a combination of CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. Further work based on this patchset has been done by Joel. His work makes pidfds pollable. It finished too late for this merge window. I would prefer to have it sitting in linux-next for a while and send it for inclusion during the 5.3 merge window" * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: samples: show race-free pidfd metadata access signal: support CLONE_PIDFD with pidfd_send_signal clone: add CLONE_PIDFD Make anon_inodes unconditional | ||
|  Christian Brauner | b3e5838252 | clone: add CLONE_PIDFD This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian@brauner.io> Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> | ||
|  Nadav Amit | 13585fa066 | fork: Provide a function for copying init_mm Provide a function for copying init_mm. This function will be later used for setting a temporary mm. Tested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-6-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Nadav Amit | aad42dd44d | uprobes: Initialize uprobes earlier In order to have a separate address space for text poking, we need to duplicate init_mm early during start_kernel(). This, however, introduces a problem since uprobes functions are called from dup_mmap(), but uprobes is still not initialized in this early stage. Since uprobes initialization is necassary for fork, and since all the dependant initialization has been done when fork is initialized (percpu and vmalloc), move uprobes initialization to fork_init(). It does not seem uprobes introduces any security problem for the poking_mm. Crash and burn if uprobes initialization fails, similarly to other early initializations. Change the init_probes() name to probes_init() to match other early initialization functions name convention. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: ard.biesheuvel@linaro.org Cc: deneen.t.dock@intel.com Cc: kernel-hardening@lists.openwall.com Cc: kristen@linux.intel.com Cc: linux_dti@icloud.com Cc: will.deacon@arm.com Link: https://lkml.kernel.org/r/20190426232303.28381-6-nadav.amit@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Roman Gushchin | 76f969e894 | cgroup: cgroup v2 freezer Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com | ||
|  Linus Torvalds | a50243b1dd | 5.1 Merge Window Pull Request This has been a slightly more active cycle than normal with ongoing core
 changes and quite a lot of collected driver updates.
 
 - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
 
 - A new data transfer mode for HFI1 giving higher performance
 
 - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR
   feature
 
 - A chip hang reset recovery system for hns
 
 - Change mm->pinned_vm to an atomic64
 
 - Update bnxt_re to support a new 57500 chip
 
 - A sane netlink 'rdma link add' method for creating rxe devices and fixing
   the various unregistration race conditions in rxe's unregister flow
 
 - Allow lookup up objects by an ID over netlink
 
 - Various reworking of the core to driver interface:
   * Drivers should not assume umem SGLs are in PAGE_SIZE chunks
   * ucontext is accessed via udata not other means
   * Start to make the core code responsible for object memory
     allocation
   * Drivers should convert struct device to struct ib_device
     via a helper
   * Drivers have more tools to avoid use after unregister problems
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g
 mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo
 IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx
 2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv
 w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd
 ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP
 Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF
 vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz
 9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38
 v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr
 3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp
 9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk=
 =p12E
 -----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "This has been a slightly more active cycle than normal with ongoing
  core changes and quite a lot of collected driver updates.
   - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
   - A new data transfer mode for HFI1 giving higher performance
   - Significant functional and bug fix update to the mlx5
     On-Demand-Paging MR feature
   - A chip hang reset recovery system for hns
   - Change mm->pinned_vm to an atomic64
   - Update bnxt_re to support a new 57500 chip
   - A sane netlink 'rdma link add' method for creating rxe devices and
     fixing the various unregistration race conditions in rxe's
     unregister flow
   - Allow lookup up objects by an ID over netlink
   - Various reworking of the core to driver interface:
       - drivers should not assume umem SGLs are in PAGE_SIZE chunks
       - ucontext is accessed via udata not other means
       - start to make the core code responsible for object memory
         allocation
       - drivers should convert struct device to struct ib_device via a
         helper
       - drivers have more tools to avoid use after unregister problems"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
  net/mlx5: ODP support for XRC transport is not enabled by default in FW
  IB/hfi1: Close race condition on user context disable and close
  RDMA/umem: Revert broken 'off by one' fix
  RDMA/umem: minor bug fix in error handling path
  RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
  cxgb4: kfree mhp after the debug print
  IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
  IB/rdmavt: Fix loopback send with invalidate ordering
  IB/iser: Fix dma_nents type definition
  IB/mlx5: Set correct write permissions for implicit ODP MR
  bnxt_re: Clean cq for kernel consumers only
  RDMA/uverbs: Don't do double free of allocated PD
  RDMA: Handle ucontext allocations by IB/core
  RDMA/core: Fix a WARN() message
  bnxt_re: fix the regression due to changes in alloc_pbl
  IB/mlx4: Increase the timeout for CM cache
  IB/core: Abort page fault handler silently during owning process exit
  IB/mlx5: Validate correct PD before prefetch MR
  IB/mlx5: Protect against prefetch of invalid MR
  RDMA/uverbs: Store PR pointer before it is overwritten
  ... | ||
|  YueHaibing | fd2081ffce | kernel/fork.c: remove duplicated include Remove duplicated include. Link: http://lkml.kernel.org/r/20181209062952.17736-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Davidlohr Bueso | 70f8a3ca68 | mm: make mm->pinned_vm an atomic64 counter Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> | ||
|  Elena Reshetova | f0b89d3958 | sched/core: Convert task_struct.stack_refcount to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable task_struct.stack_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the task_struct.stack_refcount it might make a difference in following places: - try_get_task_stack(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_task_stack(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Elena Reshetova | ec1d281923 | sched/core: Convert task_struct.usage to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable task_struct.usage is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the task_struct.usage it might make a difference in following places: - put_task_struct(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Elena Reshetova | 60d4de3ff7 | sched/core: Convert signal_struct.sigcnt to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable signal_struct.sigcnt is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the signal_struct.sigcnt it might make a difference in following places: - put_signal_struct(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Elena Reshetova | d036bda7d0 | sched/core: Convert sighand_struct.count to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable sighand_struct.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the sighand_struct.count it might make a difference in following places: - __cleanup_sighand: decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Linus Torvalds | a88cc8da02 | Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_alloc: do not wake kswapd with zone lock held hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race" mm: page_mapped: don't assume compound page is huge or THP mm/memory.c: initialise mmu_notifier_range correctly tools/vm/page_owner: use page_owner_sort in the use example kasan: fix krealloc handling for tag-based mode kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning mm, memcg: fix reclaim deadlock with writeback mm/usercopy.c: no check page span for stack objects slab: alien caches must not be initialized if the allocation of the alien cache failed fork, memcg: fix cached_stacks case zram: idle writeback fixes and cleanup | ||
|  Shakeel Butt | ba4a45746c | fork, memcg: fix cached_stacks case Commit | ||
|  David Herrmann | 7b55851367 | fork: record start_time late This changes the fork(2) syscall to record the process start_time after initializing the basic task structure but still before making the new process visible to user-space. Technically, we could record the start_time anytime during fork(2). But this might lead to scenarios where a start_time is recorded long before a process becomes visible to user-space. For instance, with userfaultfd(2) and TLS, user-space can delay the execution of fork(2) for an indefinite amount of time (and will, if this causes network access, or similar). By recording the start_time late, it much closer reflects the point in time where the process becomes live and can be observed by other processes. Lastly, this makes it much harder for user-space to predict and control the start_time they get assigned. Previously, user-space could fork a process and stall it in copy_thread_tls() before its pid is allocated, but after its start_time is recorded. This can be misused to later-on cycle through PIDs and resume the stalled fork(2) yielding a process that has the same pid and start_time as a process that existed before. This can be used to circumvent security systems that identify processes by their pid+start_time combination. Even though user-space was always aware that start_time recording is flaky (but several projects are known to still rely on start_time-based identification), changing the start_time to be recorded late will help mitigate existing attacks and make it much harder for user-space to control the start_time a process gets assigned. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Yi Wang | fb5bf31722 | fork: fix some -Wmissing-prototypes warnings We get a warning when building kernel with W=1:
  kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes]
  kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes]
Add the missing declaration in head file to fix this.
Also, remove arch_release_thread_stack() completely because no arch
seems to implement it since  | ||
|  YueHaibing | 0f4991e8fd | kernel/fork.c: mark 'stack_vm_area' with __maybe_unused Fixes gcc '-Wunused-but-set-variable' warning when CONFIG_VMAP_STACK is not set: kernel/fork.c: In function 'dup_task_struct': kernel/fork.c:843:20: warning: variable 'stack_vm_area' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1545965190-2381-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Arun KS | ca79b0c211 | mm: convert totalram_pages and totalhigh_pages variables to atomic totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Arun KS | 3d6357de8a | mm: reference totalram_pages and managed_pages once per function Patch series "mm: convert totalram_pages, totalhigh_pages and managed pages to atomic", v5. This series converts totalram_pages, totalhigh_pages and zone->managed_pages to atomic variables. totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better to remove the lock and convert variables to atomic. With the change, preventing poteintial store-to-read tearing comes as a bonus. This patch (of 4): This is in preparation to a later patch which converts totalram_pages and zone->managed_pages to atomic variables. Please note that re-reading the value might lead to a different value and as such it could lead to unexpected behavior. There are no known bugs as a result of the current code but it is better to prevent from them in principle. Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Rik van Riel | 5eed6f1dff | fork,memcg: fix crash in free_thread_stack on memcg charge fail Commit | ||
|  Linus Torvalds | 2d6bb6adb7 | New gcc plugin: stackleak - Introduces the stackleak gcc plugin ported from grsecurity by Alexander
   Popov, with x86 and arm64 support.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlvQvn4WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpSfD/sErFreuPT1beSw994Lr9Zx4k9v
 ERsuXxWBENaJOJXbOOHMfVEcEeG/1uhPSp7hlw/dpHfh0anATTrcYqm8RNKbfK+k
 o06+JK14OJfpm5Ghq/7OizhdNLCMT8wMU3XZtWfy65VSJGjEFx8Y48vMeQtpWtUK
 ylSzi9JV6j2iUBF9oibtiT53+yqsqAtX80X1G7HRCgv9kxuKMhZr+Q5oGV6+ViyQ
 Azj8mNn06iRnhHKd17WxDJr0GjSibzz4weS/9XgP3t3EcNWJo1EgBlD2KV3tOfP5
 nzmqfqTqrcjxs/tyjdh6vVCSlYucNtyCQGn63qyShQYSg6mZwclR2fY8YSTw6PWw
 GfYWFOWru9z+qyQmwFkQ9bSQS2R+JIT0oBCj9VmtF9XmPCy7K2neJsQclzSPBiCW
 wPgXVQS4IA4684O5CmDOVMwmDpGvhdBNUR6cqSzGLxQOHY1csyXubMNUsqU3g9xk
 Ob4pEy/xrrIw4WpwHcLHSEW5gV1/OLhsT0fGRJJiC947L3cN5s9EZp7FLbIS0zlk
 qzaXUcLmn6AgcfkYwg5cI3RMLaN2V0eDCMVTWZJ1wbrmUV9chAaOnTPTjNqLOTht
 v3b1TTxXG4iCpMmOFf59F8pqgAwbBDlfyNSbySZ/Pq5QH69udz3Z9pIUlYQnSJHk
 u6q++2ReDpJXF81rBw==
 =Ks6B
 -----END PGP SIGNATURE-----
Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull stackleak gcc plugin from Kees Cook:
 "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
  was ported from grsecurity by Alexander Popov. It provides efficient
  stack content poisoning at syscall exit. This creates a defense
  against at least two classes of flaws:
   - Uninitialized stack usage. (We continue to work on improving the
     compiler to do this in other ways: e.g. unconditional zero init was
     proposed to GCC and Clang, and more plugin work has started too).
   - Stack content exposure. By greatly reducing the lifetime of valid
     stack contents, exposures via either direct read bugs or unknown
     cache side-channels become much more difficult to exploit. This
     complements the existing buddy and heap poisoning options, but
     provides the coverage for stacks.
  The x86 hooks are included in this series (which have been reviewed by
  Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
  been merged through the arm64 tree (written by Laura Abbott and
  reviewed by Mark Rutland and Will Deacon).
  With VLAs having been removed this release, there is no need for
  alloca() protection, so it has been removed from the plugin"
* tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arm64: Drop unneeded stackleak_check_alloca()
  stackleak: Allow runtime disabling of kernel stack erasing
  doc: self-protection: Add information about STACKLEAK feature
  fs/proc: Show STACKLEAK metrics in the /proc file system
  lkdtm: Add a test for STACKLEAK
  gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
  x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls | ||
|  Johannes Weiner | eb414681d5 | psi: pressure stall information for CPU, memory, and IO When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:
       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Roman Gushchin | 9b6f7e163c | mm: rework memcg kernel stack accounting If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
__vmalloc_node_range() with __GFP_ACCOUNT.  So kernel stack pages are
charged against corresponding memory cgroups on allocation and uncharged
on releasing them.
The problem is that we do cache kernel stacks in small per-cpu caches and
do reuse them for new tasks, which can belong to different memory cgroups.
Each stack page still holds a reference to the original cgroup, so the
cgroup can't be released until the vmap area is released.
To make this happen we need more than two subsequent exits without forks
in between on the current cpu, which makes it very unlikely to happen.  As
a result, I saw a significant number of dying cgroups (in theory, up to 2
* number_of_cpu + number_of_tasks), which can't be released even by
significant memory pressure.
As a cgroup structure can take a significant amount of memory (first of
all, per-cpu data like memcg statistics), it leads to a noticeable waste
of memory.
Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com
Fixes:  | ||
|  Nadav Amit | 1ed0cc5a01 | mm: respect arch_dup_mmap() return value Commit | ||
|  Alexander Popov | afaef01c00 | x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls The STACKLEAK feature (initially developed by PaX Team) has the following benefits: 1. Reduces the information that can be revealed through kernel stack leak bugs. The idea of erasing the thread stack at the end of syscalls is similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel crypto, which all comply with FDP_RIP.2 (Full Residual Information Protection) of the Common Criteria standard. 2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712, CVE-2010-2963). That kind of bugs should be killed by improving C compilers in future, which might take a long time. This commit introduces the code filling the used part of the kernel stack with a poison value before returning to userspace. Full STACKLEAK feature also contains the gcc plugin which comes in a separate commit. The STACKLEAK feature is ported from grsecurity/PaX. More information at: https://grsecurity.net/ https://pax.grsecurity.net/ This code is modified from Brad Spengler/PaX Team's code in the last public patch of grsecurity/PaX based on our understanding of the code. Changes or omissions from the original code are ours and don't reflect the original grsecurity/PaX code. Performance impact: Hardware: Intel Core i7-4770, 16 GB RAM Test #1: building the Linux kernel on a single core 0.91% slowdown Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P 4.2% slowdown So the STACKLEAK description in Kconfig includes: "The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary and you are advised to test this feature on your expected workload before deploying it". Signed-off-by: Alexander Popov <alex.popov@linux.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> | ||
|  Linus Torvalds | cd9b44f907 | Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: - the rest of MM - procfs updates - various misc things - more y2038 fixes - get_maintainer updates - lib/ updates - checkpatch updates - various epoll updates - autofs updates - hfsplus - some reiserfs work - fatfs updates - signal.c cleanups - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits) ipc/util.c: update return value of ipc_getref from int to bool ipc/util.c: further variable name cleanups ipc: simplify ipc initialization ipc: get rid of ids->tables_initialized hack lib/rhashtable: guarantee initial hashtable allocation lib/rhashtable: simplify bucket_table_alloc() ipc: drop ipc_lock() ipc/util.c: correct comment in ipc_obtain_object_check ipc: rename ipcctl_pre_down_nolock() ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc: reorganize initialization of kern_ipc_perm.seq ipc: compute kern_ipc_perm.id under the ipc lock init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp adfs: use timespec64 for time conversion kernel/sysctl.c: fix typos in comments drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md fork: don't copy inconsistent signal handler state to child signal: make get_signal() return bool signal: make sigkill_pending() return bool ... | ||
|  Jann Horn | 06e62a46bb | fork: don't copy inconsistent signal handler state to child Before this change, if a multithreaded process forks while one of its threads is changing a signal handler using sigaction(), the memcpy() in copy_sighand() can race with the struct assignment in do_sigaction(). It isn't clear whether this can cause corruption of the userspace signal handler pointer, but it definitely can cause inconsistency between different fields of struct sigaction. Take the appropriate spinlock to avoid this. I have tested that this patch prevents inconsistency between sa_sigaction and sa_flags, which is possible before this patch. Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Dmitry Vyukov | a2e5144538 | kernel/hung_task.c: allow to set checking interval separately from timeout Currently task hung checking interval is equal to timeout, as the result hung is detected anywhere between timeout and 2*timeout. This is fine for most interactive environments, but this hurts automated testing setups (syzbot). In an automated setup we need to strictly order CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall is not detected as task hung and task hung is not detected as silent machine loss. The large variance in task hung detection timeout requires setting silent machine loss timeout to a very large value (e.g. if task hung is 3 mins, then silent loss need to be set to ~7 mins). The additional 3 minutes significantly reduce testing efficiency because usually we crash kernel within a minute, and this can add hours to bug localization process as it needs to do dozens of tests. Allow setting checking interval separately from timeout. This allows to set timeout to, say, 3 minutes, but checking interval to 10 secs. The interval is controlled via a new hung_task_check_interval_secs sysctl, similar to the existing hung_task_timeout_secs sysctl. The default value of 0 results in the current behavior: checking interval is equal to timeout. [akpm@linux-foundation.org: update hung_task_timeout_max's comment] Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Andrew Morton | a670468f5e | mm: zero out the vma in vma_init() Rather than in vm_area_alloc(). To ensure that the various oddball stack-based vmas are in a good state. Some of the callers were zeroing them out, others were not. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 0214f46b3a | Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.
  This set of changes is split into several parts:
   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.
   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ... | ||
|  Shakeel Butt | d46eb14b73 | fs: fsnotify: account fsnotify metadata to kmemcg Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 73ba2fb33c | for-4.19/block-20180812 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.
  This pull request contains:
   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)
   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes
   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)
   - Series improving how we handle sense information on the stack
     (Kees)
   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)
   - Zoned device support for null_blk (Matias)
   - AIX partition fixes (Mauricio Faria de Oliveira)
   - DIF checksum code made generic (Max Gurtovoy)
   - Add support for discard in iostats (Michael Callahan / Tejun)
   - Set of updates for BFQ (Paolo)
   - Removal of async write support for bsg (Christoph)
   - Bio page dirtying and clone fixups (Christoph)
   - Set of bcache fix/changes (via Coly)
   - Series improving blk-mq queue setup/teardown speed (Ming)
   - Series improving merging performance on blk-mq (Ming)
   - Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ... | ||
|  Linus Torvalds | 203b4fc903 | Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Thomas Gleixner: - Make lazy TLB mode even lazier to avoid pointless switch_mm() operations, which reduces CPU load by 1-2% for memcache workloads - Small cleanups and improvements all over the place * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove redundant check for kmem_cache_create() arm/asm/tlb.h: Fix build error implicit func declaration x86/mm/tlb: Make clear_asid_other() static x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off() x86/mm/tlb: Always use lazy TLB mode x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Restructure switch_mm_irqs_off() x86/mm/tlb: Leave lazy TLB mode at page table free time mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids x86/mm: Add TLB purge to free pmd/pte page interfaces ioremap: Update pgtable free interfaces with addr x86/mm: Disable ioremap free page handling on x86-PAE | ||
|  Eric W. Biederman | c3ad2c3b02 | signal: Don't restart fork when signals come in. Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.
The code was being overly pessimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().
While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.
Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.
It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.
The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.
Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.
V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
    - Initialize signal_struct.multiprocess in init_task
    - Move setting of shared_pending to before the new task
      is visible to signals.  This prevents signals from comming
      in before shared_pending.signal is set to delayed.signal
      and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes:  | ||
|  Jens Axboe | 05b9ba4b55 | Linux 4.18-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt vJJlibI= =42a5 -----END PGP SIGNATURE----- Merge tag 'v4.18-rc6' into for-4.19/block2 Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk> | ||
|  Eric W. Biederman | 924de3b8c9 | fork: Have new threads join on-going signal group stops There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered.  Which
leaves only SIGSTOP that needs to be considered when creating new
threads.
Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.
A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.
It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes.  Making it appear
the clone completed happened just before the SIGSTOP.
The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
V2: The call to task_join_group_stop was moved before the new task is
    added to the thread group list.  This should not matter as
    sighand->siglock is held over both the addition of the threads,
    the call to task_join_group_stop and do_signal_stop.  But the change
    is trivial and it is one less thing to worry about when reading
    the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Josef Bacik | 2c323017e3 | blk-cgroup: clear the throttle queue on fork We were hitting a panic in production where we put too many times on the request queue. This is because we'd get the throttle_queue of the parent if we fork()'ed while we needed to be throttled, but we didn't have a reference on it. Instead just clear these flags on fork so the child doesn't pay for the sins of its father. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Kirill A. Shutemov | 027232da7c | mm: introduce vma_init() Not all VMAs allocated with vm_area_alloc(). Some of them allocated on stack or in data segment. The new helper can be use to initialize VMA properly regardless where it was allocated. Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Eric W. Biederman | 7673bf553b | fork: Unconditionally exit if a fatal signal is pending In practice this does not change anything as testing for fatal_signal_pending and exiting for with an error code duplicates the work of the next clause which recalculates pending signals and then exits fork if any are pending. In both cases the pending signal will trigger the slow path when existing to userspace, and the fatal signal will cause do_exit to be called. The advantage of making this a separate test is that it makes it clear processing the fatal signal will terminate the fork, and it allows the rest of the signal logic to be updated without fear that this important case will be lost. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | 4ca1d3ee46 | fork: Move and describe why the code examines PIDNS_ADDING Normally this would be something that would be handled by handling signals that are sent to a group of processes but in this case the forking process is not a member of the group being signaled. Thus special code is needed to prevent a race with pid namespaces exiting, and fork adding new processes within them. Move this test up before the signal restart just in case signals are also pending. Fatal conditions should take presedence over restarts. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Linus Torvalds | 490fc05386 | mm: make vm_area_alloc() initialize core fields Like vm_area_dup(), it initializes the anon_vma_chain head, and the basic mm pointer. The rest of the fields end up being different for different users, although the plan is to also initialize the 'vm_ops' field to a dummy entry. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 95faf6992d | mm: make vm_area_dup() actually copy the old vma data .. and re-initialize th eanon_vma_chain head. This removes some boiler-plate from the users, and also makes it clear why it didn't need use the 'zalloc()' version. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 3928d4f5ee | mm: use helper functions for allocating and freeing vm_area structs The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.
We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.
Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:
    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Eric W. Biederman | 6883f81aac | pid: Implement PIDTYPE_TGID Everywhere except in the pid array we distinguish between a tasks pid and a tasks tgid (thread group id). Even in the enumeration we want that distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid we almost have an implementation of PIDTYPE_TGID in struct signal_struct. Add PIDTYPE_TGID as a first class member of the pid_type enumeration and into the pids array. Then remove the __PIDTYPE_TGID special case and the leader_pid in signal_struct. The net size increase is just an extra pointer added to struct pid and an extra pair of pointers of an hlist_node added to task_struct. The effect on code maintenance is the removal of a number of special cases today and the potential to remove many more special cases as PIDTYPE_TGID gets used to it's fullest. The long term potential is allowing zombie thread group leaders to exit, which will remove a lot more special cases in the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | 2c4704756c | pids: Move the pgrp and session pid pointers from task_struct to signal_struct To access these fields the code always has to go to group leader so going to signal struct is no loss and is actually a fundamental simplification. This saves a little bit of memory by only allocating the pid pointer array once instead of once for every thread, and even better this removes a few potential races caused by the fact that group_leader can be changed by de_thread, while signal_struct can not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Rik van Riel | c1a2f7f0c0 | mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids The mm_struct always contains a cpumask bitmap, regardless of CONFIG_CPUMASK_OFFSTACK. That means the first step can be to simplify things, and simply have one bitmask at the end of the mm_struct for the mm_cpumask. This does necessitate moving everything else in mm_struct into an anonymous sub-structure, which can be randomized when struct randomization is enabled. The second step is to determine the correct size for the mm_struct slab object from the size of the mm_struct (excluding the CPU bitmap) and the size the cpumask. For init_mm we can simply allocate the maximum size this kernel is compiled for, since we only have one init_mm in the system, anyway. Pointer magic by Mike Galbraith, to evade -Wstringop-overflow getting confused by the dynamically sized array. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Tetsuo Handa | 655c79bb40 | mm: check for SIGKILL inside dup_mmap() loop As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting exit_mmap() due to NULL mmap ***** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 050e9baa9d | Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables The changes to automatically test for working stack protector compiler support in the Kconfig files removed the special STACKPROTECTOR_AUTO option that picked the strongest stack protector that the compiler supported. That was all a nice cleanup - it makes no sense to have the AUTO case now that the Kconfig phase can just determine the compiler support directly. HOWEVER. It also meant that doing "make oldconfig" would now _disable_ the strong stackprotector if you had AUTO enabled, because in a legacy config file, the sane stack protector configuration would look like CONFIG_HAVE_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_STACKPROTECTOR_AUTO=y and when you ran this through "make oldconfig" with the Kbuild changes, it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version used to be disabled (because it was really enabled by AUTO), and would disable it in the new config, resulting in: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_HAS_SANE_STACKPROTECTOR=y That's dangerously subtle - people could suddenly find themselves with the weaker stack protector setup without even realizing. The solution here is to just rename not just the old RECULAR stack protector option, but also the strong one. This does that by just removing the CC_ prefix entirely for the user choices, because it really is not about the compiler support (the compiler support now instead automatially impacts _visibility_ of the options to users). This results in "make oldconfig" actually asking the user for their choice, so that we don't have any silent subtle security model changes. The end result would generally look like this: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_STACKPROTECTOR=y CONFIG_STACKPROTECTOR_STRONG=y CONFIG_CC_HAS_SANE_STACKPROTECTOR=y where the "CC_" versions really are about internal compiler infrastructure, not the user selections. Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | d82991a868 | Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull restartable sequence support from Thomas Gleixner: "The restartable sequences syscall (finally): After a lot of back and forth discussion and massive delays caused by the speculative distraction of maintainers, the core set of restartable sequences has finally reached a consensus. It comes with the basic non disputed core implementation along with support for arm, powerpc and x86 and a full set of selftests It was exposed to linux-next earlier this week, so it does not fully comply with the merge window requirements, but there is really no point to drag it out for yet another cycle" * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: Provide Makefile, scripts, gitignore rseq/selftests: Provide parametrized tests rseq/selftests: Provide basic percpu ops test rseq/selftests: Provide basic test rseq/selftests: Provide rseq library selftests/lib.mk: Introduce OVERRIDE_TARGETS powerpc: Wire up restartable sequences system call powerpc: Add syscall detection for restartable sequences powerpc: Add support for restartable sequences x86: Wire up restartable sequence system call x86: Add support for restartable sequences arm: Wire up restartable sequences system call arm: Add syscall detection for restartable sequences arm: Add restartable sequences support rseq: Introduce restartable sequences system call uapi/headers: Provide types_32_64.h | ||
|  Yang Shi | 88aa7cc688 | mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too.  It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
 Call Trace:
   schedule+0x36/0x80
   rwsem_down_read_failed+0xf0/0x150
   call_rwsem_down_read_failed+0x18/0x30
   down_read+0x20/0x40
   proc_pid_cmdline_read+0xd9/0x4e0
   __vfs_read+0x37/0x150
   vfs_read+0x96/0x130
   SyS_read+0x55/0xc0
   entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
  Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 8b5c6a3a49 | audit/stable-4.18 PR 20180605 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlsXFUEUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQVeRaWujKfIoomg//eRNpc6x9kxTijN670AC2uD0CBTlZ 2z6mHuJaOhG8bTxjZxQfUBoo6/eZJ2YC1yq6ornGFNzw4sfKsR/j86ujJim2HAmo opUhziq3SILGEvjsxfPkREe/wb49jy0AA/WjZqciitB1ig8Hz7xzqi0lpNaEspFh QJFB6XXkojWGFGrRzruAVJnPS+pDWoTQR0qafs3JWKnpeinpOdZnl1hPsysAEHt5 Ag8o4qS/P9xJM0khi7T+jWECmTyT/mtWqEtFcZ0o+JLOgt/EMvNX6DO4ETDiYRD2 mVChga9x5r78bRgNy2U8IlEWWa76WpcQAEODvhzbijX4RxMAmjsmLE+e+udZSnMZ eCITl2f7ExxrL5SwNFC/5h7pAv0RJ+SOC19vcyeV4JDlQNNVjUy/aNKv5baV0aeg EmkeobneMWxqHx52aERz8RF1in5pT8gLOYoYnWfNpcDEmjLrwhuZLX2asIzUEqrS SoPJ8hxIDCxceHOWIIrz5Dqef7x28Dyi46w3QINC8bSy2RnR/H3q40DRegvXOGiS 9WcbbwbhnM4Kau413qKicGCvdqTVYdeyZqo7fVelSciD139Vk7pZotyom4MuU25p fIyGfXa8/8gkl7fZ+HNkZbba0XWNfAZt//zT095qsp3CkhVnoybwe6OwG1xRqErq W7OOQbS7vvN/KGo= =10u6 -----END PGP SIGNATURE----- Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Another reasonable chunk of audit changes for v4.18, thirteen patches in total. The thirteen patches can mostly be broken down into one of four categories: general bug fixes, accessor functions for audit state stored in the task_struct, negative filter matches on executable names, and extending the (relatively) new seccomp logging knobs to the audit subsystem. The main driver for the accessor functions from Richard are the changes we're working on to associate audit events with containers, but I think they have some standalone value too so I figured it would be good to get them in now. The seccomp/audit patches from Tyler apply the seccomp logging improvements from a few releases ago to audit's seccomp logging; starting with this patchset the changes in /proc/sys/kernel/seccomp/actions_logged should apply to both the standard kernel logging and audit. As usual, everything passes the audit-testsuite and it happens to merge cleanly with your tree" [ Heh, except it had trivial merge conflicts with the SELinux tree that also came in from Paul - Linus ] * tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Fix wrong task in comparison of session ID audit: use existing session info function audit: normalize loginuid read access audit: use new audit_context access funciton for seccomp_actions_logged audit: use inline function to set audit context audit: use inline function to get audit context audit: convert sessionid unset to a macro seccomp: Don't special case audited processes when logging seccomp: Audit attempts to modify the actions_logged sysctl seccomp: Configurable separator for the actions_logged string seccomp: Separate read and write code for actions_logged sysctl audit: allow not equal op for audit by executable audit: add syscall information to FEATURE_CHANGE records | ||
|  Mathieu Desnoyers | d7822b1e24 | rseq: Introduce restartable sequences system call Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com | ||
|  Richard Guy Briggs | c0b0ae8a87 | audit: use inline function to set audit context Recognizing that the audit context is an internal audit value, use an access function to set the audit context pointer for the task rather than reaching directly into the task struct to set it. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: merge fuzz in audit.h] Signed-off-by: Paul Moore <paul@paul-moore.com> | ||
|  Kees Cook | e01e80634e | fork: unconditionally clear stack on fork One of the classes of kernel stack content leaks[1] is exposing the contents of prior heap or stack contents when a new process stack is allocated. Normally, those stacks are not zeroed, and the old contents remain in place. In the face of stack content exposure flaws, those contents can leak to userspace. Fixing this will make the kernel no longer vulnerable to these flaws, as the stack will be wiped each time a stack is assigned to a new process. There's not a meaningful change in runtime performance; it almost looks like it provides a benefit. Performing back-to-back kernel builds before: Run times: 157.86 157.09 158.90 160.94 160.80 Mean: 159.12 Std Dev: 1.54 and after: Run times: 159.31 157.34 156.71 158.15 160.81 Mean: 158.46 Std Dev: 1.46 Instead of making this a build or runtime config, Andy Lutomirski recommended this just be enabled by default. [1] A noisy search for many kinds of stack content leaks can be seen here: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak I did some more with perf and cycle counts on running 100,000 execs of /bin/true. before: Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841 Mean: 221015379122.60 Std Dev: 4662486552.47 after: Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348 Mean: 217745009865.40 Std Dev: 5935559279.99 It continues to look like it's faster, though the deviation is rather wide, but I'm not sure what I could do that would be less noisy. I'm open to ideas! Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mark Rutland | 3eda69c92d | kernel/fork.c: detect early free of a live mm KASAN splats indicate that in some cases we free a live mm, then continue to access it, with potentially disastrous results. This is likely due to a mismatched mmdrop() somewhere in the kernel, but so far the culprit remains elusive. Let's have __mmdrop() verify that the mm isn't live for the current task, similar to the existing check for init_mm. This way, we can catch this class of issue earlier, and without requiring KASAN. Currently, idle_task_exit() leaves active_mm stale after it switches to init_mm. This isn't harmful, but will trigger the new assertions, so we must adjust idle_task_exit() to update active_mm. Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Dominik Brodowski | 9b32105ec6 | kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare() Using this helper allows us to avoid the in-kernel calls to the sys_unshare() syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_unshare(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> | ||
|  Dominik Brodowski | 2de0db992d | mm: use do_futex() instead of sys_futex() in mm_release() sys_futex() is a wrapper to do_futex() which does not modify any values here: - uaddr, val and val3 are kept the same - op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE. Therefore, val2 is always 0. - as utime is set to NULL, *timeout is NULL This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> | ||
|  Andrew Morton | d34bc48f82 | include/linux/sched/mm.h: re-inline mmdrop() As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.
Fixes:  | ||
|  Linus Torvalds | a2e5790d84 | Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: - kasan updates - procfs - lib/bitmap updates - other lib/ updates - checkpatch tweaks - rapidio - ubsan - pipe fixes and cleanups - lots of other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) Documentation/sysctl/user.txt: fix typo MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns MAINTAINERS: update various PALM patterns MAINTAINERS: update "ARM/OXNAS platform support" patterns MAINTAINERS: update Cortina/Gemini patterns MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern MAINTAINERS: remove ANDROID ION pattern mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors mm: docs: fix parameter names mismatch mm: docs: fixup punctuation pipe: read buffer limits atomically pipe: simplify round_pipe_size() pipe: reject F_SETPIPE_SZ with size over UINT_MAX pipe: fix off-by-one error when checking buffer limits pipe: actually allow root to exceed the pipe buffer limits pipe, sysctl: remove pipe_proc_fn() pipe, sysctl: drop 'min' parameter from pipe-max-size converter kasan: rework Kconfig settings crash_dump: is_kdump_kernel can be boolean kernel/mutex: mutex_is_locked can be boolean ... | ||
|  Marcos Paulo de Souza | 667b60946e | kernel/fork.c: add comment about usage of CLONE_FS flags and namespaces All other places that deals with namespaces have an explanation of why
the restriction is there.
The description added in this commit was based on commit  | ||
|  Marcos Paulo de Souza | 9f5325aa37 | kernel/fork.c: check error and return early Thus reducing one indentation level while maintaining the same rationale. Link: http://lkml.kernel.org/r/20171117002929.5155-1-marcos.souza.org@gmail.com Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Ingo Molnar | 8284507916 | Merge branch 'linus' into sched/urgent, to resolve conflicts Conflicts: arch/arm64/kernel/entry.S arch/x86/Kconfig include/linux/sched/mm.h kernel/fork.c Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Linus Torvalds | 617aebe6a9 | Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass all hardened usercopy checks since these sizes cannot change at runtime.) This new check is WARN-by-default, so any mistakes can be found over the next several releases without breaking anyone's system. The series has roughly the following sections: - remove %p and improve reporting with offset - prepare infrastructure and whitelist kmalloc - update VFS subsystem with whitelists - update SCSI subsystem with whitelists - update network subsystem with whitelists - update process memory with whitelists - update per-architecture thread_struct with whitelists - update KVM with whitelists and fix ioctl bug - mark all other allocations as not whitelisted - update lkdtm for more sensible test overage -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Kees Cook <kees@outflux.net> iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4 43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q 9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1 gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc JgOmUnQNJWCTwUUw5AS1 =tzmJ -----END PGP SIGNATURE----- Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardened usercopy whitelisting from Kees Cook: "Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass all hardened usercopy checks since these sizes cannot change at runtime.) This new check is WARN-by-default, so any mistakes can be found over the next several releases without breaking anyone's system. The series has roughly the following sections: - remove %p and improve reporting with offset - prepare infrastructure and whitelist kmalloc - update VFS subsystem with whitelists - update SCSI subsystem with whitelists - update network subsystem with whitelists - update process memory with whitelists - update per-architecture thread_struct with whitelists - update KVM with whitelists and fix ioctl bug - mark all other allocations as not whitelisted - update lkdtm for more sensible test overage" * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits) lkdtm: Update usercopy tests for whitelisting usercopy: Restrict non-usercopy caches to size 0 kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl kvm: whitelist struct kvm_vcpu_arch arm: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy x86: Implement thread_struct whitelist for hardened usercopy fork: Provide usercopy whitelisting for task_struct fork: Define usercopy region in thread_stack slab caches fork: Define usercopy region in mm_struct slab caches net: Restrict unwhitelisted proto caches to size 0 sctp: Copy struct sctp_sock.autoclose to userspace using put_user() sctp: Define usercopy region in SCTP proto slab cache caif: Define usercopy region in caif proto slab cache ip: Define usercopy region in IP proto slab cache net: Define usercopy region in struct proto slab cache scsi: Define usercopy region in scsi_sense_cache slab cache cifs: Define usercopy region in cifs_request slab cache vxfs: Define usercopy region in vxfs_inode slab cache ufs: Define usercopy region in ufs_inode_cache slab cache ... | ||
|  Andrew Morton | d70f2a14b7 | include/linux/sched/mm.h: uninline mmdrop_async(), etc mmdrop_async() is only used in fork.c. Move that and its support functions into fork.c, uninline it all. Quite a lot of code gets moved around to avoid forward declarations. Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Kees Cook | 5905429ad8 | fork: Provide usercopy whitelisting for task_struct While the blocked and saved_sigmask fields of task_struct are copied to
userspace (via sigmask_to_save() and setup_rt_frame()), it is always
copied with a static length (i.e. sizeof(sigset_t)).
The only portion of task_struct that is potentially dynamically sized and
may be copied to userspace is in the architecture-specific thread_struct
at the end of task_struct.
cache object allocation:
    kernel/fork.c:
        alloc_task_struct_node(...):
            return kmem_cache_alloc_node(task_struct_cachep, ...);
        dup_task_struct(...):
            ...
            tsk = alloc_task_struct_node(node);
        copy_process(...):
            ...
            dup_task_struct(...)
        _do_fork(...):
            ...
            copy_process(...)
example usage trace:
    arch/x86/kernel/fpu/signal.c:
        __fpu__restore_sig(...):
            ...
            struct task_struct *tsk = current;
            struct fpu *fpu = &tsk->thread.fpu;
            ...
            __copy_from_user(&fpu->state.xsave, ..., state_size);
        fpu__restore_sig(...):
            ...
            return __fpu__restore_sig(...);
    arch/x86/kernel/signal.c:
        restore_sigcontext(...):
            ...
            fpu__restore_sig(...)
This introduces arch_thread_struct_whitelist() to let an architecture
declare specifically where the whitelist should be within thread_struct.
If undefined, the entire thread_struct field is left whitelisted.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: "Mickaël Salaün" <mic@digikod.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com> | ||
|  David Windsor | f9d29946c5 | fork: Define usercopy region in thread_stack slab caches In support of usercopy hardening, this patch defines a region in the
thread_stack slab caches in which userspace copy operations are allowed.
Since the entire thread_stack needs to be available to userspace, the
entire slab contents are whitelisted. Note that the slab-based thread
stack is only present on systems with THREAD_SIZE < PAGE_SIZE and
!CONFIG_VMAP_STACK.
cache object allocation:
    kernel/fork.c:
        alloc_thread_stack_node(...):
            return kmem_cache_alloc_node(thread_stack_cache, ...)
        dup_task_struct(...):
            ...
            stack = alloc_thread_stack_node(...)
            ...
            tsk->stack = stack;
        copy_process(...):
            ...
            dup_task_struct(...)
        _do_fork(...):
            ...
            copy_process(...)
This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split patch, provide usage trace]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com> | ||
|  David Windsor | 07dcd7fe89 | fork: Define usercopy region in mm_struct slab caches In support of usercopy hardening, this patch defines a region in the
mm_struct slab caches in which userspace copy operations are allowed.
Only the auxv field is copied to userspace.
cache object allocation:
    kernel/fork.c:
        #define allocate_mm()     (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
        dup_mm():
            ...
            mm = allocate_mm();
        copy_mm(...):
            ...
            dup_mm();
        copy_process(...):
            ...
            copy_mm(...)
        _do_fork(...):
            ...
            copy_process(...)
example usage trace:
    fs/binfmt_elf.c:
        create_elf_tables(...):
            ...
            elf_info = (elf_addr_t *)current->mm->saved_auxv;
            ...
            copy_to_user(..., elf_info, ei_index * sizeof(elf_addr_t))
        load_elf_binary(...):
            ...
            create_elf_tables(...);
This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split patch, provide usage trace]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com> | ||
|  Linus Torvalds | caf9a82657 | Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 PTI preparatory patches from Thomas Gleixner:
 "Todays Advent calendar window contains twentyfour easy to digest
  patches. The original plan was to have twenty three matching the date,
  but a late fixup made that moot.
   - Move the cpu_entry_area mapping out of the fixmap into a separate
     address space. That's necessary because the fixmap becomes too big
     with NRCPUS=8192 and this caused already subtle and hard to
     diagnose failures.
     The top most patch is fresh from today and cures a brain slip of
     that tall grumpy german greybeard, who ignored the intricacies of
     32bit wraparounds.
   - Limit the number of CPUs on 32bit to 64. That's insane big already,
     but at least it's small enough to prevent address space issues with
     the cpu_entry_area map, which have been observed and debugged with
     the fixmap code
   - A few TLB flush fixes in various places plus documentation which of
     the TLB functions should be used for what.
   - Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
     more than sysenter now and keeping the name makes backtraces
     confusing.
   - Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
     which is only invoked on fork().
   - Make vysycall more robust.
   - A few fixes and cleanups of the debug_pagetables code. Check
     PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
     C89 initialization of the address hint array which already was out
     of sync with the index enums.
   - Move the ESPFIX init to a different place to prepare for PTI.
   - Several code moves with no functional change to make PTI
     integration simpler and header files less convoluted.
   - Documentation fixes and clarifications"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
  init: Invoke init_espfix_bsp() from mm_init()
  x86/cpu_entry_area: Move it out of the fixmap
  x86/cpu_entry_area: Move it to a separate unit
  x86/mm: Create asm/invpcid.h
  x86/mm: Put MMU to hardware ASID translation in one place
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Move the CR3 construction functions to tlbflush.h
  x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
  x86/mm: Remove superfluous barriers
  x86/mm: Use __flush_tlb_one() for kernel memory
  x86/microcode: Dont abuse the TLB-flush interface
  x86/uv: Use the right TLB-flush API
  x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
  x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
  x86/mm/64: Improve the memory map documentation
  x86/ldt: Prevent LDT inheritance on exec
  x86/ldt: Rework locking
  arch, mm: Allow arch_dup_mmap() to fail
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  ... | ||
|  Thomas Gleixner | c10e83f598 | arch, mm: Allow arch_dup_mmap() to fail In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be allowed to fail. Fix up all instances. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: dan.j.williams@intel.com Cc: hughd@google.com Cc: keescook@google.com Cc: kirill.shutemov@linux.intel.com Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |