2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1521 Commits

Author SHA1 Message Date
Eric Dumazet
f5f293a4e3 sched: account system time properly
Andrew Gallatin reported that IRQ and SOFTIRQ times were
sometime not reported correctly on recent kernels, and even
bisected to commit 457533a7d3
([PATCH] fix scaled & unscaled cputime accounting) as the first
bad commit.

Further analysis pointed that commit
79741dd357 ([PATCH] idle cputime
accounting) was the real cause of the problem.

account_process_tick() was not taking into account timer IRQ
interrupting the idle task servicing a hard or soft irq.

On mostly idle cpu, irqs were thus not accounted and top or
mpstat could tell user/admin that cpu was 100 % idle, 0.00 %
irq, 0.00 % softirq, while it was not.

[ Impact: fix occasionally incorrect CPU statistics in top/mpstat ]

Reported-by: Andrew Gallatin <gallatin@myri.com>
Re-reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: rick.jones2@hp.com
Cc: brice@myri.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <49F84BC1.7080602@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 15:02:28 +02:00
Ingo Molnar
e7fd5d4b3d Merge branch 'linus' into perfcounters/core
Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
              the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:47:05 +02:00
David Howells
50fa610a3b sched: Document memory barriers implied by sleep/wake-up primitives
Add a section to the memory barriers document to note the implied
memory barriers of sleep primitives (set_current_state() and wrappers)
and wake-up primitives (wake_up() and co.).

Also extend the in-code comments on the wake_up() functions to note
these implied barriers.

[ Impact: add documentation ]

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090428140138.1192.94723.stgit@warthog.procyon.org.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:15:55 +02:00
Ingo Molnar
416dfdcdb8 Merge commit 'v2.6.30-rc3' into tracing/hw-branch-tracing
Conflicts:
	arch/x86/kernel/ptrace.c

Merge reason: fix the conflict above, and also pick up the CONFIG_BROKEN
              dependency change from upstream so that we can remove it
	      here.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-24 10:11:23 +02:00
Gautham R Shenoy
6e29ec5701 sched: Replace first_cpu() with cpumask_first() in ILB nomination code
Stephen Rothwell reported this build warning:

>  kernel/sched.c: In function 'find_new_ilb':
>  kernel/sched.c:4355: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
>
> Possibly caused by commit f711f6090a
> ("sched: Nominate idle load balancer from a semi-idle package") from
> the sched tree.  Should this call to first_cpu be cpumask_first?

For !(CONFIG_SCHED_MC || CONFIG_SCHED_SMT), find_new_ilb() nominates the
Idle load balancer as the first cpu from the nohz.cpu_mask.

This code uses the older API first_cpu(). Replace it with cpumask_first(),
which is the correct API here.

[ Impact: cleanup, address build warning ]

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090421031049.GA4140@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-21 08:06:05 +02:00
Peter Zijlstra
ff743345bf sched: remove extra call overhead for schedule()
Lai Jiangshan's patch reminded me that I promised Nick to remove
that extra call overhead in schedule().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090313112300.927414207@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-20 20:49:53 +02:00
Ingo Molnar
f1f9b3b179 perfcounters, sched: remove __task_delta_exec()
This function was left orphan by the latest round of sw-counter
cleanups.

[ Impact: remove unused kernel function ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-20 20:38:21 +02:00
Gautham R Shenoy
381512cf3d sched: Avoid printing sched_group::__cpu_power for default case
Commit 46e0bb9c12 ("sched: Print sched_group::__cpu_power
in sched_domain_debug") produces a messy dmesg output while
attempting to print the sched_group::__cpu_power for each
group in the sched_domain hierarchy.

Fix this by avoid printing the __cpu_power for default cases.
(i.e, __cpu_power == SCHED_LOAD_SCALE).

[ Impact: reduce syslog clutter ]

Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Fixed-by: Tony Luck <tony.luck@intel.com>
Cc: a.p.zijlstra@chello.nl
LKML-Reference: <20090414033936.GA534@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 00:46:05 +02:00
Miao Xie
13318a7186 sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus())
Impact: cleanup

This patch changes cpumask_first(sched_group_cpus()) to group_first_cpu()
for maintainability.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-15 13:34:07 +02:00
Steven Rostedt
ad8d75fff8 tracing/events: move trace point headers into include/trace/events
Impact: clean up

Create a sub directory in include/trace called events to keep the
trace point headers in their own separate directory. Only headers that
declare trace points should be defined in this directory.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 22:05:43 -04:00
Steven Rostedt
a8d154b009 tracing: create automated trace defines
This patch lowers the number of places a developer must modify to add
new tracepoints. The current method to add a new tracepoint
into an existing system is to write the trace point macro in the
trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
DECLARE_TRACE, then they must add the same named item into the C file
with the macro DEFINE_TRACE(name) and then add the trace point.

This change cuts out the needing to add the DEFINE_TRACE(name).
Every file that uses the tracepoint must still include the trace/<type>.h
file, but the one C file must also add a define before the including
of that file.

 #define CREATE_TRACE_POINTS
 #include <trace/mytrace.h>

This will cause the trace/mytrace.h file to also produce the C code
necessary to implement the trace point.

Note, if more than one trace/<type>.h is used to create the C code
it is best to list them all together.

 #define CREATE_TRACE_POINTS
 #include <trace/foo.h>
 #include <trace/bar.h>
 #include <trace/fido.h>

Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
the cleaner solution of the define above the includes over my first
design to have the C code include a "special" header.

This patch converts sched, irq and lockdep and skb to use this new
method.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:57:28 -04:00
Johannes Weiner
78ddb08feb wait: don't use __wake_up_common()
'777c6c5 wait: prevent exclusive waiter starvation' made
__wake_up_common() global to be used from abort_exclusive_wait().

It was needed to do a wake-up with the waitqueue lock held while
passing down a key to the wake-up function.

Since '4ede816 epoll keyed wakeups: add __wake_up_locked_key() and
__wake_up_sync_key()' there is an appropriate wrapper for this case:
__wake_up_locked_key().

Use it here and make __wake_up_common() private to the scheduler
again.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1239720785-19661-1-git-send-email-hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 17:17:16 +02:00
Gautham R Shenoy
e790fb0ba6 sched: Nominate a power-efficient ilb in select_nohz_balancer()
The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.

Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090414045535.7645.31641.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 11:49:19 +02:00
Gautham R Shenoy
f711f6090a sched: Nominate idle load balancer from a semi-idle package.
Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.

For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.

Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.

The algorithm is activated only when sched_mc/smt_power_savings != 0.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090414045530.7645.12175.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 11:49:19 +02:00
Lai Jiangshan
132380a06b tracing, sched: mark get_parent_ip() notrace
Impact: remove overly redundant tracing entries

When tracer is "function" or "function_graph", way too much
"get_parent_ip" entries are recorded in ring_buffer.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49D458B1.5000703@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 02:11:05 +02:00
Ingo Molnar
5af8c4e0fa Merge commit 'v2.6.30-rc1' into sched/urgent
Merge reason: update to latest upstream to queue up fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 17:26:00 +02:00
Jeremy Fitzhardinge
38f4b8c0da Merge commit 'origin/master' into for-linus/xen/master
* commit 'origin/master': (4825 commits)
  Fix build errors due to CONFIG_BRANCH_TRACER=y
  parport: Use the PCI IRQ if offered
  tty: jsm cleanups
  Adjust path to gpio headers
  KGDB_SERIAL_CONSOLE check for module
  Change KCONFIG name
  tty: Blackin CTS/RTS
  Change hardware flow control from poll to interrupt driven
  Add support for the MAX3100 SPI UART.
  lanana: assign a device name and numbering for MAX3100
  serqt: initial clean up pass for tty side
  tty: Use the generic RS485 ioctl on CRIS
  tty: Correct inline types for tty_driver_kref_get()
  splice: fix deadlock in splicing to file
  nilfs2: support nanosecond timestamp
  nilfs2: introduce secondary super block
  nilfs2: simplify handling of active state of segments
  nilfs2: mark minor flag for checkpoint created by internal operation
  nilfs2: clean up sketch file
  nilfs2: super block operations fix endian bug
  ...

Conflicts:
	arch/x86/include/asm/thread_info.h
	arch/x86/lguest/boot.c
	drivers/xen/manage.c
2009-04-07 13:34:16 -07:00
Markus Metzger
a26b89f05d sched, hw-branch-tracer: add wait_task_context_switch() function to sched.h
Add a function to wait until some other task has been
switched out at least once.

This differs from wait_task_inactive() subtly, in that the
latter will wait until the task has left the CPU.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Cc: markus.t.metzger@gmail.com
Cc: roland@redhat.com
Cc: eranian@googlemail.com
Cc: oleg@redhat.com
Cc: juan.villacis@intel.com
Cc: ak@linux.jf.intel.com
LKML-Reference: <20090403144549.794157000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:36:12 +02:00
Ingo Molnar
6c009ecef8 Merge branch 'linus' into perfcounters/core
Merge reason: need the upstream facility added by:

  7f1e2ca: hrtimer: fix rq->lock inversion (again)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 12:05:25 +02:00
Peter Zijlstra
849691a6cd perf_counter: remove rq->lock usage
Now that all the task runtime clock users are gone, remove the ugly
rq->lock usage from perf counters, which solves the nasty deadlock
seen when a software task clock counter was read from an NMI overflow
context.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.531137582@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:49:01 +02:00
Linus Torvalds
609862be07 Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: add stack dumps to asserts
  hrtimer: fix rq->lock inversion (again)
2009-04-06 13:37:30 -07:00
Peter Zijlstra
4a0deca657 perf_counter: generic context switch event
Impact: cleanup

Use the generic software events for context switches.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.283522645@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:15 +02:00
Ingo Molnar
f541ae326f Merge branch 'linus' into perfcounters/core-v2
Merge reason: we have gathered quite a few conflicts, need to merge upstream

Conflicts:
	arch/powerpc/kernel/Makefile
	arch/x86/ia32/ia32entry.S
	arch/x86/include/asm/hardirq.h
	arch/x86/include/asm/unistd_32.h
	arch/x86/include/asm/unistd_64.h
	arch/x86/kernel/cpu/common.c
	arch/x86/kernel/irq.c
	arch/x86/kernel/syscall_table_32.S
	arch/x86/mm/iomap_32.c
	include/linux/sched.h
	kernel/Makefile

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:02:57 +02:00
Linus Torvalds
714f83d5d9 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
  tracing, net: fix net tree and tracing tree merge interaction
  tracing, powerpc: fix powerpc tree and tracing tree interaction
  ring-buffer: do not remove reader page from list on ring buffer free
  function-graph: allow unregistering twice
  trace: make argument 'mem' of trace_seq_putmem() const
  tracing: add missing 'extern' keywords to trace_output.h
  tracing: provide trace_seq_reserve()
  blktrace: print out BLK_TN_MESSAGE properly
  blktrace: extract duplidate code
  blktrace: fix memory leak when freeing struct blk_io_trace
  blktrace: fix blk_probes_ref chaos
  blktrace: make classic output more classic
  blktrace: fix off-by-one bug
  blktrace: fix the original blktrace
  blktrace: fix a race when creating blk_tree_root in debugfs
  blktrace: fix timestamp in binary output
  tracing, Text Edit Lock: cleanup
  tracing: filter fix for TRACE_EVENT_FORMAT events
  ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
  x86: kretprobe-booster interrupt emulation code fix
  ...

Fix up trivial conflicts in
 arch/parisc/include/asm/ftrace.h
 include/linux/memory.h
 kernel/extable.c
 kernel/module.c
2009-04-05 11:04:19 -07:00
Linus Torvalds
90975ef712 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
  cpumask: remove cpumask allocation from idle_balance, fix
  numa, cpumask: move numa_node_id default implementation to topology.h, fix
  cpumask: remove cpumask allocation from idle_balance
  x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
  x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
  x86: microcode: cleanup
  x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
  cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
  numa, cpumask: move numa_node_id default implementation to topology.h
  cpumask: convert node_to_cpumask_map[] to cpumask_var_t
  cpumask: remove x86 cpumask_t uses.
  cpumask: use cpumask_var_t in uv_flush_tlb_others.
  cpumask: remove cpumask_t assignment from vector_allocation_domain()
  cpumask: make Xen use the new operators.
  cpumask: clean up summit's send_IPI functions
  cpumask: use new cpumask functions throughout x86
  x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
  cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
  cpumask: convert node_to_cpumask_map[] to cpumask_var_t
  x86: unify 32 and 64-bit node_to_cpumask_map
  ...
2009-04-05 10:33:07 -07:00
Linus Torvalds
b1dbb67911 Merge branch 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  s390: remove arch specific smp_send_stop()
  panic: clean up kernel/panic.c
  panic, smp: provide smp_send_stop() wrapper on UP too
  panic: decrease oops_in_progress only after having done the panic
  generic-ipi: eliminate WARN_ON()s during oops/panic
  generic-ipi: cleanups
  generic-ipi: remove CSD_FLAG_WAIT
  generic-ipi: remove kmalloc()
  generic IPI: simplify barriers and locking
2009-04-03 17:33:30 -07:00
Ingo Molnar
8302294f43 Merge branch 'tracing/core-v2' into tracing-for-linus
Conflicts:
	include/linux/slub_def.h
	lib/Kconfig.debug
	mm/slob.c
	mm/slub.c
2009-04-02 00:49:02 +02:00
Davide Libenzi
4ede816ac3 epoll keyed wakeups: add __wake_up_locked_key() and __wake_up_sync_key()
This patchset introduces wakeup hints for some of the most popular (from
epoll POV) devices, so that epoll code can avoid spurious wakeups on its
waiters.

The problem with epoll is that the callback-based wakeups do not, ATM,
carry any information about the events the wakeup is related to.  So the
only choice epoll has (not being able to call f_op->poll() from inside the
callback), is to add the file* to a ready-list and resolve the real events
later on, at epoll_wait() (or its own f_op->poll()) time.  This can cause
spurious wakeups, since the wake_up() itself might be for an event the
caller is not interested into.

The rate of these spurious wakeup can be pretty high in case of many
network sockets being monitored.

By allowing devices to report the events the wakeups refer to (at least
the two major classes - POLLIN/POLLOUT), we are able to spare useless
wakeups by proper handling inside the epoll's poll callback.

Epoll will have in any case to call f_op->poll() on the file* later on,
since the change to be done in order to have the full event set sent via
wakeup, is too invasive for the way our f_op->poll() system works (the
full event set is calculated inside the poll function - there are too many
of them to even start thinking the change - also poll/select would need
change too).

Epoll is changed in a way that both devices which send event hints, and
the ones that don't, are correctly handled.  The former will gain some
efficiency though.

As a general rule for devices, would be to add an event mask by using
key-aware wakeup macros, when making up poll wait queues.  I tested it
(together with the epoll's poll fix patch Andrew has in -mm) and wakeups
for the supported devices are correctly filtered.

Test program available here:

http://www.xmailserver.org/epoll_test.c

This patch:

Nothing revolutionary here.  Just using the available "key" that our
wakeup core already support.  The __wake_up_locked_key() was no brainer,
since both __wake_up_locked() and __wake_up_locked_key() are thin wrappers
around __wake_up_common().

The __wake_up_sync() function had a body, so the choice was between
borrowing the body for __wake_up_sync_key() and calling it from
__wake_up_sync(), or make an inline and calling it from both.  I chose the
former since in most archs it all resolves to "mov $0, REG; jmp ADDR".

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:20 -07:00
Gautham R Shenoy
46e0bb9c12 sched: Print sched_group::__cpu_power in sched_domain_debug
Impact: extend debug info /proc/sched_debug

If the user changes the value of the sched_mc/smt_power_savings sysfs
tunable, it'll trigger a rebuilding of the whole sched_domain tree,
with the SD_POWERSAVINGS_BALANCE flag set at certain levels.

As a result, there would be a change in the __cpu_power of sched_groups
in the sched_domain hierarchy.

Print the __cpu_power values for each sched_group in sched_domain_debug
to help verify this change and correlate it with the change in the
load-balancing behavior.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090330045520.2869.24777.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 17:58:03 +02:00
Bharata B Rao
ef12fefabf cpuacct: add per-cgroup utime/stime statistics
Add per-cgroup cpuacct controller statistics like the system and user
time consumed by the group of tasks.

Changelog:

v7
- Changed the name of the statistic from utime to user and from stime to
  system so that in future we could easily add other statistics like irq,
  softirq, steal times etc easily.

v6
- Fixed a bug in the error path of cpuacct_create() (pointed by Li Zefan).

v5
- In cpuacct_stats_show(), use cputime64_to_clock_t() since we are
  operating on a 64bit variable here.

v4
- Remove comments in cpuacct_update_stats() which explained why rcu_read_lock()
  was needed (as per Peter Zijlstra's review comments).
- Don't say that percpu_counter_read() is broken in Documentation/cpuacct.txt
  as per KAMEZAWA Hiroyuki's review comments.

v3
- Fix a small race in the cpuacct hierarchy walk.

v2
- stime and utime now exported in clock_t units instead of msecs.
- Addressed the code review comments from Balbir and Li Zefan.
- Moved to -tip tree.

v1
- Moved the stime/utime accounting to cpuacct controller.

Earlier versions
- http://lkml.org/lkml/2009/2/25/129

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
LKML-Reference: <20090331043222.GA4093@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 16:49:38 +02:00
Hidetoshi Seto
c5f8d99585 posixtimers, sched: Fix posix clock monotonicity
Impact: Regression fix (against clock_gettime() backwarding bug)

This patch re-introduces a couple of functions, task_sched_runtime
and thread_group_sched_runtime, which was once removed at the
time of 2.6.28-rc1.

These functions protect the sampling of thread/process clock with
rq lock.  This rq lock is required not to update rq->clock during
the sampling.

i.e.
  The clock_gettime() may return
   ((accounted runtime before update) + (delta after update))
  that is less than what it should be.

v2 -> v3:
	- Rename static helper function __task_delta_exec()
	  to do_task_delta_exec() since -tip tree already has
	  a __task_delta_exec() of different version.

v1 -> v2:
	- Revises comments of function and patch description.
	- Add note about accuracy of thread group's runtime.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org	[2.6.28.x][2.6.29.x]
LKML-Reference: <49D1CC93.4080401@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 16:44:16 +02:00
Bharata B Rao
a18b83b7ef cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2
Impact: fix cgroups race under rcu-preempt

cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
This can race with the task's movement between cgroups. This race
can cause an access to freed ca pointer in cpuacct_charge() or access
to invalid cgroups pointer of the task. This will not happen with rcu or
tree rcu as cpuacct_charge() is called with preemption disabled. However if
rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.

Fix this race by explicitly protecting ca and the hierarchy walk with
rcu_read_lock().

Changes for v2:

 - Update patch descrition (as per Li Zefan's review comments).

 - Remove comments in cpuacct_charge() which explained why rcu_read_lock()
   was needed (as per Peter Zijlstra's review comments).

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 18:27:59 +02:00
Peter Zijlstra
7f1e2ca9f0 hrtimer: fix rq->lock inversion (again)
It appears I inadvertly introduced rq->lock recursion to the
hrtimer_start() path when I delegated running already expired
timers to softirq context.

This patch fixes it by introducing a __hrtimer_start_range_ns()
method that will not use raise_softirq_irqoff() but
__raise_softirq_irqoff() which avoids the wakeup.

It then also changes schedule() to check for pending softirqs and
do the wakeup then, I'm not quite sure I like this last bit, nor
am I convinced its really needed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
LKML-Reference: <20090313112301.096138802@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 14:52:52 +02:00
Rusty Russell
558f6ab910 Merge branch 'cpumask-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
Conflicts:

	arch/x86/include/asm/topology.h
	drivers/oprofile/buffer_sync.c
(Both cases: changed in Linus' tree, removed in Ingo's).
2009-03-31 13:33:50 +10:30
Linus Torvalds
c4e1aa67ed Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (33 commits)
  lockdep: fix deadlock in lockdep_trace_alloc
  lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB
  lockdep: annotate reclaim context (__GFP_NOFS), fix
  lockdep: build fix for !PROVE_LOCKING
  lockstat: warn about disabled lock debugging
  lockdep: use stringify.h
  lockdep: simplify check_prev_add_irq()
  lockdep: get_user_chars() redo
  lockdep: simplify get_user_chars()
  lockdep: add comments to mark_lock_irq()
  lockdep: remove macro usage from mark_held_locks()
  lockdep: fully reduce mark_lock_irq()
  lockdep: merge the !_READ mark_lock_irq() helpers
  lockdep: merge the _READ mark_lock_irq() helpers
  lockdep: simplify mark_lock_irq() helpers #3
  lockdep: further simplify mark_lock_irq() helpers
  lockdep: simplify the mark_lock_irq() helpers
  lockdep: split up mark_lock_irq()
  lockdep: generate usage strings
  lockdep: generate the state bit definitions
  ...
2009-03-30 17:17:35 -07:00
Ingo Molnar
65fb0d23fc Merge branch 'linus' into cpumask-for-linus
Conflicts:
	arch/x86/kernel/cpu/common.c
2009-03-30 23:53:32 +02:00
Jeremy Fitzhardinge
224101ed69 x86/paravirt: finish change from lazy cpu to context switch start/end
Impact: fix lazy context switch API

Pass the previous and next tasks into the context switch start
end calls, so that the called functions can properly access the
task state (esp in end_context_switch, in which the next task
is not yet completely current).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2009-03-29 23:36:01 -07:00
Jeremy Fitzhardinge
7fd7d83d49 x86/pvops: replace arch_enter_lazy_cpu_mode with arch_start_context_switch
Impact: simplification, prepare for later changes

Make lazy cpu mode more specific to context switching, so that
it makes sense to do more context-switch specific things in
the callbacks.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2009-03-29 23:35:59 -07:00
Randy Dunlap
d5ac537e5f sched: fix errors in struct & function comments
Fix kernel-doc errors in sched.c:  the structs don't have
kernel-doc notation and the short function description needs to
be one line only.

  Error(kernel/sched.c:3197): cannot understand prototype: 'struct sd_lb_stats '
  Error(kernel/sched.c:3228): cannot understand prototype: 'struct sg_lb_stats '
  Error(kernel/sched.c:3375): duplicate section name 'Description'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-29 08:12:39 -07:00
Ingo Molnar
6e15cf0486 Merge branch 'core/percpu' into percpu-cpumask-x86-for-linus-2
Conflicts:
	arch/parisc/kernel/irq.c
	arch/x86/include/asm/fixmap_64.h
	arch/x86/include/asm/setup.h
	kernel/irq/handle.c

Semantic merge:
        arch/x86/include/asm/fixmap.h

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-27 17:28:43 +01:00
Gautham R Shenoy
b7bb4c9bb0 sched: Add comments to find_busiest_group() function
Impact: cleanup

Add /** style comments around find_busiest_group(). Also add a few
explanatory comments.

This concludes the find_busiest_group() cleanup. The function is
now down to 72 lines from the original 313 lines.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091427.13992.18933.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 13:28:30 +01:00
Gautham R Shenoy
c071df1852 sched: Refactor the power savings balance code
Impact: cleanup

Create seperate helper functions to initialize the
power-savings-balance related variables, to update them and
to check if we have a scope for performing power-savings balance.

Add no-op inline functions for the !(CONFIG_SCHED_MC || CONFIG_SCHED_SMT)
case.

This will eliminate all the #ifdef jungle in find_busiest_group() and the
other helper functions.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091422.13992.73616.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:48 +01:00
Gautham R Shenoy
a021dc0337 sched: Optimize the !power_savings_balance during fbg()
Impact: cleanup, micro-optimization

We don't need to perform power_savings balance if either the
cpu is NOT_IDLE or if the sched_domain doesn't contain the
SD_POWERSAVINGS_BALANCE flag set.

Currently, we check for these conditions multiple number of
times, even though these variables don't change over the scope
of find_busiest_group().

Check once, and store the value in the already exiting
"power_savings_balance" variable.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091417.13992.2657.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:48 +01:00
Gautham R Shenoy
dbc523a3b8 sched: Create a helper function to calculate imbalance
Move all the imbalance calculation out of find_busiest_group()
through this helper function.

With this change, the structure of find_busiest_group() will be
as follows:

- update_sched_domain_statistics.

- check if imbalance exits.

- update imbalance and return busiest.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091411.13992.43293.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:47 +01:00
Gautham R Shenoy
2e6f44aeda sched: Create helper to calculate small_imbalance in fbg()
Impact: cleanup

We have two places in find_busiest_group() where we need to calculate
the minor imbalance before returning the busiest group. Encapsulate
this functionality into a seperate helper function.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091406.13992.54316.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:47 +01:00
Gautham R Shenoy
37abe198b1 sched: Create a helper function to calculate sched_domain stats for fbg()
Impact: cleanup

Create a helper function named update_sd_lb_stats() to update the
various sched_domain related statistics in find_busiest_group().

With this we would have moved all the statistics computation out of
find_busiest_group().

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091401.13992.88737.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:46 +01:00
Gautham R Shenoy
222d656dea sched: Define structure to store the sched_domain statistics for fbg()
Impact: cleanup

Currently we use a lot of local variables in find_busiest_group()
to capture the various statistics related to the sched_domain.
Group them together into a single data structure.

This will help us to offload the job of updating the sched_domain
statistics to a helper function.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091356.13992.25970.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:46 +01:00
Gautham R Shenoy
1f8c553d0f sched: Create a helper function to calculate sched_group stats for fbg()
Impact: cleanup

Create a helper function named update_sg_lb_stats() which
can be invoked to calculate the individual group's statistics
in find_busiest_group().

This reduces the lenght of find_busiest_group() considerably.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Aked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091351.13992.43461.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:45 +01:00
Gautham R Shenoy
381be78fdc sched: Define structure to store the sched_group statistics for fbg()
Impact: cleanup

Currently a whole bunch of variables are used to store the
various statistics pertaining to the groups we iterate over
in find_busiest_group().

Group them together in a single data structure and add
appropriate comments.

This will be useful later on when we create helper functions
to calculate the sched_group statistics.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091345.13992.20099.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:45 +01:00
Gautham R Shenoy
6dfdb06290 sched: Fix indentations in find_busiest_group() using gotos
Impact: cleanup

Some indentations in find_busiest_group() can minimized by using
early exits with the help of gotos. This improves readability in
a couple of cases.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091340.13992.45062.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:44 +01:00
Gautham R Shenoy
67bb6c036d sched: Simple helper functions for find_busiest_group()
Impact: cleanup

Currently the load idx calculation code is in find_busiest_group().
Move that to a static inline helper function.

Similary, to find the first cpu of a sched_group we use
cpumask_first(sched_group_cpus(group))

Use a helper to that. It improves readability in some cases.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091335.13992.55424.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:44 +01:00
Luis Henriques
67aa0f767a sched: remove unused fields from struct rq
Impact: cleanup, new schedstat ABI

Since they are used on in statistics and are always set to zero, the
following fields from struct rq have been removed: yld_exp_empty,
yld_act_empty and yld_both_empty.

Both Sched Debug and SCHEDSTAT_VERSION versions has also been
incremented since ABIs have been changed.

The schedtop tool has been updated to properly handle new version of
schedstat:

   http://rt.wiki.kernel.org/index.php/Schedtop_utility

Signed-off-by: Luis Henriques <henrix@sapo.pt>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20090324221002.GA10061@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 23:16:51 +01:00
Rusty Russell
8c083f081d cpumask: remove cpumask allocation from idle_balance, fix
Impact: fix boot crash

Fix typo in the size calculation.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <alpine.DEB.2.00.0903181729360.31583@gandalf.stny.rr.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-19 13:49:44 +01:00
Rusty Russell
df7c8e845e cpumask: remove cpumask allocation from idle_balance
Impact: fix circular locking

Steven reports a circular locking from alloc_cpumask_var doing
a wakeup. We get rid of this using the tried-and-true technique
of using a per-cpu cpumask_var_t rather than doing an alloc
every time.

Simpler and more robust than a rare, implicit allocation within
an atomic codepath.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <alpine.DEB.2.00.0903181729360.31583@gandalf.stny.rr.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-19 08:15:15 +01:00
Luis Henriques
708dc51253 sched: small optimisation of can_migrate_task()
There were 3 invocations of task_hot() in can_migrate_task().

Replace these 3 invocations by only one invocation, cached in
a local variable.

Signed-off-by: Luis Henriques <henrix@sapo.pt>
LKML-Reference: <20090316195902.GA6197@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-17 12:04:41 +01:00
Luis Henriques
80dd99b368 sched: fix typos in documentation
Fixed typos in function documentation.

Signed-off-by: Luis Henriques <henrix@sapo.pt>
LKML-Reference: <20090316195809.GA6073@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-17 12:04:40 +01:00
Ingo Molnar
cd80a8142e Merge branch 'x86/core' into core/ipi 2009-03-13 11:05:58 +01:00
Rusty Russell
c69fc56de1 cpumask: use topology_core_cpumask/topology_thread_cpumask instead of cpu_core_map/cpu_sibling_map
Impact: cleanup

This is presumably what those definitions are for, and while all archs
define cpu_core_map/cpu_sibling map, that's changing (eg. x86 wants to
change it to a pointer).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-13 14:49:46 +10:30
Ingo Molnar
25d500067d Merge branch 'linus' into core/ipi 2009-03-13 02:14:25 +01:00
Mike Galbraith
df1c99d416 sched: add avg_overlap decay
Impact: more precise avg_overlap metric - better load-balancing

avg_overlap is used to measure the runtime overlap of the waker and
wakee.

However, when a process changes behaviour, eg a pipe becomes
un-congested and we don't need to go to sleep after a wakeup
for a while, the avg_overlap value grows stale.

When running we use the avg runtime between preemption as a
measure for avg_overlap since the amount of runtime can be
correlated to cache footprint.

The longer we run, the less likely we'll be wanting to be
migrated to another CPU.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236709131.25234.576.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-11 11:31:50 +01:00
Peter Zijlstra
57310a98a3 sched: optimize ttwu vs group scheduling
Impact: micro-optimization

We can avoid the sched domain walk on try_to_wake_up() when we know
there are no groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236603381.8389.455.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10 16:43:36 +01:00
Ingo Molnar
f0ef039851 Merge branch 'x86/core' into tracing/textedit
Conflicts:
	arch/x86/Kconfig
	block/blktrace.c
	kernel/irq/handle.c

Semantic conflict:
	kernel/trace/blktrace.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-06 16:45:01 +01:00
Lai Jiangshan
5ed0cec0ac sched: TIF_NEED_RESCHED -> need_reshed() cleanup
Impact: cleanup

Use test_tsk_need_resched(), set_tsk_need_resched(), need_resched()
instead of using TIF_NEED_RESCHED.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <49B10BA4.9070209@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-06 12:48:55 +01:00
Ingo Molnar
7fc07d8410 Merge branch 'sched/core' into sched/cleanups 2009-03-06 11:47:52 +01:00
Ingo Molnar
a1413c89ae Merge branch 'x86/urgent' into x86/core
Conflicts:
	arch/x86/include/asm/fixmap_64.h
Semantic merge:
	arch/x86/include/asm/fixmap.h

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 21:48:50 +01:00
Frederic Weisbecker
8a0be9ef82 sched: don't rebalance if attached on NULL domain
Impact: fix function graph trace hang / drop pointless softirq on UP

While debugging a function graph trace hang on an old PII, I saw
that it consumed most of its time on the timer interrupt. And
the domain rebalancing softirq was the most concerned.

The timer interrupt calls trigger_load_balance() which will
decide if it is worth to schedule a rebalancing softirq.

In case of builtin UP kernel, no problem arises because there is
no domain question.

In case of builtin SMP kernel running on an SMP box, still no
problem, the softirq will be raised each time we reach the
next_balance time.

In case of builtin SMP kernel running on a UP box (most distros
provide default SMP kernels, whatever the box you have), then
the CPU is attached to the NULL sched domain. So a kind of
unexpected behaviour happen:

trigger_load_balance() -> raises the rebalancing softirq later
on softirq: run_rebalance_domains() -> rebalance_domains() where
the for_each_domain(cpu, sd) is not taken because of the NULL
domain we are attached at. Which means rq->next_balance is never
updated. So on the next timer tick, we will enter
trigger_load_balance() which will always reschedule() the
rebalacing softirq:

if (time_after_eq(jiffies, rq->next_balance))
	raise_softirq(SCHED_SOFTIRQ);

So for each tick, we process this pointless softirq.

This patch fixes it by checking if we are attached to the null
domain before raising the softirq, another possible fix would be
to set the maximal possible JIFFIES value to rq->next_balance if
we are attached to the NULL domain.

v2: build fix on UP

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 14:04:44 +01:00
Ingo Molnar
28b1bd1cbc Merge branch 'core/locking' into tracing/ftrace 2009-03-04 18:49:19 +01:00
Ingo Molnar
8163d88c79 Merge commit 'v2.6.29-rc7' into perfcounters/core
Conflicts:
	arch/x86/mm/iomap_32.c
2009-03-04 11:42:31 +01:00
Ingo Molnar
a1be621dfa Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc7' into tracing/core 2009-03-04 11:14:47 +01:00
Wang Chen
b67802ea80 sched: kill unused parameter of pick_next_task()
Impact: micro-optimization

Parameter "prev" is not used really.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-02 12:02:53 +01:00
Ingo Molnar
5512b3ece0 Merge branches 'sched/clock', 'sched/urgent' and 'linus' into sched/core 2009-03-02 12:02:36 +01:00
Dhaval Giani
54e9912428 sched: don't allow setuid to succeed if the user does not have rt bandwidth
Impact: fix hung task with certain (non-default) rt-limit settings

Corey Hickey reported that on using setuid to change the uid of a
rt process, the process would be unkillable and not be running.
This is because there was no rt runtime for that user group. Add
in a check to see if a user can attach an rt task to its task group.
On failure, return EINVAL, which is also returned in
CONFIG_CGROUP_SCHED.

Reported-by: Corey Hickey <bugfood-ml@fatooh.org>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-27 11:11:53 +01:00
Hiroshi Shimamoto
cac64d00c2 sched_rt: don't start timer when rt bandwidth disabled
Impact: fix incorrect condition check

No need to start rt bandwidth timer when rt bandwidth is disabled.
If this timer starts, it may stop at sched_rt_period_timer() on the first time.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 14:18:55 +01:00
Li Zefan
c40c6f85a7 cpuacct: add a branch prediction
cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
always returns false after cpuacct has been initialized at system boot.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 13:22:38 +01:00
Ingo Molnar
8e818179eb Merge branch 'x86/core' into perfcounters/core
Conflicts:
	arch/x86/kernel/apic/apic.c
	arch/x86/kernel/irqinit_32.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 13:02:23 +01:00
Peter Zijlstra
6e2756376c generic-ipi: remove CSD_FLAG_WAIT
Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework
the code so that we can use CSD_FLAG_LOCK for both purposes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 14:13:44 +01:00
Ingo Molnar
0edcf8d692 Merge branch 'tj-percpu' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu
Conflicts:
	arch/x86/include/asm/pgtable.h
2009-02-24 21:52:45 +01:00
Ingo Molnar
fc6fc7f1b1 Merge branch 'linus' into x86/apic
Conflicts:
	arch/x86/mach-default/setup.c

Semantic conflict resolution:
	arch/x86/kernel/setup.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-22 20:05:19 +01:00
Rusty Russell
b36128c830 alloc_percpu: change percpu_ptr to per_cpu_ptr
Impact: cleanup

There are two allocated per-cpu accessor macros with almost identical
spelling.  The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.

tj: kill percpu_ptr() and update UP too

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20 16:29:08 +09:00
Ingo Molnar
72c26c9a26 Merge branch 'linus' into tracing/blktrace
Conflicts:
	block/blktrace.c

Semantic merge:
	kernel/trace/blktrace.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 09:00:35 +01:00
Américo Wang
2b8f836fb1 sched: use TASK_NICE for task_struct
#define TASK_NICE(p)            PRIO_TO_NICE((p)->static_prio)

So it's better to use TASK_NICE here.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-16 13:45:52 +01:00
Henrik Austad
a0a522ce3d sched: idle_at_tick is only used when CONFIG_SMP is set
Impact: struct rq size optimization

The idle_at_tick in struct rq is only used in SMP settings
and it does not make sense to have this in the rq in an UP setup.

Signed-off-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 21:16:10 +01:00
Ingo Molnar
5274f8354d Merge branch 'sched/urgent'; commit 'v2.6.29-rc5' into sched/core 2009-02-15 21:15:16 +01:00
Ingo Molnar
1c511f740f Merge branches 'tracing/ftrace', 'tracing/ring-buffer', 'tracing/sysprof', 'tracing/urgent' and 'linus' into tracing/core 2009-02-13 10:25:18 +01:00
Ingo Molnar
f8a6b2b9ce Merge branch 'linus' into x86/apic
Conflicts:
	arch/x86/kernel/acpi/boot.c
	arch/x86/mm/fault.c
2009-02-13 09:44:22 +01:00
Ingo Molnar
e9c4ffb11f Merge branch 'linus' into perfcounters/core
Conflicts:
	arch/x86/kernel/acpi/boot.c
2009-02-13 09:34:07 +01:00
Ingo Molnar
a0490fa35d sched: cpu hotplug fix
rq_attach_root() does a kfree() with the runqueue lock held.

That's not a very wise move, fix it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-12 11:57:36 +01:00
Linus Torvalds
6c6f1f0f4d Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: revert recent sync wakeup changes
2009-02-11 08:25:06 -08:00
Linus Torvalds
94dba89533 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers: fix TIMER_ABSTIME for process wide cpu timers
  timers: split process wide cpu clocks/timers, fix
  x86: clean up hpet timer reinit
  timers: split process wide cpu clocks/timers, remove spurious warning
  timers: split process wide cpu clocks/timers
  signal: re-add dead task accumulation stats.
  x86: fix hpet timer reinit for x86_64
  sched: fix nohz load balancer on cpu offline
2009-02-11 08:24:32 -08:00
Peter Zijlstra
fc631c82e1 sched: revert recent sync wakeup changes
Intel reported a 10% regression (mysql+sysbench) on a 16-way machine
with these patches:

  1596e29: sched: symmetric sync vs avg_overlap
  d942fb6: sched: fix sync wakeups

Revert them.

Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Bisected-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 14:43:35 +01:00
Ingo Molnar
f437e8b53e Merge commit 'v2.6.29-rc4' into sched/core 2009-02-11 10:17:42 +01:00
Ingo Molnar
95fd4845ed Merge commit 'v2.6.29-rc4' into perfcounters/core
Conflicts:
	arch/x86/kernel/setup_percpu.c
	arch/x86/mm/fault.c
	drivers/acpi/processor_idle.c
	kernel/irq/handle.c
2009-02-11 09:22:04 +01:00
Ingo Molnar
249d51b53a Merge commit 'v2.6.29-rc4' into core/percpu
Conflicts:
	arch/x86/mach-voyager/voyager_smp.c
	arch/x86/mm/fault.c
2009-02-09 14:58:11 +01:00
Paul Mackerras
23a185ca8a perf_counters: make software counters work as per-cpu counters
Impact: kernel crash fix

Yanmin Zhang reported that using a PERF_COUNT_TASK_CLOCK software
counter as a per-cpu counter would reliably crash the system, because
it calls __task_delta_exec with a null pointer.  The page fault,
context switch and cpu migration counters also won't function
correctly as per-cpu counters since they reference the current task.

This fixes the problem by redirecting the task_clock counter to the
cpu_clock counter when used as a per-cpu counter, and by implementing
per-cpu page fault, context switch and cpu migration counters.

Along the way, this:

- Initializes counter->ctx earlier, in perf_counter_alloc, so that
  sw_perf_counter_init can use it
- Adds code to kernel/sched.c to count task migrations into each
  cpu, in rq->nr_migrations_in
- Exports the per-cpu context switch and task migration counts
  via new functions added to kernel/sched.c
- Makes sure that if sw_perf_counter_init fails, we don't try to
  initialize the counter as a hardware counter.  Since the user has
  passed a negative, non-raw event type, they clearly don't intend
  for it to be interpreted as a hardware event.

Reported-by: "Zhang Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 12:47:16 +01:00
Ingo Molnar
4ad476e11f Merge commit 'v2.6.29-rc4' into tracing/core 2009-02-09 10:32:48 +01:00
Ingo Molnar
140573d33b Merge branches 'sched/rt' and 'sched/urgent' into sched/core 2009-02-08 20:12:46 +01:00
Ingo Molnar
673f820591 Merge branch 'linus' into core/locking
Conflicts:
	fs/btrfs/locking.c
2009-02-07 18:31:54 +01:00
Ingo Molnar
9d45cf9e36 Merge branch 'x86/urgent' into x86/apic
Conflicts:
	arch/x86/mach-default/setup.c

Semantic merge:
	arch/x86/kernel/irqinit_32.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 22:30:01 +01:00
Johannes Weiner
777c6c5f1f wait: prevent exclusive waiter starvation
With exclusive waiters, every process woken up through the wait queue must
ensure that the next waiter down the line is woken when it has finished.

Interruptible waiters don't do that when aborting due to a signal.  And if
an aborting waiter is concurrently woken up through the waitqueue, noone
will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting when
the actual lock holder woke it up concurrently.  The aborted contender
didn't acquire the lock and therefor never did an unlock followed by
waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor from
the waitqueue, iff still queued, or wakes up the next waiter otherwise.
It does so under the waitqueue lock.  Racing with a wake up means the
aborting process is either already woken (removed from the queue) and will
wake up the next waiter, or it will remove itself from the queue and the
concurrent wake up will apply to the next waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
__wait_on_bit_lock() when they were interrupted by other means than a wake
up through the queue.

[akpm@linux-foundation.org: coding-style fixes]
Reported-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Mentored-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chuck Lever <cel@citi.umich.edu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>		["after some testing"]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:48 -08:00
Suresh Siddha
483b4ee60e sched: fix nohz load balancer on cpu offline
Christian Borntraeger reports:

> After a logical cpu offline, even on a complete idle system, there
> is one cpu with full ticks. It turns out that nohz.cpu_mask has the
> the offlined cpu still set.
>
> In select_nohz_load_balancer() we check if the system is completely
> idle to turn of load balancing. We compare cpu_online_map with
> nohz.cpu_mask.  Since cpu_online_map is updated on cpu unplug,
> but nohz.cpu_mask is not, the check fails and the scheduler believes
> that we need an "idle load balancer" even on a fully idle system.
> Since the ilb cpu does not deactivate the timer tick this breaks NOHZ.

Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask
while a cpu is going offline.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-04 22:31:19 +01:00
Ingo Molnar
dc573f9b20 Merge branches 'tracing/ftrace', 'tracing/kmemtrace' and 'linus' into tracing/core 2009-02-03 06:25:38 +01:00
Peter Zijlstra
1596e29773 sched: symmetric sync vs avg_overlap
Reinstate the weakening of the sync hint if set. This yields a more
symmetric usage of avg_overlap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:49 +01:00
Peter Zijlstra
d942fb6c7d sched: fix sync wakeups
Pawel Dziekonski reported that the openssl benchmark and his
quantum chemistry application both show slowdowns due to the
scheduler under-parallelizing execution.

The reason are pipe wakeups still doing 'sync' wakeups which
overrides the normal buddy wakeup logic - even if waker and
wakee are loosely coupled.

Fix an inversion of logic in the buddy wakeup code.

Reported-by: Pawel Dziekonski <dzieko@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:06 +01:00
Steven Rostedt
7e49fcce1b trace, lockdep: manual preempt count adding for local_bh_disable
Impact: fix to preempt trace triggering lockdep check_flag failure

In local_bh_disable, the use of add_preempt_count causes the
preempt tracer to start recording the time preemption is off.
But because it already modified the preempt_count to show
softirqs disabled, and before it called the lockdep code to
handle this, it causes a state that lockdep can not handle.

The preempt tracer will reset the ring buffer on start of a trace,
and the ring buffer reset code does a spin_lock_irqsave. This
calls into lockdep and lockdep will fail when it detects the
invalid state of having softirqs disabled but the internal
current->softirqs_enabled is still set.

The fix is to manually add the SOFTIRQ_OFFSET to preempt count
and call the preempt tracer code outside the lockdep critical
area.

Thanks to Peter Zijlstra for suggesting this solution.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-23 11:10:57 +01:00
Ingo Molnar
bfe2a3c3b5 Merge branch 'core/percpu' into perfcounters/core
Conflicts:
	arch/x86/include/asm/hardirq_32.h
	arch/x86/include/asm/hardirq_64.h

Semantic merge:
	arch/x86/include/asm/hardirq.h
	[ added apic_perf_irqs field. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-23 10:20:15 +01:00
Ingo Molnar
77835492ed Merge commit 'v2.6.29-rc2' into perfcounters/core
Conflicts:
	include/linux/syscalls.h
2009-01-21 16:37:27 +01:00
Ingo Molnar
198030782c Merge branch 'x86/mm' into core/percpu
Conflicts:
	arch/x86/mm/fault.c
2009-01-21 10:39:51 +01:00
Ingo Molnar
b2b062b816 Merge branch 'core/percpu' into stackprotector
Conflicts:
	arch/x86/include/asm/pda.h
	arch/x86/include/asm/system.h

Also, moved include/asm-x86/stackprotector.h to arch/x86/include/asm.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-18 18:37:14 +01:00
Ingo Molnar
af37501c79 Merge branch 'core/percpu' into perfcounters/core
Conflicts:
	arch/x86/include/asm/pda.h

We merge tip/core/percpu into tip/perfcounters/core because of a
semantic and contextual conflict: the former eliminates the PDA,
while the latter extends it with apic_perf_irqs field.

Resolve the conflict by moving the new field to the irq_cpustat
structure on 64-bit too.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-18 18:15:49 +01:00
Linus Torvalds
7cb36b6ccd Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: sched_slice() fixlet
  sched: fix update_min_vruntime
  sched: SCHED_OTHER vs SCHED_IDLE isolation
  sched: SCHED_IDLE weight change
  sched: fix bandwidth validation for UID grouping
  Revert "sched: improve preempt debugging"
2009-01-15 16:55:00 -08:00
Peter Zijlstra
cce7ade803 sched: SCHED_IDLE weight change
Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
vruntime numbers.

time advanced in 100ms:

 weight=2

 64765.988352
 67012.881408
 88501.412352

 weight=3

 35496.181411
 34130.971298
 35497.411573

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 15:07:28 +01:00
Peter Zijlstra
98a4826b99 sched: fix bandwidth validation for UID grouping
Impact: make rt-limit tunables work again

Mark Glines reported:

> I've got an issue on x86-64 where I can't configure the system to allow
> RT tasks for a non-root user.
>
> In 2.6.26.5, I was able to do the following to set things up nicely:
> echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
> echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
>
> Seems like every value I try to echo into the /sys files returns EINVAL.

For UID grouping we initialize the root group with infinite bandwidth
which by default is actually more than the global limit, therefore the
bandwidth check always fails.

Because the root group is a phantom group (for UID grouping) we cannot
runtime adjust it, therefore we let it reflect the global bandwidth
settings.

Reported-by: Mark Glines <mark@glines.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 15:07:27 +01:00
Peter Zijlstra
831451ac4e sched: introduce avg_wakeup
Introduce a new avg_wakeup statistic.

avg_wakeup is a measure of how frequently a task wakes up other tasks, it
represents the average time between wakeups, with a limit of avg_runtime
for when it doesn't wake up anybody.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 12:00:08 +01:00
Peter Zijlstra
0d66bf6d35 mutex: implement adaptive spinning
Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:               296604

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:               85870

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 18:09:02 +01:00
Peter Zijlstra
41719b0309 mutex: preemption fixes
The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.

Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 18:09:00 +01:00
Gregory Haskins
398a153b16 sched: fix build error in kernel/sched_rt.c when RT_GROUP_SCHED && !SMP
Ingo found a build error in the scheduler when RT_GROUP_SCHED was
enabled, but SMP was not.  This patch rearranges the code such
that it is a little more streamlined and compiles under all permutations
of SMP, UP and RT_GROUP_SCHED.  It was boot tested on my 4-way x86_64
and it still passes preempt-test.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
2009-01-14 09:10:04 -05:00
Heiko Carstens
17da2bd90a [CVE-2009-0029] System call wrappers part 08
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:21 +01:00
Heiko Carstens
754fe8d297 [CVE-2009-0029] System call wrappers part 07
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:20 +01:00
Heiko Carstens
5add95d4f7 [CVE-2009-0029] System call wrappers part 06
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:20 +01:00
Ingo Molnar
01e3eb8227 Revert "sched: improve preempt debugging"
This reverts commit 7317d7b87e.

This has been reported (and bisected) by Alexey Zaytsev and
Kamalesh Babulal to produce annoying warnings during bootup
on both x86 and powerpc.

kernel_locked() is not a valid test in IRQ context (we update the
BKL's ->lock_depth and the preempt count separately and non-atomicalyy),
so we cannot put it into the generic preempt debugging checks which
can run in IRQ contexts too.

Reported-and-bisected-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Reported-and-bisected-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-12 13:00:50 +01:00
Steven Noonan
fd2ab30b65 kernel/sched.c: add missing forward declaration for 'double_rq_lock'
Impact: build fix on certain configs

Added 'double_rq_lock' forward declaration, allowing double_rq_lock
to be used in _double_lock_balance().

Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-11 13:06:07 +01:00
Ingo Molnar
0a6d4e1dc9 Merge branch 'sched/latest' of git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks into sched/rt 2009-01-11 04:58:49 +01:00
Ingo Molnar
506c10f26c Merge commit 'v2.6.29-rc1' into perfcounters/core
Conflicts:
	include/linux/kernel_stat.h
2009-01-11 02:42:53 +01:00
Rusty Russell
62ea9ceb17 cpumask: fix CONFIG_NUMA=y sched.c
Impact: fix panic on ia64 with NR_CPUS=1024

struct sched_domain is now a dangling structure; where we really want
static ones, we need to use static_sched_domain.

(As the FIXME in this file says, cpumask_var_t would be better, but
this code is hairy enough without trying to add initialization code to
the right places).

Reported-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-11 01:04:16 +01:00
Peter Zijlstra
da8d5089da sched: fix possible recursive rq->lock
Vaidyanathan Srinivasan reported:

 > =============================================
 > [ INFO: possible recursive locking detected ]
 > 2.6.28-autotest-tip-sv #1
 > ---------------------------------------------
 > klogd/5062 is trying to acquire lock:
 >  (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
 >
 > but task is already holding lock:
 >  (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31

With sched_mc at 2. (it is default-off)

Strictly speaking we'll not deadlock, because ttwu will not be able to
place the migration task on our rq, but since the code can deal with
both rqs getting unlocked, this seems the easiest way out.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-07 16:10:54 +01:00
Li Zefan
db2f59c8c9 sched: fix section mismatch
init_rootdomain() calls alloc_bootmem_cpumask_var() at system boot,
so does cpupri_init().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-06 11:07:15 +01:00
Li Zefan
0c910d2895 sched: fix double kfree in failure path
It's not the responsibility of init_rootdomain() to free root_domain
allocated by alloc_rootdomain().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-06 11:07:04 +01:00
Li Zefan
c70f22d203 sched: clean up arch_reinit_sched_domains()
- Make arch_reinit_sched_domains() static. It was exported to be used in
  s390, but now rebuild_sched_domains() is used instead.

- Make it return void.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-05 13:54:16 +01:00
Li Zefan
39aac64812 sched: mark sched_create_sysfs_power_savings_entries() as __init
Impact: cleanup

The only caller is cpu_dev_init() which is marked as __init.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-05 13:54:16 +01:00
Linus Torvalds
7d3b56ba37 Merge branch 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
  x86: setup_per_cpu_areas() cleanup
  cpumask: fix compile error when CONFIG_NR_CPUS is not defined
  cpumask: use alloc_cpumask_var_node where appropriate
  cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
  x86: use cpumask_var_t in acpi/boot.c
  x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
  sched: put back some stack hog changes that were undone in kernel/sched.c
  x86: enable cpus display of kernel_max and offlined cpus
  ia64: cpumask fix for is_affinity_mask_valid()
  cpumask: convert RCU implementations, fix
  xtensa: define __fls
  mn10300: define __fls
  m32r: define __fls
  h8300: define __fls
  frv: define __fls
  cris: define __fls
  cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
  cpumask: zero extra bits in alloc_cpumask_var_node
  cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
  cpumask: convert mm/
  ...
2009-01-03 12:04:39 -08:00
Linus Torvalds
61420f59a5 Merge branch 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [PATCH] fast vdso implementation for CLOCK_THREAD_CPUTIME_ID
  [PATCH] improve idle cputime accounting
  [PATCH] improve precision of idle time detection.
  [PATCH] improve precision of process accounting.
  [PATCH] idle cputime accounting
  [PATCH] fix scaled & unscaled cputime accounting
2009-01-03 11:56:24 -08:00
Mike Travis
6ca09dfc9f sched: put back some stack hog changes that were undone in kernel/sched.c
Impact: prevents panic from stack overflow on numa-capable machines.

Some of the "removal of stack hogs" changes in kernel/sched.c by using
node_to_cpumask_ptr were undone by the early cpumask API updates, and
causes a panic due to stack overflow.  This patch undoes those changes
by using cpumask_of_node() which returns a 'const struct cpumask *'.

In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
reducing stack usage.  (Both of these updates removed 9 FIXME's!)

Also:
   Pick up some remaining changes from the old 'cpumask_t' functions to
   the new 'struct cpumask *' functions.

   Optimize memory traffic by allocating each percpu local_cpu_mask on the
   same node as the referring cpu.

Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-03 19:00:09 +01:00
Mike Travis
7eb1955336 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask into merge-rr-cpumask
Conflicts:
	arch/x86/kernel/io_apic.c
	kernel/rcuclassic.c
	kernel/sched.c
	kernel/time/tick-sched.c

Signed-off-by: Mike Travis <travis@sgi.com>
[ mingo@elte.hu: backmerged typo fix for io_apic.c ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-03 18:53:31 +01:00
Linus Torvalds
b840d79631 Merge branch 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
  x86: export vector_used_by_percpu_irq
  x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
  sched: nominate preferred wakeup cpu, fix
  x86: fix lguest used_vectors breakage, -v2
  x86: fix warning in arch/x86/kernel/io_apic.c
  sched: fix warning in kernel/sched.c
  sched: move test_sd_parent() to an SMP section of sched.h
  sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
  sched: activate active load balancing in new idle cpus
  sched: bias task wakeups to preferred semi-idle packages
  sched: nominate preferred wakeup cpu
  sched: favour lower logical cpu number for sched_mc balance
  sched: framework for sched_mc/smt_power_savings=N
  sched: convert BALANCE_FOR_xx_POWER to inline functions
  x86: use possible_cpus=NUM to extend the possible cpus allowed
  x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
  x86: update io_apic.c to the new cpumask code
  x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
  x86: xen: use smp_call_function_many()
  x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
  ...

Fixed up trivial conflict in kernel/time/tick-sched.c manually
2009-01-02 11:44:09 -08:00
Martin Schwidefsky
79741dd357 [PATCH] idle cputime accounting
The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.

In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.

This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2008-12-31 15:11:46 +01:00
Martin Schwidefsky
457533a7d3 [PATCH] fix scaled & unscaled cputime accounting
The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2008-12-31 15:11:46 +01:00
Rusty Russell
2ca1a61583 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	arch/x86/kernel/io_apic.c
2008-12-31 23:05:57 +10:30
Ingo Molnar
a9de18eb76 Merge branch 'linus' into stackprotector
Conflicts:
	arch/x86/include/asm/pda.h
	kernel/fork.c
2008-12-31 08:31:57 +01:00
Linus Torvalds
bb758e9637 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimers: fix warning in kernel/hrtimer.c
  x86: make sure we really have an hpet mapping before using it
  x86: enable HPET on Fujitsu u9200
  linux/timex.h: cleanup for userspace
  posix-timers: simplify de_thread()->exit_itimers() path
  posix-timers: check ->it_signal instead of ->it_pid to validate the timer
  posix-timers: use "struct pid*" instead of "struct task_struct*"
  nohz: suppress needless timer reprogramming
  clocksource, acpi_pm.c: put acpi_pm_read_slow() under CONFIG_PCI
  nohz: no softirq pending warnings for offline cpus
  hrtimer: removing all ur callback modes, fix
  hrtimer: removing all ur callback modes, fix hotplug
  hrtimer: removing all ur callback modes
  x86: correct link to HPET timer specification
  rtc-cmos: export second NVRAM bank

Fixed up conflicts in sound/drivers/pcsp/pcsp.c and sound/core/hrtimer.c
manually.
2008-12-30 16:16:21 -08:00
Linus Torvalds
5f34fe1cfc Merge branch 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
  stacktrace: provide save_stack_trace_tsk() weak alias
  rcu: provide RCU options on non-preempt architectures too
  printk: fix discarding message when recursion_bug
  futex: clean up futex_(un)lock_pi fault handling
  "Tree RCU": scalable classic RCU implementation
  futex: rename field in futex_q to clarify single waiter semantics
  x86/swiotlb: add default swiotlb_arch_range_needs_mapping
  x86/swiotlb: add default phys<->bus conversion
  x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
  x86: add swiotlb allocation functions
  swiotlb: consolidate swiotlb info message printing
  swiotlb: support bouncing of HighMem pages
  swiotlb: factor out copy to/from device
  swiotlb: add arch hook to force mapping
  swiotlb: allow architectures to override phys<->bus<->phys conversions
  swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
  rcu: fix rcutorture behavior during reboot
  resources: skip sanity check of busy resources
  swiotlb: move some definitions to header
  swiotlb: allow architectures to override swiotlb pool allocation
  ...

Fix up trivial conflicts in
  arch/x86/kernel/Makefile
  arch/x86/mm/init_32.c
  include/linux/hardirq.h
as per Ingo's suggestions.
2008-12-30 16:10:19 -08:00
Rusty Russell
33edcf133b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2008-12-30 08:02:35 +10:30
Gregory Haskins
917b627d4d sched: create "pushable_tasks" list to limit pushing to one attempt
The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis).  When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted.  This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.

When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state.  Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over.  The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.

However, the current implementation suffers from a limitation in the
push logic.  Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc).  Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue.  This causes two problems:

1) We can have the same tasks analyzed over and over again during each
   push, which extends out the fast path in the scheduler for no
   gain.  Consider a RQ that has dozens of tasks that are bound to a
   core.  Each one of those tasks will be encountered and skipped
   for each push operation while they are queued.

2) There may be lower-priority tasks under the unpushable task that
   could have been successfully pushed, but will never be considered
   until either the unpushable task is cleared, or a pull operation
   succeeds.  The net result is a potential latency source for mid
   priority tasks.

This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks".  A task is added to the list
each time a task is activated or preempted.  It is removed from the
list any time it is deactivated, made current, or fails to push.

This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper.  This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks.  Now every task will have a push attempted (when
appropriate).

This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.

[ rostedt: added a couple more BUG_ONs ]

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
2008-12-29 09:39:53 -05:00
Gregory Haskins
967fc04671 sched: add sched_class->needs_post_schedule() member
We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion.  This is currently not a problem
as we only look at rq->rt.overloaded.  However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set.  Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work.  Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).

We will use this later in the series

[ rostedt: removed paranoid BUG_ON ]

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
2008-12-29 09:39:52 -05:00
Gregory Haskins
8f45e2b516 sched: make double-lock-balance fair
double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs.  This can result in higher latencies on
higher cpu-ids.

This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again.  Since callsites to
double_lock_balance already consider it a potential preemption/reschedule
point, they have the proper logic to recheck for atomicity violations.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
2008-12-29 09:39:51 -05:00
Gregory Haskins
7e96fa5875 sched: pull only one task during NEWIDLE balancing to limit critical section
git-id c4acb2c066 attempted to limit
newidle critical section length by stopping after at least one task
was moved.  Further investigation has shown that there are other
paths nested further inside the algorithm which still remain that allow
long latencies to occur with newidle balancing.  This patch applies
the same technique inside balance_tasks() to limit the duration of
this optional balancing operation.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Nick Piggin <npiggin@suse.de>
2008-12-29 09:39:50 -05:00
Gregory Haskins
e864c499d9 sched: track the next-highest priority on each runqueue
We will use this later in the series to reduce the amount of rq-lock
contention during a pull operation

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
2008-12-29 09:39:49 -05:00
Ingo Molnar
e1df957670 Merge branch 'linus' into perfcounters/core
Conflicts:
	fs/exec.c
	include/linux/init_task.h

Simple context conflicts.
2008-12-29 09:45:15 +01:00
Linus Torvalds
a39b863342 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
  sched: fix warning in fs/proc/base.c
  schedstat: consolidate per-task cpu runtime stats
  sched: use RCU variant of list traversal in for_each_leaf_rt_rq()
  sched, cpuacct: export percpu cpuacct cgroup stats
  sched, cpuacct: refactoring cpuusage_read / cpuusage_write
  sched: optimize update_curr()
  sched: fix wakeup preemption clock
  sched: add missing arch_update_cpu_topology() call
  sched: let arch_update_cpu_topology indicate if topology changed
  sched: idle_balance() does not call load_balance_newidle()
  sched: fix sd_parent_degenerate on non-numa smp machine
  sched: add uid information to sched_debug for CONFIG_USER_SCHED
  sched: move double_unlock_balance() higher
  sched: update comment for move_task_off_dead_cpu
  sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
  sched/rt: removed unneeded defintion
  sched: add hierarchical accounting to cpu accounting controller
  sched: include group statistics in /proc/sched_debug
  sched: rename SCHED_NO_NO_OMIT_FRAME_POINTER => SCHED_OMIT_FRAME_POINTER
  sched: clean up SCHED_CPUMASK_ALLOC
  ...
2008-12-28 12:27:58 -08:00
Linus Torvalds
b0f4b285d7 Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
  sched, trace: update trace_sched_wakeup()
  tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
  Revert "x86: disable X86_PTRACE_BTS"
  ring-buffer: prevent false positive warning
  ring-buffer: fix dangling commit race
  ftrace: enable format arguments checking
  x86, bts: memory accounting
  x86, bts: add fork and exit handling
  ftrace: introduce tracing_reset_online_cpus() helper
  tracing: fix warnings in kernel/trace/trace_sched_switch.c
  tracing: fix warning in kernel/trace/trace.c
  tracing/ring-buffer: remove unused ring_buffer size
  trace: fix task state printout
  ftrace: add not to regex on filtering functions
  trace: better use of stack_trace_enabled for boot up code
  trace: add a way to enable or disable the stack tracer
  x86: entry_64 - introduce FTRACE_ frame macro v2
  tracing/ftrace: add the printk-msg-only option
  tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
  x86, bts: correctly report invalid bts records
  ...

Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
2008-12-28 12:21:10 -08:00
Rusty Russell
be4d638c15 cpumask: Replace cpu_coregroup_map with cpu_coregroup_mask
cpu_coregroup_map returned a cpumask_t: it's going away.

(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
2008-12-26 22:23:43 +10:30
Ingo Molnar
32e8d18683 Merge branches 'timers/clocksource', 'timers/hpet', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/rtc' into timers/core 2008-12-25 18:02:25 +01:00
Ingo Molnar
0b271ef452 Merge commit 'v2.6.28' into core/core 2008-12-25 13:51:46 +01:00
Peter Zijlstra
468a15bb4c sched, trace: update trace_sched_wakeup()
Impact: extend the wakeup tracepoint with the info whether the wakeup was real

Add the information needed to distinguish 'real' wakeups from 'false'
wakeups.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-25 13:10:21 +01:00
James Morris
cbacc2c7f0 Merge branch 'next' into for-linus 2008-12-25 11:40:09 +11:00
Vaidyanathan Srinivasan
36dffab679 sched: nominate preferred wakeup cpu, fix
Andrew Morton reported:

> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
>
> This warning is correct - the code is buggy.

In sched.c load_balance_newidle, there's real potential use of
uninitialised variable - fix it.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-23 22:37:29 +01:00
Ingo Molnar
aa9c4c0f96 perfcounters: fix task clock counter
Impact: fix per task clock counter precision

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-23 12:45:14 +01:00
Ingo Molnar
30cd324e97 Merge branches 'tracing/ftrace', 'tracing/ring-buffer' and 'tracing/urgent' into tracing/core
Conflicts:
	include/linux/ftrace.h
2008-12-19 09:42:40 +01:00
Ingo Molnar
9924da434a sched: fix warning in kernel/sched.c
Impact: fix cpumask conversion bug

this warning:

  kernel/sched.c: In function ‘find_busiest_group’:
  kernel/sched.c:3429: warning: passing argument 1 of ‘__first_cpu’ from incompatible pointer type

shows that we forgot to convert a new patch to the new cpumask APIs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-19 09:21:57 +01:00
Vaidyanathan Srinivasan
ad273b32e4 sched: activate active load balancing in new idle cpus
Impact: tweak task balancing to save power more agressively

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-19 09:21:54 +01:00
Vaidyanathan Srinivasan
7a09b1a27b sched: nominate preferred wakeup cpu
Impact: extend load-balancing code (no change in behavior yet)

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be stored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical.  Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-19 09:21:50 +01:00
Vaidyanathan Srinivasan
d5679bd119 sched: favour lower logical cpu number for sched_mc balance
Impact: change load-balancing direction to match that of irqbalanced

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group.  Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured.  This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-19 09:21:48 +01:00
Gautham R Shenoy
afb8a9b70b sched: framework for sched_mc/smt_power_savings=N
Impact: extend range of /sys/devices/system/cpu/sched_mc_power_savings

Currently the sched_mc/smt_power_savings variable is a boolean,
which either enables or disables topology based power savings.
This patch extends the behaviour of the variable from boolean to
multivalued, such that based on the value, we decide how
aggressively do we want to perform powersavings balance at
appropriate sched domain based on topology.

Variable levels of power saving tunable would benefit end user to
match the required level of power savings vs performance
trade-off depending on the system configuration and workloads.

This version makes the sched_mc_power_savings global variable to
take more values (0,1,2).  Later versions can have a single
tunable called sched_power_savings instead of
sched_{mc,smt}_power_savings.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-19 09:21:46 +01:00
Ingo Molnar
c71dd42db2 tracing: fix warnings in kernel/trace/trace_sched_switch.c
these warnings:

  kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_register’:
  kernel/trace/trace_sched_switch.c:96: warning: passing argument 1 of ‘register_trace_sched_wakeup_new’ from incompatible pointer type
  kernel/trace/trace_sched_switch.c:112: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type
  kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_unregister’:
  kernel/trace/trace_sched_switch.c:121: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type

Trigger because sched_wakeup_new tracepoints need the same trace
signature as sched_wakeup - which was changed recently.

Fix it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-19 01:05:38 +01:00
Ken Chen
9c2c48020e schedstat: consolidate per-task cpu runtime stats
Impact: simplify code

When we turn on CONFIG_SCHEDSTATS, per-task cpu runtime is accumulated
twice. Once in task->se.sum_exec_runtime and once in sched_info.cpu_time.
These two stats are exactly the same.

Given that task->se.sum_exec_runtime is always accumulated by the core
scheduler, sched_info can reuse that data instead of duplicate the accounting.

Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-18 13:54:01 +01:00
Ken Chen
e9515c3c9f sched, cpuacct: export percpu cpuacct cgroup stats
This patch export per-cpu CPU cycle usage for a given cpuacct cgroup.
There is a need for a user space monitor daemon to track group CPU
usage on per-cpu base.  It is also useful for monitoring CFS load
balancer behavior by tracking per CPU group usage.

Signed-off-by: Ken Chen <kenchen@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 12:15:01 +01:00
Ken Chen
720f54988e sched, cpuacct: refactoring cpuusage_read / cpuusage_write
Impact: micro-optimize the code on 64-bit architectures

In the thread regarding to 'export percpu cpuacct cgroup stats'
http://lkml.org/lkml/2008/12/7/13

akpm pointed out that current cpuacct code is inefficient.  This patch
refactoring the following:

* make cpu_rq locking only on 32-bit
* change iterator to each_present_cpu instead of each_possible_cpu to
  make it hotplug friendly.

It's a bit of code churn, but I was rewarded with 160 byte code size saving
on x86-64 arch and zero code size change on i386.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 12:15:00 +01:00
Mike Galbraith
03e89e4574 sched: fix wakeup preemption clock
Impact: sharpen the wakeup-granularity to always be against current scheduler time

It was possible to do the preemption check against an old time stamp.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 09:45:38 +01:00
Ingo Molnar
6c594c21fc perfcounters: add task migrations counter
Impact: add new feature, new sw counter

Add a counter that counts the number of cross-CPU migrations a
task is suffering.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:31:26 +01:00
Rusty Russell
968ea6d80e Merge ../linux-2.6-x86
Conflicts:

	arch/x86/kernel/io_apic.c
	kernel/sched.c
	kernel/sched_stats.h
2008-12-13 21:55:51 +10:30
Rusty Russell
29c0177e6a cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.
Impact: change calling convention of existing cpumask APIs

Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.

These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: paulus@samba.org
Cc: mingo@redhat.com
Cc: tony.luck@intel.com
Cc: ralf@linux-mips.org
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: cl@linux-foundation.org
Cc: srostedt@redhat.com
2008-12-13 21:20:25 +10:30
Ingo Molnar
45ab6b0c76 Merge branch 'sched/core' into cpus4096
Conflicts:
	include/linux/ftrace.h
	kernel/sched.c
2008-12-12 13:48:57 +01:00
Heiko Carstens
d65bd5ecb2 sched: add missing arch_update_cpu_topology() call
arch_reinit_sched_domains() used to call arch_update_cpu_topology()
via arch_init_sched_domains(). This call got lost with
e761b77252 ("cpu hotplug, sched: Introduce
cpu_active_map and redo sched domain managment (take 2)".

So we might end up with outdated and missing cpus in the cpu core
maps (architecture used to call arch_reinit_sched_domains if cpu
topology changed).

This adds a call to arch_update_cpu_topology in partition_sched_domains
which gets called whenever scheduling domains get updated. Which is
what is supposed to happen when cpu topology changes.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 13:47:22 +01:00
Heiko Carstens
ee79d1bdb6 sched: let arch_update_cpu_topology indicate if topology changed
Change arch_update_cpu_topology so it returns 1 if the cpu topology changed
and 0 if it didn't change. This will be useful for the next patch which adds
a call to this function in partition_sched_domains.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 13:47:21 +01:00
Ingo Molnar
81444a7995 Merge branch 'tracing/fastboot' into cpus4096 2008-12-12 12:43:05 +01:00
Peter Zijlstra
cbc34ed1ac sched: fix tracepoints in scheduler
The trace point only caught one of many places where a task changes cpu,
put it in the right place to we get all of them.

Change the signature while we're at it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 12:08:26 +01:00
Ingo Molnar
92bf73e90a Merge branch 'x86/irq' into perfcounters/core
( with manual semantic merge of arch/x86/kernel/cpu/perf_counter.c )
2008-12-12 12:00:14 +01:00
Ingo Molnar
c1dfdc7597 Merge commit 'v2.6.28-rc8' into sched/core 2008-12-12 10:29:35 +01:00
Brian King
9a2bd244e1 sched: CPU remove deadlock fix
Impact: fix possible deadlock in CPU hot-remove path

This patch fixes a possible deadlock scenario in the CPU remove path.
migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
with the lock held. Then one of the tasks on the migration queue ends up
calling tg_shares_up which then also tries to acquire the same rq->lock.

[c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
[c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
[c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
[c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
[c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
[c000000058eab640] c00000000007ada4 .complete+0x54/0x80
[c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
[c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
[c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
[c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
[c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
[c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
[c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
[c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
[c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
[c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-09 19:27:03 +01:00
Vaidyanathan Srinivasan
efbe027e95 sched: idle_balance() does not call load_balance_newidle()
Impact: fix SD_BALANCE_NEWIDLEand broaden its use

load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE is
set at higher level domain (3-CPU) and not in low level domain (2-MC).

pulled_task is initialised to -1 and checked for non-zero which is
always true if the lowest level sched_domain does not have
SD_BALANCE_NEWIDLE flag set.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-08 16:48:00 +01:00
Thomas Gleixner
0793a61d4d performance counters: core code
Implement the core kernel bits of Performance Counters subsystem.

The Linux Performance Counter subsystem provides an abstraction of
performance counter hardware capabilities. It provides per task and per
CPU counters, and it provides event capabilities on top of those.

Performance counters are accessed via special file descriptors.
There's one file descriptor per virtual counter used.

The special file descriptor is opened via the perf_counter_open()
system call:

 int
 perf_counter_open(u32 hw_event_type,
                   u32 hw_event_period,
                   u32 record_type,
                   pid_t pid,
                   int cpu);

The syscall returns the new fd. The fd can be used via the normal
VFS system calls: read() can be used to read the counter, fcntl()
can be used to set the blocking mode, etc.

Multiple counters can be kept open at a time, and the counters
can be poll()ed.

See more details in Documentation/perf-counters.txt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-08 15:47:03 +01:00
Ken Chen
5436499e60 sched: fix sd_parent_degenerate on non-numa smp machine
Impact: optimize the sched domains tree some more

The addition of SD_SERIALIZE flag added to SD_NODE_INIT prevented top level
dummy numa sched_domain to be properly degenerated on non-numa smp machine.
The reason is that in sd_parent_degenerate(), it found that the child and
parent does not have comon sched_domain flags due to SD_SERIALIZE.  However,
for non-numa smp box, the top level is a dummy with a single sched_group.

Filter out SD_SERIALIZE if it is on non-numa machine to properly degenerate
top level node sched_domain.  this will cut back some of the sd domain walk
in the load balancer code.

Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-08 13:52:44 +01:00
Ingo Molnar
4d117c5c6b Merge branch 'sched/urgent' into sched/core 2008-12-08 13:52:00 +01:00
Ingo Molnar
b8307db247 Merge commit 'v2.6.28-rc7' into tracing/core 2008-12-04 09:07:19 +01:00
James Morris
ec98ce480a Merge branch 'master' into next
Conflicts:
	fs/nfsd/nfs4recover.c

Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().

Signed-off-by: James Morris <jmorris@namei.org>
2008-12-04 17:16:36 +11:00
Ingo Molnar
a64d31baed Merge branch 'linus' into cpus4096
Conflicts:
	kernel/trace/ring_buffer.c
2008-12-02 20:09:50 +01:00
Arun R Bharadwaj
6c415b9234 sched: add uid information to sched_debug for CONFIG_USER_SCHED
Impact: extend information in /proc/sched_debug

This patch adds uid information in sched_debug for CONFIG_USER_SCHED

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-01 20:39:50 +01:00
Ingo Molnar
af6d596fd6 sched: prevent divide by zero error in cpu_avg_load_per_task, update
Regarding the bug addressed in:

  4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task

Linus points out that the fix is not complete:

> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.

So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-29 20:45:15 +01:00
Alexey Dobriyan
70574a996f sched: move double_unlock_balance() higher
Move double_lock_balance()/double_unlock_balance() higher to fix the following
with gcc-3.4.6:

   CC      kernel/sched.o
 In file included from kernel/sched.c:1605:
 kernel/sched_rt.c: In function `find_lock_lowest_rq':
 kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available
 kernel/sched_rt.c:1077: sorry, unimplemented: called from here
 make[2]: *** [kernel/sched.o] Error 1

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-28 20:11:15 +01:00
Ingo Molnar
f1860c34b3 Merge branch 'sched/urgent' into sched/core 2008-11-28 20:11:05 +01:00
Steven Rostedt
4cd4262034 sched: prevent divide by zero error in cpu_avg_load_per_task
Impact: fix divide by zero crash in scheduler rebalance irq

While testing the branch profiler, I hit this crash:

divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[<ffffffff8024a008>]  [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
 <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
 [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
 [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
 [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
 [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
 [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
 [<ffffffff8021047a>] do_softirq+0x94/0x1cd
 [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
 [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
 [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20

The code for cpu_avg_load_per_task has:

	if (rq->nr_running)
		rq->avg_load_per_task = rq->load.weight / rq->nr_running;

The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.

The branch profiler simply made the race window bigger.

This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-27 10:29:52 +01:00
Rusty Russell
1224e376f2 sched: avoid stack var in move_task_off_dead_cpu, fix
Impact: locking fix

We can't call cpuset_cpus_allowed_locked() with the rq lock held.
However, the rq lock merely protects us from (1) cpu_online_mask changing
and (2) someone else changing p->cpus_allowed.

The first can't happen because we're being called from a cpu hotplug
notifier.  The second doesn't really matter: we are forcing the task off
a CPU it was affine to, so we're not doing very well anyway.

So we remove the rq lock from this path, and all is good.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-26 07:58:29 +01:00
Rusty Russell
bf4d83f664 sched: convert nohz struct to cpumask_var_t, fix
Impact: build fix

Fix the !CONFIG_SMP case.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-26 07:58:27 +01:00
Frederic Weisbecker
fb52607afc tracing/function-return-tracer: change the name into function-graph-tracer
Impact: cleanup

This patch changes the name of the "return function tracer" into
function-graph-tracer which is a more suitable name for a tracing
which makes one able to retrieve the ordered call stack during
the code flow.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-26 01:59:45 +01:00
Peter Zijlstra
ca109491f6 hrtimer: removing all ur callback modes
Impact: cleanup, move all hrtimer processing into hardirq context

This is an attempt at removing some of the hrtimer complexity by
reducing the number of callback modes to 1.

This means that all hrtimer callback functions will be ran from HARD-irq
context.

I went through all the 30 odd hrtimer callback functions in the kernel
and saw only one that I'm not quite sure of, which is the one in
net/can/bcm.c - hence I'm CC-ing the folks responsible for that code.

Furthermore, the hrtimer core now calls callbacks directly with IRQs
disabled in case you try to enqueue an expired timer. If this timer is a
periodic timer (which should use hrtimer_forward() to advance its time)
then it might be possible to end up in an inf. recursive loop due to the
fact that hrtimer_forward() doesn't round up to the next timer
granularity, and therefore keeps on calling the callback - obviously
this needs a fix.

Aside from that, this seems to compile and actually boot on my dual core
test box - although I'm sure there are some bugs in, me not hitting any
makes me certain :-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-25 15:45:46 +01:00
Rusty Russell
96f874e264 sched: convert remaining old-style cpumask operators
Impact: Trivial API conversion

  NR_CPUS -> nr_cpu_ids
  cpumask_t -> struct cpumask
  sizeof(cpumask_t) -> cpumask_size()
  cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)

  cpu_set() -> cpumask_set_cpu()
  first_cpu() -> cpumask_first()
  cpumask_of_cpu() -> cpumask_of()
  cpus_* -> cpumask_*

There are some FIXMEs where we all archs to complete infrastructure
(patches have been sent):

  cpu_coregroup_map -> cpu_coregroup_mask
  node_to_cpumask* -> cpumask_of_node

There is also one FIXME where we pass an array of cpumasks to
partition_sched_domains(): this implies knowing the definition of
'struct cpumask' and the size of a cpumask.  This will be fixed in a
future patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:52:42 +01:00
Rusty Russell
0e3900e6d3 sched: convert local_cpu_mask to cpumask_var_t.
Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:52:35 +01:00
Rusty Russell
68e74568fb sched: convert struct cpupri_vec cpumask_var_t.
Impact: stack usage reduction, (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.

The fact cpupro_init is called both before and after the slab is
available makes for an ugly parameter unfortunately.

We also use cpumask_any_and to get rid of a temporary in cpupri_find.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:52:22 +01:00
Rusty Russell
4212823fb4 sched: convert falback_doms to cpumask_var_t.
Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:52:15 +01:00
Rusty Russell
dcc30a35f7 sched: convert cpu_isolated_map to cpumask_var_t.
Impact: stack usage reduction, (future) size reduction, cleanup

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

We can also use cpulist_parse() instead of doing it manually in
isolated_cpu_setup.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:52:10 +01:00
Rusty Russell
d5dd3db1dc sched: convert sched_domain_debug to cpumask_var_t.
Impact: stack usage reduction

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
stack space.  cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.

In this case, we always alloced, but we don't need to any more.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:52:03 +01:00
Rusty Russell
5a16f3d30c sched: convert struct (sys_)sched_setaffinity() to cpumask_var_t.
Impact: stack usage reduction

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space on the stack.  cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.

Note the removal of the initializer of new_mask: since the first thing
we did was "cpus_and(new_mask, new_mask, cpus_allowed)" I just changed
that to "cpumask_and(new_mask, in_mask, cpus_allowed);".

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:59 +01:00
Rusty Russell
e76bd8d985 sched: avoid stack var in move_task_off_dead_cpu
Impact: stack usage reduction

With some care, we can avoid needing a temporary cpumask (we can't
really allocate here, since we can't fail).

This version calls cpuset_cpus_allowed_locked() with the task_rq_lock
held.  I'm fairly sure this works, but there might be a deadlock
hiding.

And of course, we can't get rid of the last cpumask on stack until we
can use cpumask_of_node instead of node_to_cpumask.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:52 +01:00
Rusty Russell
f17c860760 sched: convert sys_sched_getaffinity() to cpumask_var_t.
Impact: stack usage reduction

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack.  cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.

Some jiggling here to make sure we always exit at the bottom (so we hit
the free_cpumask_var there).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:45 +01:00
Rusty Russell
a0e902452d sched: convert rebalance_domains() to cpumask_var_t.
Impact: stack usage reduction

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack.  cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:37 +01:00
Rusty Russell
4d2732c63e sched: convert idle_balance() to cpumask_var_t.
Impact: stack usage reduction

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack.  cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:30 +01:00
Rusty Russell
7d1e6a9b95 sched: convert nohz struct to cpumask_var_t.
Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:24 +01:00
Rusty Russell
c6c4927b22 sched: convert struct root_domain to cpumask_var_t.
Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

def_root_domain is static, and so its masks are initialized with
alloc_bootmem_cpumask_var.  After that, alloc_cpumask_var is used.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:18 +01:00
Rusty Russell
6a7b3dc344 sched: convert nohz_cpu_mask to cpumask_var_t.
Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:51:10 +01:00
Rusty Russell
6c99e9ad47 sched: convert struct sched_group/sched_domain cpumask_ts to variable bitmaps
Impact: (future) size reduction for large NR_CPUS.

We move the 'cpumask' member of sched_group to the end, so when we
kmalloc it we can do a minimal allocation: saves space for small
nr_cpu_ids but big CONFIG_NR_CPUS.  Similar trick for 'span' in
sched_domain.

This isn't quite as good as converting to a cpumask_var_t, as some
sched_groups are actually static, but it's safer: we don't have to
figure out where to call alloc_cpumask_var/free_cpumask_var.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:50:57 +01:00
Rusty Russell
758b2cdc6f sched: wrap sched_group and sched_domain cpumask accesses.
Impact: trivial wrap of member accesses

This eases the transition in the next patch.

We also get rid of a temporary cpumask in find_idlest_cpu() thanks to
for_each_cpu_and, and sched_balance_self() due to getting weight before
setting sd to NULL.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:50:45 +01:00
Rusty Russell
1e5ce4f4a7 sched: remove any_online_cpu()
Impact: use new API

any_online_cpu() is a good name, but it takes a cpumask_t, not a
pointer.

There are several places where any_online_cpu() doesn't really want a
mask arg at all.  Replace all callers with cpumask_any() and
cpumask_any_and().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:50:17 +01:00
Rusty Russell
3404c8d97c sched: get rid of boutique sched.c allocations, use cpumask_var_t.
Impact: use new general API

Using lots of allocs rather than one big alloc is less efficient, but
who cares for this setup function?

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:50:04 +01:00
Rusty Russell
abcd083a1a sched: convert sched.c from for_each_cpu_mask to for_each_cpu.
Impact: trivial API conversion

This is a simple conversion, but note that for_each_cpu() terminates
with i >= nr_cpu_ids, not i == NR_CPUS like for_each_cpu_mask() did.

I don't convert all of them: sd->span changes in a later patch, so
change those iterators there rather than here.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:49:47 +01:00
Mike Travis
ea6f18ed5a sched: reduce stack size requirements in kernel/sched.c
Impact: cleanup

  * use node_to_cpumask_ptr in place of node_to_cpumask to reduce stack
    requirements in sched.c

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-24 17:49:27 +01:00
Ingo Molnar
943f3d0300 Merge branches 'sched/core', 'core/core' and 'tracing/core' into cpus4096 2008-11-24 17:46:57 +01:00
Ingo Molnar
b19b3c74c7 Merge branches 'core/debug', 'core/futexes', 'core/locking', 'core/rcu', 'core/signal', 'core/urgent' and 'core/xen' into core/core 2008-11-24 17:44:55 +01:00
Ingo Molnar
64b7482de2 Merge branch 'sched/rt' into sched/core 2008-11-24 17:37:12 +01:00
Ingo Molnar
82f60f0bc8 tracing/function-return-tracer: clean up task start/exit callbacks
Impact: cleanup

Eliminate #ifdefs in core code by using empty inline functions.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-23 09:19:35 +01:00
Frederic Weisbecker
f201ae2356 tracing/function-return-tracer: store return stack into task_struct and allocate it dynamically
Impact: use deeper function tracing depth safely

Some tests showed that function return tracing needed a more deeper depth
of function calls. But it could be unsafe to store these return addresses
to the stack.

So these arrays will now be allocated dynamically into task_struct of current
only when the tracer is activated.

Typical scheme when tracer is activated:
- allocate a return stack for each task in global list.
- fork: allocate the return stack for the newly created task
- exit: free return stack of current
- idle init: same as fork

I chose a default depth of 50. I don't have overruns anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-23 09:17:26 +01:00
Vegard Nossum
957ad0166e sched: update comment for move_task_off_dead_cpu
Impact: cleanup

This commit:

commit f7b4cddcc5
Author: Oleg Nesterov <oleg@tv-sign.ru>
Date:   Tue Oct 16 23:30:56 2007 -0700

    do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(ta

    Currently move_task_off_dead_cpu() is called under
    write_lock_irq(tasklist).  This means it can't use task_lock() which is
    needed to improve migrating to take task's ->cpuset into account.

    Change the code to call move_task_off_dead_cpu() with irqs enabled, and
    change migrate_live_tasks() to use read_lock(tasklist).

...forgot to update the comment in front of move_task_off_dead_cpu.

Reference: http://lkml.org/lkml/2008/6/23/135

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-21 08:57:28 +01:00
Ken Chen
ec4e0e2fe0 sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
Impact: make load-balancing more consistent

In the update_shares() path leading to tg_shares_up(), the calculation of
per-cpu cfs_rq shares is rather erratic even under moderate task wake up
rate.  The problem is that the per-cpu tg->cfs_rq load weight used in the
sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
are collected at different time.  Under moderate system load, we've seen
quite a bit of variation on the cfs_rq->shares and ultimately wildly
affects sched_entity's load weight.

This patch caches the result of initial per-cpu load weight when doing the
sum calculation, and then pass it down to update_group_shares_cpu() for
redistributing per-cpu cfs_rq shares.  This allows consistent total cfs_rq
shares across all CPUs. It also simplifies the rounding and zero load
weight check.

Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-19 18:39:37 +01:00
Ingo Molnar
9676e73a9e Merge branches 'tracing/ftrace' and 'tracing/urgent' into tracing/core
Conflicts:
	kernel/trace/ftrace.c

[ We conflicted here because we backported a few fixes to
  tracing/urgent - which has different internal APIs. ]
2008-11-19 10:04:25 +01:00
Ingo Molnar
3ac3ba0b39 Merge branch 'linus' into sched/core
Conflicts:
	kernel/Makefile
2008-11-19 09:44:37 +01:00
Li Zefan
700018e0a7 cpuset: fix regression when failed to generate sched domains
Impact: properly rebuild sched-domains on kmalloc() failure

When cpuset failed to generate sched domains due to kmalloc()
failure, the scheduler should fallback to the single partition
'fallback_doms' and rebuild sched domains, but now it only
destroys but not rebuilds sched domains.

The regression was introduced by:

| commit dfb512ec48
| Author: Max Krasnyansky <maxk@qualcomm.com>
| Date:   Fri Aug 29 13:11:41 2008 -0700
|
|    sched: arch_reinit_sched_domains() must destroy domains to force rebuild

After the above commit, partition_sched_domains(0, NULL, NULL) will
only destroy sched domains and partition_sched_domains(1, NULL, NULL)
will create the default sched domain.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-18 08:44:51 +01:00
Oleg Nesterov
74fcd524e8 account_steal_time: kill the unneeded account_group_system_time()
Impact: remove unnecessary accounting call

I don't actually understand account_steal_time() and I failed to find the
commit which added account_group_system_time(), but this looks bogus.
In any case rq->idle must be single-threaded, so it can't have ->totals.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-17 16:55:53 +01:00
Mathieu Desnoyers
7e066fb870 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()
Impact: API *CHANGE*. Must update all tracepoint users.

Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint
structure in a single spot for all the kernel. It helps reducing memory
consumption, especially when declaring a lot of tracepoints, e.g. for
kmalloc tracing.

*API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for
tracepoint declarations rather than DEFINE_TRACE(). This is the sane way
to do it. The name previously used was misleading.

Updates scheduler instrumentation to follow this API change.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-16 09:01:36 +01:00
James Morris
2b82892565 Merge branch 'master' into next
Conflicts:
	security/keys/internal.h
	security/keys/process_keys.c
	security/keys/request_key.c

Fixed conflicts above by using the non 'tsk' versions.

Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 11:29:12 +11:00
David Howells
c69e8d9c01 CRED: Use RCU to access another task's creds and to release a task's own creds
Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:19 +11:00
David Howells
b6dff3ec5e CRED: Separate task security context from task_struct
Separate the task security context from task_struct.  At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:16 +11:00
David Howells
76aac0e9a1 CRED: Wrap task credential accesses in the core kernel
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.

Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

Change some task->e?[ug]id to task_e?[ug]id().  In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-audit@redhat.com
Cc: containers@lists.linux-foundation.org
Cc: linux-mm@kvack.org
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:12 +11:00
Ingo Molnar
5cbd54ef47 sched: fix init_idle()'s use of sched_clock()
Maciej Rutecki reported:

> I have this bug during suspend to disk:
>
> [  188.592151] Enabling non-boot CPUs ...
> [  188.592151] SMP alternatives: switching to SMP code
> [  188.666058] BUG: using smp_processor_id() in preemptible
> [00000000]
> code: suspend_to_disk/2934
> [  188.666064] caller is native_sched_clock+0x2b/0x80

Which, as noted by Linus, was caused by me, via:

  7cbaef9c "sched: optimize sched_clock() a bit"

Move the rq locking a bit earlier in the initialization sequence,
that will make the sched_clock() call in init_idle() non-preemptible.

Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-12 20:05:50 +01:00
Ingo Molnar
708b8eae0f Merge branch 'linus' into core/locking 2008-11-12 12:39:21 +01:00
Balbir Singh
a2d477778e sched: fix stale value in average load per task
Impact: fix load balancer load average calculation accuracy

cpu_avg_load_per_task() returns a stale value when nr_running is 0.
It returns an older stale (caculated when nr_running was non zero) value.

This patch returns and sets rq->avg_load_per_task to zero when nr_running
is 0.

Compile and boot tested on a x86_64 box.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-12 12:33:50 +01:00
Bharata B Rao
934352f214 sched: add hierarchical accounting to cpu accounting controller
Impact: improve CPU time accounting of tasks under the cpu accounting controller

Add hierarchical accounting to cpu accounting controller and include
cpuacct documentation.

Currently, while charging the task's cputime to its accounting group,
the accounting group hierarchy isn't updated. This patch charges the cputime
of a task to its accounting group and all its parent accounting groups.

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-11 12:13:28 +01:00
Oleg Nesterov
ad474caca3 fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock
Impact: fix hang/crash on ia64 under high load

This is ugly, but the simplest patch by far.

Unlike other similar routines, account_group_exec_runtime() could be
called "implicitly" from within scheduler after exit_notify(). This
means we can race with the parent doing release_task(), we can't just
check ->signal != NULL.

Change __exit_signal() to do spin_unlock_wait(&task_rq(tsk)->lock)
before __cleanup_signal() to make sure ->signal can't be freed under
task_rq(tsk)->lock. Note that task_rq_unlock_wait() doesn't care
about the case when tsk changes cpu/rq under us, this should be OK.

Thanks to Ingo who nacked my previous buggy patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Doug Chapman <doug.chapman@hp.com>
2008-11-11 08:01:43 +01:00
Peter Zijlstra
5ac5c4d604 sched: clean up debug info
Impact: clean up and fix debug info printout

While looking over the sched_debug code I noticed that we printed the rq
schedstats for every cfs_rq, ammend this.

Also change nr_spead_over into an int, and fix a little buglet in
min_vruntime printing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-10 10:51:51 +01:00
Li Zefan
6d21cd6251 sched: clean up SCHED_CPUMASK_ALLOC
Impact: cleanup

The #if/#endif is ugly. Change SCHED_CPUMASK_ALLOC and
SCHED_CPUMASK_FREE to static inline functions.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-07 10:30:35 +01:00
Ingo Molnar
258594a138 Merge branch 'sched/urgent' into sched/core 2008-11-07 10:29:58 +01:00
Li Zefan
ca3273f964 sched: fix memory leak in a failure path
Impact: fix rare memory leak in the sched-domains manual reconfiguration code

In the failure path, rd is not attached to a sched domain,
so it causes a leak.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-07 08:29:58 +01:00
Li Zefan
f29c9b1ccb sched: fix a bug in sched domain degenerate
Impact: re-add incorrectly eliminated sched domain layers

(1) on i386 with SCHED_SMT and SCHED_MC enabled
	# mount -t cgroup -o cpuset xxx /mnt
	# echo 0 > /mnt/cpuset.sched_load_balance
	# mkdir /mnt/0
	# echo 0 > /mnt/0/cpuset.cpus
	# dmesg
	CPU0 attaching sched-domain:
	 domain 0: span 0 level CPU
	  groups: 0

(2) on i386 with SCHED_MC enabled but SCHED_SMT disabled
	# same with (1)
	# dmesg
	CPU0 attaching NULL sched-domain.

The bug is that some sched domains may be skipped unintentionally when
degenerating (optimizing) sched domains.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-07 08:29:57 +01:00
Sripathi Kodi
cf7f8690e8 sched, lockdep: inline double_unlock_balance()
We have a test case which measures the variation in the amount of time
needed to perform a fixed amount of work on the preempt_rt kernel. We
started seeing deterioration in it's performance recently. The test
should never take more than 10 microseconds, but we started 5-10%
failure rate.

Using elimination method, we traced the problem to commit
1b12bbc747 (lockdep: re-annotate
scheduler runqueues).

When LOCKDEP is disabled, this patch only adds an additional function
call to double_unlock_balance(). Hence I inlined double_unlock_balance()
and the problem went away. Here is a patch to make this change.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06 22:12:09 +01:00
Peter Zijlstra
4793241be4 sched: backward looking buddy
Impact: improve/change/fix wakeup-buddy scheduling

Currently we only have a forward looking buddy, that is, we prefer to
schedule to the task we last woke up, under the presumption that its
going to consume the data we just produced, and therefore will have
cache hot benefits.

This allows co-waking producer/consumer task pairs to run ahead of the
pack for a little while, keeping their cache warm. Without this, we
would interleave all pairs, utterly trashing the cache.

This patch introduces a backward looking buddy, that is, suppose that
in the above scenario, the consumer preempts the producer before it
can go to sleep, we will therefore miss the wakeup from consumer to
producer (its already running, after all), breaking the cycle and
reverting to the cache-trashing interleaved schedule pattern.

The backward buddy will try to schedule back to the task that woke us
up in case the forward buddy is not available, under the assumption
that the last task will be the one with the most cache hot task around
barring current.

This will basically allow a task to continue after it got preempted.

In order to avoid starvation, we allow either buddy to get wakeup_gran
ahead of the pack.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-05 10:30:14 +01:00
Li Zefan
faa2f98f85 sched: add sanity check in partition_sched_domains()
Impact: cleanup, add debug check

It's wrong to make dattr_new = NULL if doms_new == NULL, it introduces
memory leak if dattr_new != NULL. Fortunately dattr_new is always NULL
in this case. So remove the code and add a sanity check.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-04 10:25:13 +01:00
Li Zefan
a17e226092 sched: remove redundant call to unregister_sched_domain_sysctl()
Impact: cleanup

The sysctl has been unregistered by partition_sched_domains().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-04 10:23:42 +01:00
Li Zefan
eefd796a8e sched debug: remove sd_level_to_string()
Impact: cleanup

Just use the newly introduced sd->name.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-04 10:21:49 +01:00
Li Zefan
34f3a814ee sched: switch sched_features to seqfile
Impact: cleanup

So handling of sched_features read is simplified.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-30 11:38:35 +01:00
Li Zefan
eab172294d sched: cleanup for alloc_rt/fair_sched_group()
Impact: cleanup

Remove checking parent == NULL. It won't be NULLL, because we dynamically
create sub task_group only, and sub task_group always has its parent.
(root task_group is statically defined)

Also replace kmalloc_node(GFP_ZERO) with kzalloc_node().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-29 11:53:26 +01:00
Ingo Molnar
d1a76187a5 Merge commit 'v2.6.28-rc2' into core/locking
Conflicts:
	arch/um/include/asm/system.h
2008-10-28 16:54:49 +01:00
Peter Zijlstra
3f3a490480 sched: virtual time buddy preemption
Since we moved wakeup preemption back to virtual time, it makes sense to move
the buddy stuff back as well. The purpose of the buddy scheduling is to allow
a quickly scheduling pair of tasks to run away from the group as far as a
regular busy task would be allowed under wakeup preemption.

This has the advantage that the pair can ping-pong for a while, enjoying
cache-hotness. Without buddy scheduling other tasks would interleave destroying
the cache.

Also, it saves a word in cfs_rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-24 12:51:03 +02:00
Peter Zijlstra
01c8c57d66 sched: fix a find_busiest_group buglet
In one of the group load balancer patches:

	commit 408ed066b1
	Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
	Date:   Fri Jun 27 13:41:28 2008 +0200
	Subject: sched: hierarchical load vs find_busiest_group

The following change:

-               if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+               if (max_load - this_load + 2*busiest_load_per_task >=
                                        busiest_load_per_task * imbn) {

made the condition always true, because imbn is [1,2].
Therefore, remove the 2*, and give the it a fair chance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-24 12:50:59 +02:00
Linus Torvalds
88ed86fee6 Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
  proc: remove fs/proc/proc_misc.c
  proc: move /proc/vmcore creation to fs/proc/vmcore.c
  proc: move pagecount stuff to fs/proc/page.c
  proc: move all /proc/kcore stuff to fs/proc/kcore.c
  proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
  proc: move /proc/modules boilerplate to kernel/module.c
  proc: move /proc/diskstats boilerplate to block/genhd.c
  proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
  proc: move /proc/vmstat boilerplate to mm/vmstat.c
  proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
  proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
  proc: move /proc/vmallocinfo to mm/vmalloc.c
  proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
  proc: move /proc/slab_allocators boilerplate to mm/slab.c
  proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
  proc: move /proc/stat to fs/proc/stat.c
  proc: move rest of /proc/partitions code to block/genhd.c
  proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
  proc: move /proc/devices code to fs/proc/devices.c
  proc: move rest of /proc/locks to fs/locks.c
  ...
2008-10-23 12:04:37 -07:00
Linus Torvalds
1f6d6e8ebe Merge branch 'v28-range-hrtimers-for-linus-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'v28-range-hrtimers-for-linus-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (37 commits)
  hrtimers: add missing docbook comments to struct hrtimer
  hrtimers: simplify hrtimer_peek_ahead_timers()
  hrtimers: fix docbook comments
  DECLARE_PER_CPU needs linux/percpu.h
  hrtimers: fix typo
  rangetimers: fix the bug reported by Ingo for real
  rangetimer: fix BUG_ON reported by Ingo
  rangetimer: fix x86 build failure for the !HRTIMERS case
  select: fix alpha OSF wrapper
  select: fix alpha OSF wrapper
  hrtimer: peek at the timer queue just before going idle
  hrtimer: make the futex() system call use the per process slack value
  hrtimer: make the nanosleep() syscall use the per process slack
  hrtimer: fix signed/unsigned bug in slack estimator
  hrtimer: show the timer ranges in /proc/timer_list
  hrtimer: incorporate feedback from Peter Zijlstra
  hrtimer: add a hrtimer_start_range() function
  hrtimer: another build fix
  hrtimer: fix build bug found by Ingo
  hrtimer: make select() and poll() use the hrtimer range feature
  ...
2008-10-23 10:53:02 -07:00
Linus Torvalds
133e887f90 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: disable the hrtick for now
  sched: revert back to per-rq vruntime
  sched: fair scheduler should not resched rt tasks
  sched: optimize group load balancer
  sched: minor fast-path overhead reduction
  sched: fix the wrong mask_len, cleanup
  sched: kill unused scheduler decl.
  sched: fix the wrong mask_len
  sched: only update rq->clock while holding rq->lock
2008-10-23 09:37:16 -07:00
Alexey Dobriyan
b5aadf7f14 proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:06:12 +04:00
Thomas Gleixner
268a3dcfea Merge branch 'timers/range-hrtimers' into v28-range-hrtimers-for-linus-v2
Conflicts:

	kernel/time/tick-sched.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-22 09:48:06 +02:00
Linus Torvalds
92b29b86fe Merge branch 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (131 commits)
  tracing/fastboot: improve help text
  tracing/stacktrace: improve help text
  tracing/fastboot: fix initcalls disposition in bootgraph.pl
  tracing/fastboot: fix bootgraph.pl initcall name regexp
  tracing/fastboot: fix issues and improve output of bootgraph.pl
  tracepoints: synchronize unregister static inline
  tracepoints: tracepoint_synchronize_unregister()
  ftrace: make ftrace_test_p6nop disassembler-friendly
  markers: fix synchronize marker unregister static inline
  tracing/fastboot: add better resolution to initcall debug/tracing
  trace: add build-time check to avoid overrunning hex buffer
  ftrace: fix hex output mode of ftrace
  tracing/fastboot: fix initcalls disposition in bootgraph.pl
  tracing/fastboot: fix printk format typo in boot tracer
  ftrace: return an error when setting a nonexistent tracer
  ftrace: make some tracers reentrant
  ring-buffer: make reentrant
  ring-buffer: move page indexes into page headers
  tracing/fastboot: only trace non-module initcalls
  ftrace: move pc counter in irqtrace
  ...

Manually fix conflicts:
 - init/main.c: initcall tracing
 - kernel/module.c: verbose level vs tracepoints
 - scripts/bootgraph.pl: fallout from cherry-picking commits.
2008-10-20 13:35:07 -07:00
Peter Zijlstra
ffda12a17a sched: optimize group load balancer
I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
in the sched_domain. This hurts.

We need the rq-locks whenever we change the weight of the per-cpu group sched
entities. To allevate this a little, only change the weight when the new
weight is at least shares_thresh away from the old value.

This avoids the rq-lock for the top level entries, since those will never
be re-weighted, and fuzzes the lower level entries a little to gain performance
in semi-stable situations.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 14:05:02 +02:00
Thomas Gleixner
c465a76af6 Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus 2008-10-20 13:14:06 +02:00
Arjan van de Ven
651dab4264 Merge commit 'linus/master' into merge-linus
Conflicts:

	arch/x86/kvm/i8254.c
2008-10-17 09:20:26 -07:00
Peter Zijlstra
8cd162ce23 sched: only update rq->clock while holding rq->lock
Vatsa noticed rq->clock going funny and tracked it down to an update_rq_clock()
outside a rq->lock section.

This is a problem because things like double_rq_lock() update the rq->clock
value for both rqs. Therefore disabling interrupts isn't strong enough.

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-15 20:43:27 +02:00
Ingo Molnar
b2aaf8f74c Merge branch 'linus' into stackprotector
Conflicts:
	arch/x86/kernel/Makefile
	include/asm-x86/pda.h
2008-10-15 13:46:29 +02:00
Mathieu Desnoyers
0a16b60758 tracing, sched: LTTng instrumentation - scheduler
Instrument the scheduler activity (sched_switch, migration, wakeups,
wait for a task, signal delivery) and process/thread
creation/destruction (fork, exit, kthread stop). Actually, kthread
creation is not instrumented in this patch because it is architecture
dependent. It allows to connect tracers such as ftrace which detects
scheduling latencies, good/bad scheduler decisions. Tools like LTTng can
export this scheduler information along with instrumentation of the rest
of the kernel activity to perform post-mortem analysis on the scheduler
activity.

About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added. See the "Tracepoints" patch header for
performance result detail.

Changelog :

- Change instrumentation location and parameter to match ftrace
  instrumentation, previously done with kernel markers.

[ mingo@elte.hu: conflict resolutions ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: 'Peter Zijlstra' <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-14 10:30:52 +02:00
Ingo Molnar
a5d8c3483a sched debug: add name to sched_domain sysctl entries
add /proc/sys/kernel/sched_domain/cpu0/domain0/name, to make
it easier to see which specific scheduler domain remained at
that entry.

Since we process the scheduler domain tree and
simplify it, it's not always immediately clear during debugging
which domain came from where.

depends on CONFIG_SCHED_DEBUG=y.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-09 17:13:06 +02:00
Li Zefan
34b3ede235 sched: remove redundant code in cpu_cgroup_create()
css will be initialized by cgroup core.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-06 08:13:34 +02:00
Ingo Molnar
2c10c22af0 Merge branch 'linus' into sched/devel 2008-10-06 08:13:18 +02:00
Nick Piggin
7317d7b87e sched: improve preempt debugging
This patch helped me out with a problem I recently had....

Basically, when the kernel lock is held, then preempt_count underflow does not
get detected until it is released which may be a long time (and arbitrarily,
eg at different points it may be rescheduled). If the bkl is released at
schedule, the resulting output is actually fairly cryptic...

With any other lock that elevates preempt_count, it is illegal to schedule
under it (which would get found pretty quickly). bkl allows scheduling with
preempt_count elevated, which makes underflows hard to debug.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-30 12:56:25 +02:00
Ingo Molnar
1508487e7f timers: fix itimer/many thread hang, fix
fix bogus rq dereference: v3 removed the locking but also removed the rq
initialization.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-30 08:28:17 +02:00
Thomas Gleixner
ccc7dadf73 hrtimer: prevent migration of per CPU hrtimers
Impact: per CPU hrtimers can be migrated from a dead CPU

The hrtimer code has no knowledge about per CPU timers, but we need to
prevent the migration of such timers and warn when such a timer is
active at migration time.

Explicitely mark the timers as per CPU and use a more understandable
mode descriptor for the interrupts safe unlocked callback mode, which
is used by hrtimer_sleeper and the scheduler code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-09-29 17:09:14 +02:00
Frank Mayhar
7086efe1c1 timers: fix itimer/many thread hang, v3
- fix UP lockup
- another set of UP/SMP cleanups and simplifications

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-27 20:04:45 +02:00
Peter Zijlstra
4653f803e6 sched: more sanity checks on the bandwidth settings
While playing around with it, I noticed we missed some sanity checks.
Also add some comments while we're there.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-23 16:23:16 +02:00
Ingo Molnar
63e5c39859 Merge branches 'sched/urgent' and 'sched/rt' into sched/devel 2008-09-23 16:23:05 +02:00
Frank Mayhar
bb34d92f64 timers: fix itimer/many thread hang, v2
This is the second resubmission of the posix timer rework patch, posted
a few days ago.

This includes the changes from the previous resubmittion, which addressed
Oleg Nesterov's comments, removing the RCU stuff from the patch and
un-inlining the thread_group_cputime() function for SMP.

In addition, per Ingo Molnar it simplifies the UP code, consolidating much
of it with the SMP version and depending on lower-level SMP/UP handling to
take care of the differences.

It also cleans up some UP compile errors, moves the scheduler stats-related
macros into kernel/sched_stats.h, cleans up a merge error in
kernel/fork.c and has a few other minor fixes and cleanups as suggested
by Oleg and Ingo. Thanks for the review, guys.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-23 13:38:44 +02:00
Andrew Morton
006c75f146 sched: clarify ifdef tangle
- Add some comments to try to make the ifdef puzzle a bit clearer

- Explicitly inline one of the three init_hrtick() implementations.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-23 11:02:31 +02:00
Rakib Mullick
fa74820317 sched: fix init_hrtick() section mismatch warning
LD      kernel/built-in.o
WARNING: kernel/built-in.o(.text+0x326): Section mismatch in reference
from the function init_hrtick() to the variable
.cpuinit.data:hotplug_hrtick_nb.8
The function init_hrtick() references
the variable __cpuinitdata hotplug_hrtick_nb.8.
This is often because init_hrtick lacks a __cpuinitdata
annotation or the annotation of hotplug_hrtick_nb.8 is wrong.

Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-23 11:02:13 +02:00
Peter Zijlstra
15afe09bf4 sched: wakeup preempt when small overlap
Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.

The difference seems to come from different preemption agressiveness,
which affects the cache footprint of the workload and its effective
cache trashing.

Aggresively preempt a task if its avg overlap is very small, this should
avoid the task going to sleep and find it still running when we schedule
back to it - saving a wakeup.

Reported-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-22 16:28:32 +02:00
Frank Mayhar
f06febc96b timers: fix itimer/many thread hang
Overview

This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling.  It was put together
with the help of Roland McGrath, the owner and original writer of this code.

The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads.  It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.

This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

Code Changes

This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine.  (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.)  To do this, at each tick we now update fields in
signal_struct as well as task_struct.  The run_posix_cpu_timers() function
uses those fields to make its decisions.

We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:

struct task_cputime {
	cputime_t utime;
	cputime_t stime;
	unsigned long long sum_exec_runtime;
};

This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels.  For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:

struct thread_group_cputime {
	struct task_cputime totals;
};

struct thread_group_cputime {
	struct task_cputime *totals;
};

We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers).  The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends.  In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention).  For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu().  The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().

We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel.  The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields.  The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures.  The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated.  The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU.  Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.

Non-SMP operation is trivial and will not be mentioned further.

The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().

All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.

Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away.  All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline.  When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.

Performance

The fix appears not to add significant overhead to existing operations.  It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below).  Overall it's a wash except in those
two cases.

I've since done somewhat more involved testing on a dual-core Opteron system.

Case 1: With no itimer running, for a test with 100,000 threads, the fixed
	kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
	all of which was spent in the system.  There were twice as many
	voluntary context switches with the fix as without it.

Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
	an unmodified kernel can handle), the fixed kernel ran the test in
	eight percent of the time (5.8 seconds as opposed to 70 seconds) and
	had better tick accuracy (.012 seconds per tick as opposed to .023
	seconds per tick).

Case 3: A 4000-thread test with an initial timer tick of .01 second and an
	interval of 10,000 seconds (i.e. a timer that ticks only once) had
	very nearly the same performance in both cases:  6.3 seconds elapsed
	for the fixed kernel versus 5.5 seconds for the unfixed kernel.

With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds).  The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.

Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
	running), the modified kernel was very slightly favored in that while
	it killed the process in 19.997 seconds of CPU time (5.002 seconds of
	wall time), only .003 seconds of that was system time, the rest was
	user time.  The unmodified kernel killed the process in 20.001 seconds
	of CPU (5.014 seconds of wall time) of which .016 seconds was system
	time.  Really, though, the results were too close to call.  The results
	were essentially the same with no itimer running.

Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
	(where the hard limit would never be reached) and an itimer running,
	the modified kernel exhibited worse tick accuracy than the unmodified
	kernel: .050 seconds/tick versus .028 seconds/tick.  Otherwise,
	performance was almost indistinguishable.  With no itimer running this
	test exhibited virtually identical behavior and times in both cases.

In times past I did some limited performance testing.  those results are below.

On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s.  On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds.  Performance with eight, four and one
thread were comparable.  Interestingly, the timer ticks with the fix seemed
more accurate:  The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick.  Both cases were configured for an interval of
0.01 seconds.  Again, the other tests were comparable.  Each thread in this
test computed the primes up to 25,000,000.

I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix.  In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable).  System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s).  It received 147651 ticks for 0.015 seconds per tick, still quite
accurate.  There is obviously no comparable test without the fix.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-14 16:25:35 +02:00
Ingo Molnar
09b22a2f67 Merge commit 'v2.6.27-rc6' into sched/devel 2008-09-11 13:37:28 +02:00
Hiroshi Shimamoto
ec5d498991 sched: fix deadlock in setting scheduler parameter to zero
Andrei Gusev wrote:

> I played witch scheduler settings. After doing something like:
> echo -n 1000000 >sched_rt_period_us
>
> command is locked. I found in kernel.log:
>
> Sep 11 00:39:34 zaratustra
> Sep 11 00:39:34 zaratustra Pid: 4495, comm: bash Tainted: G        W
> (2.6.26.3 #12)
> Sep 11 00:39:34 zaratustra EIP: 0060:[<c0213fc7>] EFLAGS: 00210246 CPU: 0
> Sep 11 00:39:34 zaratustra EIP is at div64_u64+0x57/0x80
> Sep 11 00:39:34 zaratustra EAX: 0000389f EBX: 00000000 ECX: 00000000
> EDX: 00000000
> Sep 11 00:39:34 zaratustra ESI: d9800000 EDI: d9800000 EBP: 0000389f
> ESP: ea7a6edc
> Sep 11 00:39:34 zaratustra DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
> Sep 11 00:39:34 zaratustra Process bash (pid: 4495, ti=ea7a6000
> task=ea744000 task.ti=ea7a6000)
> Sep 11 00:39:34 zaratustra Stack: 00000000 000003e8 d9800000 0000389f
> c0119042 00000000 00000000 00000001
> Sep 11 00:39:34 zaratustra 00000000 00000000 ea7a6f54 00010000 00000000
> c04d2e80 00000001 000e7ef0
> Sep 11 00:39:34 zaratustra c01191a3 00000000 00000000 ea7a6fa0 00000001
> ffffffff c04d2e80 ea5b2480
> Sep 11 00:39:34 zaratustra Call Trace:
> Sep 11 00:39:34 zaratustra [<c0119042>] __rt_schedulable+0x52/0x130
> Sep 11 00:39:34 zaratustra [<c01191a3>] sched_rt_handler+0x83/0x120
> Sep 11 00:39:34 zaratustra [<c01a76a6>] proc_sys_call_handler+0xb6/0xd0
> Sep 11 00:39:34 zaratustra [<c01a76c0>] proc_sys_write+0x0/0x20
> Sep 11 00:39:34 zaratustra [<c01a76d9>] proc_sys_write+0x19/0x20
> Sep 11 00:39:34 zaratustra [<c016cc68>] vfs_write+0xa8/0x140
> Sep 11 00:39:34 zaratustra [<c016cdd1>] sys_write+0x41/0x80
> Sep 11 00:39:34 zaratustra [<c0103051>] sysenter_past_esp+0x6a/0x91
> Sep 11 00:39:34 zaratustra =======================
> Sep 11 00:39:34 zaratustra Code: c8 41 0f ad f3 d3 ee f6 c1 20 0f 45 de
> 31 f6 0f ad ef d3 ed f6 c1 20 0f 45 fd 0f 45 ee 31 c9 39 eb 89 fe 89 ea
> 77 08 89 e8 31 d2 <f7> f3 89 c1 89 f0 8b 7c 24 08 f7 f3 8b 74 24 04 89
> ca 8b 1c 24
> Sep 11 00:39:34 zaratustra EIP: [<c0213fc7>] div64_u64+0x57/0x80 SS:ESP
> 0068:ea7a6edc
> Sep 11 00:39:34 zaratustra ---[ end trace 4eaa2a86a8e2da22 ]---

fix the boundary condition.

sysctl_sched_rt_period=0 makes exception at to_ratio().

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-11 09:39:18 +02:00
Ingo Molnar
291c54ff76 Merge branch 'sched/cpuset' into sched/urgent 2008-09-06 21:03:16 +02:00
Max Krasnyansky
dfb512ec48 sched: arch_reinit_sched_domains() must destroy domains to force rebuild
What I realized recently is that calling rebuild_sched_domains() in
arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
and will not actually rebuild anything if new domain masks match the old ones.

What this means is that doing
     echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
on a system with cpusets enabled will not take affect untill something changes
in the cpuset setup (ie new sets created or deleted).

This patch fixes restore correct behaviour where domains must be rebuilt in
order to enable MC powersaving flags.

Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
as expected.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Tested-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-06 19:22:15 +02:00
Ingo Molnar
7f79d852ed Merge branch 'linus' into sched/devel 2008-09-06 16:51:57 +02:00
Krzysztof Helt
c8bfff6dd4 sched: compilation fix with gcc 3.4.6
I found that 2.6.27-rc5-mm1 does not compile with gcc 3.4.6.
The error is:
  CC      kernel/sched.o
kernel/sched.c: In function `start_rt_bandwidth':
kernel/sched.c:208: sorry, unimplemented: inlining failed in call to 'rt_bandwidth_enabled': function body not available
kernel/sched.c:214: sorry, unimplemented: called from here
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2

It seems that the gcc 3.4.6 requires full inline definition before first usage.
The patch below fixes the compilation problem.

Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> (if needed>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-06 15:17:09 +02:00
Arjan van de Ven
cc584b213f hrtimer: convert kernel/* to the new hrtimer apis
In order to be able to do range hrtimers we need to use accessor functions
to the "expire" member of the hrtimer struct.
This patch converts kernel/* to these accessors.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2008-09-05 21:35:13 -07:00
Balbir Singh
49048622ea sched: fix process time monotonicity
Spencer reported a problem where utime and stime were going negative despite
the fixes in commit b27f03d4bd. The suspected
reason for the problem is that signal_struct maintains it's own utime and
stime (of exited tasks), these are not updated using the new task_utime()
routine, hence sig->utime can go backwards and cause the same problem
to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
fixes the problem

TODO: using max(task->prev_utime, derived utime) works for now, but a more
generic solution is to implement cputime_max() and use the cputime_gt()
function for comparison.

Reported-by: spencer@bluehost.com
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-05 18:14:35 +02:00
Ingo Molnar
7940ca3605 sched: extract walk_tg_tree(), fix
fix:

 kernel/sched.c: In function '__rt_schedulable':
 kernel/sched.c:8771: error: implicit declaration of function 'walk_tg_tree'
 kernel/sched.c:8771: error: 'tg_nop' undeclared (first use in this function)
 kernel/sched.c:8771: error: (Each undeclared identifier is reported only once
 kernel/sched.c:8771: error: for each function it appears in.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-28 12:08:00 +02:00
Ingo Molnar
aef745fca0 sched: clean up __might_sleep()
add KERN_ to the printout and clean up the flow a bit.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-28 11:36:03 +02:00
Joe Korty
29cbef4869 make might_sleep() display the oopsing process
Expand might_sleep's printk to indicate the oopsing process.

Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-28 11:36:02 +02:00
Kevin Diggs
65eb3dc609 sched: add kernel doc for the completion, fix kernel-doc-nano-HOWTO.txt
This patch adds kernel doc for the completion feature.

An error in the split-man.pl PERL snippet in kernel-doc-nano-HOWTO.txt is
also fixed.

Signed-off-by: Kevin Diggs <kevdig@hypersurf.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-26 10:26:54 +02:00
Ingo Molnar
3cf430b063 Merge branch 'linus' into sched/devel 2008-08-26 10:25:59 +02:00
Oleg Nesterov
93dcf55f82 wait_task_inactive: "improve" the returned value for ->nvcsw == 0
wait_task_inactive() returns 1 when p->nvcsw == 0 || p->nvcsw == 1.  This
means that two subsequent calls can return the same number while the task
was scheduled in between.

Change the code to return "nvcsw | LONG_MIN" instead of "nvcsw ?: 1", now
the overlap always needs LONG_MAX schedules.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-22 15:17:31 +02:00
Oleg Nesterov
f31e11d87a wait_task_inactive(): don't consider task->nivcsw
If wait_task_inactive() returns success the task was deactivated.  In that
case schedule() always increments ->nvcsw which alone can be used as a
"generation counter".

If the next call returns the same number, we can be sure that the task was
unscheduled.  Otherwise, because we know that .on_rq == 0 again, ->nvcsw
should have been changed in between.

Q: perhaps it is better to do "ncsw = (p->nvcsw << 1) | 1" ?  This
decreases the possibility of "was it unscheduled" false positive when
->nvcsw == 0.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-22 15:17:29 +02:00
Oleg Nesterov
94d3d8247d sched: do_wait_for_common: use signal_pending_state()
Change do_wait_for_common() to use signal_pending_state() instead of open
coding.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-22 15:17:28 +02:00
Peter Zijlstra
9a7e0b180d sched: rt-bandwidth fixes
The last patch allows sysctl_sched_rt_runtime to disable bandwidth accounting
for the group scheduler - however it doesn't deal with sched_setscheduler(),
which will keep tasks out of groups that have no assigned runtime.

If we relax this, we get into the situation where RT tasks can get into a group
when we disable bandwidth control, and then starve them by enabling it again.

Rework the schedulability code to check for this condition and fail to turn
on bandwidth control with -EBUSY when this situation is found.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-19 13:10:12 +02:00
Peter Zijlstra
eb755805f2 sched: extract walk_tg_tree()
Extract walk_tg_tree() and make it a little more generic so we can use it
in the schedulablity test.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-19 13:10:11 +02:00
Peter Zijlstra
0b148fa048 sched: rt-bandwidth group disable fixes
More extensive disable of bandwidth control. It allows sysctl_sched_rt_runtime
to disable full group bandwidth control.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-19 13:10:10 +02:00
Linus Torvalds
c100548d46 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: scale sysctl_sched_shares_ratelimit with nr_cpus
  sched: fix rt-bandwidth hotplug race
  sched: fix the race between walk_tg_tree and sched_create_group
2008-08-16 17:15:32 -07:00
Peter Zijlstra
55cd53404c sched: scale sysctl_sched_shares_ratelimit with nr_cpus
David reported that his Niagra spend a little too much time in
tg_shares_up(), which considering he has a large cpu count makes sense.

So scale the ratelimit value with the number of cpus like we do for
other controls as well.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-15 18:25:07 +02:00
Dave Chinner
be4de35263 completions: uninline try_wait_for_completion and completion_done
m68k fails to build with these functions inlined in completion.h.  Move
them out of line into sched.c and export them to avoid this problem.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15 08:35:44 -07:00
Zhang, Yanmin
09f2724a78 sched: fix the race between walk_tg_tree and sched_create_group
With 2.6.27-rc3, I hit a kernel panic when running volanoMark on my
new x86_64 machine. I also hit it with other 2.6.27-rc kernels.
See below log.

Basically, function walk_tg_tree and sched_create_group have a race
between accessing and initiating tg->children. Below patch fixes it
by moving tg->children initiation to the front of linking tg->siblings
to parent->children.

{----------------panic log------------}

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff802292ab>] walk_tg_tree+0x45/0x7f
PGD 1be1c4067 PUD 1bdd8d067 PMD 0
Oops: 0000 [1] SMP
CPU 11
Modules linked in: igb
Pid: 22979, comm: java Not tainted 2.6.27-rc3 #1
RIP: 0010:[<ffffffff802292ab>]  [<ffffffff802292ab>] walk_tg_tree+0x45/0x7f
RSP: 0018:ffff8801bfbbbd18  EFLAGS: 00010083
RAX: 0000000000000000 RBX: ffff8800be0dce40 RCX: ffffffffffffffc0
RDX: ffff880102c43740 RSI: 0000000000000000 RDI: ffff8800be0dce40
RBP: ffff8801bfbbbd48 R08: ffff8800ba437bc8 R09: 0000000000001f40
R10: ffff8801be812100 R11: ffffffff805fdf44 R12: ffff880102c43740
R13: 0000000000000000 R14: ffffffff8022cf0f R15: ffffffff8022749f
FS:  00000000568ac950(0063) GS:ffff8801bfa26d00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000001bd848000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 22979, threadinfo ffff8801b145a000, task ffff8801bf18e450)
Stack:  0000000000000001 ffff8800ba5c8d60 0000000000000001 0000000000000001
 ffff8800bad1ccb8 0000000000000000 ffff8801bfbbbd98 ffffffff8022ed37
 0000000000000001 0000000000000286 ffff8801bd5ee180 ffff8800ba437bc8
Call Trace:
 <IRQ>  [<ffffffff8022ed37>] try_to_wake_up+0x71/0x24c
 [<ffffffff80247177>] autoremove_wake_function+0x9/0x2e
 [<ffffffff80228039>] ? __wake_up_common+0x46/0x76
 [<ffffffff802296d5>] __wake_up+0x38/0x4f
 [<ffffffff806169cc>] tcp_v4_rcv+0x380/0x62e

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 10:58:48 +02:00
Linus Torvalds
1ea2950884 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched, cpu hotplug: fix set_cpus_allowed() use in hotplug callbacks
  sched: fix mysql+oltp regression
  sched_clock: delay using sched_clock()
  sched clock: couple local and remote clocks
  sched clock: simplify __update_sched_clock()
  sched: eliminate scd->prev_raw
  sched clock: clean up sched_clock_cpu()
  sched clock: revert various sched_clock() changes
  sched: move sched_clock before first use
  sched: test runtime rather than period in global_rt_runtime()
  sched: fix SCHED_HRTICK dependency
  sched: fix warning in hrtick_start_fair()
2008-08-11 16:46:31 -07:00
Ingo Molnar
23a0ee908c Merge branch 'core/locking' into core/urgent 2008-08-12 00:11:49 +02:00
Ingo Molnar
251a169c69 Merge branch 'linus' into sched/urgent 2008-08-11 13:40:56 +02:00
Peter Zijlstra
1b12bbc747 lockdep: re-annotate scheduler runqueues
Instead of using a per-rq lock class, use the regular nesting operations.

However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).

So what can happen is:

 spin_lock(rq->lock);	// this rq subclass 0

 double_lock_balance(rq, other_rq);
   // release rq
   // acquire other_rq->lock subclass 0
   // acquire rq->lock subclass 1

 spin_unlock(other_rq->lock);

leaving you with rq->lock in subclass 1

So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.

Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11 09:30:22 +02:00
Jeremy Fitzhardinge
725aad24c3 __sched_setscheduler: don't do any policy checks when not "user"
The "user" parameter to __sched_setscheduler indicates whether the
change is being done on behalf of a user process or not.  If not, we
shouldn't apply any permissions checks, so don't call
security_task_setscheduler().

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-04 17:16:20 -07:00
Peter Zijlstra
5e710e37bd lockdep: change scheduler annotation
While thinking about David's graph walk lockdep patch it _finally_
dawned on me that there is no reason we have a lock class per cpu ...

Sorry for being dense :-/

The below changes the annotation from a lock class per cpu, to a single
nested lock, as the scheduler never holds more that 2 rq locks at a time
anyway.

If there was code requiring holding all rq locks this would not work and
the original annotation would be the only option, but that not being the
case, this is a much lighter one.

Compiles and boots on a 2-way x86_64.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-01 10:46:48 +02:00
Andi Kleen
f718cd4add sched: make scheduler sysfs attributes sysdev class devices
They are really class devices, but were incorrectly declared.  This
leads to crashes with the recent changes that makes non normal sysdevs
use a different prototype.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-30 09:41:47 -07:00
roel kluin
e26873bb10 sched: test runtime rather than period in global_rt_runtime()
Test runtime rather than period

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-28 15:57:24 +02:00
Roland McGrath
85ba2d862e tracehook: wait_task_inactive
This extends wait_task_inactive() with a new argument so it can be used in
a "soft" mode where it will check for the task changing state unexpectedly
and back off.  There is no change to existing callers.  This lays the
groundwork to allow robust, noninvasive tracing that can try to sample a
blocked thread but back off safely if it wakes up.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:09 -07:00
Eduard - Gabriel Munteanu
7babe8db99 Full conversion to early_initcall() interface, remove old interface
A previous patch added the early_initcall(), to allow a cleaner hooking of
pre-SMP initcalls.  Now we remove the older interface, converting all
existing users to the new one.

[akpm@linux-foundation.org: cleanups]
[akpm@linux-foundation.org: build fix]
[kosaki.motohiro@jp.fujitsu.com: warning fix]
[kosaki.motohiro@jp.fujitsu.com: warning fix]
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:04 -07:00
Jonathan Lim
49b5cf3472 accounting: account for user time when updating memory integrals
Adapt acct_update_integrals() to include user time when calculating the time
difference.  The units of acct_rss_mem1 and acct_vm_mem1 are also changed from
pages-jiffies to pages-usecs to avoid calling jiffies_to_usecs() in
xacct_add_tsk() which might overflow.

Signed-off-by: Jonathan Lim <jlim@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:46 -07:00
Linus Torvalds
7f9dce3837 Merge branch 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: hrtick_enabled() should use cpu_active()
  sched, x86: clean up hrtick implementation
  sched: fix build error, provide partition_sched_domains() unconditionally
  sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
  cpu hotplug: Make cpu_active_map synchronization dependency clear
  cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
  sched: rework of "prioritize non-migratable tasks over migratable ones"
  sched: reduce stack size in isolated_cpu_setup()
  Revert parts of "ftrace: do not trace scheduler functions"

Fixed up conflicts in include/asm-x86/thread_info.h (due to the
TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
introduction).
2008-07-23 19:36:53 -07:00
Linus Torvalds
26dcce0fab Merge branch 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
  NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
  cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
  NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
  NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
  cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
  cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
  cpumask: Provide a generic set of CPUMASK_ALLOC macros
  cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
  cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
  cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
  cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
  cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
  cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
  Revert "cpumask: introduce new APIs"
  cpumask: make for_each_cpu_mask a bit smaller
  net: Pass reference to cpumask variable in net/sunrpc/svc.c
  ...

Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
2008-07-23 18:37:44 -07:00
Andi Kleen
4a0b2b4dbe sysdev: Pass the attribute to the low level sysdev show/store function
This allow to dynamically generate attributes and share show/store
functions between attributes. Right now most attributes are generated
by special macros and lots of duplicated code. With the attribute
passed it's instead possible to attach some data to the attribute
and then use that in shared low level functions to do different things.

I need this for the dynamically generated bank attributes in the x86
machine check code, but it'll allow some further cleanups.

I converted all users in tree to the new show/store prototype. It's a single
huge patch to avoid unbisectable sections.

Runtime tested: x86-32, x86-64
Compiled only: ia64, powerpc
Not compile tested/only grep converted: sh, arm, avr32

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:55:02 -07:00
Ingo Molnar
ba42059fbd sched: hrtick_enabled() should use cpu_active()
Peter pointed out that hrtick_enabled() should use cpu_active().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-20 11:02:06 +02:00
Ingo Molnar
d986434a7d Merge branch 'sched/urgent' into sched/devel 2008-07-20 11:01:29 +02:00
Peter Zijlstra
31656519e1 sched, x86: clean up hrtick implementation
random uvesafb failures were reported against Gentoo:

  http://bugs.gentoo.org/show_bug.cgi?id=222799

and Mihai Moldovan bisected it back to:

> 8f4d37ec07 is first bad commit
> commit 8f4d37ec07
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date:   Fri Jan 25 21:08:29 2008 +0100
>
>    sched: high-res preemption tick

Linus suspected it to be hrtick + vm86 interaction and observed:

> Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
> _incorrect_ per se, but they are definitely bad.
>
> Why?
>
> Using random _TIF_WORK_MASK flags is really impolite for doing
> "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
> special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
> vm86 mode unnecessarily.
>
> See the "work_notifysig_v86" label, and how it does that
> "save_v86_state()" thing etc etc.

Right, I never liked having to fiddle with those TIF flags. Initially I
needed it because the hrtimer base lock could not nest in the rq lock.
That however is fixed these days.

Currently the only reason left to fiddle with the TIF flags is remote
wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
about using the new and improved IPI function call stuff to implement
hrtimer_start_on().

However that does require that smp_call_function_single(.wait=0) works
from interrupt context - /me looks at the latest series from Jens - Yes
that does seem to be supported, good.

Here's a stab at cleaning this stuff up ...

Mihai reported test success as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mihai Moldovan <ionic@ionic.de>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-20 10:37:28 +02:00
Max Krasnyansky
e761b77252 cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.

It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.

Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.
Cpuset changes to make rebuild_sched_domains() callable from various
contexts are in the separate patch (right next after this one).

This not only boots but also easily handles
	while true; do make clean; make -j 8; done
and
	while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.

Also this is running with most of the debug features enabled (lockdep,
mutex, etc) no BUG_ONs or lockdep complaints so far.

I believe I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).

Signed-off-by: Max Krasnyanskiy <maxk@qualcomm.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Cc: dmitry.adamushko@gmail.com
Cc: pj@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 13:22:25 +02:00
Mike Travis
13b40c1e40 sched: reduce stack size in isolated_cpu_setup()
* Remove 16k stack requirements in isolated_cpu_setup when NR_CPUS=4096.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 11:55:42 +02:00
Ingo Molnar
82638844d9 Merge branch 'linus' into cpus4096
Conflicts:

	arch/x86/xen/smp.c
	kernel/sched_rt.c
	net/iucv/iucv.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 00:29:07 +02:00
Linus Torvalds
666484f025 Merge branch 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softirq: remove irqs_disabled warning from local_bh_enable
  softirq: remove initialization of static per-cpu variable
  Remove argument from open_softirq which is always NULL
2008-07-14 15:28:42 -07:00
Linus Torvalds
948769a5ba Merge branch 'sched/new-API-sched_setscheduler' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched/new-API-sched_setscheduler' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: add new API sched_setscheduler_nocheck: add a flag to control access checks
2008-07-14 14:50:49 -07:00
Ingo Molnar
5806b81ac1 Merge branch 'auto-ftrace-next' into tracing/for-linus
Conflicts:

	arch/x86/kernel/entry_32.S
	arch/x86/kernel/process_32.c
	arch/x86/kernel/process_64.c
	arch/x86/lib/Makefile
	include/asm-x86/irqflags.h
	kernel/Makefile
	kernel/sched.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-14 16:11:52 +02:00
Ingo Molnar
d14c8a680c Merge branch 'sched/for-linus' into tracing/for-linus 2008-07-14 16:11:02 +02:00
Ingo Molnar
54ef76f37b Merge branch 'linus' into sched/devel 2008-07-13 08:50:13 +02:00
Ingo Molnar
ae94b8075a Merge branch 'linus' into x86/core
Conflicts:

	arch/x86/mm/ioremap.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-12 07:29:02 +02:00
Linus Torvalds
b1e387348a sched: fix cpu hotplug, cleanup
Clean up __migrate_task(): to just have separate "done" and "fail"
cases, instead of that "out" case with random error behavior.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-10 20:39:58 +02:00
Ingo Molnar
bac0c9103b Merge branch 'tracing/ftrace' into auto-ftrace-next 2008-07-10 11:43:00 +02:00
Dmitry Adamushko
dc7fab8b3b sched: fix cpu hotplug
I think we may have a race between try_to_wake_up() and
migrate_live_tasks() -> move_task_off_dead_cpu() when the later one
may end up looping endlessly.

Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is
called so we may get a race between try_to_wake_up() and
migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push
a task out of a dead CPU causing the later one to loop endlessly.

Heiko Carstens observed:

| That's exactly what explains a dump I got yesterday. Thanks for fixing! :)

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: miaox@cn.fujitsu.com
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-10 09:35:34 +02:00
Mike Travis
076ac2af86 sched, numa: replace MAX_NUMNODES with nr_node_ids in kernel/sched.c
* Replace usages of MAX_NUMNODES with nr_node_ids in kernel/sched.c,
    where appropriate.  This saves some allocated space as well as many
    wasted cycles going through node entries that are non-existent.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 11:31:30 +02:00
Ingo Molnar
032f82786f Merge commit 'v2.6.26-rc9' into sched/devel 2008-07-07 08:01:26 +02:00
Ingo Molnar
68083e05d7 Merge commit 'v2.6.26-rc9' into cpus4096 2008-07-06 14:23:39 +02:00
Ankita Garg
46ac22bab4 sched: fix accounting in task delay accounting & migration
On Thu, Jun 19, 2008 at 12:27:14PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 10:50 +0530, Ankita Garg wrote:
>
> > Thanks Peter for the explanation...
> >
> > I agree with the above and that is the reason why I did not see weird
> > values with cpu_time. But, run_delay still would suffer skews as the end
> > points for delta could be taken on different cpus due to migration (more
> > so on RT kernel due to the push-pull operations). With the below patch,
> > I could not reproduce the issue I had seen earlier. After every dequeue,
> > we take the delta and start wait measurements from zero when moved to a
> > different rq.
>
> OK, so task delay delay accounting is broken because it doesn't take
> migration into account.
>
> What you've done is make it symmetric wrt enqueue, and account it like
>
>   cpu0      cpu1
>
> enqueue
>  <wait-d1>
> dequeue
>             enqueue
>              <wait-d2>
>             run
>
> Where you add both d1 and d2 to the run_delay,.. right?
>

Thanks for reviewing the patch. The above is exactly what I have done.

> This seems like a good fix, however it looks like the patch will break
> compilation in !CONFIG_SCHEDSTATS && !CONFIG_TASK_DELAY_ACCT, of it
> failing to provide a stub for sched_info_dequeue() in that case.

Fixed. Pl. find the new patch below.

Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: rostedt@goodmis.org
Cc: suresh.b.siddha@intel.com
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dhaval@linux.vnet.ibm.com
Cc: vatsa@linux.vnet.ibm.com
Cc: David Bahi <DBahi@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-04 12:50:23 +02:00
Gregory Haskins
2087a1ad82 sched: add avg-overlap support to RT tasks
We have the notion of tracking process-coupling (a.k.a. buddy-wake) via
the p->se.last_wake / p->se.avg_overlap facilities, but it is only used
for cfs to cfs interactions.  There is no reason why an rt to cfs
interaction cannot share in establishing a relationhip in a similar
manner.

Because PREEMPT_RT runs many kernel threads as FIFO priority, we often
times have heavy interaction between RT threads waking CFS applications.
This patch offers a substantial boost (50-60%+) in perfomance under those
circumstances.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: npiggin@suse.de
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-04 12:50:22 +02:00
Gregory Haskins
c4acb2c066 sched: terminate newidle balancing once at least one task has moved over
Inspired by Peter Zijlstra.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: npiggin@suse.de
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-04 12:50:21 +02:00
Raistlin
619b048803 sched: fix divide error when trying to configure rt_period to zero
Here it is another little Oops we found while configuring invalid values
via cgroups:

echo 0 > /dev/cgroups/0/cpu.rt_period_us
or
echo 4294967296 > /dev/cgroups/0/cpu.rt_period_us

[  205.509825] divide error: 0000 [#1]
[  205.510151] Modules linked in:
[  205.510151]
[  205.510151] Pid: 2339, comm: bash Not tainted (2.6.26-rc8 #33)
[  205.510151] EIP: 0060:[<c030c6ef>] EFLAGS: 00000293 CPU: 0
[  205.510151] EIP is at div64_u64+0x5f/0x70
[  205.510151] EAX: 0000389f EBX: 00000000 ECX: 00000000 EDX: 00000000
[  205.510151] ESI: d9800000 EDI: 00000000 EBP: c6cede60 ESP: c6cede50
[  205.510151]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  205.510151] Process bash (pid: 2339, ti=c6cec000 task=c79be370 task.ti=c6cec000)
[  205.510151] Stack: d9800000 0000389f c05971a0 d9800000 c6cedeb4 c0214dbd 00000000 00000000
[  205.510151]        c6cede88 c0242bd8 c05377c0 c7a41b40 00000000 00000000 00000000 c05971a0
[  205.510151]        c780ed20 c7508494 c7a41b40 00000000 00000002 c6cedebc c05971a0 ffffffea
[  205.510151] Call Trace:
[  205.510151]  [<c0214dbd>] ? __rt_schedulable+0x1cd/0x240
[  205.510151]  [<c0242bd8>] ? cgroup_file_open+0x18/0xe0
[  205.510151]  [<c0214fe4>] ? tg_set_bandwidth+0xa4/0xf0
[  205.510151]  [<c0215066>] ? sched_group_set_rt_period+0x36/0x50
[  205.510151]  [<c021508e>] ? cpu_rt_period_write_uint+0xe/0x10
[  205.510151]  [<c0242dc5>] ? cgroup_file_write+0x125/0x160
[  205.510151]  [<c0232c15>] ? hrtimer_interrupt+0x155/0x190
[  205.510151]  [<c02f047f>] ? security_file_permission+0xf/0x20
[  205.510151]  [<c0277ad8>] ? rw_verify_area+0x48/0xc0
[  205.510151]  [<c0283744>] ? dupfd+0x104/0x130
[  205.510151]  [<c027838c>] ? vfs_write+0x9c/0x160
[  205.510151]  [<c0242ca0>] ? cgroup_file_write+0x0/0x160
[  205.510151]  [<c027850d>] ? sys_write+0x3d/0x70
[  205.510151]  [<c0203019>] ? sysenter_past_esp+0x6a/0x91
[  205.510151]  =======================
[  205.510151] Code: 0f 45 de 31 f6 0f ad d0 d3 ea f6 c1 20 0f 45 c2 0f 45 d6 89 45 f0 89 55 f4 8b 55 f4 31 c9 8b 45 f0 39 d3 89 c6 77 08 89 d0 31 d2 <f7> f3 89 c1 83 c4 08 89 f0 f7 f3 89 ca 5b 5e 5d c3 55 89 e5 56
[  205.510151] EIP: [<c030c6ef>] div64_u64+0x5f/0x70 SS:ESP 0068:c6cede50

The attached patch solves the issue for me.

I'm checking as soon as possible for the period not being zero since, if
it is, going ahead is useless. This way we also save a mutex_lock() and
a read_lock() wrt doing it inside tg_set_bandwidth() or
__rt_schedulable().

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-01 08:23:24 +02:00
Vegard Nossum
30432094a7 sched: fix warning
This patch fixes the following warning:

kernel/sched.c:1667: warning: 'cfs_rq_set_shares' defined but not used

This seems the correct way to fix this; cfs_rq_set_shares() is only used
in a single place, which is also inside #ifdef CONFIG_FAIR_GROUP_SCHED.

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-30 08:37:32 +02:00
Ingo Molnar
34e83e850f sched: build fix
fix:

kernel/sched.c: In function ‘sched_group_set_shares':
kernel/sched.c:8635: error: implicit declaration of function ‘cfs_rq_set_shares'

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-30 08:37:13 +02:00
Dmitry Adamushko
79c537998d sched: fix cpu hotplug
the CPU hotplug problems (crashes under high-volume unplug+replug
tests) seem to be related to migrate_dead_tasks().

Firstly I added traces to see all tasks being migrated with
migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
pops up (the one with "se == NULL" in the loop of
pick_next_task_fair()) shortly after the traces indicate that some has
been migrated with migrate_dead_tasks()). btw., I can reproduce it
much faster now with just a plain cpu down/up loop.

[disclaimer] Well, unless I'm really missing something important in
this late hour [/desclaimer] pick_next_task() is not something
appropriate for migrate_dead_tasks() :-)

the following change seems to eliminate the problem on my setup
(although, I kept it running only for a few minutes to get a few
messages indicating migrate_dead_tasks() does move tasks and the
system is still ok)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-29 08:50:21 +02:00
Peter Zijlstra
f1d239f732 sched: incremental effective_load()
Increase the accuracy of the effective_load values.

Not only consider the current increment (as per the attempted wakeup), but
also consider the delta between when we last adjusted the shares and the
current situation.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:47 +02:00
Peter Zijlstra
83378269a5 sched: correct wakeup weight calculations
rw_i = {2, 4, 1, 0}
s_i = {2/7, 4/7, 1/7, 0}

wakeup on cpu0, weight=1

rw'_i = {3, 4, 1, 0}
s'_i = {3/8, 4/8, 1/8, 0}

s_0 = S * rw_0 / \Sum rw_j ->
  \Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct)

s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) =
       1 * (2+1) / (7+1) = 3/8 (correct

so we find that adding 1 to cpu0 gains 5/56 in weight
if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:46 +02:00
Peter Zijlstra
2398f2c6d3 sched: update shares on wakeup
We found that the affine wakeup code needs rather accurate load figures
to be effective. The trouble is that updating the load figures is fairly
expensive with group scheduling. Therefore ratelimit the updating.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:45 +02:00
Peter Zijlstra
cd80917e4f sched: fix shares boost logic
In case the domain is empty, pretend there is a single task on each cpu, so
that together with the boost logic we end up giving 1/n shares to each
cpu.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:44 +02:00
Peter Zijlstra
93b75217df sched: disable source/target_load bias
The bias given by source/target_load functions can be very large, disable
it by default to get faster convergence.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:44 +02:00
Peter Zijlstra
051c67640e sched: remove prio preference from balance decisions
Priority looses much of its meaning in a hierarchical context. So don't
use it in balance decisions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:42 +02:00
Peter Zijlstra
408ed066b1 sched: hierarchical load vs find_busiest_group
find_busiest_group() has some assumptions about task weight being in the
NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this
by replacing it with the average task weight, which will adapt the situation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:40 +02:00
Peter Zijlstra
a8a51d5e59 sched: persistent average load per task
Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of
cpu_avg_load_per_task() - this is useful because of the hierarchical group
model in which task weight can be much smaller.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:39 +02:00
Peter Zijlstra
039a1c41b3 sched: fix sched_balance_self() smp group balancing
Finding the least idle cpu is more accurate when done with updated shares.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:38 +02:00
Peter Zijlstra
3e5459b4be sched: fix newidle smp group balancing
Re-compute the shares on newidle - so we can make a decision based on
recent data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:37 +02:00
Peter Zijlstra
c8cba857b4 sched: simplify the group load balancer
While thinking about the previous patch - I realized that using per domain
aggregate load values in load_balance_fair() is wrong. We should use the
load value for that CPU.

By not needing per domain hierarchical load values we don't need to store
per domain aggregate shares, which greatly simplifies all the math.

It basically falls apart in two separate computations:
 - per domain update of the shares
 - per CPU update of the hierarchical load

Also get rid of the move_group_shares() stuff - just re-compute the shares
again after a successful load balance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:36 +02:00
Peter Zijlstra
a25b5aca87 sched: no need to aggregate task_weight
We only need to know the task_weight of the busiest rq - nothing to do
if there are no tasks there.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:35 +02:00
Peter Zijlstra
d3f40dbab9 sched: dont micro manage share losses
We used to try and contain the loss of 'shares' by playing arithmetic
games. Replace that by noticing that at the top sched_domain we'll
always have the full weight in shares to distribute.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:34 +02:00
Peter Zijlstra
4d8d595dfa sched: update aggregate when holding the RQs
It was observed that in __update_group_shares_cpu()

  rq_weight > aggregate()->rq_weight

This is caused by forks/wakeups in between the initial aggregate pass and
locking of the RQs for load balance. To avoid this situation partially re-do
the aggregation once we have the RQs locked (which avoids new tasks from
appearing).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:33 +02:00
Peter Zijlstra
b6a86c746f sched: fix sched_domain aggregation
Keeping the aggregate on the first cpu of the sched domain has two problems:
 - it could collide between different sched domains on different cpus
 - it could slow things down because of the remote accesses

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:32 +02:00
Peter Zijlstra
103638d95b sched: fix wakeup granularity and buddy granularity
Uncouple buddy selection from wakeup granularity.

The initial idea was that buddies could run ahead as far as a normal task
can - do this by measuring a pair 'slice' just as we do for a normal task.

This means we can drop the wakeup_granularity back to 5ms.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:31 +02:00
Peter Zijlstra
76a2a6ee8a sched: sched_clock_cpu() based cpu_clock()
with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy
difference) use this to provide cpu_clock().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:30 +02:00
Peter Zijlstra
c09595f63b sched: revert revert of: fair-group: SMP-nice for group scheduling
Try again..

Initial commit: 18d95a2832
Revert: 6363ca57c7

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:29 +02:00
Peter Zijlstra
a7be37ac8e sched: revert the revert of: weight calculations
Try again..

initial commit: 8f1bc385cf
revert: f9305d4a09

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:27 +02:00
Ingo Molnar
97e6722b8d Merge branch 'linus' into tracing/ftrace 2008-06-25 12:27:56 +02:00
Ingo Molnar
5ce001b0e5 Merge branch 'linus' into stackprotector 2008-06-25 12:27:29 +02:00
Ingo Molnar
773dc8eaca Merge branch 'linus' into sched/new-API-sched_setscheduler 2008-06-25 12:27:05 +02:00
Ingo Molnar
f57aec5a87 Merge branch 'linus' into sched/devel
Conflicts:

	kernel/sched_rt.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25 12:26:59 +02:00
Ingo Molnar
ace7f1b796 Merge branch 'linus' into core/softirq 2008-06-25 12:23:59 +02:00
Rusty Russell
961ccddd59 sched: add new API sched_setscheduler_nocheck: add a flag to control access checks
Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.

Two cases could have failed, so are changed to sched_setscheduler_nocheck:
  kernel/softirq.c:cpu_callback()
	- CPU hotplug callback
  kernel/stop_machine.c:__stop_machine_run()
	- Called from various places, including modprobe()

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:57:56 +02:00
Ingo Molnar
1de8644cc7 Merge branch 'linus' into sched/devel 2008-06-23 11:30:23 +02:00
Ingo Molnar
f34bfb1bee Merge branch 'linus' into tracing/ftrace 2008-06-23 11:11:42 +02:00
Ingo Molnar
198bb971e2 Merge branch 'linus' into sched/urgent 2008-06-23 11:00:26 +02:00
Ingo Molnar
a60b33cf59 Merge branch 'linus' into core/softirq 2008-06-23 10:52:59 +02:00
Linus Torvalds
1f1e2ce8a5 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
  rcupreempt: remove export of rcu_batches_completed_bh
  cpuset: limit the input of cpuset.sched_relax_domain_level
2008-06-20 12:37:13 -07:00
Oleg Nesterov
ea71a54670 sched: refactor wait_for_completion_timeout()
Simplify the code and fix the boundary condition of
wait_for_completion_timeout(,0).

We can kill the first __remove_wait_queue() as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-06-20 17:15:49 +02:00
Roland Dreier
bb10ed0994 sched: fix wait_for_completion_timeout() spurious failure under heavy load
It seems that the current implementaton of wait_for_completion_timeout()
has a small problem under very high load for the common pattern:

	if (!wait_for_completion_timeout(&done, timeout))
		/* handle failure */

because the implementation very roughly does (lots of code deleted to
show the basic flow):

	static inline long __sched
	do_wait_for_common(struct completion *x, long timeout, int state)
	{
		if (x->done)
			return timeout;

		do {
			timeout = schedule_timeout(timeout);

			if (!timeout)
				return timeout;

		} while (!x->done);

		return timeout;
	}

so if the system is very busy and x->done is not set when
do_wait_for_common() is entered, it is possible that the first call to
schedule_timeout() returns 0 because the task doing wait_for_completion
doesn't get rescheduled for a long time, even if it is woken up early
enough.

In this case, wait_for_completion_timeout() returns 0 without even
checking x->done again, and the code above falls into its failure case
purely for scheduler reasons, even if the hardware event or whatever was
being waited for happened early enough.

It would make sense to add an extra test to do_wait_for() in the timeout
case and return 1 if x->done is actually set.

A quick audit (not exhaustive) of wait_for_completion_timeout() callers
seems to indicate that no one actually cares about the return value in
the success case -- they just test for 0 (timed out) versus non-zero
(wait succeeded).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20 13:19:32 +02:00
Peter Zijlstra
10b612f440 sched: rt: fix the bandwidth contraint computations
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20 10:26:01 +02:00
Li Zefan
30e0e17819 cpuset: limit the input of cpuset.sched_relax_domain_level
We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
for inputs outside this range.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19 09:45:36 +02:00
Max Krasnyansky
f18f982abf sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19 09:14:51 +02:00
Ingo Molnar
1cdad71537 Merge branch 'sched' into sched-devel
Conflicts:

	kernel/sched_rt.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:09:15 +02:00
Peter Zijlstra
7ea56616ba sched: rt-group: fix hierarchy
Don't re-set the entity's runqueue to the wrong rq after we've set it
to the right one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:06:56 +02:00
Dario Faggioli
49307fd6f7 sched: NULL pointer dereference while setting sched_rt_period_us
When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with:

 echo 10000 > /proc/sys/kernel/sched_rt_period_us

We get this:

 BUG: unable to handle kernel NULL pointer dereference at 0000008c
 [  947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160
 [  947.683123] *pde = 00000000=20
 [  947.683782] Oops: 0000 [#1]
 [  947.684307] Modules linked in:
 [  947.684308]
 [  947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8)
 [  947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0
 [  947.684308] EIP is at __rt_schedulable+0x12/0x160
 [  947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
 [  947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0
 [  947.684308]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
 [  947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000)
 [  947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000
 [  947.684308]        c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000
 [  947.684308]        c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc
 [  947.684308] Call Trace:
 [  947.684308]  [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50
 [  947.684308]  [<c0216d40>] ? sched_rt_handler+0x80/0x110
 [  947.684308]  [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0
 [  947.684308]  [<c02af2fa>] ? proc_sys_write+0x1a/0x20
 [  947.684308]  [<c0273c36>] ? vfs_write+0x96/0x160
 [  947.684308]  [<c02af2e0>] ? proc_sys_write+0x0/0x20
 [  947.684308]  [<c027423d>] ? sys_write+0x3d/0x70
 [  947.684308]  [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91
 [  947.684308]  =======================
 [  947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75
 f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4
 89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8
 8b
 [  947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP  0068:c6cc9ed0

We think the following patch solves the issue.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:06:54 +02:00
Dmitry Adamushko
20b6331bfe sched: rework of "prioritize non-migratable tasks over migratable ones"
regarding this commit: 45c01e8249

I think we can do it simpler. Please take a look at the patch below.

Instead of having 2 separate arrays (which is + ~800 bytes on x86_32 and
twice so on x86_64), let's add "exclusive" (the ones that are bound to
this CPU) tasks to the head of the queue and "shared" ones -- to the
end.

In case of a few newly woken up "exclusive" tasks, they are 'stacked'
(not queued as now), meaning that a task {i+1} is being placed in front
of the previously woken up task {i}. But I don't think that this
behavior may cause any realistic problems.

There are a couple of changes on top of this one.

(1) in check_preempt_curr_rt()

I don't think there is a need for the "pick_next_rt_entity(rq, &rq->rt)
!= &rq->curr->rt" check.

enqueue_task_rt(p) and check_preempt_curr_rt() are always called one
after another with rq->lock being held so the following check
"p->rt.nr_cpus_allowed == 1 && rq->curr->rt.nr_cpus_allowed != 1" should
be enough (well, just its left part) to guarantee that 'p' has been
queued in front of the 'curr'.

(2) in set_cpus_allowed_rt()

I don't thinks there is a need for requeue_task_rt() here.

Perhaps, the only case when 'requeue' (+ reschedule) might be useful is
as follows:

i) weight == 1 && cpu_isset(task_cpu(p), *new_mask)

i.e. a task is being bound to this CPU);

ii) 'p' != rq->curr

but here, 'p' has already been on this CPU for a while and was not
migrated. i.e. it's possible that 'rq->curr' would not have high chances
to be migrated right at this particular moment (although, has chance in
a bit longer term), should we allow it to be preempted.

Anyway, I think we should not perhaps make it more complex trying to
address some rare corner cases. For instance, that's why a single queue
approach would be preferable. Unless I'm missing something obvious, this
approach gives us similar functionality at lower cost.

Verified only compilation-wise.

(Almost)-Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-18 12:41:18 +02:00
Rabin Vincent
95e904c7da sched: fix defined-but-unused warning
Fix this warning, which appears with !CONFIG_SMP:
kernel/sched.c:1216: warning: `init_hrtick' defined but not used

Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-17 10:36:58 +02:00
Ingo Molnar
9583f3d9c0 Merge branch 'linus' into core/softirq 2008-06-16 11:24:17 +02:00
Ingo Molnar
e765ee90da Merge branch 'linus' into tracing/ftrace 2008-06-16 11:15:58 +02:00
Ingo Molnar
f9e8e07e07 Merge branch 'linus' into sched-devel 2008-06-16 11:15:21 +02:00
Lai Jiangshan
7a232e0350 sched: 64-bit: fix arithmetics overflow
(overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)

A weight of a cfs_rq is the sum of weights of which entities
are queued on this cfs_rq, so it will overflow when there are
too many entities.

Although, overflow occurs very rarely, but it break fairness when
it occurs. 64-bits systems have more memory than 32-bit systems
and 64-bit systems can create more process usually, so overflow may
occur more frequently.

This patch guarantees fairness when overflow happens on 64-bit systems.
Thanks to the optimization of compiler, it changes nothing on 32-bit.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 14:29:54 +02:00
Lai Jiangshan
2e084786f6 sched: fair group: fix overflow(was: fix divide by zero)
I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
(use 2^32, 2^33, ...., 2^63 as shares value)

# mkdir /dev/cpuctl
# mount -t cgroup -o cpu cpuctl /dev/cpuctl
# cd /dev/cpuctl
# mkdir sub
# echo 0x8000000000000000 > sub/cpu.shares
# echo $$ > sub/tasks
oops here! divide by zero.

This is because do_div() expects the 2th parameter to be 32 bits,
but unsigned long is 64 bits in x86_64.

Peter Zijstra pointed it out that the sane thing to do is limit the
shares value to something smaller instead of using an even more
expensive divide.

Also, I found another bug about "the shares value is too large":

pid1 and pid2 are set affinity to cpu#0
pid1 is attached to cg1 and pid2 is attached to cg2

if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
then pid2 got 100% usage of cpu, and pid1 0%

if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
then pid2 got 0% usage of cpu, and pid1 100%

And a weight of a cfs_rq is the sum of weights of which entities
are queued on this cfs_rq, so the shares value should be limited
to a smaller value.

I think that (1UL << 18) is a good limited value:

1) it's not too large, we can create a lot of group before overflow
2) it's several times the weight value for nice=-19 (not too small)

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 14:23:55 +02:00
Ingo Molnar
040ec23d07 sched: sched_clock() lockdep fix
Sitsofe Wheeler bisected the following commit to cause a lockdep to
warn about itself and turn itself off:

> commit c6531cce6e
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Mon May 12 21:21:14 2008 +0200
>
>     sched: do not trace sched_clock

do not use raw irq flags in cpu_clock() as it causes lockdep to lose
track of the true state of the IRQ flag.

Reported-and-bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-06-10 14:52:14 +02:00
Paul Mundt
e9886ca3a9 sched: kill off dead cfs_rq_set_shares()
Building with CONFIG_FAIR_GROUP_SCHED=y on UP results in an unused
cfs_rq_set_shares() reference. As nothing is using this dummy function
in the first place, just kill it off.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 14:51:04 +02:00
David Rientjes
9985b0bab3 sched: prevent bound kthreads from changing cpus_allowed
Kthreads that have called kthread_bind() are bound to specific cpus, so
other tasks should not be able to change their cpus_allowed from under
them.  Otherwise, it is possible to move kthreads, such as the migration
or software watchdog threads, so they are not allowed access to the cpu
they work on.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 12:26:16 +02:00
Peter Zijlstra
7def2be1dc sched: fix hotplug cpus on ia64
Cliff Wickman wrote:

> I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1)
> and get a very predictable hotplug cpu problem.
> billberry1:/tmp/cpw # ./dis
> disabled cpu 17
> enabled cpu 17
> billberry1:/tmp/cpw # ./dis
> disabled cpu 17
> enabled cpu 17
> billberry1:/tmp/cpw # ./dis
>
> The script that disables the cpu always hangs (unkillable)
> on the 3rd attempt.
>
> And a bit further:
> The kstopmachine thread always sits on the run queue (real time) for about
> 30 minutes before running.

this fix solves some (but not all) issues between CPU hotplug and
RT bandwidth throttling.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 12:17:28 +02:00
Oleg Nesterov
16882c1e96 sched: fix TASK_WAKEKILL vs SIGKILL race
schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
this allows us to do

	current->state = TASK_INTERRUPTIBLE;
	schedule();

without fear to sleep with pending signal.

However, the code like

	current->state = TASK_KILLABLE;
	schedule();

is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
no effect).

Introduce the new helper, signal_pending_state(), and change schedule() to
use it. Hopefully it will have more users, that is why the task's state is
passed separately.

Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
This is needed to preserve the current behaviour (ptrace_notify). I hope
this check will be removed soon, but this (afaics good) change needs the
separate discussion.

The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
basically the same that schedule() does now. However, this patch of course
bloats schedule().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:37:25 +02:00
Thomas Gleixner
e958b36004 sched: move weighted_cpuload into #ifdef CONFIG_SMP section
weighted_cpuload is only used on SMP. move it into the CONFIG_SMP
section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:25:02 +02:00
Max Krasnyansky
68f4f1ec08 sched: Move cpu masks from kernel/sched.c into kernel/cpu.c
kernel/cpu.c seems a more logical place for those maps since they do not really
have much to do with the scheduler these days.

kernel/cpu.c is now built for the UP kernel too, but it does not affect the size
the kernel sections.

$ size vmlinux

before
   text       data        bss        dec        hex    filename
3313797     307060     310352    3931209     3bfc49    vmlinux

after
   text       data        bss        dec        hex    filename
3313797     307060     310352    3931209     3bfc49    vmlinux

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Cc: mingo@elte.hu
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:25:01 +02:00
Max Krasnyansky
5c8e1ed1d2 sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Cc: mingo@elte.hu
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:25:00 +02:00
Gregory Haskins
1f11eb6a8b sched: fix cpupri hotplug support
The RT folks over at RedHat found an issue w.r.t. hotplug support which
was traced to problems with the cpupri infrastructure in the scheduler:

https://bugzilla.redhat.com/show_bug.cgi?id=449676

This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
kernels mentioned above.

It turned out that the issue was that offline cpus could get inadvertently
registered with cpupri so that they were erroneously selected during
migration decisions.  The end result would be an OOPS as the offline cpu
had tasks routed to it.

This patch generalizes the old join/leave domain interface into an
online/offline interface, and adjusts the root-domain/hotplug code to
utilize it.

I was able to easily reproduce the issue prior to this patch, and am no
longer able to reproduce it after this patch.  I can offline cpus
indefinately and everything seems to be in working order.

Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
me in the right direction.  Also thank you to Peter for reviewing the
early iterations of this patch.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:42 +02:00
Gautham R Shenoy
099f98c8a1 sched: print the sd->level in sched_domain_debug code
While printing out the visual representation of the sched-domains, print
the level (MC, SMT, CPU, NODE, ... ) of each of the sched_domains.

Credit: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-06 15:19:41 +02:00
Dhaval Giani
6d6bc0ad86 sched: add comments for ifdefs in sched.c
make sched.c easier to read.

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:38 +02:00
Arjan van de Ven
e21f5b153b sched: print module list in the "scheduling while atomic" warning
For the normal WARN_ON() etc we added a print-the-modules-list already,
which is very useful to figure out candidates for certain types of bugs.

This patch adds the same print to the "scheduling while atomic" BUG warning,
for the same reason: when we get here it's very useful to see which modules
are loaded, to narrow down the candidate code list.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:37 +02:00
Rabin Vincent
81d41d7ece sched: fix defined-but-unused warning
Fix this warning, which appears with !CONFIG_SMP:
kernel/sched.c:1216: warning: `init_hrtick' defined but not used

Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:34 +02:00
Thomas Gleixner
f7dcd80bbc namespacecheck: fixes in kernel/sched.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:32 +02:00
Dmitry Adamushko
d07355f5de sched: check for SD_SERIALIZE atomically in rebalance_domains()
Nothing really serious here, mainly just a matter of nit-picking :-/

From: Dmitry Adamushko <dmitry.adamushko@gmail.com>
For CONFIG_SCHED_DEBUG && CONFIG_SYSCT configs, sd->flags can be altered
while being manipulated in rebalance_domains(). Let's do an atomic check.
We rely here on the atomicity of read/write accesses for aligned words.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:30 +02:00
Gregory Haskins
6e0534f278 sched: use a 2-d bitmap for searching lowest-pri CPU
The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:28 +02:00
Mike Galbraith
f333fdc909 sched: make !hrtick faster
it is safe to ignore timers and flags when the feature is disabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:27 +02:00
Gregory Haskins
45c01e8249 sched: prioritize non-migratable tasks over migratable ones
Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm
that could allow suboptimal balancing if a non-migratable task gets
queued behind a running migratable one.  It is discussed in this thread:

http://lkml.org/lkml/2008/4/22/296

This issue has been further exacerbated by a recent checkin to
sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3).

>From a pure priority standpoint, the run-queue is doing the "right"
thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1
wakes up at equal or lower priority (affined only to cpu1) later, it
*should* wait for T0 to finish.  However, in reality that is likely
suboptimal from a system perspective if there are other cores that
could allow T0 and T1 to run concurrently.  Since T1 can not migrate,
the only choice for higher concurrency is to try to move T0.  This is
not something we addessed in the recent rt-balancing re-work.

This patch tries to enhance the balancing algorithm by accomodating this
scenario.  It accomplishes this by incorporating the migratability of a
task into its priority calculation.  Within a numerical tsk->prio, a
non-migratable task is logically higher than a migratable one.  We
maintain this by introducing a new per-priority queue (xqueue, or
exclusive-queue) for holding non-migratable tasks.  The scheduler will
draw from the xqueue over the standard shared-queue (squeue) when
available.

There are several details for utilizing this properly.

1) During task-wake-up, we not only need to check if the priority
   preempts the current task, but we also need to check for this
   non-migratable condition.  Therefore, if a non-migratable task wakes
   up and sees an equal priority migratable task already running, it
   will attempt to preempt it *if* there is a likelyhood that the
   current task will find an immediate home.

2) Tasks only get this non-migratable "priority boost" on wake-up.  Any
   requeuing will result in the non-migratable task being queued to the
   end of the shared queue.  This is an attempt to prevent the system
   from being completely unfair to migratable tasks during things like
   SCHED_RR timeslicing.

I am sure this patch introduces potentially "odd" behavior if you
concoct a scenario where a bunch of non-migratable threads could starve
migratable ones given the right pattern.  I am not yet convinced that
this is a problem since we are talking about tasks of equal RT priority
anyway, and there never is much in the way of guarantees against
starvation under that scenario anyway. (e.g. you could come up with a
similar scenario with a specific timing environment verses an affinity
environment).  I can be convinced otherwise, but for now I think this is
"ok".

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:25 +02:00
Ingo Molnar
6363ca57c7 revert ("sched: fair-group: SMP-nice for group scheduling")
Yanmin Zhang reported:

Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

With bisect, I located the following patch:

| 18d95a2832 is first bad commit
| commit 18d95a2832
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair-group: SMP-nice for group scheduling

Revert it so that we get v2.6.25 behavior.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:28:57 +02:00
Ingo Molnar
4285f594f8 sched: cleanup
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:15 +02:00
Roel Kluin
3f33a7ce95 sched: unite unlikely pairs in rt_policy() and schedule_debug()
Removes obfuscation and may improve assembly.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:14 +02:00
Ingo Molnar
f9305d4a09 revert ("sched: fair: weight calculations")
Yanmin Zhang reported:

Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many
regressions with 2.6.26-rc1:

 1) 8-core stoakley: 28%;
 2) 16-core tigerton: 20%;
 3) Itanium Montvale: 50%.

Bisect located this patch:

| 8f1bc385cf is first bad commit
| commit 8f1bc385cf
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair: weight calculations

Revert it to the 2.6.25 state.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:24:01 +02:00
Eric Sandeen
7c9f8861e6 stackprotector: use canary at end of stack to indicate overruns at oops time
(Updated with a common max-stack-used checker that knows about
the canary, as suggested by Joe Perches)

Use a canary at the end of the stack to clearly indicate
at oops time whether the stack has ever overflowed.

This is a very simple implementation with a couple of
drawbacks:

1) a thread may legitimately use exactly up to the last
   word on the stack

 -- but the chances of doing this and then oopsing later seem slim

2) it's possible that the stack usage isn't dense enough
   that the canary location could get skipped over

 -- but the worst that happens is that we don't flag the overrun
 -- though this happens fairly often in my testing :(

With the code in place, an intentionally-bloated stack oops might
do:

BUG: unable to handle kernel paging request at ffff8103f84cc680
IP: [<ffffffff810253df>] update_curr+0x9a/0xa8
PGD 8063 PUD 0
Thread overran stack or stack corrupted
Oops: 0000 [1] SMP
CPU 0
...

... unless the stack overrun is so bad that it corrupts some other
thread.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-26 16:15:33 +02:00
Ingo Molnar
c6531cce6e sched: do not trace sched_clock
The tracer uses sched_clock, so do not trace it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-25 16:02:23 +02:00
Carlos R. Mafra
962cf36c5b Remove argument from open_softirq which is always NULL
As git-grep shows, open_softirq() is always called with the last argument
being NULL

block/blk-core.c:       open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
kernel/hrtimer.c:       open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
kernel/rcuclassic.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/rcupreempt.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
kernel/softirq.c:       open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
kernel/softirq.c:       open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);

This observation has already been made by Matthew Wilcox in June 2002
(http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)

"I notice that none of the current softirq routines use the data element
passed to them."

and the situation hasn't changed since them. So it appears we can safely
remove that extra argument to save 128 (54) bytes of kernel data (text).

Signed-off-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-25 07:43:15 +02:00
Mathieu Desnoyers
5b82a1b08a Port ftrace to markers
Porting ftrace to the marker infrastructure.

Don't need to chain to the wakeup tracer from the sched tracer, because markers
support multiple probes connected.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:29:25 +02:00
Ingo Molnar
4d9493c90f ftrace: remove add-hoc code
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:13:32 +02:00
Peter Zijlstra
5429db2d26 ftrace: fix wakeup callback
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:09:02 +02:00
Ingo Molnar
36fc25a9f4 ftrace: sched tree fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:08:28 +02:00
Ingo Molnar
017730c112 ftrace: fix wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:05:02 +02:00
Ingo Molnar
24cd5d111e ftrace: trace curr/next tasks
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:04:51 +02:00
Ingo Molnar
4e65551905 ftrace: sched tracer, trace full rbtree
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:04:44 +02:00
Ingo Molnar
8ac0fca4cc ftrace: sched tracer fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:04:28 +02:00
Steven Rostedt
6cd8a4bb2f ftrace: trace preempt off critical timings
Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.

This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers

Now instead of just tracing irqs off, preemption off can be selected
to be recorded.

When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.

By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:32:54 +02:00
Steven Rostedt
7c731e0a49 ftrace: make the task state char-string visible to all
The tracer wants to be able to convert the state number
into a user visible character. This patch pulls that conversion
string out the scheduler into the header. This way if it were to
ever change, other parts of the kernel will know.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:31:05 +02:00
Ingo Molnar
bd3bff9e20 sched: add latency tracer callbacks to the scheduler
add 3 lightweight callbacks to the tracer backend.

zero impact if tracing is turned off.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:30:55 +02:00
Mike Travis
363ab6f142 core: use performance variant for_each_cpu_mask_nr
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate

Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 18:35:12 +02:00
Mike Travis
a953e4597a sched: replace MAX_NUMNODES with nr_node_ids in kernel/sched.c
* Replace usages of MAX_NUMNODES with nr_node_ids in kernel/sched.c,
    where appropriate.  This saves some allocated space as well as many
    wasted cycles going through node entries that are non-existent.

For inclusion into sched-devel/latest tree.

Based on:
	git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    +   sched-devel/latest  .../mingo/linux-2.6-sched-devel.git

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-23 18:22:17 +02:00
Mirco Tischler
0c70814c31 cgroups: fix compile warning
Return type of cpu_rt_runtime_write() should be int instead of ssize_t.

Signed-off-by: Mirco Tischler <mt-ml@gmx.de>
Acked-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Linus Torvalds
c3921ab715 Add new 'cond_resched_bkl()' helper function
It acts exactly like a regular 'cond_resched()', but will not get
optimized away when CONFIG_PREEMPT is set.

Normal kernel code is already preemptable in the presense of
CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
02b67cc3ba "sched: do not do
cond_resched() when CONFIG_PREEMPT").

But when wanting to conditionally reschedule while holding a lock, you
need to use "cond_sched_lock(lock)", and the new function is the BKL
equivalent of that.

Also make fs/locks.c use it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-11 16:04:48 -07:00
Linus Torvalds
8e3e076c5a BKL: revert back to the old spinlock implementation
The generic semaphore rewrite had a huge performance regression on AIM7
(and potentially other BKL-heavy benchmarks) because the generic
semaphores had been rewritten to be simple to understand and fair.  The
latter, in particular, turns a semaphore-based BKL implementation into a
mess of scheduling.

The attempt to fix the performance regression failed miserably (see the
previous commit 00b41ec261 'Revert
"semaphore: fix"'), and so for now the simple and sane approach is to
instead just go back to the old spinlock-based BKL implementation that
never had any issues like this.

This patch also has the advantage of being reported to fix the
regression completely according to Yanmin Zhang, unlike the semaphore
hack which still left a couple percentage point regression.

As a spinlock, the BKL obviously has the potential to be a latency
issue, but it's not really any different from any other spinlock in that
respect.  We do want to get rid of the BKL asap, but that has been the
plan for several years.

These days, the biggest users are in the tty layer (open/release in
particular) and Alan holds out some hope:

  "tty release is probably a few months away from getting cured - I'm
   afraid it will almost certainly be the very last user of the BKL in
   tty to get fixed as it depends on everything else being sanely locked."

so while we're not there yet, we do have a plan of action.

Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10 20:58:02 -07:00
Peter Zijlstra
3e51f33fcc sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
this replaces the rq->clock stuff (and possibly cpu_clock()).

 - architectures that have an 'imperfect' hardware clock can set
   CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

 - the 'jiffie' window might be superfulous when we update tick_gtod
   before the __update_sched_clock() call in sched_clock_tick()

 - cpu_clock() might be implemented as:

     sched_clock_cpu(smp_processor_id())

   if the accuracy proves good enough - how far can TSC drift in a
   single jiffie when considering the filtering and idle hooks?

[ mingo@elte.hu: various fixes and cleanups ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
Ingo Molnar
dfbf4a1bc3 sched: fix cpu clock
David Miller pointed it out that nothing in cpu_clock() sets
prev_cpu_time. This caused __sync_cpu_clock() to be called
all the time - against the intention of this code.

The result was that in practice we hit a global spinlock every
time cpu_clock() is called - which - even though cpu_clock()
is used for tracing and debugging, is suboptimal.

While at it, also:

- move the irq disabling to the outest layer,
  this should make cpu_clock() warp-free when called with irqs
  enabled.

- use long long instead of cycles_t - for platforms where cycles_t
  is 32-bit.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
Miao Xie
cb4ad1ffc7 sched: fair-group: fix a Div0 error of the fair group scheduler
When I echoed 0 into the "cpu.shares" file, a Div0 error occured.

We found it is caused by the following calling.

   sched_group_set_shares(tg, shares)
       set_se_shares(tg->se[i], shares/nr_cpu_ids)
           __set_se_shares(se, shares)
               div64_64((1ULL<<32), shares)

When the echoed value was less than the number of processores, the result of the
sentence "shares/nr_cpu_ids" was 0, and then the system called div64() to divide
the result, the Div0 error occured.

It is unnecessary that the shares value is divided by nr_cpu_ids, I think.
Because in the function  __update_group_shares_cpu() and init_tg_cfs_entry(),
the shares value isn't divided by nr_cpu_ids when setting shares of the sched
entity.

This patch fixes this bug. And echoing ULONG_MAX value into cpu.shares also
causes Div0 error, so we set a macro MAX_SHARES to limit the max value of
shares.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
Heiko Carstens
712555ee4f sched: fix missing locking in sched_domains code
Concurrent calls to detach_destroy_domains and arch_init_sched_domains
were prevented by the old scheduler subsystem cpu hotplug mutex. When
this got converted to get_online_cpus() the locking got broken.
Unlike before now several processes can concurrently enter the critical
sections that were protected by the old lock.

So use the already present doms_cur_mutex to protect these sections again.

Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
Ingo Molnar
690229a091 sched: make clock sync tunable by architecture code
make time_sync_thresh tunable to architecture code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
David Simner
673a90a1e0 sched: fix sched_info_switch not being called according to documentation
http://bugzilla.kernel.org/show_bug.cgi?id=10545

sched_stats.h says that __sched_info_switch is "called when prev !=
next" in the comment.  sched.c should therefore do that.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
Peter Zijlstra
b328ca182f sched: fix hrtick_start_fair and CPU-Hotplug
Gautham R Shenoy reported:

 > While running the usual CPU-Hotplug stress tests on linux-2.6.25,
 > I noticed the following in the console logs.
 >
 > This is a wee bit difficult to reproduce. In the past 10 runs I hit this
 > only once.
 >
 > ------------[ cut here ]------------
 >
 > WARNING: at kernel/sched.c:962 hrtick+0x2e/0x65()
 >
 > Just wondering if we are doing a good job at handling the cancellation
 > of any per-cpu scheduler timers during CPU-Hotplug.

This looks like its indeed not cancelled at all and migrates the it to
another cpu. Fix it via a proper hotplug notifier mechanism.

Reported-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:18 +02:00
Harvey Harrison
983ed7a66b sched: add statics, don't return void expressions
Noticed by sparse:
kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static?
kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static?
kernel/sched_fair.c:845:3: warning: returning void-valued expression
kernel/sched.c:4386:3: warning: returning void-valued expression

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:17 +02:00
Andrew Morton
d478c2cfaa sched: add debug checks to idle functions
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: "Justin Mattock" <justinmattock@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:17 +02:00
Peter Zijlstra
e05510d01a sched: optimize calc_delta_mine()
Joel noticed that the !lw->inv_weight contition isn't unlikely anymore so
remove the unlikely annotation. Also, remove the two div64_u64() inv_weight
calculations, which makes them rely on the calc_delta_mine() path as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Joel Schopp <jschopp@austin.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05 23:56:17 +02:00
Roman Zippel
6f6d6a1a6a rename div64_64 to div64_u64
Rename div64_64 to div64_u64 to make it consistent with the other divide
functions, so it clearly includes the type of the divide.  Move its definition
to math64.h as currently no architecture overrides the generic implementation.
 They can still override it of course, but the duplicated declarations are
avoided.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:03:58 -07:00
Paul Menage
06ecb27cfb CGroups _s64 files: use read_s64/write_s64 in CFS cgroup for rt_runtime file
This removes some filesystem boilerplate from the CFS cgroup subsystem.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:09 -07:00
Paul Menage
f4c753b7ea CGroup API files: rename read/write_uint methods to read_write_u64
Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.

[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:07 -07:00
Linus Torvalds
6e18933f2b Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [PATCH] Build fix for CONFIG_NUMA=y && CONFIG_SMP=n
  [IA64] fix bootmem regression on Altix
2008-04-25 12:50:00 -07:00
Linus Torvalds
0b79dada97 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes:
  sched: fix share (re)distribution
  softlockup: fix NOHZ wakeup
  seqlock: livelock fix
2008-04-25 12:47:56 -07:00
David Miller
5a9d3225a0 sched: use alloc_bootmem() instead of alloc_bootmem_low()
There is no guarantee that there is physical ram below 4GB, and in
fact many boxes don't have exactly that.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-25 09:53:06 +02:00
Peter Zijlstra
3f5087a2ba sched: fix share (re)distribution
fix __aggregate_redistribute_shares() related lockup reported by
David S. Miller.

The problem this code tries to solve is 'accurately' calculating the 'fair'
share of the group weight for each cpu. The current code falls back to a global
group rebalance in case the sched_domain's span it looks at has no shares, but
does have tasks.

The reason it gets stuck here, is because its inherently racy - if someone
steals the last task after we compute the agg->rq_weight, but before we
rebalance, we'll never get out of the loop.

We could of course go fix that, but while looking at this issue I found that
this 'fallback' wasn't nearly as rare as I'd hoped it to be. In fact its quite
common - and given it walks the whole machine, thats very bad.

The new approach is simple (why didn't I think of it before?), we set the
aggregate shares to the full task group weight, and each larger sched domain
that encounters an aggregate shares larger than the weight, clips it (it
already re-distributes anyway).

This nicely converges to the desired global picture where the sum of all
shares equals the task group weight.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-25 00:25:08 +02:00
Mike Travis
03970f065d [PATCH] Build fix for CONFIG_NUMA=y && CONFIG_SMP=n
Regression caused by 434d53b00d

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-04-24 14:40:29 -07:00
Russ Anderson
472613b961 [IA64] fix bootmem regression on Altix
A recent change prevents SGI Altix from booting.
This patch fixes the problem.

The regresson was introduced in commit 434d53b00d

Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-04-24 14:21:21 -07:00
Randy Dunlap
73486722b7 kernel-doc: fix sched.c missing parameter
Add missing kernel-doc in kernel/sched.c:

Warning(linux-2.6.25-git3//kernel/sched.c:7044): No description found for parameter 'span'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-22 13:48:02 -07:00
Ingo Molnar
c24b7c5244 sched: features fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
f00b45c145 sched: /debug/sched_features
provide a text based interface to the scheduler features; this saves the
'user' from setting bits using decimal arithmetic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Ingo Molnar
06379aba52 sched: add SCHED_FEAT_DEADLINE
unused at the moment.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
8f1bc385cf sched: fair: weight calculations
In order to level the hierarchy, we need to calculate load based on the
root view. That is, each task's load is in the same unit.

             A
            / \
           B   1
          / \
         2   3

To compute 1's load we do:

	   weight(1)
	--------------
	 rq_weight(A)

To compute 2's load we do:

	  weight(2)      weight(B)
	------------ * -----------
	rq_weight(B)   rw_weight(A)

This yields load fractions in comparable units.

The consequence is that it changes virtual time. We used to have:

                time_{i}
  vtime_{i} = ------------
               weight_{i}

  vtime = \Sum vtime_{i} = time / rq_weight.

But with the new way of load calculation we get that vtime equals time.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
4a55bd5e97 sched: fair-group: de-couple load-balancing from the rb-trees
De-couple load-balancing from the rb-trees, so that I can change their
organization.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
18d95a2832 sched: fair-group: SMP-nice for group scheduling
Implement SMP nice support for the full group hierarchy.

On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide view when walking down the
hierarchy, and readjust the weights when walking back up.

After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naively balance each
task group until we've moved the targeted amount of load.

Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.

XXX: there will be some numerical issues due to the limited nature of
     SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
     total weight. When the tree is deep enough, or the task weight small
     enough, we'll run out of bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Hidetoshi Seto
1d3504fcf5 sched, cpuset: customize sched domains, core
[rebased for sched-devel/latest]

 - Add a new cpuset file, having levels:
     sched_relax_domain_level

 - Modify partition_sched_domains() and build_sched_domains()
   to take attributes parameter passed from cpuset.

 - Fill newidle_idx for node domains which currently unused but
   might be required if sched_relax_domain_level become higher.

 - We can change the default level by boot option 'relax_domain_level='.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
b40b2e8eb5 sched: rt: multi level group constraints
multi level rt constraints

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
f473aa5e02 sched: task_group hierarchy
Add the full parent<->child relation thing into task_groups as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
eff766a65c sched: fix the task_group hierarchy for UID grouping
UID grouping doesn't actually have a task_group representing the root of
the task_group tree. Add one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Dhaval Giani
ec7dc8ac73 sched: allow the group scheduler to have multiple levels
This patch makes the group scheduler multi hierarchy aware.

[a.p.zijlstra@chello.nl: rt-parts and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Dhaval Giani
354d60c2ff sched: mix tasks and groups
This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

[a.p.zijlstra@chello.nl: rt bits and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Ingo Molnar
ea736ed5d3 sched: fix checks
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Peter Zijlstra
112f53f5d7 sched: old sleeper bonus
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
cd8ba7cd9b sched: add new set_cpus_allowed_ptr function
Add a new function that accepts a pointer to the "newly allowed cpus"
cpumask argument.

int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)

The current set_cpus_allowed() function is modified to use the above
but this does not result in an ABI change.  And with some compiler
optimization help, it may not introduce any additional overhead.

Additionally, to enforce the read only nature of the new_mask arg, the
"const" property is migrated to sub-functions called by set_cpus_allowed.
This silences compiler warnings.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
e0982e90cd init: move setup of nr_cpu_ids to as early as possible
Move the setting of nr_cpu_ids from sched_init() to start_kernel()
so that it's available as early as possible.

Note that an arch has the option of setting it even earlier if need be,
but it should not result in a different value than the setup_nr_cpu_ids()
function.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
4bdbaad33d sched: remove another cpumask_t variable from stack
* Remove another cpumask_t variable from stack that was missed in the
      last kernel_sched_c updates.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
7c16ec585c cpumask: reduce stack usage in SD_x_INIT initializers
* Remove empty cpumask_t (and all non-zero/non-null) variables
    in SD_*_INIT macros.  Use memset(0) to clear.  Also, don't
    inline the initializer functions to save on stack space in
    build_sched_domains().

  * Merge change to include/linux/topology.h that uses the new
    node_to_cpumask_ptr function in the nr_cpus_node macro into
    this patch.

Depends on:
	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
	[sched-devel]: sched: add new set_cpus_allowed_ptr function

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
c5f59f0833 nodemask: use new node_to_cpumask_ptr function
* Use new node_to_cpumask_ptr.  This creates a pointer to the
    cpumask for a given node.  This definition is in mm patch:

	asm-generic-add-node_to_cpumask_ptr-macro.patch

  * Use new set_cpus_allowed_ptr function.

Depends on:
	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
	[sched-devel]: sched: add new set_cpus_allowed_ptr function
	[x86/latest]: x86: add cpus_scnprintf function

Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
b53e921ba1 generic: reduce stack pressure in sched_affinity
* Modify sched_affinity functions to pass cpumask_t variables by reference
    instead of by value.

  * Use new set_cpus_allowed_ptr function.

Depends on:
	[sched-devel]: sched: add new set_cpus_allowed_ptr function

Cc: Paul Jackson <pj@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
f9a86fcbbb cpuset: modify cpuset_set_cpus_allowed to use cpumask pointer
* Modify cpuset_cpus_allowed to return the currently allowed cpuset
    via a pointer argument instead of as the function return value.

  * Use new set_cpus_allowed_ptr function.

  * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.

Depends on:
	[sched-devel]: sched: add new set_cpus_allowed_ptr function

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:58 +02:00
Mike Travis
434d53b00d sched: remove fixed NR_CPUS sized arrays in kernel_sched_c
* Change fixed size arrays to per_cpu variables or dynamically allocated
   arrays in sched_init() and sched_init_smp().

     (1) static struct sched_entity *init_sched_entity_p[NR_CPUS];
     (1) static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
     (1) static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
     (1) static struct rt_rq *init_rt_rq_p[NR_CPUS];
	 static struct sched_group **sched_group_nodes_bycpu[NR_CPUS];

     (1) - these arrays are allocated via alloc_bootmem_low()

 * Change sched_domain_debug_one() to use cpulist_scnprintf instead of
   cpumask_scnprintf.  This reduces the output buffer required and improves
   readability when large NR_CPU count machines arrive.

 * In sched_create_group() we allocate new arrays based on nr_cpu_ids.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:58 +02:00
Dhaval Giani
0297b80339 sched: allow cpuacct stats to be reset
Currently the schedstats implementation does not allow the statistics
to be reset. This patch aims to allow that.

  echo 0 > cpuacct.usage

resets the usage. Any other value is not allowed and returns -EINVAL.

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:58 +02:00
Dhaval Giani
32cd756a80 sched: cleanup cpuacct variable names
Change the variable names to the common convention for the cpuacct
subsystem.

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:58 +02:00
Peter Zijlstra
ac086bc229 sched: rt-group: smp balancing
Currently the rt group scheduling does a per cpu runtime limit, however
the rt load balancer makes no guarantees about an equal spread of real-
time tasks, just that at any one time, the highest priority tasks run.

Solve this by making the runtime limit a global property by borrowing
excessive runtime from the other cpus once the local limit runs out.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:58 +02:00
Peter Zijlstra
d0b27fa778 sched: rt-group: synchonised bandwidth period
Various SMP balancing algorithms require that the bandwidth period
run in sync.

Possible improvements are moving the rt_bandwidth thing into root_domain
and keeping a span per rt_bandwidth which marks throttled cpus.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar
50df5d6aea sched: remove sysctl_sched_batch_wakeup_granularity
it's unused.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar
02e2b83bd2 sched: reenable sync wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar
d25ce4cd49 sched: cache hot buddy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar
1fc8afa4c8 sched: feat affine wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar
b85d066726 sched: introduce SCHED_FEAT_SYNC_WAKEUPS, turn it off
turn off sync wakeups by default. They are not needed anymore - the
buddy logic should be smart enough to keep the system from
overscheduling.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Guillaume Chazarain
15934a3732 sched: fix rq->clock overflows detection with CONFIG_NO_HZ
When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC.
We check that the number of skipped ticks matches the clock jump seen in
__update_rq_clock().

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Reynes Philippe
30914a58af sched: sched.c needs tick.h
kernel/sched.c:506: erreur: implicit declaration of function tick_get_tick_sched
kernel/sched.c:506: erreur: invalid type argument of ->
kernel/sched.c:506: erreur: NOHZ_MODE_INACTIVE undeclared (first use in this function)
kernel/sched.c:506: erreur: (Each undeclared identifier is reported only once
kernel/sched.c:506: erreur: for each function it appears in.)

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar
27ec440779 sched: make cpu_clock() globally synchronous
Alexey Zaytsev reported (and bisected) that the introduction of
cpu_clock() in printk made the timestamps jump back and forth.

Make cpu_clock() more reliable while still keeping it fast when it's
called frequently.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Thomas Gleixner
06d8308c61 NOHZ: reevaluate idle sleep length after add_timer_on()
add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.

To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.

Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.

Call this function from add_timer_on().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org

--
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/timer.c        |   10 +++++++++-
 3 files changed, 58 insertions(+), 1 deletion(-)
2008-03-26 08:28:55 +01:00
Heiko Carstens
22e52b072d sched: add arch_update_cpu_topology hook.
Will be called each time the scheduling domains are rebuild.
Needed for architectures that don't have a static cpu topology.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:48 +01:00
Heiko Carstens
9aefd0abd8 sched: add exported arch_reinit_sched_domains() to header file.
Needed so it can be called from outside of sched.c.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:47 +01:00
Roel Kluin
23e3c3cd2e sched: remove double unlikely from schedule()
Combine two unlikely's

Signed-off-by: Roel Kluin <12o3l@tiscali.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:47 +01:00
Peter Zijlstra
2070ee01d3 sched: cleanup old and rarely used 'debug' features.
TREE_AVG and APPROX_AVG are initial task placement policies that have been
disabled for a long while.. time to remove them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:47 +01:00
Ingo Molnar
f540a6080a sched: wakeup-buddy tasks are cache-hot
Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:53 +01:00
Ingo Molnar
4ae7d5cefd sched: improve affine wakeups
improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.

Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.

( Also slightly move the preempt_check within try_to_wake_up() - this has
  no effect on functionality but allows 'early wakeups' (for still-on-rq
  tasks) to be correctly accounted as well.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:53 +01:00
Peter Zijlstra
aa2ac25229 sched: fix overload performance: buddy wakeups
Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.

Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.

Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-15 03:02:50 +01:00
Ingo Molnar
27d1172660 sched: fix calc_delta_mine()
lw->weight can be 0 for a short time during bootup.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15 03:02:50 +01:00
Ingo Molnar
e89996ae3f sched: fix update_load_add()/sub()
Clear the cached inverse value when updating load. This is needed for
calc_delta_mine() to work correctly when using the rq load.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15 03:02:49 +01:00
Hiroshi Shimamoto
0e1f34833b sched: fix race in schedule()
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.

Here is a possible scenario:

   CPU0                               CPU1
    |                              schedule()
    |                              ->deactivate_task()
    |                              ->idle_balance()
    |                              -->load_balance_newidle()
rt_mutex_setprio()                     |
    |                              --->double_lock_balance()
    *get lock                          *rel lock
    * on_rq=0, ruuning=1               |
    * sched_class is changed           |
    *rel lock                          *get lock
    :                                  |
                                       :
                                   ->put_prev_task_rt()
                                   ->pick_next_task_fair()
                                       => panic

The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-15 03:02:49 +01:00
Gregory Haskins
08f503b0c0 keep rd->online and cpu_online_map in sync
It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map.  This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 14:02:58 +01:00
Gregory Haskins
1f94ef598e Revert "cpu hotplug: adjust root-domain->online span in response to hotplug event"
This reverts commit 393d94d98b.

Lets fix this right.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 14:02:58 +01:00
Gregory Haskins
393d94d98b cpu hotplug: adjust root-domain->online span in response to hotplug event
We currently set the root-domain online span automatically when the
domain is added to the cpu if the cpu is already a member of
cpu_online_map.

This was done as a hack/bug-fix for s2ram, but it also causes a problem
with hotplug CPU_DOWN transitioning.  The right way to fix the original
problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
which is already too late.

This solves the hung reboot regression reported by Andrew Morton and
others.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-09 10:05:14 -07:00
Dhaval Giani
521f1a2489 sched: don't allow rt_runtime_us to be zero for groups having rt tasks
This patch checks if we can set the rt_runtime_us to 0. If there is a
realtime task in the group, we don't want to set the rt_runtime_us as 0
or bad things will happen. (that task wont get any CPU time despite
being TASK_RUNNNG)

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Peter Zijlstra
2692a2406b sched: rt-group: fixup schedulability constraints calculation
it was only possible to configure the rt-group scheduling parameters
beyond the default value in a very small range.

that's because div64_64() has a different calling convention than
do_div() :/

fix a few untidies while we are here; sysctl_sched_rt_period may overflow
due to that multiplication, so cast to u64 first. Also that RUNTIME_INF
juggling makes little sense although its an effective NOP.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Miao Xie
1868f958eb sched: fix the wrong time slice value for SCHED_FIFO tasks
Function sys_sched_rr_get_interval returns wrong time slice value for
SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Pavel Roskin
150d8bede7 sched: export task_nice
The API is trivial, and so is the implementation.

Signed-off-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Peter Zijlstra
810b38179e sched: retain vruntime
Kei Tokunaga reported an interactivity problem when moving tasks
between control groups.

Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.

Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:42:59 +01:00
Peter Zijlstra
62fb185130 sched: revert load_balance_monitor() changes
The following commits cause a number of regressions:

  commit 58e2d4ca58
  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
  Date:   Fri Jan 25 21:08:00 2008 +0100
  sched: group scheduling, change how cpu load is calculated

  commit 6b2d770026
  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
  Date:   Fri Jan 25 21:08:00 2008 +0100
  sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups

Namely:
 - very frequent wakeups on SMP, reported by PowerTop users.
 - cacheline trashing on (large) SMP
 - some latencies larger than 500ms

While there is a mergeable patch to fix the latter, the former issues
are not fixable in a manner suitable for .25 (we're at -rc3 now).

Hence we revert them and try again in v2.6.26.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-04 17:54:06 +01:00
Harvey Harrison
67ca7bde2e sched: fix signedness warnings in sched.c
Unsigned long values are always assigned to switch_count,
make it unsigned long.

kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3897:15:    expected long *switch_count
kernel/sched.c:3897:15:    got unsigned long *<noident>
kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3921:16:    expected long *switch_count
kernel/sched.c:3921:16:    got unsigned long *<noident>

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 16:34:17 +01:00
Ingo Molnar
6892b75e60 sched: make early bootup sched_clock() use safer
do not call sched_clock() too early. Not only might rq->idle
not be set up - but pure per-cpu data might not be accessible
either.

this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y.

Tested-by: Tony Luck <tony.luck@gmail.com>
Acked-by: Tony Luck <tony.luck@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 16:34:16 +01:00
Linus Torvalds
04e2f1741d Add memory barrier semantics to wake_up() & co
Oleg Nesterov and others have pointed out that on some architectures,
the traditional sequence of

	set_current_state(TASK_INTERRUPTIBLE);
	if (CONDITION)
		return;
	schedule();

is racy wrt another CPU doing

	CONDITION = 1;
	wake_up_process(p);

because while set_current_state() has a memory barrier separating
setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION
variable, there is no such memory barrier on the wakeup side.

Now, wake_up_process() does actually take a spinlock before it reads and
sets the task state on the waking side, and on x86 (and many other
architectures) that spinlock is in fact equivalent to a memory barrier,
but that is not generally guaranteed.  The write that sets CONDITION
could move into the critical region protected by the runqueue spinlock.

However, adding a smp_wmb() to before the spinlock should now order the
writing of CONDITION wrt the lock itself, which in turn is ordered wrt
the accesses within the spinlock (which includes the reading of the old
state).

This should thus close the race (which probably has never been seen in
practice, but since smp_wmb() is a no-op on x86, it's not like this will
make anything worse either on the most common architecture where the
spinlock already gave the required protection).

Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 18:05:03 -08:00
Srinivasa Ds
4362758279 kprobes: refuse kprobe insertion on add/sub_preempt_counter()
Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
functions inturn call add/sub_preempt_count().  So we need to refuse user from
inserting probe in to these functions.

This patch disallows user from probing add/sub_preempt_count().

Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Peter Zijlstra
b68aa2300c sched: rt-group: refure unrunnable tasks
Refuse to accept or create RT tasks in groups that can't run them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:40 +01:00
Peter Zijlstra
bccbe08a60 sched: rt-group: clean up the ifdeffery
Clean up some of the excessive ifdeffery introduces in the last patch.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:40 +01:00
Peter Zijlstra
052f1dc7eb sched: rt-group: make rt groups scheduling configurable
Make the rt group scheduler compile time configurable.
Keep it experimental for now.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:40 +01:00
Peter Zijlstra
9f0c1e560c sched: rt-group: interface
Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
This avoids picking a granularity for the ratio.

Extend the /sys/kernel/uids/<uid>/ interface to allow setting
the group's rt_runtime.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra
23b0fdfc92 sched: rt-group: deal with PI
Steven mentioned the fun case where a lock holding task will be throttled.

Simple fix: allow groups that have boosted tasks to run anyway.

If a runnable task in a throttled group gets boosted the dequeue/enqueue
done by rt_mutex_setprio() is enough to unthrottle the group.

This is ofcourse not quite correct. Two possible ways forward are:
  - second prio array for boosted tasks
  - boost to a prio ceiling (this would also work for deadline scheduling)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra
4cf5d77a6e sched: fix incorrect irq lock usage in normalize_rt_tasks()
lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called
from hardirq context through sysrq-n

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra
8ed3699682 sched: fair-group: separate tg->shares from task_group_lock
On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote:
> BUG: sleeping function called from invalid context
> at /home/den/src/linux-netns26/kernel/mutex.c:209
> in_atomic():1, irqs_disabled():0
> no locks held by swapper/0.
> Pid: 0, comm: swapper Not tainted 2.6.24 #304
>
> Call Trace:
>  <IRQ>  [<ffffffff80252d1e>] ? __debug_show_held_locks+0x15/0x27
>  [<ffffffff8022c2a8>] __might_sleep+0xc0/0xdf
>  [<ffffffff8049f1df>] mutex_lock_nested+0x28/0x2a9
>  [<ffffffff80231294>] sched_destroy_group+0x18/0xea
>  [<ffffffff8023e835>] sched_destroy_user+0xd/0xf
>  [<ffffffff8023e8c1>] free_uid+0x8a/0xab
>  [<ffffffff80233e24>] __put_task_struct+0x3f/0xd3
>  [<ffffffff80236708>] delayed_put_task_struct+0x23/0x25
>  [<ffffffff8026fda7>] __rcu_process_callbacks+0x8d/0x215
>  [<ffffffff8026ff52>] rcu_process_callbacks+0x23/0x44
>  [<ffffffff8023a2ae>] __do_softirq+0x79/0xf8
>  [<ffffffff8020f8c3>] ? profile_pc+0x2a/0x67
>  [<ffffffff8020d38c>] call_softirq+0x1c/0x30
>  [<ffffffff8020f689>] do_softirq+0x61/0x9c
>  [<ffffffff8023a233>] irq_exit+0x51/0x53
>  [<ffffffff8021bd1a>] smp_apic_timer_interrupt+0x77/0xad
>  [<ffffffff8020ce3b>] apic_timer_interrupt+0x6b/0x70
>  <EOI>  [<ffffffff8020b0dd>] ? default_idle+0x43/0x76
>  [<ffffffff8020b0db>] ? default_idle+0x41/0x76
>  [<ffffffff8020b09a>] ? default_idle+0x0/0x76
>  [<ffffffff8020b186>] ? cpu_idle+0x76/0x98

separate the tg->shares protection from the task_group lock.

Reported-by: Denis V. Lunev <den@openvz.org>
Tested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Harvey Harrison
7ad5b3a505 kernel: remove fastcall in kernel/*
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:31 -08:00
Linus Torvalds
75659ca0c1 Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
  Remove commented-out code copied from NFS
  NFS: Switch from intr mount option to TASK_KILLABLE
  Add wait_for_completion_killable
  Add wait_event_killable
  Add schedule_timeout_killable
  Use mutex_lock_killable in vfs_readdir
  Add mutex_lock_killable
  Use lock_page_killable
  Add lock_page_killable
  Add fatal_signal_pending
  Add TASK_WAKEKILL
  exit: Use task_is_*
  signal: Use task_is_*
  sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
  ptrace: Use task_is_*
  power: Use task_is_*
  wait: Use TASK_NORMAL
  proc/base.c: Use task_is_*
  proc/array.c: Use TASK_REPORT
  perfmon: Use task_is_*
  ...

Fixed up conflicts in NFS/sunrpc manually..
2008-02-01 11:45:47 +11:00
Gerald Stralko
5aff0531ee sched: remove unused params
This removes the extra struct task_struct *p parameter in inc_nr_running
and dec_nr_running functions.

Signed-off by: Jerry Stralko <gerb.stralko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-31 22:45:23 +01:00
Nick Piggin
95c354fe9f spinlock: lockbreak cleanup
The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.

Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.

Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:20 +01:00
Nick Piggin
5fb5e6de55 sched: print backtrace of running tasks too
The attached patch is something really simple that can sometimes help
in getting more info out of a hung system.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Guillaume Chazarain
cc203d2422 sched: monitor clock underflows in /proc/sched_debug
We monitor clock overflows, let's also monitor clock underflows.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Guillaume Chazarain
782daeee3d sched: fix rq->clock warps on frequency changes
sched: fix rq->clock warps on frequency changes

Fix 2bacec8c31
(sched: touch softlockup watchdog after idling) that reintroduced warps
on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
that checks rq->clock for warps, so call it after adjusting rq->clock.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Ingo Molnar
6478d8800b sched: remove the !PREEMPT_BKL code
remove the !PREEMPT_BKL code.

this removes 160 lines of legacy code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Peter Zijlstra
48d5e25821 sched: rt throttling vs no_hz
We need to teach no_hz about the rt throttling because its tick driven.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:31 +01:00
Peter Zijlstra
6f505b1642 sched: rt group scheduling
Extend group scheduling to also cover the realtime classes. It uses the time
limiting introduced by the previous patch to allow multiple realtime groups.

The hard time limit is required to keep behaviour deterministic.

The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
number of task groups. This is the worst case behaviour I can't seem to get out
of, the avg. case of the algorithms can be improved, I focused on correctness
and worst case.

[ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:30 +01:00
Peter Zijlstra
fa85ae2418 sched: rt time limit
Very simple time limit on the realtime scheduling classes.
Allow the rq's realtime class to consume sched_rt_ratio of every
sched_rt_period slice. If the class exceeds this quota the fair class
will preempt the realtime class.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:29 +01:00
Peter Zijlstra
8f4d37ec07 sched: high-res preemption tick
Use HR-timers (when available) to deliver an accurate preemption tick.

The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.

The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:29 +01:00
Herbert Xu
02b67cc3ba sched: do not do cond_resched() when CONFIG_PREEMPT
Why do we even have cond_resched when real preemption
is on? It seems to be a waste of space and time.

remove cond_resched with CONFIG_PREEMPT on.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:28 +01:00
Ingo Molnar
03319ec8b0 sched: documentation, whitespace fixes
whitespace fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:28 +01:00
Peter Zijlstra
fa717060f1 sched: sched_rt_entity
Move the task_struct members specific to rt scheduling together.
A future optimization could be to put sched_entity and sched_rt_entity
into a union.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:27 +01:00
Gregory Haskins
dc938520d2 sched: dynamically update the root-domain span/online maps
The baseline code statically builds the span maps when the domain is formed.
Previous attempts at dynamically updating the maps caused a suspend-to-ram
regression, which should now be fixed.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:26 +01:00
Steven Rostedt
cb46984504 sched: RT-balance, add new methods to sched_class
Dmitry Adamushko found that the current implementation of the RT
balancing code left out changes to the sched_setscheduler and
rt_mutex_setprio.

This patch addresses this issue by adding methods to the schedule classes
to handle being switched out of (switched_from) and being switched into
(switched_to) a sched_class. Also a method for changing of priorities
is also added (prio_changed).

This patch also removes some duplicate logic between rt_mutex_setprio and
sched_setscheduler.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:22 +01:00
Steven Rostedt
9a897c5a67 sched: RT-balance, replace hooks with pre/post schedule and wakeup methods
To make the main sched.c code more agnostic to the schedule classes.
Instead of having specific hooks in the schedule code for the RT class
balancing. They are replaced with a pre_schedule, post_schedule
and task_wake_up methods. These methods may be used by any of the classes
but currently, only the sched_rt class implements them.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:22 +01:00
Dmitry Adamushko
5d2f5a616d sched: get rid of 'new_cpu' in try_to_wake_up()
Clean-up try_to_wake_up().

Get rid of the 'new_cpu' variable in try_to_wake_up() [ that's, one
#ifdef section less ].  Also remove a few redundant blank lines.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:21 +01:00
Ingo Molnar
b913176917 sched: add credits for RT balancing improvements
add credits for RT balancing improvements.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:19 +01:00
Ingo Molnar
0eab914657 sched: style cleanup, #2
style cleanup of various changes that were done recently.

no code changed:

      text    data     bss     dec     hex filename
     26399    2578      48   29025    7161 sched.o.before
     26399    2578      48   29025    7161 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:19 +01:00
Ingo Molnar
d7876a08db sched: remove unused JIFFIES_TO_NS() macro
remove unused JIFFIES_TO_NS() macro.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:19 +01:00
Gregory Haskins
637f50851b sched: only balance our RT tasks within our domain
We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:18 +01:00
Gregory Haskins
57d885fea0 sched: add sched-domain roots
We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Paul Jackson <pj@sgi.com>
CC: Simon Derr <simon.derr@bull.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:18 +01:00
Steven Rostedt
0d1311a536 sched: RT-balance on new task
rt-balance when creating new tasks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:14 +01:00
Gregory Haskins
a22d7fc187 sched: wake-balance fixes
We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:12 +01:00
Gregory Haskins
e7693a362e sched: de-SCHED_OTHER-ize the RT path
The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to SCHED_OTHER tasks where load is king.
For RT tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:09 +01:00
Gregory Haskins
73fe6aae84 sched: add RT-balance cpu-weight
Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:07 +01:00
Steven Rostedt
4642dafdf9 sched: push RT tasks from overloaded CPUs
This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:07 +01:00
Steven Rostedt
f65eda4f78 sched: pull RT tasks from overloaded runqueues
This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:07 +01:00
Steven Rostedt
e8fa136262 sched: add RT task pushing
This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:05 +01:00
Steven Rostedt
764a9d6fe4 sched: track highest prio task queued
This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:04 +01:00
Steven Rostedt
63489e45e2 sched: count # of queued RT tasks
This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:03 +01:00
Ingo Molnar
82a1fcb902 softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
this patch extends the soft-lockup detector to automatically
detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
printed the following way:

 ------------------>
 INFO: task prctl:3042 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
 prctl         D fd5e3793     0  3042   2997
        f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
        f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
        f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
 Call Trace:
  [<c04883a5>] schedule_timeout+0x6d/0x8b
  [<c04883d8>] schedule_timeout_uninterruptible+0x15/0x17
  [<c0133a76>] msleep+0x10/0x16
  [<c0138974>] sys_prctl+0x30/0x1e2
  [<c0104c52>] sysenter_past_esp+0x5f/0xa5
  =======================
 2 locks held by prctl/3042:
 #0:  (&sb->s_type->i_mutex_key#5){--..}, at: [<c0197d11>] do_fsync+0x38/0x7a
 #1:  (jbd_handle){--..}, at: [<c01ca3d2>] journal_start+0xc7/0xe9
 <------------------

the current default timeout is 120 seconds. Such messages are printed
up to 10 times per bootup. If the system has crashed already then the
messages are not printed.

if lockdep is enabled then all held locks are printed as well.

this feature is a natural extension to the softlockup-detector (kernel
locked up without scheduling) and to the NMI watchdog (kernel locked up
with IRQs disabled).

[ Gautham R Shenoy <ego@in.ibm.com>: CPU hotplug fixes. ]
[ Andrew Morton <akpm@linux-foundation.org>: build warning fix. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2008-01-25 21:08:02 +01:00
Gautham R Shenoy
95402b3829 cpu-hotplug: replace per-subsystem mutexes with get_online_cpus()
This patch converts the known per-subsystem mutexes to get_online_cpus
put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and
CPU_LOCK_RELEASE hotplug notification events.

Signed-off-by: Gautham  R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:02 +01:00
Gautham R Shenoy
86ef5c9a8e cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus()
Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights the
refcount semantics in these operations.

The new API guarantees protection against the cpu-hotplug operation, but
it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.

In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:02 +01:00
Srivatsa Vaddagiri
6b2d770026 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
The current load balancing scheme isn't good enough for precise
group fairness.

For example: on a 8-cpu system, I created 3 groups as under:

	a = 8 tasks (cpu.shares = 1024)
	b = 4 tasks (cpu.shares = 1024)
	c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b     | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c     |  86.326 | 18.0% | 47.5%  47.9%  48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
	group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
        percentage. Col3 data is derived as:
		Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
		Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

	a = +24.47%
	b = -9.13%
	c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b     | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c     | 160.653 | 33.5% | 85.8%  90.6%  91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

	a = +0.67%
	b = -0.83%
	c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

	CPU0	CPU1
---------------------------
	A (10)  B(5)
		C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

	A = 50%
	B = 25%
	C = 25%

which is not the desired result.

What's changing in the patch?
=============================

	- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
	  defined (see below)
	- API Change
		- Two tunables introduced in sysfs (under SCHED_DEBUG) to
		  control the frequency at which the load balance monitor
		  thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
	W(K,i)  = Weight of group K on cpu i
	T(K,i)  = Task load present in group K's cfs_rq on cpu i
	T(K)    = Total task load of group K across various cpus
	S(K) 	= Shares allocated to group K
	NRCPUS	= Number of online cpus in the scheduler domain to
	 	  which group K is assigned.

Then,
	W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate
at which it runs.

Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>

- don't start the load_balance_monitor when there is only a single cpu.
- rename the kthread because its currently longer than TASK_COMM_LEN

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri
a183561567 sched: introduce a mutex and corresponding API to serialize access to doms_curarray
doms_cur[] array represents various scheduling domains which are
mutually exclusive. Currently cpusets code can modify this array (by
calling partition_sched_domains()) as a result of user modifying
sched_load_balance flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read
the doms_cur[] array w/o worrying abt concurrent modifications to the
array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri
58e2d4ca58 sched: group scheduling, change how cpu load is calculated
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now
a summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch has a minor impact on code size, but should have
no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri
ec2c507fe8 sched: group scheduling, minor fixes
Minor bug fixes for the group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing
  cfs_rq debugging stats for various task groups.

- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
  walking task group list)

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:07:59 +01:00
Srivatsa Vaddagiri
93f992ccc0 sched: group scheduling code cleanup
Minor cleanups:

- Fix coding style
- remove obsolete comment

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:07:59 +01:00