2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1336 Commits

Author SHA1 Message Date
Hugh Dickins
d936cf9b39 badpage: ratelimit print_bad_pte and bad_page
print_bad_pte() and bad_page() might each need ratelimiting - especially
for their dump_stacks, almost never of interest, yet not quite
dispensible.  Correlating corruption across neighbouring entries can be
very helpful, so allow a burst of 60 reports before keeping quiet for the
remainder of that minute (or allow a steady drip of one report per
second).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:07 -08:00
Hugh Dickins
22b31eec63 badpage: vm_normal_page use print_bad_pte
print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does not
belong.  That's weak coverage when we suspect pagetable corruption.

Originally, it was called when vm_normal_page() found an invalid pfn: but
pfn_valid is expensive on some architectures and configurations, so 2.6.24
put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
2.6.26 replaced it by a VM_BUG_ON (likewise).

Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
__read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
pfn against that.  We could call this pfn_plausible() or pfn_sane(), but I
doubt we'll need it elsewhere: of course it's not reliable, but gives much
stronger pagetable validation on many boxes.

Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:07 -08:00
Hugh Dickins
3dc147414c badpage: replace page_remove_rmap Eeek and BUG
Now that bad pages are kept out of circulation, there is no need for the
infamous page_remove_rmap() BUG() - once that page is freed, its negative
mapcount will issue a "Bad page state" message and the page won't be
freed.  Removing the BUG() allows more info, on subsequent pages, to be
gathered.

We do have more info about the page at this point than bad_page() can know
- notably, what the pmd is, which might pinpoint something like low 64kB
corruption - but page_remove_rmap() isn't given the address to find that.

In practice, there is only one call to page_remove_rmap() which has ever
reported anything, that from zap_pte_range() (usually on exit, sometimes
on munmap).  It has all the info, so remove page_remove_rmap()'s "Eeek"
message and leave it all to zap_pte_range().

mm/memory.c already has a hardly used print_bad_pte() function, showing
some of the appropriate info: extend it to show what we want for the rmap
case: pte info, page info (when there is a page) and vma info to compare.
zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
use if it works that out for itself.

Some of this info is also shown in bad_page()'s "Bad page state" message.
Keep them separate, but adjust them to match each other as far as
possible.  Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
there too.

print_bad_pte() show current->comm unconditionally (though it should get
repeated in the usually irrelevant stack trace): sorry, I misled Nick
Piggin to make it conditional on vm_mm == current->mm, but current->mm is
already NULL in the exit case.  Usually current->comm is good, though
exceptionally it may not be that of the mm (when "swapoff" for example).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:07 -08:00
Hugh Dickins
8cc3b39221 badpage: keep any bad page out of circulation
Until now the bad_page() checkers have special-cased PageReserved, keeping
those pages out of circulation thereafter.  Now extend the special case to
all: we want to keep ANY page with bad state out of circulation - the
"free" page may well be in use by something.

Leave the bad state of those pages untouched, for examination by
debuggers; except for PageBuddy - leaving that set would risk bringing the
page back.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:07 -08:00
Hugh Dickins
79f4b7bf39 badpage: simplify page_alloc flag check+clear
Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
a page: check the same flags as before when freeing, clear ALL the flags
(unless PageReserved) when freeing, check ALL flags off when allocating.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:07 -08:00
KOSAKI Motohiro
efab818641 mm: make setup_per_zone_inactive_ratio() static
Sparse output following warning.

mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static?

cleanup here.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:05 -08:00
Hugh Dickins
b962716b45 mm: optimize get_scan_ratio for no swap
Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes, the gcc
optimizer gives us that, when nr_swap_pages is #defined as 0L.  Move usual
declaration to swapfile.c: it never belonged in page_alloc.c.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:04 -08:00
Julia Lawall
58a01a4572 mm/page_alloc.c: eliminate NULL test and memset after alloc_bootmem
As noted by Akinobu Mita in patch b1fceac2b9,
alloc_bootmem and related functions never return NULL and always return a
zeroed region of memory.  Thus a NULL test or memset after calls to these
functions is unnecessary.

This was fixed using the following semantic patch.
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@@
expression E;
statement S;
@@

E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
... when != E
(
- BUG_ON (E == NULL);
|
- if (E == NULL) S
)

@@
expression E,E1;
@@

E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
... when != E
- memset(E,0,E1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:02 -08:00
Yinghai Lu
5594c8c813 mm: print out memmap number only if it is not zero
Don't print the size of the zone's memmap array if it does not have one.

Impact: cleanup

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:00 -08:00
David Rientjes
e33c3b5e17 cpusets: update mems allowed in page allocator
If all allowable memory is unreclaimable, it is possible to loop forever
in the page allocator for ~__GFP_NORETRY allocations.

During this time, it is also possible for a task's cpuset to expand its
set of allowable nodes so that it now includes free memory.  The cached
copy of this set, current->mems_allowed, is stale, however, since there
has not been a subsequent call to cpuset_update_task_memory_state().

The cached copy of the set of allowable nodes is now updated in the page
allocator's slow path so the additional memory is available to
get_page_from_freelist().

[akpm@linux-foundation.org: add comment]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12 17:17:16 -08:00
Andy Whitcroft
18229df5b6 hugetlb: pull gigantic page initialisation out of the default path
As we can determine exactly when a gigantic page is in use we can optimise
the common regular page cases by pulling out gigantic page initialisation
into its own function.  As gigantic pages are never released to buddy we
do not need a destructor.  This effectivly reverts the previous change to
the main buddy allocator.  It also adds a paranoid check to ensure we
never release gigantic pages from hugetlbfs to the main buddy.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>		[2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06 15:41:18 -08:00
KAMEZAWA Hiroyuki
52d4b9ac0b memcg: allocate all page_cgroup at boot
Allocate all page_cgroup at boot and remove page_cgroup poitner from
struct page.  This patch adds an interface as

 struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
 - 4 bytes per PAGE_SIZE.
 - 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
  - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
  - we can avoid calling kmalloc/kfree.
  - we can avoid allocating tons of small objects which can be fragmented.
  - we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

	"allocated %ld bytes of page_cgroup"
        "please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Gerald Schaefer
1125b4e394 setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock
This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
There seems to be no need for the lru_lock anymore, but there is a need for
zone->lock instead, because that function may call move_freepages() via
setup_zone_migrate_reserve().

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Yinghai Lu
d903ef9f38 mm: print out meminit for memmap
Improve debuggability of memory setup problems.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Lee Schermerhorn
985737cf2e mlock: count attempts to free mlocked page
Allow free of mlock()ed pages.  This shouldn't happen, but during
developement, it occasionally did.

This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Nick Piggin
b291f00039 mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>

splitlru: introduce __get_user_pages():

  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:30 -07:00
Lee Schermerhorn
7b854121eb Unevictable LRU Page Statistics
Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Rik van Riel
556adecba1 vmscan: second chance replacement for anonymous pages
We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages.  At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by not
taking the referenced state into account when deactivating an anonymous
page.  After all, every anonymous page starts out referenced, so why
check?

If an anonymous page gets referenced again before it reaches the end of
the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

[kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
[kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Rik van Riel
4f98a2fee8 vmscan: split LRU lists into anon & file sets
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon").  The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists.  The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Rik van Riel
b2e185384f define page_file_cache() function
Define page_file_cache() function to answer the question:
	is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted to make
available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will be needed by
subsequent "split LRU" and "noreclaim" patches.

Unfortunately this needs to use a page flag, since the PG_swapbacked state
needs to be preserved all the way to the point where the page is last
removed from the LRU.  Trying to derive the status from other info in the
page resulted in wrong VM statistics in earlier split VM patchsets.

The total number of page flags in use on a 32 bit machine after this patch
is 19.

[akpm@linux-foundation.org: fix up out-of-order merge fallout]
[hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Christoph Lameter
b69408e88b vmscan: Use an indexed array for LRU variables
Currently we are defining explicit variables for the inactive and active
list.  An indexed array can be more generic and avoid repeating similar
code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on the
reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Andrew Morton
db99100d2e mm/page_alloc.c:free_area_init_nodes() fix inappropriate use of enum
Local variable `i' is a) misleadingly-named for an `enum zone_type' and b)
used for indexing zones as well as nodes as well as node_maps.

Make it an `int'.

Reported-by: Frans Pop <elendil@planet.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:29 -07:00
Andy Whitcroft
6babc32c41 mm: handle initialising compound pages at orders greater than MAX_ORDER
When we initialise a compound page we initialise the page flags and head
page pointer for all base pages spanned by that page.  When we initialise
a gigantic page (a page of order greater than or equal to MAX_ORDER) we
have to initialise more than MAX_ORDER_NR_PAGES pages.  Currently we
assume that all elements of the mem_map in this page are contigious in
memory.  However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
and with SPARSEMEM enabled they will not be contigious.  This leads us to
walk off the end of the first section and scribble on everything which
follows, BAD.

When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
section of the mem_map.  As gigantic pages can only be maximally aligned
we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
the start of the page.

This is a bug fix for the gigantic page support in hugetlbfs.

Credit to Mel Gorman for spotting the issue.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02 15:53:13 -07:00
Marcin Slusarz
527655835e mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data
WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
The variable contig_page_data references
the variable __initdata bootmem_node_data
If the reference is valid then annotate the
variable with __init* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Sean MacLennan <smaclennan@pikatech.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:37 -07:00
Adam Litke
344c790e38 mm: make setup_zone_migrate_reserve() aware of overlapping nodes
I have gotten to the root cause of the hugetlb badness I reported back on
August 15th.  My system has the following memory topology (note the
overlapping node):

            Node 0 Memory: 0x8000000-0x44000000
            Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000.  When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone.  Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:37 -07:00
Jan Beulich
74768ed833 page allocator: use no-panic variant of alloc_bootmem() in alloc_large_system_hash()
..  since a failed allocation is being (initially) handled gracefully, and
panic()-ed upon failure explicitly in the function if retries with smaller
sizes failed.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12 16:07:27 -07:00
Yinghai Lu
1d1958f050 mm: remove find_max_pfn_with_active_regions
It has no user now

Also print out info about adding/removing active regions.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-30 09:41:44 -07:00
Rusty Russell
9b1a4d3837 stop_machine: Wean existing callers off stop_machine_run()
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-07-28 12:16:31 +10:00
Yasunori Goto
af370fb8cb memory hotplug: small fixes to bootmem freeing for memory hotremove
- Change some naming
  * Magic -> types
  * MIX_INFO -> MIX_SECTION_INFO
  * Change definition of bootmem type from direct hex value

- __free_pages_bootmem() becomes __meminit.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:21 -07:00
Adrian Bunk
b69a7288ea mm/page_alloc.c: cleanups
This patch contains the following cleanups:
- make the following needlessly global variables static:
  - required_kernelcore
  - zone_movable_pfn[]
- make the following needlessly global functions static:
  - move_freepages()
  - move_freepages_block()
  - setup_pageset()
  - find_usable_zone_for_movable()
  - adjust_zone_range_for_zone_movable()
  - __absent_pages_in_range()
  - find_min_pfn_for_node()
  - find_zone_movable_pfns_for_nodes()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:20 -07:00
Timur Tabi
2be0ffe2b2 mm: add alloc_pages_exact() and free_pages_exact()
alloc_pages_exact() is similar to alloc_pages(), except that it allocates
the minimum number of pages to fulfill the request.  This is useful if you
want to allocate a very large buffer that is slightly larger than an even
power-of-two number of pages.  In that case, alloc_pages() will waste a
lot of memory.

I have a video driver that wants to allocate a 5MB buffer.  alloc_pages()
wiill waste 3MB of physically-contiguous memory.

Signed-off-by: Timur Tabi <timur@freescale.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:20 -07:00
Andi Kleen
01ad1c0827 mm: export prep_compound_page to mm
hugetlb will need to get compound pages from bootmem to handle the case of
them being greater than or equal to MAX_ORDER.  Export the constructor
function needed for this.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:17 -07:00
Johannes Weiner
9109fb7b35 mm: drop unneeded pgdat argument from free_area_init_node()
free_area_init_node() gets passed in the node id as well as the node
descriptor.  This is redundant as the function can trivially get the node
descriptor itself by means of NODE_DATA() and the node's id.

I checked all the users and NODE_DATA() seems to be usable everywhere
from where this function is called.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:16 -07:00
Andy Whitcroft
3c82d0ce2c buddy: clarify comments describing buddy merge
In __free_one_page(), the comment "Move the buddy up one level" appears
attached to the break and by implication when the break is taken we are
moving it up one level:

	if (!page_is_buddy(page, buddy, order))
		break;          /* Move the buddy up one level. */

In reality the inverse is true, we break out when we can no longer merge
this page with its buddy.  Looking back into pre-history (into the full
git history) it appears that these two lines accidentally got joined as
part of another change.

Move the comment down where it belongs below the if and clarify its
language.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:15 -07:00
KOSAKI Motohiro
e4048e5dc4 page allocator: inline some __alloc_pages() wrappers
Two zonelist patch series rewrote __page_alloc() largely.  Now, it is just
a wrapper function.  Inlining them will save a function call.

[akpm@linux-foundation.org: export __alloc_pages_internal]
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:14 -07:00
Johannes Weiner
b61bfa3c46 mm: move bootmem descriptors definition to a single place
There are a lot of places that define either a single bootmem descriptor or an
array of them.  Use only one central array with MAX_NUMNODES items instead.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:14 -07:00
Mel Gorman
68ad8df42e mm: print out the zonelists on request for manual verification
This patch prints out the zonelists during boot for manual verification by the
user if the mminit_loglevel is MMINIT_VERIFY or higher.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:14 -07:00
Mel Gorman
2dbb51c49f mm: make defensive checks around PFN values registered for memory usage
There are a number of different views to how much memory is currently active.
There is the arch-independent zone-sizing view, the bootmem allocator and
memory models view.

Architectures register this information at different times and is not
necessarily in sync particularly with respect to some SPARSEMEM limitations.

This patch introduces mminit_validate_memmodel_limits() which is able to
validate and correct PFN ranges with respect to the memory model.  It is only
SPARSEMEM that currently validates itself.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:13 -07:00
Mel Gorman
708614e618 mm: verify the page links and memory model
Print out information on how the page flags are being used if mminit_loglevel
is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
flags regardless of loglevel.

When the page flags are updated with section, node and zone information, a
check are made to ensure the values can be retrieved correctly.  Finally we
confirm that pfn_to_page and page_to_pfn are the correct inverse functions.

[akpm@linux-foundation.org: fix printk warnings]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:13 -07:00
Mel Gorman
6b74ab97bc mm: add a basic debugging framework for memory initialisation
Boot initialisation is very complex, with significant numbers of
architecture-specific routines, hooks and code ordering.  While significant
amounts of the initialisation is architecture-independent, it trusts the data
received from the architecture layer.  This is a mistake, and has resulted in
a number of difficult-to-diagnose bugs.

This patchset adds some validation and tracing to memory initialisation.  It
also introduces a few basic defensive measures.  The validation code can be
explicitly disabled for embedded systems.

This patch:

Add additional debugging and verification code for memory initialisation.

Once enabled, the verification checks are always run and when required
additional debugging information may be outputted via a mminit_loglevel=
command-line parameter.

The verification code is placed in a new file mm/mm_init.c.  Ideally other mm
initialisation code will be moved here over time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:13 -07:00
Ingo Molnar
1a781a777b Merge branch 'generic-ipi' into generic-ipi-for-linus
Conflicts:

	arch/powerpc/Kconfig
	arch/s390/kernel/time.c
	arch/x86/kernel/apic_32.c
	arch/x86/kernel/cpu/perfctr-watchdog.c
	arch/x86/kernel/i8259_64.c
	arch/x86/kernel/ldt.c
	arch/x86/kernel/nmi_64.c
	arch/x86/kernel/smpboot.c
	arch/x86/xen/smp.c
	include/asm-x86/hw_irq_32.h
	include/asm-x86/hw_irq_64.h
	include/asm-x86/mach-default/irq_vectors.h
	include/asm-x86/mach-voyager/irq_vectors.h
	include/asm-x86/smp.h
	kernel/Makefile

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-15 21:55:59 +02:00
Paul Jackson
5dab8ec139 mm, generic, x86 boot: more tweaks to hex prints of some pfn addresses
Fix some problems with (and applies on top of) a previous patch:
  x86 boot: show pfn addresses in hex not decimal in some kernel info printks

Primarily change "0x%8lx" format, which displays with a right aligned
space filled hex number (spaces between the "0x" prefix and the number),
into "%0#10lx" format, which zero fills instead of space fills, and
which uses the printf flag '#' to request the "0x" prefix instead of
hard coding it.

Also replace some other "0x%lx" formats with "%#lx", making use of the
'#' printf flag again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 13:10:40 +02:00
Paul Jackson
2bc0d2615a x86 boot: more consistently use type int for node ids
Everywhere I look, node id's are of type 'int', except in this one
case, which has 'unsigned long'.  Change this one to 'int' as well.
There is nothing special about the way this variable 'nid' is used in
this routine to justify using an unusual type here.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 12:51:38 +02:00
Paul Jackson
e2fc252e0c x86 boot: show pfn addresses in hex not decimal in some kernel info printks
Page frame numbers (the portion of physical addresses above the low
order page offsets) are displayed in several kernel debug and info
prints in decimal, not hex.  Decimal addresse are unreadable.  Use hex.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 12:51:37 +02:00
Yinghai Lu
d52d53b8a5 RFC x86: try to remove arch_get_ram_range
want to remove arch_get_ram_range, and use early_node_map instead.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 12:48:27 +02:00
Yinghai Lu
b5bc6c0e55 x86, mm: use add_highpages_with_active_regions() for high pages init v2
use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage

also remove page_is_reserved_early(), it is not needed anymore.

v2: fix the build of other platforms

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 10:37:25 +02:00
Yinghai Lu
cc1050bafe x86: replace shrink_active_range() with remove_active_range()
in case we have kva before ramdisk on a node, we still need to use
those ranges.

v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
    those area could be treat as free high pages.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 10:36:29 +02:00
Ingo Molnar
896395c290 Merge branch 'linus' into tmp.x86.mpparse.new 2008-07-08 10:32:56 +02:00
Mel Gorman
494de90098 Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
The non-NUMA case of build_zonelist_cache() would initialize the
zlcache_ptr for both node_zonelists[] to NULL.

Which is problematic, since non-NUMA only has a single node_zonelists[]
entry, and trying to zero the non-existent second one just overwrote the
nr_zones field instead.

As kswapd uses this value to determine what reclaim work is necessary,
the result is that kswapd never reclaims.  This causes processes to
stall frequently in low-memory situations as they always direct reclaim.
This patch initialises zlcache_ptr correctly.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Dan Williams <dan.j.williams@intel.com>
[ Simplified patch a bit ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-03 09:22:59 -07:00
Jens Axboe
15c8b6c1aa on_each_cpu(): kill unused 'retry' parameter
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:38 +02:00
Yinghai Lu
cc1a9d86ce mm, x86: shrink_active_range() should check all
Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.

So we need to make shrink_active_range() smarter.

shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:31:44 +02:00
Russ Anderson
dfa7e20cc0 mm: Minor clean-up of page flags in mm/page_alloc.c
Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.

The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups.  Doing that
in a header file is cleaner than adding #ifdefs to the
C code.

Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-09 10:22:24 -07:00
Yinghai Lu
e8c27ac919 x86, numa, 32-bit: print out debug info on all kvas
also fix the print out of node_remap_end_vaddr

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-03 13:26:26 +02:00
Heiko Carstens
cd94b9dbfa memory hotplug: fix early allocation handling
Trying to add memory via add_memory() from within an initcall function
results in

bootmem alloc of 163840 bytes failed!
Kernel panic - not syncing: Out of memory

This is caused by zone_wait_table_init() which uses system_state to decide
if it should use the bootmem allocator or not.

When initcalls are handled the system_state is still SYSTEM_BOOTING but
the bootmem allocator doesn't work anymore.  So the allocation will fail.

To fix this use slab_is_available() instead as indicator like we do it
everywhere else.

[akpm@linux-foundation.org: coding-style fix]
Reviewed-by: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:12 -07:00
Andy Whitcroft
7eb54824b7 zonelists: handle a node zonelist with no applicable entries
When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
panics when trying node local allocations:

 BUG: unable to handle kernel NULL pointer dereference at 0000034c
 IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e
 *pdpt = 00000000013a7001 *pde = 0000000000000000
 Oops: 0000 [#1] SMP
 Modules linked in:

 Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
 EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0
 EIP is at get_page_from_freelist+0x4a/0x18e
 EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
 ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
 Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
 Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
        f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
        00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
 Call Trace:
  [<c10426e4>] __alloc_pages_internal+0x99/0x378
  [<c10429ca>] __alloc_pages+0x7/0x9
  [<c105e0e8>] kmem_getpages+0x66/0xef
  [<c105ec55>] cache_grow+0x8f/0x123
  [<c105f117>] ____cache_alloc_node+0xb9/0xe4
  [<c105f427>] kmem_cache_alloc_node+0x92/0xd2
  [<c122118c>] setup_cpu_cache+0xaf/0x177
  [<c105e6ca>] kmem_cache_create+0x2c8/0x353
  [<c13853af>] kmem_cache_init+0x1ce/0x3ad
  [<c13755c5>] start_kernel+0x178/0x1ee

This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
page.  In this system there is only ZONE_DMA and ZONE_NORMAL memory on
node 0, all other nodes are mapped above 4GB physical.  Here is a dump
of the zonelists from this system:

    zonelists pgdat=c1400000
     0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
     1: c14006c0:2 c1400360:1 c1400000:0
    zonelists pgdat=f7c00000
     0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
     1: f7c006c0:2
    zonelists pgdat=f7e00000
     0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
     1: f7e006c0:2

When performing a node local allocation we call get_page_from_freelist()
looking for a page.  It in turn calls first_zones_zonelist() which returns
a preferred_zone.  Where there are no applicable zones this will be NULL.
However we use this unconditionally, leading to this panic.

Where there are no applicable zones there is no possibility of a successful
allocation, so simply fail the allocation.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:11 -07:00
Johannes Weiner
f723215419 mm: don't drop a partial page in a zone's memory map size
In a zone's present pages number, account for all pages occupied by the
memory map, including a partial.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:07 -07:00
Heiko Carstens
76cdd58e55 memory_hotplug: always initialize pageblock bitmap
Trying to online a new memory section that was added via memory hotplug
sometimes results in crashes when the new pages are added via __free_page.
 Reason for that is that the pageblock bitmap isn't initialized and hence
contains random stuff.  That means that get_pageblock_migratetype()
returns also random stuff and therefore

	list_add(&page->lru,
		&zone->free_area[order].free_list[migratetype]);

in __free_one_page() tries to do a list_add to something that isn't even
necessarily a list.

This happens since 86051ca5ea ("mm: fix
usemap initialization") which makes sure that the pageblock bitmap gets
only initialized for pages present in a zone.  Unfortunately for hot-added
memory the zones "grow" after the memmap and the pageblock memmap have
been initialized.  Which means that the new pages have an unitialized
bitmap.  To solve this the calls to grow_zone_span() and grow_pgdat_span()
are moved to __add_zone() just before the initialization happens.

The patch also moves the two functions since __add_zone() is the only
caller and I didn't want to add a forward declaration.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:15 -07:00
Thomas Gleixner
3ac7fe5a4a infrastructure to debug (dynamic) objects
We can see an ever repeating problem pattern with objects of any kind in the
kernel:

1) freeing of active objects
2) reinitialization of active objects

Both problems can be hard to debug because the crash happens at a point where
we have no chance to decode the root cause anymore.  One problem spot are
kernel timers, where the detection of the problem often happens in interrupt
context and usually causes the machine to panic.

While working on a timer related bug report I had to hack specialized code
into the timer subsystem to get a reasonable hint for the root cause.  This
debug hack was fine for temporary use, but far from a mergeable solution due
to the intrusiveness into the timer code.

The code further lacked the ability to detect and report the root cause
instantly and keep the system operational.

Keeping the system operational is important to get hold of the debug
information without special debugging aids like serial consoles and special
knowledge of the bug reporter.

The problems described above are not restricted to timers, but timers tend to
expose it usually in a full system crash.  Other objects are less explosive,
but the symptoms caused by such mistakes can be even harder to debug.

Instead of creating specialized debugging code for the timer subsystem a
generic infrastructure is created which allows developers to verify their code
and provides an easy to enable debug facility for users in case of trouble.

The debugobjects core code keeps track of operations on static and dynamic
objects by inserting them into a hashed list and sanity checking them on
object operations and provides additional checks whenever kernel memory is
freed.

The tracked object operations are:
- initializing an object
- adding an object to a subsystem list
- deleting an object from a subsystem list

Each operation is sanity checked before the operation is executed and the
subsystem specific code can provide a fixup function which allows to prevent
the damage of the operation.  When the sanity check triggers a warning message
and a stack trace is printed.

The list of operations can be extended if the need arises.  For now it's
limited to the requirements of the first user (timers).

The core code enqueues the objects into hash buckets.  The hash index is
generated from the address of the object to simplify the lookup for the check
on kfree/vfree.  Each bucket has it's own spinlock to avoid contention on a
global lock.

The debug code can be compiled in without being active.  The runtime overhead
is minimal and could be optimized by asm alternatives.  A kernel command line
option enables the debugging code.

Thanks to Ingo Molnar for review, suggestions and cleanup patches.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:53 -07:00
Nishanth Aravamudan
a41f24ea9f page allocator: smarter retry of costly-order allocations
Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<<order reclaimed
pages), rather than forever (which is the case for allocations <=
PAGE_ALLOC_COSTLY_ORDER).

This change improves the /proc/sys/vm/nr_hugepages interface with a
follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
than administrators repeatedly echo'ing a particular value into the
sysctl, and forcing reclaim into action manually, this change allows for
the sysctl to attempt a reasonable effort itself. Similarly, dynamic
pool growth should be more successful under load, as lumpy reclaim can
try to free up pages, rather than failing right away.

Choosing to reclaim only up to the order of the requested allocation
strikes a balance between not failing hugepage allocations and returning
to the caller when it's unlikely to every succeed. Because of lumpy
reclaim, if we have freed the order requested, hopefully it has been in
big chunks and those chunks will allow our allocation to succeed. If
that isn't the case after freeing up the current order, I don't think it
is likely to succeed in the future, although it is possible given a
particular fragmentation pattern.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:05:58 -07:00
Nishanth Aravamudan
ab857d0938 mm: fix misleading __GFP_REPEAT related comments
The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
core VM have somewhat differing comments as to their actual semantics.
Annoyingly, the flags definition has inline and header comments, which might
be interpreted as not being equivalent.  Just add references to the header
comments in the inline ones so they don't go out of sync in the future.  In
their use in __alloc_pages() clarify that the current implementation treats
low-order allocations and __GFP_REPEAT allocations as distinct cases.

To clarify, the flags' semantics are:

__GFP_NORETRY means try no harder than one run through __alloc_pages

__GFP_REPEAT means __GFP_NOFAIL

__GFP_NOFAIL means repeat forever

order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:05:58 -07:00
KAMEZAWA Hiroyuki
86051ca5ea mm: fix usemap initialization
usemap must be initialized only when pfn is within zone.  If not, it corrupts
memory.

And this patch also reduces the number of calls to set_pageblock_migratetype()
from
	(pfn & (pageblock_nr_pages -1)
to
	!(pfn & (pageblock_nr_pages-1)
it should be called once per pageblock.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Shi Weihua <shiwh@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:05:58 -07:00
Pavel Machek
2309f9e6fe mm/page_alloc.c: remove hand-coded get_order()
Remove hand-coded get_order() from page_alloc.c.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:26 -07:00
Yasunori Goto
0c0a4a517a memory hotplug: free memmaps allocated by bootmem
This patch is to free memmaps which is allocated by bootmem.

Freeing usemap is not necessary.  The pages of usemap may be necessary for
other sections.

If removing section is last section on the node, its section is the final user
of usemap page.  (usemaps are allocated on its section by previous patch.) But
it shouldn't be freed too, because the section must be logical offline state
which all pages are isolated against page allocater.  If it is freed, page
alloctor may use it which will be removed physically soon.  It will be
disaster.  So, this patch keeps it as it is.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:26 -07:00
Christoph Lameter
0a128b2b1a pageflags: eliminate PG_xxx aliases
Remove aliases of PG_xxx.  We can easily drop those now and alias by
specifying the PG_xxx flag in the macro that generates the functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:22 -07:00
S.Caglar Onur
f05111f501 mm/page_alloc.c: fix indentation
zlc_setup(): handle jiffies wraparound
(10ed273f50) changes tab with spaces

Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
Mel Gorman
19770b3260 mm: filter based on a nodemask as well as a gfp_mask
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy.  As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering.  This eliminates the need for
MPOL_BIND to create a custom zonelist.

A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist.  I.e., pages will be allocated from the closest allowed node with
available memory.

[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Mel Gorman
dd1a239f6f mm: have zonelist contains structs with both a zone pointer and zone_idx
Filtering zonelists requires very frequent use of zone_idx().  This is costly
as it involves a lookup of another structure and a substraction operation.  As
the zone_idx is often required, it should be quickly accessible.  The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary.  Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
54a6eb5c47 mm: use two zonelist that are filtered by GFP mask
Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations.  Based on the zones
allowed by a gfp mask, one of these zonelists is selected.  All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists.  The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages.  The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
18ea7e710d mm: remember what the preferred zone is for zone_statistics
On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign.  It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.

This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it.  This simplifies the reading of later patches in
this set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
0e88460da6 mm: introduce node_zonelist() for accessing the zonelist for a GFP mask
Introduce a node_zonelist() helper function.  It is used to lookup the
appropriate zonelist given a node and a GFP mask.  The patch on its own is a
cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
necessary, it can be merged with the next patch in this set without problems.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
dac1d27bc8 mm: use zonelists instead of zones when direct reclaiming pages
The following patches replace multiple zonelists per node with two zonelists
that are filtered based on the GFP flags.  The patches as a set fix a bug with
regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
MPOL_BIND will apply to the two highest zones when the highest zone is
ZONE_MOVABLE.  This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
only custom zonelists.

The first patch cleans up an inconsistency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch introduces a helper function node_zonelist() for looking up
the appropriate zonelist for a GFP mask which simplifies patches later in the
set.

The third patch defines/remembers the "preferred zone" for numa statistics, as
it is no longer always the first zone in a zonelist.

The forth patch replaces multiple zonelists with two zonelists that are
filtered.  The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fifth patch introduces helper macros for retrieving the zone and node
indices of entries in a zonelist.

The final patch introduces filtering of the zonelists based on a nodemask.
Two zonelists exist per node, one for normal allocations and one for
__GFP_THISNODE.

Performance results varied depending on the machine configuration.  In real
workloads the gain/loss will depend on how much the userspace portion of the
benchmark benefits from having more cache available due to reduced referencing
of zonelists.

These are the range of performance losses/gains when running against
2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.
			     loss   to  gain
Total CPU time on Kernbench: -0.86% to  1.13%
Elapsed   time on Kernbench: -0.79% to  0.76%
page_test from aim9:         -4.37% to  0.79%
brk_test  from aim9:         -0.71% to  4.07%
fork_test from aim9:         -1.84% to  4.60%
exec_test from aim9:         -0.71% to  1.08%

This patch:

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation.  Similarly, direct reclaim of pages
iterates over an array of zones.  For consistency, this patch converts direct
reclaim to use a zonelist.  No functionality is changed by this patch.  This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mike Travis
c5f59f0833 nodemask: use new node_to_cpumask_ptr function
* Use new node_to_cpumask_ptr.  This creates a pointer to the
    cpumask for a given node.  This definition is in mm patch:

	asm-generic-add-node_to_cpumask_ptr-macro.patch

  * Use new set_cpus_allowed_ptr function.

Depends on:
	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
	[sched-devel]: sched: add new set_cpus_allowed_ptr function
	[x86/latest]: x86: add cpus_scnprintf function

Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Hugh Dickins
9442ec9df4 memcg: bad page if page_cgroup when free
Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
were set here, it'd likely cause corruption when the page is reused.

Don't use page_assign_page_cgroup to clear it: that should be private to
memcontrol.c, and always called with the lock taken; and memmap_init_zone
doesn't need it either - like page->mapping and other pointers throughout the
kernel, Linux assumes pointers in zeroed structures are NULL pointers.

Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
KOSAKI Motohiro
10ed273f50 zlc_setup(): handle jiffies wraparound
jiffies subtraction may cause an overflow problem.  It should be using
time_after().

[akpm@linux-foundation.org: include jiffies.h]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:10 -08:00
Alexander van Heukelum
b5a0e01132 Solve section mismatch for free_area_init_core.
WARNING: vmlinux.o(.meminit.text+0x649):
Section mismatch in reference from the
function free_area_init_core() to the function .init.text:setup_usemap()
The function __meminit free_area_init_core() references
a function __init setup_usemap().
If free_area_init_core is only used by setup_usemap then
annotate free_area_init_core with a matching annotation.

The warning is covers this stack of functions in mm/page_alloc.c:

alloc_bootmem_node must be marked __init.
alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM.
(usemap_size is only used by setup_usemap, if !SPARSEMEM.)
setup_usemap is only used by free_area_init_core.
free_area_init_core is only used by free_area_init_node.

free_area_init_node is used by:
arch/alpha/mm/numa.c: __init paging_init()
arch/arm/mm/init.c: __init bootmem_init_node()
arch/avr32/mm/init.c: __init paging_init()
arch/cris/arch-v10/mm/init.c: __init paging_init()
arch/cris/arch-v32/mm/init.c: __init paging_init()
arch/m32r/mm/discontig.c: __init zone_sizes_init()
arch/m32r/mm/init.c: __init zone_sizes_init()
arch/m68k/mm/motorola.c: __init paging_init()
arch/m68k/mm/sun3mmu.c: __init paging_init()
arch/mips/sgi-ip27/ip27-memory.c: __init paging_init()
arch/parisc/mm/init.c: __init paging_init()
arch/sparc/mm/srmmu.c: __init srmmu_paging_init()
arch/sparc/mm/sun4c.c: __init sun4c_paging_init()
arch/sparc64/mm/init.c: __init paging_init()
mm/page_alloc.c: __init free_area_init_nodes()
mm/page_alloc.c: __init free_area_init()
and
mm/memory_hotplug.c: hotadd_new_pgdat()

hotadd_new_pgdat can not be an __init function, but:

It is compiled for MEMORY_HOTPLUG configurations only
MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA
X86_64_ACPI_NUMA depends on X86_64
ARCH_FLATMEM_ENABLE depends on X86_32
ARCH_DISCONTIGMEM_ENABLE depends on X86_32
So X86_64_ACPI_NUMA implies SPARSEMEM, right?

So we can mark the stack of functions __init for !SPARSEMEM, but we must mark
them __meminit for SPARSEMEM configurations.  This is ok, because then the
calls to alloc_bootmem_node are also avoided.

Compile-tested on:
silly minimal config
defconfig x86_32
defconfig x86_64
defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Harvey Harrison
edde08f2a8 misc: removal of final callers using fastcall
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:31 -08:00
Balbir Singh
8a9f3ccd24 Memory controller: memory accounting
Add the accounting hooks.  The accounting is carried out for RSS and Page
Cache (unmapped) pages.  There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time.  Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache().  Swap cache is also accounted for.

Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier.  A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup.  Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: <Valdis.Kletnieks@vt.edu>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Larry Woodman
e6f3602d2c Include count of pagecache pages in show_mem() output
The show_mem() output does not include the total number of pagecache
pages.  This would be helpful when analyzing the debug information in
the /var/log/messages file after OOM kills occur.

This patch includes the total pagecache pages in that output.

Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:19 -08:00
Harvey Harrison
920c7a5d0c mm: remove fastcall from mm/
fastcall is always defined to be empty, remove it

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
Andi Kleen
1e548deb5d page allocator: remove unused arguments in zone_init_free_lists()
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
Christoph Lameter
3dfa5721f1 Page allocator: get rid of the list of cold pages
We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.

The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Martin Bligh <mbligh@mbligh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
Christoph Lameter
9f8f217253 Page allocator: clean up pcp draining functions
- Add comments explaing how drain_pages() works.

- Eliminate useless functions

- Rename drain_all_local_pages to drain_all_pages(). It does drain
  all pages not only those of the local processor.

- Eliminate useless interrupt off / on sequences. drain_pages()
  disables interrupts on its own. The execution thread is
  pinned to processor by the caller. So there is no need to
  disable interrupts.

- Put drain_all_pages() declaration in gfp.h and remove the
  declarations from suspend.h and from mm/memory_hotplug.c

- Make software suspend call drain_all_pages(). The draining
  of processor local pages is may not the right approach if
  software suspend wants to support SMP. If they call drain_all_pages
  then we can make drain_pages() static.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:17 -08:00
Sam Ravnborg
1d6f4e60e7 mm: fix section mismatch warning in page_alloc.c
With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw
following warning:
WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback')

The culprit was zone_batchsize() which were annotated __devinit but used
from process_zones() which is annotated __cpuinit.  zone_batchsize() are
used from another function annotated __meminit so the only valid option is
to drop the annotation of zone_batchsize() so we know it is always valid to
use it.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-17 15:38:58 -08:00
Thomas Bogendoerfer
467bc461d2 Fix crash with FLAT_MEMORY and ARCH_PFN_OFFSET != 0
When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in
memmap_init_zone().  This bug got introduced by commit
c713216dee

Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-08 16:10:36 -08:00
Mel Gorman
81eabcbe0b mm: fix page allocation for larger I/O segments
In some cases the IO subsystem is able to merge requests if the pages are
adjacent in physical memory.  This was achieved in the allocator by having
expand() return pages in physically contiguous order in situations were a
large buddy was split.  However, list-based anti-fragmentation changed the
order pages were returned in to avoid searching in buffered_rmqueue() for a
page of the appropriate migrate type.

This patch restores behaviour of rmqueue_bulk() preserving the physical
order of pages returned by the allocator without incurring increased search
costs for anti-fragmentation.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Mark Lord <mlord@pobox.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
Mel Gorman
ba72cb8cb0 Fix boot problem with iSeries lacking hugepage support
Ordinarily the size of a pageblock is determined at compile-time based on the
hugepage size. On PPC64, the hugepage size is determined at runtime based on
what is supported by the machine. With legacy machines such as iSeries that
do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order
being set to -PAGE_SHIFT and a crash results shortly afterwards.

This patch adds a function to select a sensible value for pageblock order by
default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT
is a sensible value before using the hugepage size; if it is not MAX_ORDER-1
is used.

This is a fix for 2.6.24.

Credit goes to Stephen Rothwell for identifying the bug and testing candidate
patches.  Additional credit goes to Andy Whitcroft for spotting a problem
with respects to IA-64 before releasing. Additional credit to David Gibson
for testing with the libhugetlbfs test suite.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:51 -08:00
Hugh Dickins
09f345da75 prep_zero_page: remove bogus BUG_ON
2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait"
assertion.  Presumably added under the misconception that clear_highpage
uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-28 11:04:28 -08:00
Linus Torvalds
44048d700b Revert "Bias the placement of kernel pages at lower PFNs"
This reverts commit 5adc5be7cd.

Alexey Dobriyan reports that it causes huge slowdowns under some loads,
in his case a "mkfs.ext2" on a 30G partition.  With the placement bias,
the mkfs took over four minutes, with it reverted it's back to about ten
seconds for Alexey.

Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-12 14:14:44 -08:00
Simon Arlott
183ff22bb6 spelling fixes: mm/
Spelling fixes in mm/.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-20 01:27:18 +02:00
David Rientjes
ff0ceb9deb oom: serialize out of memory calls
A final allocation attempt with a very high watermark needs to be attempted
before invoking out_of_memory().  OOM killer serialization needs to occur
before this final attempt, otherwise tasks attempting to OOM-lock all zones in
its zonelist may spin and acquire the lock unnecessarily after the OOM
condition has already been alleviated.

If the final allocation does succeed, the zonelist is simply OOM-unlocked and
__alloc_pages() returns the page.  Otherwise, the OOM killer is invoked.

If the task cannot acquire OOM-locks on all zones in its zonelist, it is put
to sleep and the allocation is retried when it gets rescheduled.  One of its
zones is already marked as being in the OOM killer so it'll hopefully be
getting some free memory soon, at least enough to satisfy a high watermark
allocation attempt.  This prevents needlessly killing a task when the OOM
condition would have already been alleviated if it had simply been given
enough time.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
David Rientjes
e815af95f9 oom: change all_unreclaimable zone member to flags
Convert the int all_unreclaimable member of struct zone to unsigned long
flags.  This can now be used to specify several different zone flags such as
all_unreclaimable and reclaim_in_progress, which can now be removed and
converted to a per-zone flag.

Flags are set and cleared as follows:

	zone_set_flag(struct zone *zone, zone_flags_t flag)
	zone_clear_flag(struct zone *zone, zone_flags_t flag)

Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
which have the same semantics as the old zone->all_unreclaimable and
zone->reclaim_in_progress, respectively.  Also converts all current users that
set or clear either flag to use the new interface.

Helper functions are defined to test the flags:

	int zone_is_all_unreclaimable(const struct zone *zone)
	int zone_is_reclaim_locked(const struct zone *zone)

All flag operators are of the atomic variety because there are currently
readers that are implemented that do not take zone->lock.

[akpm@linux-foundation.org: add needed include]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
David Rientjes
5a3135c2e7 oom: move prototypes to appropriate header file
Move the OOM killer's extern function prototypes to include/linux/oom.h and
include it where necessary.

[clg@fr.ibm.com: build fix]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
KAMEZAWA Hiroyuki
0c0e619589 memory unplug: page offline
Logic.
 - set all pages in  [start,end)  as isolated migration-type.
   by this, all free pages in the range will be not-for-use.
 - Migrate all LRU pages in the range.
 - Test all pages in the range's refcnt is zero or not.

Todo:
 - allocate migration destination page from better area.
 - confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
 (I don't like this kind of page but..
 - Find out pages which cannot be migrated.
 - more running tests.
 - Use reclaim for unplugging other memory type area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki
a5d76b54a3 memory unplug: page isolation
Implement generic chunk-of-pages isolation method by using page grouping ops.

This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
 - MIGRATE_TYPES increases.
 - bitmap for migratetype is enlarged.

pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
By this, you can isolated *freed* pages from users. How-to-free pages is not
a purpose of this patch. You may use reclaim and migrate codes to free pages.

If start_isolate_page_range(start,end) is called,
 - migratetype of the range turns to be MIGRATE_ISOLATE  if
   its type is MIGRATE_MOVABLE. (*) this check can be updated if other
   memory reclaiming works make progress.
 - MIGRATE_ISOLATE is not on migratetype fallback list.
 - All free pages and will-be-freed pages are isolated.
To check all pages in the range are isolated or not,  use test_pages_isolated(),
To cancel isolation, use undo_isolate_page_range().

Changes V6 -> V7
 - removed unnecessary #ifdef

There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
Mel Gorman
48f13bf3e7 Breakout page_order() to internal.h to avoid special knowledge of the buddy allocator
The statistics patch later needs to know what order a free page is on the free
lists.  Rather than having special knowledge of page_private() when
PageBuddy() is set, this patch places out page_order() in internal.h and adds
a VM_BUG_ON to catch using it on non-PageBuddy pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Adrian Bunk
484f51f820 mm/page_alloc.c: make code static
This patch makes needlessly global code static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Mel Gorman
467c996c1e Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo
This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
 The information is collected only on request so there is no runtime overhead.
 The statistics are in three parts:

The first part prints information on the size of blocks that pages are
being grouped on and looks like

Page block order: 10
Pages per block:  1024

The second part is a more detailed version of /proc/buddyinfo and looks like

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type  Reclaimable      1      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Reserve      0      4      4      0      0      0      0      1      0      1      0
Node    0, zone   Normal, type    Unmovable    111      8      4      4      2      3      1      0      0      0      0
Node    0, zone   Normal, type  Reclaimable    293     89      8      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Movable      1      6     13      9      7      6      3      0      0      0      0
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      4

The third part looks like

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
Node 0, zone      DMA            0            1            2            1
Node 0, zone   Normal            3           17           94            4

To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
/proc/pagetypeinfo to reduce code duplication.  It seems specific to what
vmstat.c requires but could be broken out as a general utility function in
mmzone.c if there were other other potential users.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
d9c2340052 Do not depend on MAX_ORDER when grouping pages by mobility
Currently mobility grouping works at the MAX_ORDER_NR_PAGES level.  This makes
sense for the majority of users where this is also the huge page size.
However, on platforms like ia64 where the huge page size is runtime
configurable it is desirable to group at a lower order.  On x86_64 and
occasionally on x86, the hugepage size may not always be MAX_ORDER_NR_PAGES.

This patch groups pages together based on the value of HUGETLB_PAGE_ORDER.  It
uses a compile-time constant if possible and a variable where the huge page
size is runtime configurable.

It is assumed that grouping should be done at the lowest sensible order and
that the user would not want to override this.  If this is not true,
page_block order could be forced to a variable initialised via a boot-time
kernel parameter.

One potential issue with this patch is that IA64 now parses hugepagesz with
early_param() instead of __setup().  __setup() is called after the memory
allocator has been initialised and the pageblock bitmaps already setup.  In
tests on one IA64 there did not seem to be any problem with using
early_param() and in fact may be more correct as it guarantees the parameter
is handled before the parsing of hugepages=.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
d100313fd6 Fix calculation in move_freepages_block for counting pages
move_freepages_block() returns the number of blocks moved.  This value is used
to determine if a block of pages should be stolen for the exclusive use of a
migrate type or not.  However, the value returned is being used correctly.
This patch fixes the calculation to return the number of base pages that have
been moved.

This should be considered a fix to the patch
move-free-pages-between-lists-on-steal.patch

Credit to Andy Whitcroft for spotting the problem.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
64c5e135bf don't group high order atomic allocations
Grouping high-order atomic allocations together was intended to allow
bursty users of atomic allocations to work such as e1000 in situations
where their preallocated buffers were depleted.  This did not work in at
least one case with a wireless network adapter needing order-1 allocations
frequently.  To resolve that, the free pages used for min_free_kbytes were
moved to separate contiguous blocks with the patch
bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.

It is felt that keeping the free pages in the same contiguous blocks should
be sufficient for bursty short-lived high-order atomic allocations to
succeed, maybe even with the e1000.  Even if there is a failure, increasing
the value of min_free_kbytes will free pages as contiguous bloks in
contrast to the standard buddy allocator which makes no attempt to keep the
minimum number of free pages contiguous.

This patch backs out grouping high order atomic allocations together to
determine if it is really needed or not.  If a new report comes in about
high-order atomic allocations failing, the feature can be reintroduced to
determine if it fixes the problem or not.  As a side-effect, this patch
reduces by 1 the number of bits required to track the mobility type of
pages within a MAX_ORDER_NR_PAGES block.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
ac0e5b7a6b remove PAGE_GROUP_BY_MOBILITY
Grouping pages by mobility can be disabled at compile-time. This was
considered undesirable by a number of people. However, in the current stack of
patches, it is not a simple case of just dropping the configurable patch as it
would cause merge conflicts.  This patch backs out the configuration option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
56fd56b868 Bias the location of pages freed for min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks
The standard buddy allocator always favours the smallest block of pages.
The effect of this is that the pages free to satisfy min_free_kbytes tends
to be preserved since boot time at the same location of memory ffor a very
long time and as a contiguous block.  When an administrator sets the
reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
that remain free.  This allows the occasional high atomic allocation to
succeed up until the point the blocks are split.  In practice, it is
difficult to split these blocks but when they do split, the benefit of
having min_free_kbytes for contiguous blocks disappears.  Additionally,
increasing min_free_kbytes once the system has been running for some time
has no guarantee of creating contiguous blocks.

On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
blocks when there are no free pages of the appropriate type available.  A
side-effect of this is that all blocks in memory tends to be used up and
the contiguous free blocks from boot time are not preserved like in the
vanilla allocator.  This can cause a problem if a new caller is unwilling
to reclaim or does not reclaim for long enough.

A failure scenario was found for a wireless network device allocating
order-1 atomic allocations but the allocations were not intense or frequent
enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
This was reproduced on a desktop by booting with mem=256mb, forcing the
driver to allocate at order-1, running a bittorrent client (downloading a
debian ISO) and building a kernel with -j2.

This patch addresses the problem on the desktop machine booted with
mem=256mb.  It works by setting aside a reserve of MAX_ORDER_NR_PAGES
blocks, the number of which depends on the value of min_free_kbytes.  These
blocks are only fallen back to when there is no other free pages.  Then the
smallest possible page is used just like the normal buddy allocator instead
of the largest possible page to preserve contiguous pages The pages in free
lists in the reserve blocks are never taken for another migrate type.  The
results is that even if min_free_kbytes is set to a low value, contiguous
blocks will be preserved in the MIGRATE_RESERVE blocks.

This works better than the vanilla allocator because if min_free_kbytes is
increased, a new reserve block will be chosen based on the location of
reclaimable pages and the block will free up as contiguous pages.  In the
vanilla allocator, no effort is made to target a block of pages to free as
contiguous pages and min_free_kbytes pages are scattered randomly.

This effect has been observed on the test machine.  min_free_kbytes was set
initially low but it was kept as a contiguous free block within
MIGRATE_RESERVE.  min_free_kbytes was then set to a higher value and over a
period of time, the free blocks were within the reserve and coalescing.
How long it takes to free up depends on how quickly LRU is rotating.
Amusingly, this means that more activity will free the blocks faster.

This mechanism potentially replaces MIGRATE_HIGHALLOC as it may be more
effective than grouping contiguous free pages together.  It all depends on
whether the number of active atomic high allocations exceeds
min_free_kbytes or not.  If the number of active allocations exceeds
min_free_kbytes, it's worth it but maybe in that situation, min_free_kbytes
should be set higher.  Once there are no more reports of allocation
failures, a patch will be submitted that backs out MIGRATE_HIGHALLOC and
see if the reports stay missing.

Credit to Mariusz Kozlowski for discovering the problem, describing the
failure scenario and testing patches and scenarios.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
46dafbca2b Be more agressive about stealing when MIGRATE_RECLAIMABLE allocations fallback
MIGRATE_RECLAIMABLE allocations tend to be very bursty in nature like when
updatedb starts.  It is likely this will occur in situations where MAX_ORDER
blocks of pages are not free.  This means that updatedb can scatter
MIGRATE_RECLAIMABLE pages throughout the address space.  This patch is more
agressive about stealing blocks of pages for MIGRATE_RECLAIMABLE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
5adc5be7cd Bias the placement of kernel pages at lower PFNs
This patch chooses blocks with lower PFNs when placing kernel allocations.
This is particularly important during fallback in low memory situations to
stop unmovable pages being placed throughout the entire address space.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
9ef9acb05a Do not group pages by mobility type on low memory systems
Grouping pages by mobility can only successfully operate when there are more
MAX_ORDER_NR_PAGES areas than mobility types.  When there are insufficient
areas, fallbacks cannot be avoided.  This has noticeable performance impacts
on machines with small amounts of memory in comparison to MAX_ORDER_NR_PAGES.
For example, on IA64 with a configuration including huge pages spans 1GiB with
MAX_ORDER_NR_PAGES so would need at least 4GiB of RAM before grouping pages by
mobility would be useful.  In comparison, an x86 would need 16MB.

This patch checks the size of vm_total_pages in build_all_zonelists(). If
there are not enough areas,  mobility is effectivly disabled by considering
all allocations as the same type (UNMOVABLE).  This is achived via a
__read_mostly flag.

With this patch, performance is comparable to disabling grouping pages
by mobility at compile-time on a test machine with insufficient memory.
With this patch, it is reasonable to get rid of grouping pages by mobility
a compile-time option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
e010487dbe Group high-order atomic allocations
In rare cases, the kernel needs to allocate a high-order block of pages
without sleeping.  For example, this is the case with e1000 cards configured
to use jumbo frames.  Migrating or reclaiming pages in this situation is not
an option.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_HIGHATOMIC type are exactly what they sound
like.  Care is taken that pages of other migrate types do not use the same
blocks as high-order atomic allocations.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
e12ba74d8f Group short-lived and reclaimable kernel allocations
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations.  When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
c361be55b3 Move free pages between lists on steal
When a fallback occurs, there will be free pages for one allocation type
stored on the list for another.  When a large steal occurs, this patch will
move all the free pages within one list to the other.

[y-goto@jp.fujitsu.com: fix BUG_ON check at move_freepages()]
[apw@shadowen.org: Move to using pfn_valid_within()]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andy Whitcroft <andyw@uk.ibm.com>
Cc: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
e2c55dc87f Drain per-cpu lists when high-order allocations fail
Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block.  When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0.  It simply reuses the code used by suspend
and hotplug.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Mel Gorman
b92a6edd4b Add a configure option to group pages by mobility
The grouping mechanism has some memory overhead and a more complex allocation
path.  This patch allows the strategy to be disabled for small memory systems
or if it is known the workload is suffering because of the strategy.  It also
acts to show where the page groupings strategy interacts with the standard
buddy allocator.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Mel Gorman
535131e692 Choose pages from the per-cpu list based on migration type
The freelists for each migrate type can slowly become polluted due to the
per-cpu list.  Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_MOVABLE pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a migratable region. This
patch prevents this leak occuring by storing the MIGRATE_ type of the page in
page->private. On allocate, a page will only be returned of the desired type,
else more pages will be allocated. This may temporarily allow a per-cpu list
to go over the pcp->high limit but it'll be corrected on the next free. Care
is taken to preserve the hotness of pages recently freed.

The additional code is not measurably slower for the workloads we've tested.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Mel Gorman
b2a0ac8875 Split the free lists for movable and unmovable allocations
This patch adds the core of the fragmentation reduction strategy.  It works by
grouping pages together based on their ability to migrate or be reclaimed.
Basically, it works by breaking the list in zone->free_area list into
MIGRATE_TYPES number of lists.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Mel Gorman
835c134ec4 Add a bitmap that is used to track flags affecting a block of pages
Here is the latest revision of the anti-fragmentation patches.  Of particular
note in this version is special treatment of high-order atomic allocations.
Care is taken to group them together and avoid grouping pages of other types
near them.  Artifical tests imply that it works.  I'm trying to get the
hardware together that would allow setting up of a "real" test.  If anyone
already has a setup and test that can trigger the atomic-allocation problem,
I'd appreciate a test of these patches and a report.  The second major change
is that these patches will apply cleanly with patches that implement
anti-fragmentation through zones.

kernbench shows effectively no performance difference varying between -0.2%
and +2% on a variety of test machines.  Success rates for huge page allocation
are dramatically increased.  For example, on a ppc64 machine, the vanilla
kernel was only able to allocate 1% of memory as a hugepage and this was due
to a single hugepage reserved as min_free_kbytes.  With these patches applied,
17% was allocatable as superpages.  With reclaim-related fixes from Andy
Whitcroft, it was 40% and further reclaim-related improvements should increase
this further.

Changelog Since V28
o Group high-order atomic allocations together
o It is no longer required to set min_free_kbytes to 10% of memory. A value
  of 16384 in most cases will be sufficient
o Now applied with zone-based anti-fragmentation
o Fix incorrect VM_BUG_ON within buffered_rmqueue()
o Reorder the stack so later patches do not back out work from earlier patches
o Fix bug were journal pages were being treated as movable
o Bias placement of non-movable pages to lower PFNs
o More agressive clustering of reclaimable pages in reactions to workloads
  like updatedb that flood the size of inode caches

Changelog Since V27

o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
  the mistaken impression that it was the 100% solution for high order
  allocations. Instead, it greatly increases the chances high-order
  allocations will succeed and lays the foundation for defragmentation and
  memory hot-remove to work properly
o Redefine page groupings based on ability to migrate or reclaim instead of
  basing on reclaimability alone
o Get rid of spurious inits
o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
  searched for a page of the appropriate type
o Added more explanation commentary
o Fix up bug in pageblock code where bitmap was used before being initalised

Changelog Since V26
o Fix double init of lists in setup_pageset

Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together.  When pages are migrated (or reclaimed under
memory pressure), large contiguous pages will be freed.

This patch works by categorising allocations by their ability to migrate;

Movable - The pages may be moved with the page migration mechanism. These are
	generally userspace pages.

Reclaimable - These are allocations for some kernel caches that are
	reclaimable or allocations that are known to be very short-lived.

Unmovable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

HighAtomic - These are high-order allocations belonging to callers that
	cannot sleep or perform any IO. In practice, this is restricted to
	jumbo frame allocation for network receive. It is assumed that the
	allocations are short-lived

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability.  Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it.  Hence, over time, pages of the different
types can be clustered together.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list.  Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones.
Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
(MAX_ORDER-1)) pages in most cases.  This would be 16384 on x86 and x86_64 for
example.

Our tests show that about 60-70% of physical memory can be allocated on a
desktop after a few days uptime.  In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of the
test.  To compare, a standard kernel was getting < 1% of memory as large pages
on a desktop and about 8-12% of memory as large pages at the end of stress
tests.

Following this email are 12 patches that implement thie page grouping feature.
 The first patch introduces a mechanism for storing flags related to a whole
block of pages.  Then allocations are split between movable and all other
allocations.  Following that are patches to deal with per-cpu pages and make
the mechanism configurable.  The next patch moves free pages between lists
when partially allocated blocks are used for pages of another migrate type.
The second last patch groups reclaimable kernel allocations such as inode
caches together.  The final patch related to groupings keeps high-order atomic
allocations.

The last two patches are more concerned with control of fragmentation.  The
second last patch biases placement of non-movable allocations towards the
start of memory.  This is with a view of supporting memory hot-remove of DIMMs
with higher PFNs in the future.  The biasing could be enforced a lot heavier
but it would cost.  The last patch agressively clusters reclaimable pages like
inode caches together.

The fragmentation reduction strategy needs to track if pages within a block
can be moved or reclaimed so that pages are freed to the appropriate list.
This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
pages.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
allocated during initialisation.  SPARSEMEM statically allocates the bitmap in
a struct mem_section so that bitmaps do not have to be resized during memory
hotadd.  This wastes a small amount of memory per unused section (usually
sizeof(unsigned long)) but the complexity of dynamically allocating the memory
is quite high.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Lee Schermerhorn
37b07e4163 memoryless nodes: fixup uses of node_online_map in generic code
Here's a cut at fixing up uses of the online node map in generic code.

mm/shmem.c:shmem_parse_mpol()

	Ensure nodelist is subset of nodes with memory.
	Use node_states[N_HIGH_MEMORY] as default for missing
	nodelist for interleave policy.

mm/shmem.c:shmem_fill_super()

	initialize policy_nodes to node_states[N_HIGH_MEMORY]

mm/page-writeback.c:highmem_dirtyable_memory()

	sum over nodes with memory

mm/page_alloc.c:zlc_setup()

	allowednodes - use nodes with memory.

mm/page_alloc.c:default_zonelist_order()

	average over nodes with memory.

mm/page_alloc.c:find_next_best_node()

	skip nodes w/o memory.
	N_HIGH_MEMORY state mask may not be initialized at this time,
	unless we want to depend on early_calculate_totalpages() [see
	below].  Will ZONE_MOVABLE ever be configurable?

mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

	spread kernelcore over nodes with memory.

	This required calling early_calculate_totalpages()
	unconditionally, and populating N_HIGH_MEMORY node
	state therein from nodes in the early_node_map[].
	If we can depend on this, we can eliminate the
	population of N_HIGH_MEMORY mask from __build_all_zonelists()
	and use the N_HIGH_MEMORY mask in find_next_best_node().

mm/mempolicy.c:mpol_check_policy()

	Ensure nodes specified for policy are subset of
	nodes with memory.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
523b945855 Memoryless nodes: Fix GFP_THISNODE behavior
GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
first zone of a nodelist.  That only works if the node has memory.  A
memoryless node will have its first node on another pgdat (node).

GFP_THISNODE currently will return simply memory on the first pgdat.  Thus it
is returning memory on other nodes.  GFP_THISNODE should fail if there is no
local memory on a node.

Add a new set of zonelists for each node that only contain the nodes that
belong to the zones itself so that no fallback is possible.

Then modify gfp_type to pickup the right zone based on the presence of
__GFP_THISNODE.

Drop the existing GFP_THISNODE checks from the page_allocators hot path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
633c0666b5 Memoryless nodes: drop one memoryless node boot warning
get_pfn_range_for_nid() is called multiple times for each node at boot time.
Each time, it will warn about nodes with no memory, resulting in boot messages
like:

        Node 0 active with no memory
        Node 0 active with no memory
        Node 0 active with no memory
        Node 0 active with no memory
        Node 0 active with no memory
        Node 0 active with no memory
        On node 0 totalpages: 0
        Node 0 active with no memory
        Node 0 active with no memory
          DMA zone: 0 pages used for memmap
        Node 0 active with no memory
        Node 0 active with no memory
          Normal zone: 0 pages used for memmap
        Node 0 active with no memory
        Node 0 active with no memory
          Movable zone: 0 pages used for memmap

and so on for each memoryless node.

We already have the "On node N totalpages: ..." and other related messages, so
drop the "Node N active with no memory" warnings.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
37c0708dbe Memoryless nodes: Add N_CPU node state
We need the check for a node with cpu in zone reclaim.  Zone reclaim will not
allow remote zone reclaim if a node has a cpu.

[Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter
7ea1530ab3 Memoryless nodes: introduce mask of nodes with memory
It is necessary to know if nodes have memory since we have recently begun to
add support for memoryless nodes.  For that purpose we introduce a two new
node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.

A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
type of mmemory.  If a node has memory then it has at least one zone defined
in its pgdat structure that is located in the pgdat itself.

A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
ZONE_HIGHMEM.  This means it is possible to allocate memory that is not
subject to kmap.

N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
that we do the right thing when we encounter a memoryless node.

[akpm@linux-foundation.org: build fix]
[Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
[y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter
1380891071 Memoryless nodes: Generic management of nodemasks for various purposes
Why do we need to support memoryless nodes?

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> For fujitsu, problem is called "empty" node.
>
> When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
> creates nodes, which includes no memory, no cpu.
>
> I tried to remove empty-node in past, but that was denied.
> It was because we can hot-add cpu to the empty node.
> (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
>
>
> For HP, (Lee can comment on this later), they have memory-less-node.
> As far as I hear, HP's machine can have following configration.
>
> (example)
> Node0: CPU0   memory AAA MB
> Node1: CPU1   memory AAA MB
> Node2: CPU2   memory AAA MB
> Node3: CPU3   memory AAA MB
> Node4: Memory XXX GB
>
> AAA is very small value (below 16MB)  and will be omitted by ia64 bootstrap.
> After boot, only Node 4 has valid memory (but have no cpu.)
>
> Maybe this is memory-interleave by firmware config.

Christoph Lameter <clameter@sgi.com> wrote:

> Future SGI platforms (actually also current one can have but nothing like
> that is deployed to my knowledge) have nodes with only cpus. Current SGI
> platforms have nodes with just I/O that we so far cannot manage in the
> core. So the arch code maps them to the nearest memory node.

Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> For the HP platforms, we can configure each cell with from 0% to 100%
> "cell local memory".  When we configure with <100% CLM, the "missing
> percentages" are interleaved by hardware on a cache-line granularity to
> improve bandwidth at the expense of latency for numa-challenged
> applications [and OSes, but not our problem ;-)].  When we boot Linux on
> such a config, all of the real nodes have no memory--it all resides in a
> single interleaved pseudo-node.
>
> When we boot Linux on a 100% CLM configuration [== NUMA], we still have
> the interleaved pseudo-node.  It contains a few hundred MB stolen from
> the real nodes to contain the DMA zone.  [Interleaved memory resides at
> phys addr 0].  The memoryless-nodes patches, along with the zoneorder
> patches, support this config as well.
>
> Also, when we boot a NUMA config with the "mem=" command line,
> specifying less memory than actually exists, Linux takes the excluded
> memory "off the top" rather than distributing it across the nodes.  This
> can result in memoryless nodes, as well.
>

This patch:

Preparation for memoryless node patches.

Provide a generic way to keep nodemasks describing various characteristics of
NUMA nodes.

Remove the node_online_map and the node_possible map and realize the same
functionality using two nodes stats: N_POSSIBLE and N_ONLINE.

[Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Jesper Juhl
8691f3a72f mm: no need to cast vmalloc() return value in zone_wait_table_init()
vmalloc() returns a void pointer, so there's no need to cast its
return value in mm/page_alloc.c::zone_wait_table_init().

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Andrew Morton
6419168813 process_zones(): fix recovery code
Don't try to free memory which we didn't allocate.

Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:22 -07:00
Mel Gorman
b377fd3982 Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE
The NUMA layer only supports NUMA policies for the highest zone.  When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE.  The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE.  As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.

The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64.  No abnormal results were seen in
kernbench, tbench, dbench or hackbench.  It passes regression tests from
the numactl package with and without kernelcore= once numactl tests are
patched to wait for vmstat counters to update.

akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
ZONE_MOVABLE and kernelcore= in 2.6.23.  Christoph says "For .24 either merge
the mobility or get the other solution that Mel is working on.  That solution
would only use a single zonelist per node and filter on the fly.  That may
help performance and also help to make memory policies work better."

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Mel Gorman
a8bbf72ab9 Do not trigger OOM-killer for high-order allocation failures
out_of_memory() may be called when an allocation is failing and the direct
reclaim is not making any progress.  This does not take into account the
requested order of the allocation.  If the request if for an order larger
than PAGE_ALLOC_COSTLY_ORDER, it is reasonable to fail the allocation
because the kernel makes no guarantees about those allocations succeeding.

This false OOM situation can occur if a user is trying to grow the hugepage
pool in a script like;

#!/bin/bash
REQUIRED=$1
echo 1 > /proc/sys/vm/hugepages_treat_as_movable
echo $REQUIRED > /proc/sys/vm/nr_hugepages
ACTUAL=`cat /proc/sys/vm/nr_hugepages`
while [ $REQUIRED -ne $ACTUAL ]; do
	echo Huge page pool at $ACTUAL growing to $REQUIRED
	echo $REQUIRED > /proc/sys/vm/nr_hugepages
	ACTUAL=`cat /proc/sys/vm/nr_hugepages`
	sleep 1
done

This is a reasonable scenario when ZONE_MOVABLE is in use but triggers OOM
easily on 2.6.23-rc1. This patch will fail an allocation for an order above
PAGE_ALLOC_COSTLY_ORDER instead of killing processes and retrying.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-31 15:39:36 -07:00
Rafael J. Wysocki
296699de6b Introduce CONFIG_SUSPEND for suspend-to-Ram and standby
Introduce CONFIG_SUSPEND representing the ability to enter system sleep
states, such as the ACPI S3 state, and allow the user to choose SUSPEND
and HIBERNATION independently of each other.

Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
been chosen and the kernel is intended for SMP systems.

Also, introduce CONFIG_PM_SLEEP which is automatically selected if
CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
code needed for both suspend and hibernation.

The top-level power management headers and the ACPI code related to
suspend and hibernation are modified to use the new definitions (the
changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
the number of ifdefs).

There are many other files in which CONFIG_PM can be replaced with
CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
the future.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-29 16:45:38 -07:00
Mel Gorman
b5445f956e Allow nodes to exist that only contain ZONE_MOVABLE
With the introduction of kernelcore=, a configurable zone is created on
request.  In some cases, this value will be small enough that some nodes
contain only ZONE_MOVABLE.  On some NUMA configurations when this occurs,
arch-independent zone-sizing will get the size of the memory holes within
the node incorrect.  The value of present_pages goes negative and the boot
fails.

This patch fixes the bug in the calculation of the size of the hole.  The
test case is to boot test a NUMA machine with a low value of kernelcore=
before and after the patch is applied.  While this bug exists in early
kernel it cannot be triggered in practice.

This patch has been boot-tested on a variety machines with and without
kernelcore= set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26 11:35:19 -07:00
Paul Mundt
e228929bc2 mm: fix memory hotplug oops from ZONE_MOVABLE changes.
zone_movable_pfn is presently marked as __initdata and referenced from
adjust_zone_range_for_zone_movable(), which in turn is referenced by
zone_spanned_pages_in_node().  Both of these are __meminit annotated.  When
memory hotplug is enabled, this will oops on a hot-add, due to
zone_movable_pfn having been freed.

__meminitdata annotation gives the desired behaviour.

This will only impact platforms that enable both memory hotplug
and ARCH_POPULATES_NODE_MAP.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20 08:44:19 -07:00
Fengguang Wu
fe3cba17c4 mm: share PG_readahead and PG_reclaim
Share the same page flag bit for PG_readahead and PG_reclaim.

One is used only on file reads, another is only for emergency writes.  One
is used mostly for fresh/young pages, another is for old pages.

Combinations of possible interactions are:

a) clear PG_reclaim => implicit clear of PG_readahead
	it will delay an asynchronous readahead into a synchronous one
	it actually does _good_ for readahead:
		the pages will be reclaimed soon, it's readahead thrashing!
		in this case, synchronous readahead makes more sense.

b) clear PG_readahead => implicit clear of PG_reclaim
	one(and only one) page will not be reclaimed in time
	it can be avoided by checking PageWriteback(page) in readahead first

c) set PG_reclaim => implicit set of PG_readahead
	will confuse readahead and make it restart the size rampup process
	it's a trivial problem, and can mostly be avoided by checking
	PageWriteback(page) first in readahead

d) set PG_readahead => implicit set of PG_reclaim
	PG_readahead will never be set on already cached pages.
	PG_reclaim will always be cleared on dirtying a page.
	so not a problem.

In summary,
	a)   we get better behavior
	b,d) possible interactions can be avoided
	c)   racy condition exists that might affect readahead, but the chance
	     is _really_ low, and the hurt on readahead is trivial.

Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu
d77c2d7cc5 readahead: introduce PG_readahead
Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
invoke the read-ahead logic.  For the sake of I/O pipelining, don't wait until
it runs out of cached pages!

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Meelap Shah
c2f1a551de knfsd: nfsd4: vary maximum delegation limit based on RAM size
Our original NFSv4 delegation policy was to give out a read delegation on any
open when it was possible to.

Since the lifetime of a delegation isn't limited to that of an open, a client
may quite reasonably hang on to a delegation as long as it has the inode
cached.  This becomes an obvious problem the first time a client's inode cache
approaches the size of the server's total memory.

Our first quick solution was to add a hard-coded limit.  This patch makes a
mild incremental improvement by varying that limit according to the server's
total memory size, allowing at most 4 delegations per megabyte of RAM.

My quick back-of-the-envelope calculation finds that in the worst case (where
every delegation is for a different inode), a delegation could take about
1.5K, which would make the worst case usage about 6% of memory.  The new limit
works out to be about the same as the old on a 1-gig server.

[akpm@linux-foundation.org: Don't needlessly bloat vmlinux]
[akpm@linux-foundation.org: Make it right for highmem machines]
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:07 -07:00
Andy Whitcroft
5ad333eb66 Lumpy Reclaim V4
When we are out of memory of a suitable size we enter reclaim.  The current
reclaim algorithm targets pages in LRU order, which is great for fairness at
order-0 but highly unsuitable if you desire pages at higher orders.  To get
pages of higher order we must shoot down a very high proportion of memory;
>95% in a lot of cases.

This patch set adds a lumpy reclaim algorithm to the allocator.  It targets
groups of pages at the specified order anchored at the end of the active and
inactive lists.  This encourages groups of pages at the requested orders to
move from active to inactive, and active to free lists.  This behaviour is
only triggered out of direct reclaim when higher order pages have been
requested.

This patch set is particularly effective when utilised with an
anti-fragmentation scheme which groups pages of similar reclaimability
together.

This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
the foundation.  Credit to Mel Gorman for sanitity checking.

Mel said:

  The patches have an application with hugepage pool resizing.

  When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
  be resized with greater reliability.  Testing on a desktop machine with 2GB
  of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
  was very slow as the success rate was quite low.  Without lumpy-reclaim,
  each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
  With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

[akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
[bunk@stusta.de: static declarations for internal functions]
[a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Mel Gorman
7e63efef85 Add a movablecore= parameter for sizing ZONE_MOVABLE
This patch adds a new parameter for sizing ZONE_MOVABLE called
movablecore=.  While kernelcore= is used to specify the minimum amount of
memory that must be available for all allocation types, movablecore= is
used to specify the minimum amount of memory that is used for migratable
allocations.  The amount of memory used for migratable allocations
determines how large the huge page pool could be dynamically resized to at
runtime for example.

How movablecore is actually handled is that the total number of pages in
the system is calculated and a value is set for kernelcore that is

kernelcore == totalpages - movablecore

Both kernelcore= and movablecore= can be safely specified at the same time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Mel Gorman
ed7ed36517 handle kernelcore=: generic
This patch adds the kernelcore= parameter for x86.

Once all patches are applied, a new command-line parameter exist and a new
sysctl.  This patch adds the necessary documentation.

From: Yasunori Goto <y-goto@jp.fujitsu.com>

  When "kernelcore" boot option is specified, kernel can't boot up on ia64
  because of an infinite loop.  In addition, the parsing code can be handled
  in an architecture-independent manner.

  This patch uses common code to handle the kernelcore= parameter.  It is
  only available to architectures that support arch-independent zone-sizing
  (i.e.  define CONFIG_ARCH_POPULATES_NODE_MAP).  Other architectures will
  ignore the boot parameter.

[bunk@stusta.de: make cmdline_parse_kernelcore() static]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Mel Gorman
2a1e274acf Create the ZONE_MOVABLE zone
The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
that is only usable by allocations that specify both __GFP_HIGHMEM and
__GFP_MOVABLE.  This has the effect of keeping all non-movable pages within a
single memory partition while allowing movable allocations to be satisfied
from either partition.  The patches may be applied with the list-based
anti-fragmentation patches that groups pages together based on mobility.

The size of the zone is determined by a kernelcore= parameter specified at
boot-time.  This specifies how much memory is usable by non-movable
allocations and the remainder is used for ZONE_MOVABLE.  Any range of pages
within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

When selecting a zone to take pages from for ZONE_MOVABLE, there are two
things to consider.  First, only memory from the highest populated zone is
used for ZONE_MOVABLE.  On the x86, this is probably going to be ZONE_HIGHMEM
but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64.  Second,
the amount of memory usable by the kernel will be spread evenly throughout
NUMA nodes where possible.  If the nodes are not of equal size, the amount of
memory usable by the kernel on some nodes may be greater than others.

By default, the zone is not as useful for hugetlb allocations because they are
pinned and non-migratable (currently at least).  A sysctl is provided that
allows huge pages to be allocated from that zone.  This means that the huge
page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
the system assuming that pages are not mlocked.  Despite huge pages being
non-movable, we do not introduce additional external fragmentation of note as
huge pages are always the largest contiguous block we care about.

Credit goes to Andy Whitcroft for catching a large variety of problems during
review of the patches.

This patch creates an additional zone, ZONE_MOVABLE.  This zone is only usable
by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE.  Hot-added
memory continues to be placed in their existing destination as there is no
mechanism to redirect them to a specific zone.

[y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Akinobu Mita
54114994f4 fault-injection: add min-order parameter to fail_page_alloc
Limiting smaller allocation failures by fault injection helps to find real
possible bugs.  Because higher order allocations are likely to fail and
zero-order allocations are not likely to fail.

This patch adds min-order parameter to fail_page_alloc.  It specifies the
minimum page allocation order to be injected failures.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:45 -07:00
Dan Aloni
b49ad484c5 mm/page_alloc.c: lower printk severity
Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
Paul Mundt
6ea6e6887d mm: more __meminit annotations
Currently zone_spanned_pages_in_node() and zone_absent_pages_in_node() are
non-static for ARCH_POPULATES_NODE_MAP and static otherwise.  However, only
the non-static versions are __meminit annotated, despite only being called
from __meminit functions in either case.

zone_init_free_lists() is currently non-static and not __meminit annotated
either, despite only being called once in the entire tree by
init_currently_empty_zone(), which too is __meminit.  So make it static and
properly annotated.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
Jan Beulich
98011f569e mm: fix improper .init-type section references
.. which modpost started warning about.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
Eric Dumazet
1037b83bd0 MM: alloc_large_system_hash() can free some memory for non power-of-two bucketsize
alloc_large_system_hash() is called at boot time to allocate space for
several large hash tables.

Lately, TCP hash table was changed and its bucketsize is not a power-of-two
anymore.

On most setups, alloc_large_system_hash() allocates one big page (order >
0) with __get_free_pages(GFP_ATOMIC, order).  This single high_order page
has a power-of-two size, bigger than the needed size.

We can free all pages that wont be used by the hash table.

On a 1GB i386 machine, this patch saves 128 KB of LOWMEM memory.

TCP established hash table entries: 32768 (order: 6, 393216 bytes)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
KAMEZAWA Hiroyuki
f0c0b2b808 change zonelist order: zonelist order selection logic
Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
Paul Mundt
d09c6b8094 mm: Fix memory/cpu hotplug section mismatch and oops.
When building with memory hotplug enabled and cpu hotplug disabled, we
end up with the following section mismatch:

WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
.init.text: (between 'free_area_init_node' and '__build_all_zonelists')

This happens as a result of:

        -> free_area_init_node()
          -> free_area_init_core()
            -> zone_pcp_init() <-- all __meminit up to this point
              -> zone_batchsize() <-- marked as __cpuinit                     fo

This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but
CONFIG_MEMORY_HOTPLUG=y unsets __meminit.

Changing zone_batchsize() to __devinit fixes this.

__devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and
CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to
another section identifier completely. Without this, memory hot-add
of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is
not also enabled.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

--

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2007-06-15 16:18:08 -07:00
Roman Zippel
12d810c1b8 m68k: discontinuous memory support
Fix support for discontinuous memory

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-31 07:58:14 -07:00
Miklos Szeredi
418508c132 fix unused setup_nr_node_ids
mm/page_alloc.c:931: warning: 'setup_nr_node_ids' defined but not used

This is now the only (!) compiler warning I get in my UML build :)

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Sam Ravnborg
577a32f620 mm: fix section mismatch warnings
modpost had two cases hardcoded for mm/
Shift over to __init_refok and kill the
hardcoded function names in modpost.

This has the drawback that the functions
will always be kept no matter configuration.
With previous code the function were placed in
init section if configuration allowed it.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2007-05-19 09:11:58 +02:00
Stephen Rothwell
6f076f5dd9 early_pfn_to_nid needs to be __meminit
Since it is referenced by memmap_init_zone (which is __meminit) via the
early_pfn_in_nid macro when CONFIG_NODES_SPAN_OTHER_NODES is set (which
basically means PowerPC 64).

This removes a section mismatch warning in those circumstances.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:52 -07:00
Christoph Lameter
4037d45220 Move remote node draining out of slab allocators
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes.  This requires SLUB to have
a whole subsystem in order to be compatible with SLAB.  Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there.  If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero.  Then we will drain one batch from the pageset.  The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty.  So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues.  The number of pcp is determined by the
number of processors and nodes in a system.  A system with 4 processors and 2
nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki
8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Yasunori Goto
72280ede31 Add white list into modpost.c for memory hotplug code and ia64's machvec section
This patch is add white list into modpost.c for some functions and
ia64's section to fix section mismatchs.

  sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator
  at boot time, and kmalloc/vmalloc at hotplug time. If config
  memory hotplug is on, there are references of bootmem allocater(init text)
  from them (normal text). This is cause of section mismatch.

  Bootmem is called by many functions and it must be
  used only at boot time. I think __init of them should keep for
  section mismatch check. So, I would like to register sparse_index_alloc()
  and zone_wait_table_init() into white list.

  In addition, ia64's .machvec section is function table of some platform
  dependent code. It is mixture of .init.text and normal text. These
  reference of __init functions are valid too.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Yasunori Goto
a3142c8e1d Fix section mismatch of memory hotplug related code.
This is to fix many section mismatches of code related to memory hotplug.
I checked compile with memory hotplug on/off on ia64 and x86-64 box.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Rafael J. Wysocki
7be9823491 swsusp: use inline functions for changing page flags
Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc.  with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:58 -07:00
Christoph Lameter
6d7779538f mm: optimize compound_head() by avoiding a shared page flag
The patch adds PageTail(page) and PageHead(page) to check if a page is the
head or the tail of a compound page.  This is done by masking the two bits
describing the state of a compound page and then comparing them.  So one
comparision and a branch instead of two bit checks and two branches.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter
d85f33855c Make page->private usable in compound pages
If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages.  Right now we have to be careful f.e.
 if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field.  This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Having page->private available for SLUB would allow more meta information in
the page struct.  I can probably avoid the 16 bit ints that I have in there
right now.

Also if page->private is available then a compound page may be equipped with
buffer heads.  This may free up the way for filesystems to support larger
blocks than page size.

We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
be reclaimed.  Because of the alias one needs to check PageCompound first.

The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2

[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Mel Gorman
3b1d92c565 Do not disable interrupts when reading min_free_kbytes
The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
read or write.  This function iterates through every zone and calls
spin_lock_irqsave() on the zone LRU lock.  When reading min_free_kbytes,
this is a total waste of time that disables interrupts on the local
processor.  It might even be noticable machines with large numbers of zones
if a process started constantly reading min_free_kbytes.

This patch only calls setup_per_zone_pages_min() only on write. Tested on
an x86 laptop and it did the right thing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Andy Whitcroft
14e0729841 add pfn_valid_within helper for sub-MAX_ORDER hole detection
Generally we work under the assumption that memory the mem_map array is
contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie.  that if we
have validated any page within this MAX_ORDER_NR_PAGES block we need not check
any other.  This is not true when CONFIG_HOLES_IN_ZONE is set and we must
check each and every reference we make from a pfn.

Add a pfn_valid_within() helper which should be used when scanning pages
within a MAX_ORDER_NR_PAGES block when we have already checked the validility
of the block normally with pfn_valid().  This can then be optimised away when
we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Christoph Lameter
476f35348e Safer nr_node_ids and nr_node_ids determination and initial values
The nr_cpu_ids value is currently only calculated in smp_init.  However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init.  So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.

Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take.  If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated.  If it is set to the maximum possible then we only waste
some memory for early boot users.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Nick Piggin
5409bae07a [PATCH] Rename PG_checked to PG_owner_priv_1
Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
private flag for use by the owner/allocator of the page.  In the case of
pagecache pages (which might be considered to be owned by the mm),
filesystems may use the flag.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Christoph Lameter
8ef8286689 [PATCH] slab: reduce size of alien cache to cover only possible nodes
The alien cache is a per cpu per node array allocated for every slab on the
system.  Currently we size this array for all nodes that the kernel does
support.  For IA64 this is 1024 nodes.  So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.

This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.

The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Christoph Lameter
74c7aa8b85 [PATCH] Replace highest_possible_node_id() with nr_node_ids
highest_possible_node_id() is currently used to calculate the last possible
node idso that the network subsystem can figure out how to size per node
arrays.

I think having the ability to determine the maximum amount of nodes in a
system at runtime is useful but then we should name this entry
correspondingly, it should return the number of node_ids, and the the value
needs to be setup only once on bootup.  The node_possible_map does not
change after bootup.

This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id().  nr_node_ids is calculated on bootup when the
page allocators pagesets are initialized.

[deweerdt@free.fr: fix oops]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Christoph Lameter
4b51d66989 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM
Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
  for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
  0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
6267276f3f [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone
This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones.  The patches allow to go to a minimum of 2 zones.  This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
   The necessity of DMA zones may drastically be reduced
   in the future. This patchset allows a compilation of
   a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
   my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
   VM operations because it must be scanned and balancing
   logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
   we have only one zone. This will allow the unrolling of many
   loops in the VM and allows the optimization of varous
   code paths in the VM.

6. Having only a single zone in a NUMA system results in a
   1-1 correspondence between nodes and zones. Various additional
   optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone.  If you look at
what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
and make this as compact as possible.  The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone   Normal
  pages free     4435
        min      1448
        low      1810
        high     2172
        active   241786
        inactive 210170
        scanned  0 (a: 0 i: 0)
        spanned  524224
        present  524224
    nr_anon_pages 61680
    nr_mapped    14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty     39
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    cpu: 0 pcp: 0
              count: 156
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 20
    cpu: 1 pcp: 0
              count: 177
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 12
              high:  62
              batch: 15
  vm stats threshold: 20
  all_unreclaimable: 0
  prev_priority:     12
  temp_priority:     12
  start_pfn:         0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone.  If
ZONE_DMA is optional then other zones may be first.  So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat.  However, we still need a zonetable to index all
the zones for each node if this is a NUMA system.  Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
65e458d43d [PATCH] Drop get_zone_counts()
Values are available via ZVC sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
9195481d2f [PATCH] Drop nr_free_pages_pgdat()
Function is unnecessary now.  We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
9617729941 [PATCH] Drop free_pages()
nr_free_pages is now a simple access to a global variable.  Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included.  There is one
occurrence in power management where we need to add the include.  Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
d23ad42324 [PATCH] Use ZVC for free_pages
This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter
c878538598 [PATCH] Use ZVC for inactive and active counts
The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied.  Thus the ratio is always
too low and can never reach 100%.  The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore.  In that case we may get into
a situation in which f.e.  the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty.  These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up.  This patchset makes these counters ZVC counters.  This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
  page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations.  More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Mel Gorman
a6af2bc3d5 [PATCH] Avoid excessive sorting of early_node_map[]
find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call.  This is an excessive amount of sorting and
that can be avoided.  This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found.  The
map is then only sorted once when required.  Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Andrew Morton
a25700a53f [PATCH] mm: show bounce pages in oom killer output
Also split that long line up - people like to send us wordwrapped oom-kill
traces.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:47 -08:00
Linus Torvalds
6fd6b17c6d Revert "[PATCH] mm: micro optimise zone_watermark_ok"
This reverts commit e80ee884ae.

Pawel Sikora had a boot-time oops due to it - because the sign change
invalidates the following comparisons, since 'free_pages' can be
negative.

The micro-optimization just isn't worth it.

Bisected-by: Pawel Sikora <pluto@agmk.net>
Acked-by: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-31 16:46:40 -08:00
Dave Hansen
a2f3aa0257 [PATCH] Fix sparsemem on Cell
Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime.  It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.

It has been compile tested on ppc64, ia64, s390, i386 and x86_64.

Acked-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:20 -08:00
Christoph Lameter
f2e12bb272 [PATCH] Check for populated zone in __drain_pages
Both process_zones() and drain_node_pages() check for populated zones
before touching pagesets.  However, __drain_pages does not do so,

This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.

Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets.  Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.

If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.

If the cpu is then brought up again then we call process_zones() which will
ignore the unpopulated zone.  So the pageset pointers will still be NULL.

If the cpu is then again brought down then __drain_pages will attempt to
drain pages by following the NULL pageset pointer for unpopulated zones.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:29 -08:00
Paul Mundt
9ab37b8f21 [PATCH] Sanely size hash tables when using large base pages
At the moment the inode/dentry cache hash tables (common by way of
alloc_large_system_hash()) are incorrectly sized by their respective
detection logic when we attempt to use large base pages on systems with
little memory.

This results in odd behaviour when using a 64kB PAGE_SIZE, such as:

Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)

The mount cache hash table is seemingly the only one that gets this right
by directly taking PAGE_SIZE in to account.

The following patch attempts to catch the bogus values and round it up to
at least 0-order.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:23 -08:00
Paul Jackson
02a0e53d82 [PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:

  cpuset_zone_allowed_hardwall()
  cpuset_zone_allowed_softwall()

Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.

If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.

Unfortunately, this meant that users would end up with the softwall version
without thinking about it.  Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.

The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)

The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)

This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.

If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.

This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines.  It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.

For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.

Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Don Mullis
6b1b60f41e [PATCH] fault-injection: defaults likely to please a new user
Assign defaults most likely to please a new user:
 1) generate some logging output
    (verbose=2)
 2) avoid injecting failures likely to lock up UI
    (ignore_gfp_wait=1, ignore_gfp_highmem=1)

Signed-off-by: Don Mullis <dwm@meer.net>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Akinobu Mita
933e312e73 [PATCH] fault-injection capability for alloc_pages()
This patch provides fault-injection capability for alloc_pages()

Boot option:

fail_page_alloc=<interval>,<probability>,<space>,<times>

	<interval> -- specifies the interval of failures.

	<probability> -- specifies how often it should fail in percent.

	<space> -- specifies the size of free space where memory can be
		   allocated safely in pages.

	<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/fail_page_alloc/interval
/debug/fail_page_alloc/probability
/debug/fail_page_alloc/specifies
/debug/fail_page_alloc/times
/debug/fail_page_alloc/ignore-gfp-highmem
/debug/fail_page_alloc/ignore-gfp-wait

Example:

	fail_page_alloc=10,100,0,-1

The page allocation (alloc_pages(), ...) fails once per 10 times.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:02 -08:00
David Howells
f0d1b0b30d [PATCH] LOG2: Implement a general integer log2 facility in the kernel
This facility provides three entry points:

	ilog2()		Log base 2 of unsigned long
	ilog2_u32()	Log base 2 of u32
	ilog2_u64()	Log base 2 of u64

These facilities can either be used inside functions on dynamic data:

	int do_something(long q)
	{
		...;
		y = ilog2(x)
		...;
	}

Or can be used to statically initialise global variables with constant values:

	unsigned n = ilog2(27);

When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant.  They treat negative numbers as
unsigned.

When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.

[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Helge Deller
15ad7cdcfd [PATCH] struct seq_operations and struct file_operations constification
- move some file_operations structs into the .rodata section

 - move static strings from policy_types[] array into the .rodata section

 - fix generic seq_operations usages, so that those structs may be defined
   as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Ingo Molnar
0231606785 [PATCH] hotplug CPU: clean up hotcpu_notifier() use
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

    text    data     bss     dec     hex filename
 1624412  728710 3674856 6027978  5bfaca vmlinux.before
 1624412  728710 3674856 6027978  5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Andrew Morton
0490366432 [PATCH] remove HASH_HIGHMEM
It has no users and it's doubtful that we'll need it again.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:37 -08:00
Andy Whitcroft
33f2ef89f8 [PATCH] mm: make compound page destructor handling explicit
Currently we we use the lru head link of the second page of a compound page
to hold its destructor.  This was ok when it was purely an internal
implmentation detail.  However, hugetlbfs overrides this destructor
violating the layering.  Abstract this out as explicit calls, also
introduce a type for the callback function allowing them to be type
checked.  For each callback we pre-declare the function, causing a type
error on definition rather than on use elsewhere.

[akpm@osdl.org: cleanups]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter
952f3b51be [PATCH] GFP_THISNODE must not trigger global reclaim
The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node.  If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).

However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
then we will currently do some reclaim before returning NULL.  The caller
may want memory from other nodes before reclaim should be triggered.  (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).

There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.

So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
waking up kswapd.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Andy Whitcroft
ce421c799b [PATCH] mm: cleanup indentation on switch for CPU operations
These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c.  Fix them up to match that concensus.

    [PATCH] node local per-cpu-pages
    [PATCH] ZVC: Scale thresholds depending on the size of the system
    commit e7c8d5c995
    commit df9ecaba3f

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Andy Whitcroft
25ba77c141 [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int
NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long.  This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.

In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:

1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
   against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
   uses bit_set which in generic code takes an int.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Christoph Lameter
bc4ba393c0 [PATCH] drain_node_page(): Drain pages in batch units
drain_node_pages() currently drains the complete pageset of all pages.  If
there are a large number of pages in the queues then we may hold off
interrupts for too long.

Duplicate the method used in free_hot_cold_page.  Only drain pcp->batch
pages at one time.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Kirill Korotaev
b43a57bb4d [PATCH] OOM can panic due to processes stuck in __alloc_pages()
OOM can panic due to the processes stuck in __alloc_pages() doing infinite
rebalance loop while no memory can be reclaimed.  OOM killer tries to kill
some processes, but unfortunetaly, rebalance label was moved by someone
below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
OOM-killed and it can simply fail the allocation :/

Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
memory allocation tricks triggered OOM panic.

Signed-off-by: Denis Lunev <den@sw.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Nick Piggin
cc10250907 [PATCH] mm: add arch_alloc_page
Add an arch_alloc_page to match arch_free_page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Paul Jackson
9276b1bc96 [PATCH] memory page_alloc zonelist caching speedup
Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.

Remembers the zones in a zonelist that were short of free memory in the
last second.  And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)

Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago.  Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results.  After all that
    discussion with David on why I didn't need them, I went and got
    some ;).  I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably.  At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real').  Good.  See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct.  That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:

 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use.  Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past.  So this is not guaranteed.  Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

 2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

See comment in the patch below for a description of what it actually does.

Technical details of note (or controversy):

 - See the use of "zlc_active" and "did_zlc_setup" below, to delay
   adding any work for this new mechanism until we've looked at the
   first zone in zonelist.  I figured the odds of the first zone
   having the memory we needed were high enough that we should just
   look there, first, then get fancy only if we need to keep looking.

 - Some odd hackery was needed to add items to struct zonelist, while
   not tripping up the custom zonelists built by the mm/mempolicy.c
   code for MPOL_BIND.  My usual wordy comments below explain this.
   Search for "MPOL_BIND".

 - Some per-node data in the struct zonelist is now modified frequently,
   with no locking.  Multiple CPU cores on a node could hit and mangle
   this data.  The theory is that this is just performance hint data,
   and the memory allocator will work just fine despite any such mangling.
   The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
   (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
   all be self correcting after at most a one second delay.

 - This still does a linear scan of the same lengths as before.  All
   I've optimized is making the scan faster, not algorithmically
   shorter.  It is now able to scan a compact array of 'unsigned
   short' in the case of many full nodes, so one cache line should
   cover quite a few nodes, rather than each node hitting another
   one or two new and distinct cache lines.

 - If both Andi and Nick don't find this too complicated, I will be
   (pleasantly) flabbergasted.

 - I removed the comment claiming we only use one cachline's worth of
   zonelist.  We seem, at least in the fake numa case, to have put the
   lie to that claim.

 - I pay no attention to the various watermarks and such in this performance
   hint.  A node could be marked full for one watermark, and then skipped
   over when searching for a page using a different watermark.  I think
   that's actually quite ok, as it will tend to slightly increase the
   spreading of memory over other nodes, away from a memory stressed node.

===============

Performance - some benchmark results and analysis:

This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.

Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.)  System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.

Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added.  The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.

      number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
 GBs    N  	------------	   --------------    ----------------	systime
 mem threads   sys user  real	  sys  user  real     sys  user  real	 better
  12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
  12	19	99   22     8	   99	 22	8	0     0     0	   0%
  12	37     111   25     6	  112	 25	6	1     0     0	  -0%
  12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
  38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
  38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
  38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
  38	55     501   77    23	  511	 80    24      10     3     1	  -1%
  64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
  64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
  64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
  64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
  90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
  90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
  90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
  90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%

Notes:
 1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

 2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass.  The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass.  For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

 3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

 5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest.  I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away.  Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect.  Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

 7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

Conclusions:
 * This zonelist cache patch probably makes little difference one way or the
   other for most workloads on real numa hardware, if those workloads avoid
   heavy off node allocations.
 * For memory intensive workloads requiring substantial off-node allocations
   on real numa hardware, this patch improves both kernel and elapsed timings
   up to ten per-cent.
 * For fake numa systems, I'm optimistic, but will have to leave that up to
   Rohit Seth to actually test (once I get him a 2.6.18 backport.)

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Christoph Lameter
89689ae7f9 [PATCH] Get rid of zone_table[]
The zone table is mostly not needed.  If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.

In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field.  In all of
the above cases there will be no need at all for the zone table.

The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags.  In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
 This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.

For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table.  The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Paul Jackson
0798e5193c [PATCH] memory page alloc minor cleanups
- s/freeliest/freelist/ spelling fix

- Check for NULL *z zone seems useless - even if it could happen, so
  what?  Perhaps we should have a check later on if we are faced with an
  allocation request that is not allowed to fail - shouldn't that be a
  serious kernel error, passing an empty zonelist with a mandate to not
  fail?

- Initializing 'z' to zonelist->zones can wait until after the first
  get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
  loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.

- Remove superfluous braces around a break

- Fix a couple errant spaces

- Adjust indentation on the cpuset_zone_allowed() check, to match the
  lines just before it -- seems easier to read in this case.

- Add another set of braces to the zone_watermark_ok logic

From: Paul Jackson <pj@sgi.com>

  Backout one item from a previous "memory page_alloc minor cleanups" patch.
   Until and unless we are certain that no one can ever pass an empty zonelist
  to __alloc_pages(), this check for an empty zonelist (or some BUG
  equivalent) is essential.  The code in get_page_from_freelist() blow ups if
  passed an empty zonelist.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Mel Gorman
1abbfb412b [PATCH] x86_64: fix bad page state in process 'swapper'
find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
depend on a sorted early_node_map[].  However, sort_node_map() is being
called after fin_min_pfn_with_active_regions() in
free_area_init_nodes().

In most cases, this is ok, but on at least one x86_64, the SRAT table
caused the E820 ranges to be registered out of order.  This gave the
wrong values for the min PFN range resulting in some pages not being
initialised.

This patch sorts the early_node_map in find_min_pfn_for_node().  It has
been boot tested on x86, x86_64, ppc64 and ia64.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-23 09:30:38 -08:00
nkalmala
941c7105dc [PATCH] mm: un-needed add-store operation wastes a few bytes
Un-needed add-store operation wastes a few bytes.
8 bytes wasted with -O2, on a ppc.

Signed-off-by: nkalmala <nkalmala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:56 -08:00
Mel Gorman
0c6cb97463 [PATCH] Calculation fix for memory holes beyong the end of physical memory
absent_pages_in_range() made the assumption that users of the
arch-independent zone-sizing API would not care about holes beyound the end
of physical memory.  This was not the case and was "fixed" in a patch
called "Account for holes that are outside the range of physical memory".
However, when given a range that started before a hole in "real" memory and
ended beyond the end of memory, it would get the result wrong.  The bug is
in mainline but a patch is below.

It has been tested successfully on a number of machines and architectures.
Additional credit to Keith Mannthey for discovering the problem, helping
identify the correct fix and confirming it Worked For Him.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: keith mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Martin Bligh
3bb1a852ab [PATCH] vmscan: Fix temp_priority race
The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.

The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently.  In balance_pgdat, we keep a separate
priority record per zone in a local array.  In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.

Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low.  They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.

From: Andrew Morton <akpm@osdl.org>

__zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list.  Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.

Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.

This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
possible to remove that now, and to just start out at DEF_PRIORITY?

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:50 -07:00
Andy Whitcroft
7516795739 [PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc
Reintroduce NODES_SPAN_OTHER_NODES for powerpc

Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
    This reverts commit f62859bb68.
Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
    This reverts commit a94b3ab7ea.

Also update the comments to indicate that this is still required
and where its used.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:06 -07:00
Andrew Morton
6220ec7844 [PATCH] highest_possible_node_id() linkage fix
Qooting Adrian:

- net/sunrpc/svc.c uses highest_possible_node_id()

- include/linux/nodemask.h says highest_possible_node_id() is
  out-of-line #if MAX_NUMNODES > 1

- the out-of-line highest_possible_node_id() is in lib/cpumask.c

- lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
  CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y

-> highest_possible_node_id() is used in net/sunrpc/svc.c
   CONFIG_NODES_SHIFT defined and > 0

-> include/linux/numa.h: MAX_NUMNODES > 1

-> compile error

The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
depends on NUMA (but m32r isn't the only affected architecture).

So move the function into page_alloc.c

Cc: Adrian Bunk <bunk@stusta.de>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:43 -07:00
Andrew Morton
3fcfab16c5 [PATCH] separate bdi congestion functions from queue congestion functions
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.

The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.

This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.

Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Nick Piggin
9858db504c [PATCH] mm: locks_freed fix
Move the lock debug checks below the page reserved checks.  Also, having
debug_check_no_locks_freed in kernel_map_pages is wrong.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Nick Piggin
dafb13673c [PATCH] mm: arch_free_page fix
After the PG_reserved check was added, arch_free_page was being called in the
wrong place (it could be called for a page we don't actually want to free).
Fix that.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Mel Gorman
b888132b0f [PATCH] mm: remove memmap_zone_idx()
memmap_zone_idx() is not used anymore.  It was required by an earlier
version of
account-for-memmap-and-optionally-the-kernel-image-as-holes.patch but not
any more.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:14 -07:00
Randy Dunlap
88ca3b94e8 [PATCH] page_alloc: fix kernel-doc and func. declaration
Fix kernel-doc and function declaration (missing "void") in
mm/page_alloc.c.

Add mm/page_alloc.c to kernel-api.tmpl in DocBook.

mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:12 -07:00
Nick Piggin
e80ee884ae [PATCH] mm: micro optimise zone_watermark_ok
Having min be a signed quantity means gcc can't turn high latency divides
into shifts.  There happen to be two such divides for GFP_ATOMIC (ie.
networking, ie.  important) allocations, one of which depends on the other.
 Fixing this makes code smaller as a bonus.

Shame on somebody (probably me).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:12 -07:00
Randy Dunlap
423b41d773 [PATCH] mm/page_alloc: use NULL instead of 0 for ptr
Use NULL instead of 0 for pointer value, eliminate sparse warnings.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter
66a550308b [PATCH] Do not allocate pagesets for unpopulated zones.
We do not need to allocate pagesets for unpopulated zones.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter
d5f541ed6e [PATCH] Add node to zone for the NUMA case
Add the node in order to optimize zone_to_nid.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter
08e0f6a970 [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA
The NUMA_BUILD constant is always available and will be set to 1 on
NUMA_BUILDs.  That way checks valid only under CONFIG_NUMA can easily be done
without #ifdef CONFIG_NUMA

F.e.

if (NUMA_BUILD && <numa_condition>) {
...
}

[akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
 causing ifdef explosion in core kernel, so let's see if this is a comfortable
 way in whcih to control that]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Jes Sorensen
c72419138f [PATCH] Condense output of show_free_areas()
On larger systems, the amount of output dumped on the console when you do
SysRq-M is beyond insane.  This patch is trying to reduce it somewhat as
even with the smaller NUMA systems that have hit the desktop this seems to
be a fair thing to do.

The philosophy I have taken is as follows:
 1) If a zone is empty, don't tell, we don't need yet another line
    telling us so. The information is available since one can look up
    the fact how many zones were initialized in the first place.
 2) Put as much information on a line is possible, if it can be done
    in one line, rahter than two, then do it in one. I tried to format
    the temperature stuff for easy reading.

Change show_free_areas() to not print lines for empty zones.  If no zone
output is printed, the zone is empty.  This reduces the number of lines
dumped to the console in sysrq on a large system by several thousand lines.

Change the zone temperature printouts to use one line per CPU instead of
two lines (one hot, one cold).  On a 1024 CPU, 1024 node system, this
reduces the console output by over a million lines of output.

While this is a bigger problem on large NUMA systems, it is also applicable
to smaller desktop sized and mid range NUMA systems.

Old format:

Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 42, batch 7 used:24
cpu 0 cold: high 14, batch 3 used:1
cpu 1 hot: high 42, batch 7 used:34
cpu 1 cold: high 14, batch 3 used:0
cpu 2 hot: high 42, batch 7 used:0
cpu 2 cold: high 14, batch 3 used:0
cpu 3 hot: high 42, batch 7 used:0
cpu 3 cold: high 14, batch 3 used:0
cpu 4 hot: high 42, batch 7 used:0
cpu 4 cold: high 14, batch 3 used:0
cpu 5 hot: high 42, batch 7 used:0
cpu 5 cold: high 14, batch 3 used:0
cpu 6 hot: high 42, batch 7 used:0
cpu 6 cold: high 14, batch 3 used:0
cpu 7 hot: high 42, batch 7 used:0
cpu 7 cold: high 14, batch 3 used:0
Node 0 DMA32 per-cpu: empty
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Node 1 DMA per-cpu:
[snip]
Free pages:     5410688kB (0kB HighMem)
Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
....

New format:

Mem-info:
Node 0 DMA per-cpu:
CPU    0: Hot: hi:   42, btch:   7 usd:  41   Cold: hi:   14, btch:   3 usd:   2
CPU    1: Hot: hi:   42, btch:   7 usd:  40   Cold: hi:   14, btch:   3 usd:   1
CPU    2: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    3: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    4: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    5: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    6: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    7: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
Node 1 DMA per-cpu:
[snip]
Free pages:     5411088kB (0kB HighMem)
Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Mel Gorman
fb01439c5b [PATCH] Allow an arch to expand node boundaries
Arch-independent zone-sizing determines the size of a node
(pgdat->node_spanned_pages) based on the physical memory that was
registered by the architecture.  However, when
CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
spanned_pages will be much larger and that mem_map will be allocated that
is used lated on memory hot-add.

This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
to call push_node_boundaries() which will set the node beginning and end to
at *least* the requested boundary.

Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Mel Gorman
9c7cd6877c [PATCH] Account for holes that are outside the range of physical memory
absent_pages_in_range() made the assumption that users of the API would not
care about holes beyound the end of physical memory.  This was not the
case.  This patch will account for ranges outside of physical memory as
holes correctly.

Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Mel Gorman
0e0b864e06 [PATCH] Account for memmap and optionally the kernel image as holes
The x86_64 code accounted for memmap and some portions of the the DMA zone as
holes.  This was because those areas would never be reclaimed and accounting
for them as memory affects min watermarks.  This patch will account for the
memmap as a memory hole.  Architectures may optionally use set_dma_reserve()
if they wish to account for a portion of memory in ZONE_DMA as a hole.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Mel Gorman
c713216dee [PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located.  Once located, the code to calculate zone
sizes and holes in each architecture is very similar.  Some of this zone and
hole sizing code is difficult to read for no good reason.  This set of patches
eliminates the similar-looking architecture-specific code.

The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range().  When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones.  The zone sizes and holes are then calculated in an architecture
independent manner.

Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
	It adjusts the watermarks slightly

Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig.  Bob Picco has also tested and debugged on
IA64.  Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine.  These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.

There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.

The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy.  There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.

Additional credit;
	Dave Hansen for the initial suggestion and comments on early patches
	Andy Whitcroft for reviewing early versions and catching numerous
		errors
	Tony Luck for testing and debugging on IA64
	Bob Picco for fixing bugs related to pfn registration, reviewing a
		number of patch revisions, providing a number of suggestions
		on future direction and testing heavily
	Jack Steiner and Robin Holt for testing on IA64 and clarifying
		issues related to memory holes
	Yasunori for testing on IA64
	Andi Kleen for reviewing and feeding back about x86_64
	Christian Kujau for providing valuable information related to ACPI
		problems on x86_64 and testing potential fixes

This patch:

Define the structure to represent an active range of page frames within a node
in an architecture independent manner.  Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Rafael J. Wysocki
f623f0db8e [PATCH] swsusp: Fix mark_free_pages
Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
PageNosaveFree for PageNosave pages.  This allows us to get rid of an ugly
hack in kernel/power/snapshot.c#copy_data_pages().

Additionally, the page-copying loop in copy_data_pages() is moved to an
inline function.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Christoph Lameter
89fa30242f [PATCH] NUMA: Add zone_to_nid function
There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM.  Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter
0ff38490c8 [PATCH] zone_reclaim: dynamic slab reclaim
Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode.  Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs.  We have had a case where a machine with
46GB of memory was using 40-42GB for slab.  Zone reclaim was effective in
dealing with pagecache pages.  However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim.  Zone reclaim
occurs if there is a danger of an off node allocation.  At that point we

1. Shrink the per node page cache if the number of pagecache
   pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
   (patch depends on earlier one that implements that counter)
   are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific.  So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit.  I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
972d1a7b14 [PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLE
Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE.  The
intend seems to be to check for slab pages that could be freed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
8417bba4b1 [PATCH] Replace min_unmapped_ratio by min_unmapped_pages in struct zone
*_pages is a better description of the role of the variable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
39bbcb8f88 [PATCH] mm: do not check unpopulated zones for draining and counter updates
If a zone is unpopulated then we do not need to check for pages that are to
be drained and also not for vm counters that may need to be updated.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
006d22d9bb [PATCH] Optimize free_one_page
Free one_page currently adds the page to a fake list and calls
free_page_bulk.  Fee_page_bulk takes it off again and then calles
__free_one_page.

Make free_one_page go directly to __free_one_page.  Saves list on / off and
a temporary list in free_one_page for higher ordered pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
1192d52641 [PATCH] Cleanup: Add zone pointer to get_page_from_freelist
There are frequent references to *z in get_page_from_freelist.

Add an explicit zone variable that can be used in all these places.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter
9b819d204c [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions
Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes.  This
flag is essential if a kernel component requires memory to be located on a
certain node.  It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter
19655d3487 [PATCH] linearly index zone->node_zonelists[]
I wonder why we need this bitmask indexing into zone->node_zonelists[]?

We always start with the highest zone and then include all lower zones
if we build zonelists.

Are there really cases where we need allocation from ZONE_DMA or
ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
of highest_zone() makes that already impossible.

If we go linear on the index then gfp_zone() == highest_zone() and a lot
of definitions fall by the wayside.

We can now revert back to the use of gfp_zone() in mempolicy.c ;-)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Christoph Lameter
2f6726e54a [PATCH] Apply type enum zone_type
After we have done this we can now do some typing cleanup.

The memory policy layer keeps a policy_zone that specifies
the zone that gets memory policies applied. This variable
can now be of type enum zone_type.

The check_highest_zone function and the build_zonelists funnctionm must
then also take a enum zone_type parameter.

Plus there are a number of loops over zones that also should use
zone_type.

We run into some troubles at some points with functions that need a
zone_type variable to become -1. Fix that up.

[pj@sgi.com: fix set_mempolicy() crash]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Christoph Lameter
4e4785bcf0 [PATCH] mempolicies: fix policy_zone check
There is a check in zonelist_policy that compares pieces of the bitmap
obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
zonelist_policy().

The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
The policy_zone is a zone number with the possible values of ZONE_DMA,
ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
of values.

For some reason seemed to work before the zone reduction patchset (It
definitely works on SGI boxes since we just have one zone and the check
cannot fail).

With the zone reduction patchset this check definitely fails on systems
with two zones if the system actually has memory in both zones.

This is because ZONE_NORMAL is selected using no __GFP flag at
all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
is set. __GFP_DMA is 0x01.  So gfp_zone(gfpmask) == 1.

policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
populated.

For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is <
policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
allocations!

Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
to DMA allocations!

What we realy want in that place is to establish the highest allowable
zone for a given gfp_mask. If the highest zone is higher or equal to the
policy_zone then memory policies need to be applied. We have such
a highest_zone() function in page_alloc.c.

So move the highest_zone() function from mm/page_alloc.c into
include/linux/gfp.h.  On the way we simplify the function and use the new
zone_type that was also introduced with the zone reduction patchset plus we
also specify the right type for the gfp flags parameter.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Christoph Lameter
e53ef38d05 [PATCH] reduce MAX_NR_ZONES: make ZONE_HIGHMEM optional
Make ZONE_HIGHMEM optional

- ifdef out code and definitions related to CONFIG_HIGHMEM

- __GFP_HIGHMEM falls back to normal allocations if there is no
  ZONE_HIGHMEM

- GFP_ZONEMASK becomes 0x01 if there is no DMA32 and no HIGHMEM
  zone.

[jdike@addtoit.com: build fix]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Christoph Lameter
fb0e7942bd [PATCH] reduce MAX_NR_ZONES: make ZONE_DMA32 optional
Make ZONE_DMA32 optional

- Add #ifdefs around ZONE_DMA32 specific code and definitions.

- Add CONFIG_ZONE_DMA32 config option and use that for x86_64
  that alone needs this zone.

- Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
  for ia64 and fix up the way per node ZVCs are calculated.

- Fall back to prior GFP_ZONEMASK of 0x03 if there is no
  DMA32 zone.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Christoph Lameter
2f1b624868 [PATCH] reduce MAX_NR_ZONES: use enum to define zones, reformat and comment
Use enum for zones and reformat zones dependent information

Add comments explaning the use of zones and add a zones_t type for zone
numbers.

Line up information that will be #ifdefd by the following patches.

[akpm@osdl.org: comment cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Christoph Lameter
98d2b0ebda [PATCH] reduce MAX_NR_ZONES: page allocator ZONE_HIGHMEM cleanup
page allocator ZONE_HIGHMEM fixups

1. We do not need to do an #ifdef in si_meminfo since both counters
   in use are zero if !CONFIG_HIGHMEM.

2. Add #ifdef in si_meminfo_node instead to avoid referencing zone
   information for ZONE_HIGHMEM if we do not have HIGHMEM
   (may not be there after the following patches).

3. Replace the use of ZONE_HIGHMEM with MAX_NR_ZONES in build_zonelists_node

4. build_zonelists_node: Remove BUG_ON for ZONE_HIGHMEM. Zone will
   be optional soon and thus BUG_ON cannot be triggered anymore.

5. init_free_area_core: Replace a use of ZONE_HIGHMEM with NR_MAX_ZONES.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Christoph Lameter
c1f60a5a41 [PATCH] reduce MAX_NR_ZONES: move HIGHMEM counters into highmem.c/.h
Move totalhigh_pages and nr_free_highpages() into highmem.c/.h

Move the totalhigh_pages definition into highmem.c/.h.  Move the
nr_free_highpages function into highmem.c

[yoichi_yuasa@tripeaks.co.jp: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Christoph Lameter
182e8e2373 [PATCH] reduce MAX_NR_ZONES: make display of highmem counters conditional on CONFIG_HIGHMEM
Do not display HIGHMEM memory sizes if CONFIG_HIGHMEM is not set.

Make HIGHMEM dependent texts and make display of highmem counters optional

Some texts are depending on CONFIG_HIGHMEM.

Remove those strings and remove the display of highmem counter values if
CONFIG_HIGHMEM is not set.

[akpm@osdl.org: remove some ifdefs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Nick Piggin
725d704eca [PATCH] mm: VM_BUG_ON
Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM.  Use this
in the lightweight, inline refcounting functions; PageLRU and PageActive
checks in vmscan, because they're pretty well confined to vmscan.  And in
page allocate/free fastpaths which can be the hottest parts of the kernel
for kbuilds.

Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
side-effects, and should not be used outside core mm code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:44 -07:00
David Rientjes
f3ef9ead31 [PATCH] do not free non slab allocated per_cpu_pageset
Stops panic associated with attempting to free a non slab-allocated
per_cpu_pageset.

Signed-off-by: David Rientjes <rientjes@cs.washington.edu>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-25 17:38:36 -07:00
David S. Miller
f034b5d4ef [XFRM]: Dynamic xfrm_state hash table sizing.
The grow algorithm is simple, we grow if:

1) we see a hash chain collision at insert, and
2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and
3) the number of xfrm_state objects is > the current hash mask

All of this needs some tweaking.

Remove __initdata from "hashdist" so we can use it safely at run time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:08:41 -07:00
Christoph Lameter
9614634fe6 [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O
It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
pages.  File I/O will use these pages for the next read/write requests and the
unmapped pages increase.  After the zone has filled up again zone reclaim will
remove it again after only 32 pages.  This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass.  However.  it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:26:59 -07:00
Linus Torvalds
22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Christoph Lameter
f8891e5e1f [PATCH] Light weight event counters
The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat.  They have no
essential function for the VM.

We use a simple increment of per cpu variables.  In order to avoid the most
severe races we disable preempt.  Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter.  However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter.  For many architectures this will be reduced by the compiler to a
single instruction.  This single instruction is atomic for i386 and x86_64.
 And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
  on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
  Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Christoph Lameter
ca889e6c45 [PATCH] Use Zoned VM Counters for NUMA statistics
The numa statistics are really event counters.  But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure.  We can now use the per zone nature of the
zoned VM counters to realize these.

This will shrink the size of the pcp structure on NUMA systems.  We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.

 Bits	Prior pcp size	  	Size after patch	We can add
 ------------------------------------------------------------------
 64	128 bytes (16 words)	80 bytes (10 words)	48
 32	 76 bytes (19 words)	56 bytes (14 words)	8 (64 byte cacheline)
							72 (128 byte)

Remove the special statistics for numa and replace them with zoned vm
counters.  This has the side effect that global sums of these events now
show up in /proc/vmstat.

Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.

Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Christoph Lameter
fd39fc8561 [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter
Conversion of nr_unstable to a per zone counter

We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.

This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again.  We are only left with event type counters
in page state.

[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Christoph Lameter
ce866b34ae [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter
Conversion of nr_writeback to per zone counter.

This removes the last page_state counter from arch/i386/mm/pgtable.c so we
drop the page_state from there.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter
b1e7a8fd85 [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter
This makes nr_dirty a per zone counter.  Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones.  Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter
df849a1529 [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter
Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter
9a865ffa34 [PATCH] zoned vm counters: conversion of nr_slab to per zone counter
- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
  the slab. This may become useful to balance slab allocations over
  various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter
347ce434d5 [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter
Currently a single atomic variable is used to establish the size of the page
cache in the whole machine.  The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Christoph Lameter
65ba55f500 [PATCH] zoned vm counters: convert nr_mapped to per zone counter
nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations.  This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Christoph Lameter
2244b95a7b [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation
Per zone counter infrastructure

The counters that we currently have for the VM are split per processor.  The
processor however has not much to do with the zone these pages belong to.  We
cannot tell f.e.  how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
particular node.  If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system.  For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim.  We do not know how many unmapped pages exist
per zone.  So we just have to try to reclaim.  If it is not working then we
pause and try again later.  It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone.  This patchset allows the determination
of the number of unmapped pages per zone.  We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone.  The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array.  VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state.  The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided.  These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state().  In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Christoph Lameter
f6ac2354d7 [PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h
NOTE: ZVC are *not* the lightweight event counters.  ZVCs are reliable whereas
event counters do not need to be.

Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is.  In a NUMA system this can be helpful for local
reclaim and other memory optimizations that may be able to shift VM load in
order to get more balanced memory use.

It is also useful to know how the computing load affects the memory
allocations on various zones.  This patchset allows the retrieval of that data
from userspace.

The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.

Small per cpu 8 bit counters are added to struct zone.  If the counter exceeds
certain thresholds then the counters are accumulated in an array of
atomic_long in the zone and in a global array that sums up all zone values.
The small 8 bit counters are next to the per cpu page pointers and so they
will be in high in the cpu cache when pages are allocated and freed.

Access to VM counter information for a zone and for the whole machine is then
possible by simply indexing an array (Thanks to Nick Piggin for pointing out
that approach).  The access to the total number of pages of various types does
no longer require the summing up of all per cpu counters.

Benefits of this patchset right now:

- Ability for UP and SMP configuration to determine how memory
  is balanced between the DMA, NORMAL and HIGHMEM zones.

- loops over all processors are avoided in writeback and
  reclaim paths. We can avoid caching the writeback information
  because the needed information is directly accessible.

- Special handling for nr_pagecache removed.

- zone_reclaim_interval vanishes since VM stats can now determine
  when it is worth to do local reclaim.

- Fast inline per node page state determination.

- Accurate counters in /sys/devices/system/node/node*/meminfo. Current
  counters are counting simply which processor allocated a page somewhere
  and guestimate based on that. So the counters were not useful to show
  the actual distribution of page use on a specific zone.

- The swap_prefetch patch requires per node statistics in order to
  figure out when processors of a node can prefetch. This patch provides
  some of the needed numbers.

- Detailed VM counters available in more /proc and /sys status files.

References to earlier discussions:
V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

Performance tests with AIM7 did not show any regressions.  Seems to be a tad
faster even.  Tested on ia64/NUMA.  Builds fine on i386, SMP / UP.  Includes
fixes for s390/arm/uml arch code.

This patch:

Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

Create vmstat.c/vmstat.h by separating the counter code and the proc
functions.

Move the vm_stat_text array before zoneinfo_show.

[akpm@osdl.org: s390 build fix]
[akpm@osdl.org: HOTPLUG_CPU build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Jörn Engel
6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Ingo Molnar
f9b8404cf8 [PATCH] pi-futex: introduce debug_check_no_locks_freed()
Add debug_check_no_locks_freed(), as a central inline to add
bad-lock-free-debugging functionality to.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Chandra Seetharaman
74b85f3790 [PATCH] cpu hotplug: make cpu_notifier related notifier blocks __cpuinit only
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
9c7b216d23 [PATCH] cpu hotplug: revert init patch submitted for 2.6.17
In 2.6.17, there was a problem with cpu_notifiers and XFS.  I provided a
band-aid solution to solve that problem.  In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18.  Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Andreas Mohr
d6e05edc59 spelling fixes
acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-26 18:35:02 +02:00
Chris Wright
43b0bc00fd [PATCH] cpuset: remove extra cpuset_zone_allowed check in __alloc_pages
This is redundant with check in wakeup_kswapd.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:08 -07:00
Kirill Korotaev
8f9de51a4a [PATCH] printk() should not be called under zone->lock
This patch fixes printk() under zone->lock in show_free_areas().  It can be
unsafe to call printk() under this lock, since caller can try to
allocate/free some memory and selfdeadlock on this lock.  I found
allocations/freeing mem both in netconsole and serial console.

This issue was faced in reallity when meminfo was periodically printed for
debug purposes and netconsole was used.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:52 -07:00
Andrew Morton
bd1e22b8e0 [PATCH] initialise total_memory() earlier
Initialise total_memory earlier in boot.  Because if for some reason we run
page reclaim early in boot, we don't want total_memory to be zero when we use
it as a divisor.

And rename total_memory to vm_total_pages to avoid naming clashes with
architectures.

Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:52 -07:00
Andy Whitcroft
67de648211 [PATCH] squash duplicate page_to_pfn and pfn_to_page
We have architectures where the size of page_to_pfn and pfn_to_page are
significant enough to overall image size that they wish to push them out of
line.  However, in the process we have grown a second copy of the
implementation of each of these routines for each memory model.  Share the
implmentation exposing it either inline or out-of-line as required.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:47 -07:00
Yasunori Goto
6811378e7d [PATCH] wait_table and zonelist initializing for memory hotadd: update zonelists
In current code, zonelist is considered to be build once, no modification.
But MemoryHotplug can add new zone/pgdat.  It must be updated.

This patch modifies build_all_zonelists().  By this, build_all_zonelist() can
reconfig pgdat's zonelists.

To update them safety, this patch use stop_machine_run().  Other cpus don't
touch among updating them by using it.

In old version (V2 of node hotadd), kernel updated them after zone
initialization.  But present_page of its new zone is still 0, because
online_page() is not called yet at this time.  Build_zonelists() checks
present_pages to find present zone.  It was too early.  So, I changed it after
online_pages().

Signed-off-by: Yasunori Goto     <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:46 -07:00
Yasunori Goto
cca448fe92 [PATCH] wait_table and zonelist initializing for memory hotadd: wait_table initialization
Wait_table is initialized according to zone size at boot time.  But, we cannot
know the maixmum zone size when memory hotplug is enabled.  It can be
changed....  And resizing of wait_table is hard.

So kernel allocate and initialzie wait_table as its maximum size.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:46 -07:00
Yasunori Goto
718127cc31 [PATCH] wait_table and zonelist initializing for memory hotadd: add return code for init_current_empty_zone
When add_zone() is called against empty zone (not populated zone), we have to
initialize the zone which didn't initialize at boot time.  But,
init_currently_empty_zone() may fail due to allocation of wait table.  So,
this patch is to catch its error code.

Changes against wait_table is in the next patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:46 -07:00
Yasunori Goto
86356ab147 [PATCH] wait_table and zonelist initializing for memory hotadd: change to meminit for build_zonelist
Change definitions of some functions and data from __init to __meminit.

These functions and data can be used after bootup by this patch to be used for
hot-add codes.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:46 -07:00
Yasunori Goto
02b694dea4 [PATCH] wait_table and zonelist initializing for memory hotadd: change name of wait_table_size()
This is just to rename from wait_table_size() to wait_table_hash_nr_entries().

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:46 -07:00
Andy Whitcroft
cb2b95e1c6 [PATCH] zone handle unaligned zone boundaries
The buddy allocator has a requirement that boundaries between contigious
zones occur aligned with the the MAX_ORDER ranges.  Where they do not we
will incorrectly merge pages cross zone boundaries.  This can lead to pages
from the wrong zone being handed out.

Originally the buddy allocator would check that buddies were in the same
zone by referencing the zone start and end page frame numbers.  This was
removed as it became very expensive and the buddy allocator already made
the assumption that zones boundaries were aligned.

It is clear that not all configurations and architectures are honouring
this alignment requirement.  Therefore it seems safest to reintroduce
support for non-aligned zone boundaries.  This patch introduces a new check
when considering a page a buddy it compares the zone_table index for the
two pages and refuses to merge the pages where they do not match.  The
zone_table index is unique for each node/zone combination when
FLATMEM/DISCONTIGMEM is enabled and for each section/zone combination when
SPARSEMEM is enabled (a SPARSEMEM section is at least a MAX_ORDER size).

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
Bob Picco
e984bb43f7 [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary
Andy added code to buddy allocator which does not require the zone's
endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
__page_find_buddy could compute a buddy not in node_mem_map for partial
MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
allocator and not part of zone).  Of course the negative here is we could
waste a little memory but the positive is eliminating all the old checks
for zone boundary conditions.

SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
either because the holes and endpoints are handled differently.  This
leaves checking alloc_remap and other arches which privately allocate for
node_mem_map.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:22 -07:00
Paul Jackson
bdd804f478 [PATCH] Cpuset: might sleep checking zones allowed fix
Fix a couple of infrequently encountered 'sleeping function called from
invalid context' in the cpuset hooks in __alloc_pages.  Could sleep while
interrupts disabled.

The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
__alloc_pages() to determine if a zone is allowed in the current tasks
cpuset.  This routine can sleep, for certain GFP_KERNEL allocations, if the
zone is on a memory node not allowed in the current cpuset, but might be
allowed in a parent cpuset.

But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).

The rule was intended to be:
  Don't call cpuset_zone_allowed() if you can't sleep, unless you
  pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  the code that might scan up ancestor cpusets and sleep.

This rule was being violated in a couple of places, due to a bogus change
made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
to cleanup its logic, and also due to a later fix to constrain which swap
daemons were awoken.

The bogus change can be seen at:
  http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
  [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags

This was first noticed on a tight memory system, in code that was disabling
interrupts and doing allocation requests with __GFP_WAIT not set, which
resulted in __might_sleep() writing complaints to the log "Debug: sleeping
function called ...", when the code in cpuset_zone_allowed() tried to take
the callback_sem cpuset semaphore.

We haven't seen a system hang on this 'might_sleep' yet, but we are at
decent risk of seeing it fairly soon, especially since the additional
cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
March 2006.

Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
to Nick Piggin who warned me of this back in Nov 2005, before I was ready
to listen.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
Andrew Morton
ac924c6034 [PATCH] setup_per_zone_pages_min() overflow fix
As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=6490, this
function can experience overflows on 32-bit machines, causing our response to
changed values of min_free_kbytes to go whacky.

Fixing it efficiently is all too hard, so fix it with 64-bit math instead.

Cc: Ake Sandgren <ake.sandgren@hpc2n.umu.se>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Chandra Seetharaman
83d722f7e1 [PATCH] Remove __devinit and __cpuinit from notifier_call definitions
Few of the notifier_chain_register() callers use __init in the definition
of notifier_call.  It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).

This patch fixes all such usages to _not_ have the notifier_call __init
section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:30:03 -07:00
Andrew Morton
6aa3001b23 [PATCH] page_alloc.c: buddy handling cleanup
Fix up some whitespace damage.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:50 -07:00
Hideo AOKI
cb45b0e966 [PATCH] overcommit: add calculate_totalreserve_pages()
These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
__vm_enough_memory().

- why the kernel needed patching

  When the kernel can't allocate anonymous pages in practice, currnet
  OVERCOMMIT_GUESS could return success. This implementation might be
  the cause of oom kill in memory pressure situation.

  If the Linux runs with page reservation features like
  /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
  the oom kill occurs easily.

- the overall design approach in the patch

  When the OVERCOMMET_GUESS algorithm calculates number of free pages,
  the reserved free pages are regarded as non-free pages.

  This change helps to avoid the pitfall that the number of free pages
  become less than the number which the kernel tries to keep free.

- testing results

  I tested the patches using my test kernel module.

  If the patches aren't applied to the kernel, __vm_enough_memory()
  returns success in the situation but autual page allocation is
  failed.

  On the other hand, if the patches are applied to the kernel, memory
  allocation failure is avoided since __vm_enough_memory() returns
  failure in the situation.

  I checked that on i386 SMP 16GB memory machine. I haven't tested on
  nommu environment currently.

This patch adds totalreserve_pages for __vm_enough_memory().

Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
pages_high in each zone. Finally, the function stores the sum of each
zone to totalreserve_pages.

The totalreserve_pages is calculated when the VM is initilized.
And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
or /proc/sys/vm/min_free_kbytes are changed.

Signed-off-by: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:32 -07:00
Nick Piggin
676165a8af [PATCH] Fix buddy list race that could lead to page lru list corruptions
Rohit found an obscure bug causing buddy list corruption.

page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
to determine whether or not a free page's buddy is itself free and in the
buddy lists.

Each of the conjuncts may be true at different times due to unrelated
conditions, so the non-atomic page_is_buddy test may find each conjunct to
be true even if they were not both true at the same time (ie. the page was
not on the buddy lists).

Signed-off-by: Martin Bligh <mbligh@google.com>
Signed-off-by: Rohit Seth <rohitseth@google.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-10 10:16:37 -07:00
KAMEZAWA Hiroyuki
ae0f15fb91 [PATCH] for_each_online_pgdat: remove pgdat_list
By using for_each_online_pgdat(), pgdat_list is not necessary now.  This patch
removes it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:48 -08:00
KAMEZAWA Hiroyuki
ec936fc563 [PATCH] for_each_online_pgdat: renaming for_each_pgdat
Replace for_each_pgdat() with for_each_online_pgdat().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:48 -08:00
KAMEZAWA Hiroyuki
a0140c1d85 [PATCH] remove zone_mem_map
This patch removes zone_mem_map.

pfn_to_page uses pgdat, page_to_pfn uses zone.  page_to_pfn can use pgdat
instead of zone, which is only one user of zone_mem_map.  By modifing it,
we can remove zone_mem_map.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:47 -08:00
KAMEZAWA Hiroyuki
a117e66ed4 [PATCH] unify pfn_to_page: generic functions
There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
Each arch has its own page_to_pfn(), pfn_to_page() for each models.
But most of them can use the same arithmetic.

This patch adds asm-generic/memory_model.h, which includes generic
page_to_pfn(), pfn_to_page() definitions for each memory model.

When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
used instead of macro. This is enabled by some archs and  reduces
text size.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:44 -08:00
John Hawkes
6e692ed37a [PATCH] fix alloc_large_system_hash() roundup
The "rounded up to nearest power of 2 in size" algorithm in
alloc_large_system_hash is not correct.  As coded, it takes an otherwise
acceptable power-of-2 value and doubles it.  For example, we see the error
if we boot with thash_entries=2097152 which produces a hash table with
4194304 entries.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:58 -08:00
Anton Blanchard
f5335c0f1b [PATCH] quieten zone_pcp_init
In zone_pcp_init we print out all zones even if they are empty:

On node 0 totalpages: 245760
  DMA zone: 245760 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0

To conserve dmesg space why not print only the non zero zones.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:50 -08:00
Christoph Lameter
0b1303fcf2 [PATCH] cpusets: only wakeup kswapd for zones in the current cpuset
If we get under some memory pressure in a cpuset (we only scan zones that
are in the cpuset for memory) then kswapd is woken up for all zones.  This
patch only wakes up kswapd in zones that are part of the current cpuset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Christoph Lameter
879336c393 [PATCH] drain_node_pages: interrupt latency reduction / optimization
1. Only disable interrupts if there is actually something to free

2. Only dirty the pcp cacheline if we actually freed something.

3. Disable interrupts for each single pcp and not for cleaning
  all the pcps in all zones of a node.

drain_node_pages is called every 2 seconds from cache_reap. This
fix should avoid most disabling of interrupts.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:06 -08:00
Andrew Morton
6626c5d53b [PATCH] mm: prep_zero_page() in irq is a bug
prep_zero_page() uses KM_USER0 and hence may not be used from IRQ context, at
least for highmem pages.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:02 -08:00
Nick Piggin
17cf44064a [PATCH] mm: cleanup prep_ stuff
Move the prep_ stuff into prep_new_page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:02 -08:00
Nick Piggin
7835e98b2e [PATCH] remove set_page_count() outside mm/
set_page_count usage outside mm/ is limited to setting the refcount to 1.
Remove set_page_count from outside mm/, and replace those users with
init_page_count() and set_page_refcounted().

This allows more debug checking, and tighter control on how code is allowed
to play around with page->_count.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:02 -08:00
Nick Piggin
84097518d1 [PATCH] mm: nommu use compound pages
Now that compound page handling is properly fixed in the VM, move nommu
over to using compound pages rather than rolling their own refcounting.

nommu vm page refcounting is broken anyway, but there is no need to have
divergent code in the core VM now, nor when it gets fixed.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: David Howells <dhowells@redhat.com>

(Needs testing, please).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:01 -08:00
Nick Piggin
545b1ea9bf [PATCH] mm: cleanup bootmem
The bootmem code added to page_alloc.c duplicated some page freeing code
that it really doesn't need to because it is not so performance critical.

While we're here, make prefetching work properly by actually prefetching
the page we're about to use before prefetching ahead to the next one (ie.
get the most important transaction started first).  Also prefetch just a
single page ahead rather than leaving a gap of 16.

Jack Steiner reported no problems with SGI's ia64 simulator.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:58 -08:00
Nick Piggin
8dfcc9ba27 [PATCH] mm: split highorder pages
Have an explicit mm call to split higher order pages into individual pages.
 Should help to avoid bugs and be more explicit about the code's intention.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:57 -08:00
Nick Piggin
5e9dace8d3 [PATCH] mm: page_alloc less atomics
More atomic operation removal from page allocator

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:57 -08:00
Andrew Morton
b40607fc02 [PATCH] __get_page_state() cpumask cleanup and fix
__get_page_state() has an open-coded for_each_cpu_mask() loop in it.

Tidy that up, then notice that the code was buggy:

	while (cpu < NR_CPUS) {
		unsigned long *in, *out, off;

		if (!cpu_isset(cpu, *cpumask))
			continue;

an obvious infinite loop.  I guess we just never call it with a holey cpu
mask.

Even after my cpumask size-reduction work, this patch increases code size :(

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:55 -08:00
Christoph Lameter
8fce4d8e3b [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages.
The cache reaper currently tries to free all alien caches and all remote
per cpu pages in each pass of cache_reap.  For a machines with large number
of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
Interrupts are disabled while reclaiming creating unacceptable delays.

This patch changes that behavior by adding a per cpu reap_node variable.
Instead of attempting to free all caches, we free only one alien cache and
the per cpu pages from one remote node.  That reduces the time spend in
cache_reap.  However, doing so will lengthen the time it takes to
completely drain all remote per cpu pagesets and all alien caches.  The
time needed will grow with the number of nodes in the system.  All caches
are drained when they overflow their respective capacity.  So the drawback
here is only that a bit of memory may be wasted for awhile longer.

Details:

1. Rename drain_remote_pages to drain_node_pages to allow the specification
   of the node to drain of pcp pages.

2. Add additional functions init_reap_node, next_reap_node for NUMA
   that manage a per cpu reap_node counter.

3. Add a reap_alien function that reaps only from the current reap_node.

For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
the additional interrupt holdoff reductions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-09 19:47:38 -08:00
Christoph Lameter
9b0f8b040a [PATCH] Terminate process that fails on a constrained allocation
Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints).  If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior.  The process may be killed but then the
sysadmin or developer can investigate the situation.  The solution is
similar to what we do when running out of hugepages.

This patch adds a check before we kill processes.  At that point
performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes.  If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should kill the
current process.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:09 -08:00
Linus Torvalds
4cf808eb44 [PATCH] Handle holes in node mask in node fallback list setup
Change the find_next_best_node algorithm to correctly skip
over holes in the node online mask. Previously it would not handle
missing nodes correctly and cause crashes at boot.

[Written by Linus, tested by AK]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 13:27:06 -08:00
Hugh Dickins
d98c7a0984 [PATCH] compound page: default destructor
Somehow I imagined that calling a NULL destructor would free a compound page
rather than oopsing.  No, we must supply a default destructor, __free_pages_ok
using the order noted by prep_compound_page.  hugetlb can still replace this
as before with its own free_huge_page pointer.

The case that needs this is not common: rarely does put_compound_page's
put_page_testzero bring the count down to 0.  But if get_user_pages is applied
to some part of a compound page, without immediate release (e.g.  AIO or
Infiniband), then it's possible for its put_page to come after the containing
vma has been unmapped and the driver done its free_pages.

That's just the kind of case compound pages are supposed to be guarding
against (but Nick points out, nor did PageReserved handle this right).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:33 -08:00
Hugh Dickins
41d78ba550 [PATCH] compound page: use page[1].lru
If a compound page has its own put_page_testzero destructor (the only current
example is free_huge_page), that is noted in page[1].mapping of the compound
page.  But that's rather a poor place to keep it: functions which call
set_page_dirty_lock after get_user_pages (e.g.  Infiniband's
__ib_umem_release) ought to be checking first, otherwise set_page_dirty is
liable to crash on what's not the address of a struct address_space.

And now I'm about to make that worse: it turns out that every compound page
needs a destructor, so we can no longer rely on hugetlb pages going their own
special way, to avoid further problems of page->mapping reuse.  For example,
not many people know that: on 50% of i386 -Os builds, the first tail page of a
compound page purports to be PageAnon (when its destructor has an odd
address), which surprises page_add_file_rmap.

Keep the compound page destructor in page[1].lru.next instead.  And to free up
the common pairing of mapping and index, also move compound page order from
index to lru.prev.  Slab reuses page->lru too: but if we ever need slab to use
compound pages, it can easily stack its use above this.

(akpm: decoded version of the above: the tail pages of a compound page now
have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
caller to check that they're not compund pages before doing the dirty).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:33 -08:00
Eric Dumazet
88a2a4ac6b [PATCH] percpu data: only iterate over possible CPUs
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.

As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().

(The above only applies to users of asm-generic/percpu.h.  powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:51 -08:00
Ashok Raj
6292d9aaf3 [PATCH] __cpuinit functions wrongly marked __meminit
__meminit has overzelously been modified and crept its way into marking
cpuup callbacks as __meminit.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:09 -08:00
Christoph Lameter
9eeff2395e [PATCH] Zone reclaim: Reclaim logic
Some bits for zone reclaim exists in 2.6.15 but they are not usable.  This
patch fixes them up, removes unused code and makes zone reclaim usable.

Zone reclaim allows the reclaiming of pages from a zone if the number of
free pages falls below the watermarks even if other zones still have enough
pages available.  Zone reclaim is of particular importance for NUMA
machines.  It can be more beneficial to reclaim a page than taking the
performance penalties that come with allocating a page on a remote zone.

Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch.  By default
RECLAIM_DISTANCE is 20.  20 is the distance to another node in the same
component (enclosure or motherboard) on IA64.  The meaning of the NUMA
distance information seems to vary by arch.

If zone reclaim is not successful then no further reclaim attempts will
occur for a certain time period (ZONE_RECLAIM_INTERVAL).

This patch was discussed before. See

http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Matt Tolentino
c09b42404d [PATCH] x86_64: add __meminit for memory hotplug
Add __meminit to the __init lineup to ensure functions default
to __init when memory hotplug is not enabled.  Replace __devinit
with __meminit on functions that were changed when the memory
hotplug code was introduced.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16 23:18:35 -08:00
Greg Ungerer
cbe8dd4af2 [PATCH] memmap_init_zone(): remove uneccesary page++
Remove unecessary page++ from memmap_init_zone loop.

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:49 -08:00
Paul Jackson
4eac915d02 [PATCH] mm: gfp_atomic comments
Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use
emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:09 -08:00
Hugh Dickins
7365f3d169 [PATCH] Restore KERN_EMERG to each line printed by bad_page
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:08 -08:00
David Woodhouse
a4fc7ab1d0 [PATCH] fix/simplify mutex debugging code
Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
arguments instead, since all its callers were just calculating the 'to'
address for themselves anyway... (and sometimes doing so badly).

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 08:14:16 -08:00
Ingo Molnar
de5097c2e7 [PATCH] mutex subsystem, more debugging code
more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:21 -08:00
Paul Jackson
3e0d98b9f1 [PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure.  It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero.  So only
    systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Nick Piggin
48db57f8ff [PATCH] mm: free_pages opt
Try to streamline free_pages_bulk by ensuring callers don't pass in a
'count' that exceeds the list size.

Some cleanups:
Rename __free_pages_bulk to __free_one_page.
Put the page list manipulation from __free_pages_ok into free_one_page.
Make __free_pages_ok static.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Nick Piggin
23316bc86f [PATCH] mm: cleanup zone_pcp
Use zone_pcp everywhere even though NUMA code "knows" the internal details
of the zone.  Stop other people trying to copy, and it looks nicer.

Also, only print the pagesets of online cpus in zoneinfo.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Rohit Seth
8ad4b1fb82 [PATCH] Make high and batch sizes of per_cpu_pagelists configurable
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists.  This patch makes these two
variables configurable through /proc interface.

A new tunable /proc/sys/vm/percpu_pagelist_fraction is added.  This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list.  The min value for this is 8.  It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.

The batch value of each per cpu pagelist is also updated as a result.  It
is set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Christoph Lameter
bec6b0c89b [PATCH] slab: remove nested #ifdef CONFIG_NUMA
For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
CONFIG_NUMA in the page allocator.  Remove innermost #ifdef CONFIG_NUMA

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Andrew Morton
84c2008af0 [PATCH] revert "mm: page_state fixes"
Hugh says:

page_alloc_cpu_notify() specifically contains code to

 		/* Add dead cpu's page_states to our own. */

which handles this more efficiently.

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:38 -08:00
Nick Piggin
a74609fafa [PATCH] mm: page_state opt
Optimise page_state manipulations by introducing interrupt unsafe accessors
to page_state fields.  Callers must provide their own locking (either
disable interrupts or not update from interrupt context).

Switch over the hot callsites that can easily be moved under interrupts off
sections.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:29 -08:00
Christoph Lameter
070f80326a [PATCH] build_zonelists_node(): rename args
Give j and r meaningful names.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:28 -08:00
Christoph Lameter
02a68a5ebc [PATCH] Fix zone policy determination
The use k in the inner loop means that the highest zone nr is always used
if any zone of a node is populated.  This means that the policy zone is not
correctly determined on arches that do no use HIGHMEM like ia64.

Change the loop to decrement k which also simplifies the BUG_ON.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:28 -08:00
Christoph Lameter
4be38e351c [PATCH] mm: move determination of policy_zone into page allocator
Currently the function to build a zonelist for a BIND policy has the side
effect to set the policy_zone.  This seems to be a bit strange.  policy
zone seems to not be initialized elsewhere and therefore 0.  Do we police
ZONE_DMA if no bind policy has been used yet?

This patch moves the determination of the zone to apply policies to into
the page allocator.  We determine the zone while building the zonelist for
nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:28 -08:00
Christoph Lameter
1a93205bdf [PATCH] mm: simplify build_zonelists_node by removing the case statement.
Simplify build_zonelists_node by removing the case statement.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:28 -08:00
Con Kolivas
f3fe65122d [PATCH] mm: add populated_zone() helper
There are numerous places we check whether a zone is populated or not.

Provide a helper function to check for populated zones and convert all
checks for zone->present_pages.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:28 -08:00
Nick Piggin
224abf92b2 [PATCH] mm: bad_page optimisation
Cut down size slightly by not passing bad_page the function name (it should be
able to be determined by dump_stack()).  And cut down the number of printks in
bad_page.

Also, cut down some branching in the destroy_compound_page path.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:26 -08:00
Nick Piggin
9328b8faae [PATCH] mm: dma32 zone statistics
Add dma32 to zone statistics.  Also attempt to arrange struct page_state a
bit better (visually).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:26 -08:00
David Howells
a226f6c899 [PATCH] FRV: Clean up bootmem allocator's page freeing algorithm
The attached patch cleans up the way the bootmem allocator frees pages.

A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is
called from mm/bootmem.c to turn pages over to the main allocator.  All the
bits of code to initialise pages (clearing PG_reserved and setting the page
count) are moved to here.  The checks on page validity are removed, on the
assumption that the struct page arrays will have been prepared correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:26 -08:00
Nick Piggin
085cc7d5de [PATCH] mm: page_alloc cleanups
Small cleanups that does not change generated code with the gcc's I've tested
with.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Nick Piggin
a86b1f5316 [PATCH] mm: page_state fixes
read_page_state and __get_page_state only traverse online CPUs, which will
cause results to fluctuate when CPUs are plugged in or out.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Nick Piggin
2d92c5c915 [PATCH] mm: remove pcp low
struct per_cpu_pages.low is useless.  Remove it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Nick Piggin
13e7444b0e [PATCH] mm: remove bad_range
bad_range is supposed to be a temporary check.  It would be a pity to throw it
out.  Make it depend on CONFIG_DEBUG_VM instead.

CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the
page allocator.  Add that to page_is_buddy instead.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Nick Piggin
92be2e33b1 [PATCH] mm: microopt conditions
Micro optimise some conditionals where we don't need lazy evaluation.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Nick Piggin
77a8a78834 [PATCH] mm: set_page_refs opt
Inline set_page_refs.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Nick Piggin
c54ad30c78 [PATCH] mm: pagealloc opt
Slightly optimise some page allocation and freeing functions by taking
advantage of knowing whether or not interrupts are disabled.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:25 -08:00
Mike Kravetz
a94b3ab7ea [PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES
The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM
could handle pSeries numa layouts.  However, support for DISCONTIGMEM has
been replaced by SPARSEMEM on powerpc.  As a result, this config option and
supporting code is no longer needed.

I have already sent a patch to Paul that removes the option from powerpc
specific code.  This removes the arch independent piece.  Doesn't really
matter which is applied first.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:24 -08:00
Paul Jackson
47f3a867f6 [PATCH] mm: fix __alloc_pages cpuset ALLOC_* flags
Two changes to the setting of the ALLOC_CPUSET flag in
mm/page_alloc.c:__alloc_pages()

- A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET.
  This case of all cases, since it is handling a request that will free up
  more memory than is asked for (exiting tasks, e.g.) should be allowed to
  escape cpuset constraints when memory is tight.

- A logic change to make it simpler.  Honor cpusets even on GFP_ATOMIC
  (!wait) requests.  With this, cpuset confinement applies to all requests
  except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can
  remove the ALLOC_CPUSET flag entirely.  Since I don't know any real reason
  this logic has to be either way, I am choosing the path of the simplest
  code.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:21 -08:00
Al Viro
78d9955bb0 [PATCH] missing prototype (mm/page_alloc.c)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-15 10:04:30 -08:00
Nick Piggin
0ceaacc978 [PATCH] Fix up per-cpu page batch sizes
The code to clamp batch sizes to 2^n - 1 went missing and an extra
check got added, which must have been a hunk of the "higer order pcp
batch refills" work sneaking in.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-03 20:46:40 -08:00
Nick Piggin
3148890bfa [PATCH] mm: __alloc_pages cleanup fix
I believe this patch is required to fix breakage in the asynch reclaim
watermark logic introduced by this patch:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee

Just some background of the watermark logic in case it isn't clear...
Basically what we have is this:

 ---  pages_high
   |
   | (a)
   |
 ---  pages_low
   |
   | (b)
   |
 ---  pages_min
   |
   | (c)
   |
 ---  0

Now when pages_low is reached, we want to kick asynch reclaim, which gives us
an interval of "b" before we must start synch reclaim, and gives kswapd an
interval of "a" before it need go back to sleep.

When pages_min is reached, normal allocators must enter synch reclaim, but
PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie.  atomic allocations, recursive
allocations, etc.) get access to varying amounts of the reserve "c".

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:42:24 -08:00
Hugh Dickins
689bcebfda [PATCH] unpaged: PG_reserved bad_page
It used to be the case that PG_reserved pages were silently never freed, but
in 2.6.15-rc1 they may be freed with a "Bad page state" message.  We should
work through such cases as they appear, fixing the code; but for now it's
safer to issue the message without freeing the page, leaving PG_reserved set.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
664beed019 [PATCH] unpaged: unifdefed PageCompound
It looks like snd_xxx is not the only nopage to be using PageReserved as a way
of holding a high-order page together: which no longer works, but is masked by
our failure to free from VM_RESERVED areas.  We cannot fix that bug without
first substituting another way to hold the high-order page together, while
farming out the 0-order pages from within it.

That's just what PageCompound is designed for, but it's been kept under
CONFIG_HUGETLB_PAGE.  Remove the #ifdefs: which saves some space (out- of-line
put_page), doesn't slow down what most needs to be fast (already using
hugetlb), and unifies the way we handle high-order pages.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Jens Axboe
6b1de9161e [PATCH] VM: fix zone list restart in page allocatate
We must reassign z before looping through the zones kicking kswapd,
since it will be NULL if we hit an OOM condition and jump back to the
beginning again. 'z' is initially assigned before the restart: label. So
move the restart label up a little.

Signed-off-by: Jens Axboe <axboe@suse.de>
2005-11-17 12:43:01 -08:00
Linus Torvalds
4060994c3e Merge x86-64 update from Andi 2005-11-14 19:56:02 -08:00
Andi Kleen
07808b74e7 [PATCH] x86_64: Remove obsolete ARCH_HAS_ATOMIC_UNSIGNED and page_flags_t
Has been introduced for x86-64 at some point to save memory
in struct page, but has been obsolete for some time. Just
remove it.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14 19:55:14 -08:00
Andi Kleen
b0d4169321 [PATCH] x86_64: When cpu_up fails clean up page allocator properly
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14 19:55:13 -08:00
Andi Kleen
a2f1b42490 [PATCH] x86_64: Add 4GB DMA32 zone
Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones.

As a bit of historical background: when the x86-64 port
was originally designed we had some discussion if we should
use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or
both. Both was ruled out at this point because it was in early
2.4 when VM is still quite shakey and had bad troubles even
dealing with one DMA zone.  We settled on the 16MB DMA zone mainly
because we worried about older soundcards and the floppy.

But this has always caused problems since then because
device drivers had trouble getting enough DMA able memory. These days
the VM works much better and the wide use of NUMA has proven
it can deal with many zones successfully.

So this patch adds both zones.

This helps drivers who need a lot of memory below 4GB because
their hardware is not accessing more (graphic drivers - proprietary
and free ones, video frame buffer drivers, sound drivers etc.).
Previously they could only use IOMMU+16MB GFP_DMA, which
was not enough memory.

Another common problem is that hardware who has full memory
addressing for >4GB misses it for some control structures in memory
(like transmit rings or other metadata).  They tended to allocate memory
in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent,
but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory
(even on AMD systems the IOMMU tends to be quite small) especially if you have
many devices.  With the new zone pci_alloc_consistent can just put
this stuff into memory below 4GB which works better.

One argument was still if the zone should be 4GB or 2GB. The main
motivation for 2GB would be an unnamed not so unpopular hardware
raid controller (mostly found in older machines from a particular four letter
company) who has a strange 2GB restriction in firmware. But
that one works ok with swiotlb/IOMMU anyways, so it doesn't really
need GFP_DMA32. I chose 4GB to be compatible with IA64 and because
it seems to be the most common restriction.

The new zone is so far added only for x86-64.

For other architectures who don't set up this
new zone nothing changes. Architectures can set a compatibility
define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32
as GFP_DMA. Otherwise it's a nop because on 32bit architectures
it's normally not needed because GFP_NORMAL (=0) is DMA able
enough.

One problem is still that GFP_DMA means different things on different
architectures. e.g. some drivers used to have #ifdef ia64  use GFP_DMA
(trusting it to be 4GB) #elif __x86_64__ (use other hacks like
the swiotlb because 16MB is not enough) ... . This was quite
ugly and is now obsolete.

These should be now converted to use GFP_DMA32 unconditionally. I haven't done
this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent
which will use GFP_DMA32 transparently.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14 19:55:13 -08:00
Nick Piggin
669ed17521 [PATCH] mm: highmem watermarks
The pages_high - pages_low and pages_low - pages_min deltas are the asynch
reclaim watermarks.  As such, the should be in the same ratios as any other
zone for highmem zones.  It is the pages_min - 0 delta which is the
PF_MEMALLOC reserve, and this is the region that isn't very useful for
highmem.

This patch ensures highmem systems have similar characteristics as non highmem
ones with the same amount of memory, and also that highmem zones get similar
reclaim pressures to other zones.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Rohit Seth
7fb1d9fca5 [PATCH] mm: __alloc_pages cleanup
Clean up of __alloc_pages.

Restoration of previous behaviour, plus further cleanups by introducing an
'alloc_flags', removing the last of should_reclaim_zone.

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Kirill Korotaev
885036d32f [PATCH] mm: __GFP_NOFAIL fix
In __alloc_pages():

if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
         /* go through the zonelist yet again, ignoring mins */
         for (i = 0; zones[i] != NULL; i++) {
                 struct zone *z = zones[i];

                 page = buffered_rmqueue(z, order, gfp_mask);
                 if (page) {
                         zone_statistics(zonelist, z);
                         goto got_pg;
                 }
         }
         goto nopage;                <<<< HERE!!! FAIL...
}

kswapd (which has PF_MEMALLOC flag) can fail to allocate memory even when
it allocates it with __GFP_NOFAIL flag.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Denis Lunev <den@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Dave Jones
6b482c6779 [PATCH] Don't print per-cpu vm stats for offline cpus.
I just hit a page allocation error on a kernel configured to support
64 CPUs.  It spewed 60 completely useless unnecessary lines of info.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-10 13:25:53 -08:00
Adrian Bunk
f8b8db77b0 [PATCH] unexport nr_swap_pages
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
John Hawkes
2f96996de0 [PATCH] mm: wider use of for_each_*cpu()
In 'mm' change the explicit use of a for-loop using NR_CPUS into the
general for_each_cpu() constructs.  This widens the scope of potential
future optimizations of the general constructs, as well as takes advantage
of the existing optimizations of first_cpu() and next_cpu(), which is
advantageous when the true CPU count is much smaller than NR_CPUS.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:45 -07:00
Dave Hansen
3947be1969 [PATCH] memory hotplug: sysfs and add/remove functions
This adds generic memory add/remove and supporting functions for memory
hotplug into a new file as well as a memory hotplug kernel config option.

Individual architecture patches will follow.

For now, disable memory hotplug when swsusp is enabled.  There's a lot of
churn there right now.  We'll fix it up properly once it calms down.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
bdc8cb9845 [PATCH] memory hotplug locking: zone span seqlock
See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
208d54e551 [PATCH] memory hotplug locking: node_size_lock
pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.

Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem().  However, they are all included for completeness.  This
should also make any future consolidation of all of the implementations a
little more straightforward.

This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated.  This is the place there pfn_valid() is made false
for a memory area that's being removed.  The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
c6a57e19e4 [PATCH] memory hotplug prep: fixup bad_range()
When doing memory hotplug operations, the size of existing zones can obviously
change.  This means that zone->zone_{start_pfn,spanned_pages} can change.

There are currently no locks that protect these structure members.  However,
they are rarely accessed at runtime.  Outside of swsusp, the only place that I
can find is bad_range().

So, split bad_range() up into two pieces: one that needs to be locked and
anther that doesn't.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
ed8ece2ec8 [PATCH] memory hotplug prep: break out zone initialization
If a zone is empty at boot-time and then hot-added to later, it needs to run
the same init code that would have been run on it at boot.

This patch breaks out zone table and per-cpu-pages functions for use by the
hotplug code.  You can almost see all of the free_area_init_core() function on
one page now.  :)

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Hugh Dickins
4c21e2f244 [PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Nick Piggin
b5810039a5 [PATCH] core remove PageReserved
Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.

PageReserved special casing is removed from get_page and put_page.

All setting and clearing of PageReserved is retained, and it is now flagged
in the page_alloc checks to help ensure we don't introduce any refcount
based freeing of Reserved pages.

MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated.  We never completely handled it correctly anyway, and is be
reintroduced in future if required (Hugh has a proof of concept).

Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
be trivially removed.

Last real user of PageReserved is swsusp, which uses PageReserved to
determine whether a struct page points to valid memory or not.  This still
needs to be addressed (a generic page_is_ram() should work).

A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
thus mapcounted and count towards shared rss).  These writes to the struct
page could cause excessive cacheline bouncing on big systems.  There are a
number of ways this could be addressed if it is an issue.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Refcount bug fix for filemap_xip.c

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:39 -07:00
Seth, Rohit
e46a5e28c2 [PATCH] mm: set per-cpu-pages lower threshold to zero
Set the low water mark for hot pages in pcp to zero.

(akpm: for the life of me I cannot remember why we created pcp->low.  Neither
can Martin and the changelog is silent.  Maybe it was just a brainfart, but I
have this feeling that there was a reason.  If not, we should remove the
fields completely.  We'll see.)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:35 -07:00
Seth, Rohit
ba56e91c94 [PATCH] mm: page_alloc: increase size of per-cpu-pages
Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB.

Over 100+ runs for a workload, the difference in mean is about 2%.  The best
results for both are almost same.  Though the max variation in results with
1/2MB is only 2.2%, whereas with 1/4MB it is 12%.

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:35 -07:00
Al Viro
260b23674f [PATCH] gfp_t: the rest
zone handling, mapping->flags handling

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28 08:16:51 -07:00
Al Viro
af4ca457ea [PATCH] gfp_t: infrastructure
Beginning of gfp_t annotations:

 - -Wbitwise added to CHECKFLAGS
 - old __bitwise renamed to __bitwise__
 - __bitwise defined to either __bitwise__ or nothing, depending on
   __CHECK_ENDIAN__ being defined
 - gfp_t switched from __nocast to __bitwise__
 - force cast to gfp_t added to __GFP_... constants
 - new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts
   the result to int

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28 08:16:46 -07:00
Magnus Damm
1c6fe94659 [PATCH] NUMA: broken per cpu pageset counters
The NUMA counters in struct per_cpu_pageset (linux/mmzone.h) are never
cleared today.  This works ok for CPU 0 on NUMA machines because
boot_pageset[] is already zero, but for other CPU:s this results in
uninitialized counters.

Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26 10:39:43 -07:00
Al Viro
dd0fc66fb3 [PATCH] gfp flags annotations - part 1
- added typedef unsigned int __nocast gfp_t;

 - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
   the same warnings as far as sparse is concerned, doesn't change
   generated code (from gcc point of view we replaced unsigned int with
   typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08 15:00:57 -07:00
Randy Dunlap
9f1583339a [PATCH] use add_taint() for setting tainted bit flags
Use the add_taint() interface for setting tainted bit flags instead of
doing it manually.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13 08:22:29 -07:00
Renaud Lienhart
207f36eec9 [PATCH] remove invalid comment in mm/page_alloc.c
free_pages_bulk() doesn't free the entire list if count == 0.

Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:31 -07:00
Paul Jackson
9bf2229f88 [PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution.  With this patch, there are now the following four layers of
memory placement available:

 1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
 3) The current tasks cpuset (GFP_USER allocations constrained to here), and
 4) Specific node placement, using mbind and set_mempolicy.

These nest - each layer is a subset (same or within) of the previous.

Layer (2) above is new, with this patch.  The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed.  The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.

GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.

The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset.  Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.

This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.

Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:40 -07:00
Ravikiran G Thirumalai
6c231b7bab [PATCH] Additions to .data.read_mostly section
Mark variables which are usually accessed for reads with __readmostly.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:33 -07:00
Christoph Lameter
c3d8c14145 [PATCH] More __read_mostly variables
Move some more frequently read variables that showed up during some of our
performance tests as sometimes ending up in hot cachelines to the
read_mostly section.

Fix: Move the __read_mostly from before hpet_usec_quotient to follow the
variable like the other uses of __read_mostly.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:18 -07:00
Martin Hicks
c07e02db76 [PATCH] VM: add page_state info to per-node meminfo
Add page_state info to the per-node meminfo file in sysfs.  This is mostly
just for informational purposes.

The lack of this information was brought up recently during a discussion
regarding pagecache clearing, and I put this patch together to test out one
of the suggestions.

It seems like interesting info to have, so I'm submitting the patch.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:49 -07:00
Martin Hicks
53e9a6159f [PATCH] VM: zone reclaim atomic ops cleanup
Christoph Lameter and Marcelo Tosatti asked to get rid of the
atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:44 -07:00
Nick Piggin
242e546862 [PATCH] mm: remove atomic
This bitop does not need to be atomic because it is performed when there will
be no references to the page (ie.  the page is being freed).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:44 -07:00
Martin J. Bligh
e310fd4325 [PATCH] Fix NUMA node sizing in nr_free_zone_pages
We are iterating over all nodes in nr_free_zone_pages().  Because the
fallback zonelists contain all nodes in the system, and we walk all the
zonelists, we're counting memory multiple times (once for each node).  This
caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes
all the dirty ratio calculations, etc incorrect.

There's still a further bug to fix from e820 holes causing overestimation
as well, but this fix is separate, and good as is, and fixes one class of
problems.  Problem found by Badari, and tested by Ram Pai - thanks!

Signed-off-by:  Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by:  Matt Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-30 10:14:46 -07:00
Andy Whitcroft
12b1c5f382 [PATCH] Remove bogus warning in page_alloc.c
Originally __free_pages_bulk used the relative page number within a zone to
define its buddies.  This meant that to maintain the "maximally aligned"
requirements (that an allocation of size N will be aligned at least to N
physically) zones had to also be aligned to 1<<MAX_ORDER pages.  When
__free_pages_bulk was updated to use the relative page frame numbers of the
free'd pages to pair buddies this released the alignment constraint on the
'left' edge of the zone.  This allows _either_ edge of the zone to contain
partial MAX_ORDER sized buddies.  These simply never will have matching
buddies and thus will never make it to the 'top' of the pyramid.

The patch below removes a now redundant check ensuring that the mem_map was
aligned to MAX_ORDER.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:54 -07:00
Marcelo Tosatti
37b173a4d0 [PATCH] remove completly bogus comment inside __alloc_pages() try_to_free_pages handling
Remove completly bogus comment from did_some_progress != 0 handling (that
same comment is a few lines below on did_some_progress = 0 case, where it
belongs).

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:35 -07:00
Marcelo Tosatti
79b9ce311e [PATCH] print order information when OOM killing
Dump the current allocation order when OOM killing.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:35 -07:00
Bob Picco
3212c6be25 [PATCH] fix WANT_PAGE_VIRTUAL in memmap_init
I spotted this issue while in memmap_init last week.  I can't say the
change has any test coverage by me.  start_pfn was formerly used in main
"for" loop.  The fix is replace start_pfn with pfn.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 15:11:42 -07:00
Andy Whitcroft
29751f6991 [PATCH] sparsemem hotplug base
Make sparse's initalization be accessible at runtime.  This allows sparse
mappings to be created after boot in a hotplug situation.

This patch is separated from the previous one just to give an indication how
much of the sparse infrastructure is *just* for hotplug memory.

The section_mem_map doesn't really store a pointer.  It stores something that
is convenient to do some math against to get a pointer.  It isn't valid to
just do *section_mem_map, so I don't think it should be stored as a pointer.

There are a couple of things I'd like to store about a section.  First of all,
the fact that it is !NULL does not mean that it is present.  There could be
such a combination where section_mem_map *is* NULL, but the math gets you
properly to a real mem_map.  So, I don't think that check is safe.

Since we're storing 32-bit-aligned structures, we have a few bits in the
bottom of the pointer to play with.  Use one bit to encode whether there's
really a mem_map there, and the other one to tell whether there's a valid
section there.  We need to distinguish between the two because sometimes
there's a gap between when a section is discovered to be present and when we
can get the mem_map for it.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:05 -07:00
Andy Whitcroft
641c767389 [PATCH] sparsemem swiss cheese numa layouts
The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem.  It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn.  It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.

Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts.  The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:

	Node 0: +++++     +++++
	Node 1:      +++++

Previous behavior before Mike's patch was to assign nodes like this:

	Node 0: 00000     XXXXX
	Node 1:      11111

Where the 'X' areas were simply thrown away.  The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0: 000000000000000 Node 1: 11111

This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory.  memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages.  However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:

	struct page *start = pfn_to_page(start_pfn);
	// effectively start = &node->node_mem_map[0]
	for (page = start; page < (start + size); page++) {
		init_page_here();...
		page++;
	}

Slow, and wasteful, but generally harmless.

But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):

	for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
		page = pfn_to_page(pfn);
	}

And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0.  This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:05 -07:00
Andy Whitcroft
d41dee369b [PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.

A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.

Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.

Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.

In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.

One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.

This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:04 -07:00
Dave Hansen
93b7504e3e [PATCH] Introduce new Kconfig option for NUMA or DISCONTIG
There is some confusion that arose when working on SPARSEMEM patch between
what is needed for DISCONTIG vs. NUMA.

Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
All of the current NUMA implementations require an implementation of
DISCONTIG.  Because of this, quite a lot of code which is really needed for
NUMA is actually under DISCONTIG #ifdefs.  For SPARSEMEM, we changed some
of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
case.

Introducing this new NEED_MULTIPLE_NODES config option allows code that is
needed for both NUMA or DISCONTIG to be separated out from code that is
specific to DISCONTIG.

One great advantage of this approach is that it doesn't require every
architecture to be converted over.  All of the current implementations
should "just work", only the ones implementing SPARSEMEM will have to be
fixed up.

The change to free_area_init() makes it work inside, or out of the new
config option.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:03 -07:00
Dave Hansen
348f8b6c48 [PATCH] sparsemem base: reorganize page->flags bit operations
Generify the value fields in the page_flags.  The aim is to allow the location
and size of these fields to be varied.  Additionally we want to move away from
fixed allocations per field whilst still enforcing the overall bit utilisation
limits.  We rely on the compiler to spot and optimise the accessor functions.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:01 -07:00
Dave Hansen
6f167ec721 [PATCH] sparsemem base: simple NUMA remap space allocator
Introduce a simple allocator for the NUMA remap space.  This space is very
scarce, used for structures which are best allocated node local.

This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.

Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
  to allow us to allocate the pgdat's using this mechanism; which we do not yet
  do.  Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:01 -07:00
Christoph Lameter
b7c84c6ada [PATCH] boot_pageset must not be freed.
The boot_pageset needs to be preserved for hotplugging and for off line
processors and nodes. Otherwise pointers will point into memory that has
now a different use. /proc/zoneinfo is currently showing strange results
if processors / nodes are not present.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22 20:42:32 -07:00
Denis Vlasenko
c0d62219a4 [PATCH] Kill stray newline
OOM killer prints a stray newline.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:21 -07:00
Hugh Dickins
334795eca4 [PATCH] bad_page: clear reclaim and slab
Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:19 -07:00
Christoph Lameter
2caaad41e4 [PATCH] Reduce size of huge boot per_cpu_pageset
Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset.  Use one specially
configured pageset per cpu for all zones and nodes during bootup.

- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
  in free_hot_cold_page (otherwise we would have to hold a page
  in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
  for the boot pageset. A boot pageset is not necessary for cpu hotplug.

Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:18 -07:00
Christoph Lameter
4ae7c03943 [PATCH] Periodically drain non local pagesets
The pageset array can potentially acquire a huge amount of memory on large
NUMA systems.  F.e.  on a system with 512 processors and 256 nodes there
will be 256*512 pagesets.  If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets.  The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets.  Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.

The slab allocator flushes its per cpu caches every 2 seconds.  The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.

The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:18 -07:00
Janet Morgan
578c2fd6a7 [PATCH] add OOM debug
This patch provides more debug info when the system is OOM.  It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.

Thanks to Dave Jones for coming up with the idea.

Signed-off-by: Janet Morgan <janetmor@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Benjamin LaHaise
c2f29ea111 [PATCH] __read_page_state(): pass unsigned long instead of unsigned
By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument.  This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Benjamin LaHaise
83e5d8f725 [PATCH] __mod_page_state(): pass unsigned long instead of unsigned
By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument.  This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Darren Hart
1ad539b2bd [PATCH] vm: try_to_free_pages unused argument
try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0.  The following patch removes the argument and updates all the
calls to try_to_free_pages.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Christoph Lameter
e7c8d5c995 [PATCH] node local per-cpu-pages
This patch modifies the way pagesets in struct zone are managed.

Each zone has a per-cpu array of pagesets.  So any particular CPU has some
memory in each zone structure which belongs to itself.  Even if that CPU is
not local to that zone.

So the patch relocates the pagesets for each cpu to the node that is nearest
to the cpu instead of allocating the pagesets in the (possibly remote) target
zone.  This means that the operations to manage pages on remote zone can be
done with information available locally.

We play a macro trick so that non-NUMA pmachines avoid the additional
pointer chase on the page allocator fastpath.

AIM7 benchmark on a 32 CPU SGI Altix

w/o patches:
Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      484.68  100       484.6769     12.01      1.97   Fri Mar 25 11:01:42 2005
  100    27140.46   89       271.4046     21.44    148.71   Fri Mar 25 11:02:04 2005
  200    30792.02   82       153.9601     37.80    296.72   Fri Mar 25 11:02:42 2005
  300    32209.27   81       107.3642     54.21    451.34   Fri Mar 25 11:03:37 2005
  400    34962.83   78        87.4071     66.59    588.97   Fri Mar 25 11:04:44 2005
  500    31676.92   75        63.3538     91.87    742.71   Fri Mar 25 11:06:16 2005
  600    36032.69   73        60.0545     96.91    885.44   Fri Mar 25 11:07:54 2005
  700    35540.43   77        50.7720    114.63   1024.28   Fri Mar 25 11:09:49 2005
  800    33906.70   74        42.3834    137.32   1181.65   Fri Mar 25 11:12:06 2005
  900    34120.67   73        37.9119    153.51   1325.26   Fri Mar 25 11:14:41 2005
 1000    34802.37   74        34.8024    167.23   1465.26   Fri Mar 25 11:17:28 2005

with slab API changes and pageset patch:

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      485.00  100       485.0000     12.00      1.96   Fri Mar 25 11:46:18 2005
  100    28000.96   89       280.0096     20.79    150.45   Fri Mar 25 11:46:39 2005
  200    32285.80   79       161.4290     36.05    293.37   Fri Mar 25 11:47:16 2005
  300    40424.15   84       134.7472     43.19    438.42   Fri Mar 25 11:47:59 2005
  400    39155.01   79        97.8875     59.46    590.05   Fri Mar 25 11:48:59 2005
  500    37881.25   82        75.7625     76.82    730.19   Fri Mar 25 11:50:16 2005
  600    39083.14   78        65.1386     89.35    872.79   Fri Mar 25 11:51:46 2005
  700    38627.83   77        55.1826    105.47   1022.46   Fri Mar 25 11:53:32 2005
  800    39631.94   78        49.5399    117.48   1169.94   Fri Mar 25 11:55:30 2005
  900    36903.70   79        41.0041    141.94   1310.78   Fri Mar 25 11:57:53 2005
 1000    36201.23   77        36.2012    160.77   1458.31   Fri Mar 25 12:00:34 2005

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:16 -07:00
Martin Hicks
1e7e5a9048 [PATCH] VM: rate limit early reclaim
When early zone reclaim is turned on the LRU is scanned more frequently when a
zone is low on memory.  This limits when the zone reclaim can be called by
skipping the scan if another thread (either via kswapd or sync reclaim) is
already reclaiming from the zone.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Martin Hicks
0c35bbadc5 [PATCH] VM: add __GFP_NORECLAIM
When using the early zone reclaim, it was noticed that allocating new pages
that should be spread across the whole system caused eviction of local pages.

This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts.  The example that is implemented here is for page
cache pages.  We want page cache pages to be spread across the whole system,
and we don't want page cache pages to evict other pages to get local memory.

Signed-off-by:  Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Martin Hicks
753ee72896 [PATCH] VM: early zone reclaim
This is the core of the (much simplified) early reclaim.  The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines.  With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim.  It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

			wall  user   sys   %cpu  ctx sw.  sleeps
			----  ----   ---   ----   ------  ------
No patch		1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Nikita Danilov
295ab93497 [PATCH] mm: add /proc/zoneinfo
Add /proc/zoneinfo file to display information about memory zones.  Useful
to analyze VM behaviour.

Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Dean Nelson
7223a93a53 [IA64] Export node_online_map and node_possible_map
Export node_online_map and node_possible_map so that kernel modules can use
the nodemask macros, like, for_each_node() and for_each_online_node().

Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-05-03 12:09:32 -07:00
Pavel Pisa
4dc3b16ba1 [PATCH] DocBook: changes and extensions to the kernel documentation
I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again.  The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels.  I
have tried to proceed with that task.  I have done that more times from 2.6.0
time and it gets boring to do same changes again and again.  Linux kernel
compiles after changes for i386 and ARM targets.  I have added references to
some more files into kernel-api book, I have added some section names as well.
 So please, check that changes do not break something and that categories are
not too much skewed.

I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention.  Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors.  Changed <pid> to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.

You can see result of the modified documentation build at
  http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz

Some more sources are ready to be included into kernel-doc generated
documentation.  Sources has been added into kernel-api for now.  Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.

Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:25 -07:00
KAMEZAWA Hiroyuki
edfbe2b003 [PATCH] count bounce buffer pages in vmstat
This is a patch for counting the number of pages for bounce buffers.  It's
shown in /proc/vmstat.

Currently, the number of bounce pages are not counted anywhere.  So, if
there are many bounce pages, it seems that there are leaked pages.  And
it's difficult for a user to imagine the usage of bounce pages.  So, it's
meaningful to show # of bouce pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:58:37 -07:00
Nick Piggin
b84a35be02 [PATCH] mempool: NOMEMALLOC and NORETRY
Mempools have 2 problems.

The first is that mempool_alloc can possibly get stuck in __alloc_pages
when they should opt to fail, and take an element from their reserved pool.

The second is that it will happily eat emergency PF_MEMALLOC reserves
instead of going to their reserved pools.

Fix the first by passing __GFP_NORETRY in the allocation calls in
mempool_alloc.  Fix the second by introducing a __GFP_MEMPOOL flag which
directs the page allocator not to allocate from the reserve pool.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:58:36 -07:00
Nick Piggin
8e30f272a9 [PATCH] mm: pcp use non powers of 2 for batch size
Jack Steiner reported this to have fixed his problem (bad colouring):
"The patches fix both problems that I found - bad
 coloring & excessive pages in pagesets."

In most workloads this is not likely to be such a pronounced problem,
however it should help corner cases.  And avoiding powers of 2 in these
types of memory operations is always a good idea.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:58:36 -07:00
Linus Torvalds
1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00