4.12-rc ppc64 4k-page needs costly allocations

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

4.12-rc ppc64 4k-page needs costly allocations

Hugh Dickins-2
Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
I find that swapping loads on ppc64 on G5 with 4k pages are failing:

SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
  cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
  pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
  node 0: slabs: 209, objs: 209, free: 8
gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
Call Trace:
[c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
[c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
[c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
[c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
[c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
[c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
[c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
[c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
[c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
[c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
[c00000000090be30] [c00000000000a118] system_call+0x38/0xfc

I did try booting with slub_debug=O as the message suggested, but that
made no difference: it still hoped for but failed on order:4 allocations.

I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
it seemed to be a hard requirement for something, but I didn't find what.

I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
the expected order:3, which then results in OOM-killing rather than direct
allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
makes no real difference to the outcome: swapping loads still abort early.

Relying on order:3 or order:4 allocations is just too optimistic: ppc64
with 4k pages would do better not to expect to support a 128TB userspace.

I tried the obvious partial revert below, but it's not good enough:
the system did not boot beyond

Starting init: /sbin/init exists but couldn't execute it (error -7)
Starting init: /bin/sh exists but couldn't execute it (error -7)
Kernel panic - not syncing: No working init found. ...

--- 4.12-rc2/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ linux/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  9
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE (sizeof(pte_t) << H_PTE_INDEX_SIZE)
--- 4.12-rc2/arch/powerpc/include/asm/processor.h
+++ linux/arch/powerpc/include/asm/processor.h
@@ -110,7 +110,7 @@ void release_thread(struct task_struct *
 #define TASK_SIZE_128TB (0x0000800000000000UL)
 #define TASK_SIZE_512TB (0x0002000000000000UL)
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
 /*
  * Max value currently used:
  */
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Michael Ellerman-2
Hugh Dickins <[hidden email]> writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...

Ouch, sorry.

I boot test a G5 with 4K pages, but I don't stress test it much so I
didn't notice this.

I think making 128TB depend on 64K pages makes sense, Aneesh is going to
try and do a patch for that.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Christoph Lameter-9
In reply to this post by Hugh Dickins-2
On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Christoph Lameter-9
In reply to this post by Michael Ellerman-2
On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Hugh Dickins-2
[ Merging two mails into one response ]

On Wed, 31 May 2017, Christoph Lameter wrote:

> On Tue, 30 May 2017, Hugh Dickins wrote:
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
>
> I am curious as to what is going on there. Do you have the output from
> these failed allocations?

I thought the relevant output was in my mail.  I did skip the Mem-Info
dump, since that just seemed noise in this case: we know memory can get
fragmented.  What more output are you looking for?

>
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
>
> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
> be able to enable it at runtime.

Yes, I thought so.

>
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
>
> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
>
> Ahh. Ok debugging increased the object size to an order 4. This should be
> order 3 without debugging.

But it was still order 4 when booted with slub_debug=O, which surprised me.
And that surprises you too?  If so, then we ought to dig into it further.

>
> Why are the slab allocators used to create slab caches for large object
> sizes?

There may be more optimal ways to allocate, but I expect that when
the ppc guys are writing the code to handle both 4k and 64k page sizes,
kmem caches offer the best span of possibility without complication.

>
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
>
> I thought you had these huge 64k page sizes?

ppc64 does support 64k page sizes, and they've been the default for years;
but since 4k pages are still supported, I choose to use those (I doubt
I could ever get the same load going with 64k pages).

Hugh
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Mathieu Malaterre
On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins <[hidden email]> wrote:

> [ Merging two mails into one response ]
>
> On Wed, 31 May 2017, Christoph Lameter wrote:
>> On Tue, 30 May 2017, Hugh Dickins wrote:
>> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>>
>> > I did try booting with slub_debug=O as the message suggested, but that
>> > made no difference: it still hoped for but failed on order:4 allocations.
>>
>> I am curious as to what is going on there. Do you have the output from
>> these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?
>
>>
>> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
>> > it seemed to be a hard requirement for something, but I didn't find what.
>>
>> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
>> be able to enable it at runtime.
>
> Yes, I thought so.
>
>>
>> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
>> > the expected order:3, which then results in OOM-killing rather than direct
>> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
>> > makes no real difference to the outcome: swapping loads still abort early.
>>
>> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
>>
>> Ahh. Ok debugging increased the object size to an order 4. This should be
>> order 3 without debugging.
>
> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.
>
>>
>> Why are the slab allocators used to create slab caches for large object
>> sizes?
>
> There may be more optimal ways to allocate, but I expect that when
> the ppc guys are writing the code to handle both 4k and 64k page sizes,
> kmem caches offer the best span of possibility without complication.
>
>>
>> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
>> > with 4k pages would do better not to expect to support a 128TB userspace.
>>
>> I thought you had these huge 64k page sizes?
>
> ppc64 does support 64k page sizes, and they've been the default for years;
> but since 4k pages are still supported, I choose to use those (I doubt
> I could ever get the same load going with 64k pages).

4k is pretty much required on ppc64 when it comes to nouveau:

https://bugs.freedesktop.org/show_bug.cgi?id=94757

2cts
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Aneesh Kumar K.V-3
In reply to this post by Hugh Dickins-2
Hugh Dickins <[hidden email]> writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...
>

Can you try this patch.

commit fc55c0dc8b23446f937c1315aa61e74673de5ee6
Author: Aneesh Kumar K.V <[hidden email]>
Date:   Thu Jun 1 08:06:40 2017 +0530

    powerpc/mm/4k: Limit 4k page size to 64TB
   
    Supporting 512TB requires us to do a order 3 allocation for level 1 page
    table(pgd). Limit 4k to 64TB for now.
   
    Signed-off-by: Aneesh Kumar K.V <[hidden email]>

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b671ca..0c4e470571ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  9
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE (sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index a2123f291ab0..5de3271026f1 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -110,13 +110,15 @@ void release_thread(struct task_struct *);
 #define TASK_SIZE_128TB (0x0000800000000000UL)
 #define TASK_SIZE_512TB (0x0002000000000000UL)
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
 /*
  * Max value currently used:
  */
-#define TASK_SIZE_USER64 TASK_SIZE_512TB
+#define TASK_SIZE_USER64 TASK_SIZE_512TB
+#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB
 #else
-#define TASK_SIZE_USER64 TASK_SIZE_64TB
+#define TASK_SIZE_USER64 TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB
 #endif
 
 /*
@@ -132,7 +134,7 @@ void release_thread(struct task_struct *);
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
-#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
+#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4))
 
 #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
  TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
@@ -143,8 +145,8 @@ void release_thread(struct task_struct *);
  * with 128TB and conditionally enable upto 512TB
  */
 #ifdef CONFIG_PPC_BOOK3S_64
-#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
- TASK_SIZE_USER32 : TASK_SIZE_128TB)
+#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
+ TASK_SIZE_USER32 : DEFAULT_MAP_WINDOW_USER64)
 #else
 #define DEFAULT_MAP_WINDOW TASK_SIZE
 #endif
@@ -153,7 +155,7 @@ void release_thread(struct task_struct *);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /* Limit stack to 128TB */
-#define STACK_TOP_USER64 TASK_SIZE_128TB
+#define STACK_TOP_USER64 DEFAULT_MAP_WINDOW_USER64
 #else
 #define STACK_TOP_USER64 TASK_SIZE_USER64
 #endif
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 8389ff5ac002..77062461c469 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -921,7 +921,7 @@ void __init setup_arch(char **cmdline_p)
 
 #ifdef CONFIG_PPC_MM_SLICES
 #ifdef CONFIG_PPC64
- init_mm.context.addr_limit = TASK_SIZE_128TB;
+ init_mm.context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
 #else
 #error "context.addr_limit not initialized."
 #endif
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2ae78ef..a3edf813d455 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -99,7 +99,7 @@ static int hash__init_new_context(struct mm_struct *mm)
  * mm->context.addr_limit. Default to max task size so that we copy the
  * default values to paca which will help us to handle slb miss early.
  */
- mm->context.addr_limit = TASK_SIZE_128TB;
+ mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
 
  /*
  * The old code would re-promote on fork, we don't do that when using
 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Christoph Lameter-9
In reply to this post by Hugh Dickins-2


> > I am curious as to what is going on there. Do you have the output from
> > these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?

The output for the failing allocations when you disabling debugging. For
that I would think that you need remove(!) the slub_debug statement on the kernel
command line. You can verify that debug is off by inspecting the values in
/sys/kernel/slab/<yourcache>/<debug option>

> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.

No it does no longer. I dont think slub_debug=O does disable debugging
(frankly I am not sure what it does). Please do not specify any debug options.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Hugh Dickins-2
In reply to this post by Aneesh Kumar K.V-3
On Thu, 1 Jun 2017, Aneesh Kumar K.V wrote:

> Hugh Dickins <[hidden email]> writes:
>
> > Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> > I find that swapping loads on ppc64 on G5 with 4k pages are failing:
> >
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
> >   node 0: slabs: 209, objs: 209, free: 8
> > gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> > CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> > Call Trace:
> > [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> > [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> > [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> > [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> > [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> > [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> > [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> > [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> > [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> > [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> > [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
> >
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
> >
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
> >
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
> >
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
> >
> > I tried the obvious partial revert below, but it's not good enough:
> > the system did not boot beyond
> >
> > Starting init: /sbin/init exists but couldn't execute it (error -7)
> > Starting init: /bin/sh exists but couldn't execute it (error -7)
> > Kernel panic - not syncing: No working init found. ...
> >
>
> Can you try this patch.

Thanks!  By the time I got to try it, you'd sent another later in the
day.  Fractionally different, and I didn't spend any time working out
whether the difference was significant or cosmetic, I just tried that
second one instead.  No problems with it so far, hasn't been running
long, but long enough to say that it definitely fixes the problems
I was getting - thank you.

Hugh

>
> commit fc55c0dc8b23446f937c1315aa61e74673de5ee6
> Author: Aneesh Kumar K.V <[hidden email]>
> Date:   Thu Jun 1 08:06:40 2017 +0530
>
>     powerpc/mm/4k: Limit 4k page size to 64TB
>    
>     Supporting 512TB requires us to do a order 3 allocation for level 1 page
>     table(pgd). Limit 4k to 64TB for now.
>    
>     Signed-off-by: Aneesh Kumar K.V <[hidden email]>
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b671ca..0c4e470571ca 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -8,7 +8,7 @@
>  #define H_PTE_INDEX_SIZE  9
>  #define H_PMD_INDEX_SIZE  7
>  #define H_PUD_INDEX_SIZE  9
> -#define H_PGD_INDEX_SIZE  12
> +#define H_PGD_INDEX_SIZE  9
>  
>  #ifndef __ASSEMBLY__
>  #define H_PTE_TABLE_SIZE (sizeof(pte_t) << H_PTE_INDEX_SIZE)
> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
> index a2123f291ab0..5de3271026f1 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -110,13 +110,15 @@ void release_thread(struct task_struct *);
>  #define TASK_SIZE_128TB (0x0000800000000000UL)
>  #define TASK_SIZE_512TB (0x0002000000000000UL)
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
>  /*
>   * Max value currently used:
>   */
> -#define TASK_SIZE_USER64 TASK_SIZE_512TB
> +#define TASK_SIZE_USER64 TASK_SIZE_512TB
> +#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB
>  #else
> -#define TASK_SIZE_USER64 TASK_SIZE_64TB
> +#define TASK_SIZE_USER64 TASK_SIZE_64TB
> +#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB
>  #endif
>  
>  /*
> @@ -132,7 +134,7 @@ void release_thread(struct task_struct *);
>   * space during mmap's.
>   */
>  #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
> -#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
> +#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4))
>  
>  #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
>   TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
> @@ -143,8 +145,8 @@ void release_thread(struct task_struct *);
>   * with 128TB and conditionally enable upto 512TB
>   */
>  #ifdef CONFIG_PPC_BOOK3S_64
> -#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
> - TASK_SIZE_USER32 : TASK_SIZE_128TB)
> +#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
> + TASK_SIZE_USER32 : DEFAULT_MAP_WINDOW_USER64)
>  #else
>  #define DEFAULT_MAP_WINDOW TASK_SIZE
>  #endif
> @@ -153,7 +155,7 @@ void release_thread(struct task_struct *);
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
>  /* Limit stack to 128TB */
> -#define STACK_TOP_USER64 TASK_SIZE_128TB
> +#define STACK_TOP_USER64 DEFAULT_MAP_WINDOW_USER64
>  #else
>  #define STACK_TOP_USER64 TASK_SIZE_USER64
>  #endif
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index 8389ff5ac002..77062461c469 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -921,7 +921,7 @@ void __init setup_arch(char **cmdline_p)
>  
>  #ifdef CONFIG_PPC_MM_SLICES
>  #ifdef CONFIG_PPC64
> - init_mm.context.addr_limit = TASK_SIZE_128TB;
> + init_mm.context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
>  #else
>  #error "context.addr_limit not initialized."
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index c6dca2ae78ef..a3edf813d455 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -99,7 +99,7 @@ static int hash__init_new_context(struct mm_struct *mm)
>   * mm->context.addr_limit. Default to max task size so that we copy the
>   * default values to paca which will help us to handle slb miss early.
>   */
> - mm->context.addr_limit = TASK_SIZE_128TB;
> + mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
>  
>   /*
>   * The old code would re-promote on fork, we don't do that when using
>  
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Hugh Dickins-2
In reply to this post by Christoph Lameter-9
On Thu, 1 Jun 2017, Christoph Lameter wrote:

>
> > > I am curious as to what is going on there. Do you have the output from
> > > these failed allocations?
> >
> > I thought the relevant output was in my mail.  I did skip the Mem-Info
> > dump, since that just seemed noise in this case: we know memory can get
> > fragmented.  What more output are you looking for?
>
> The output for the failing allocations when you disabling debugging. For
> that I would think that you need remove(!) the slub_debug statement on the kernel
> command line. You can verify that debug is off by inspecting the values in
> /sys/kernel/slab/<yourcache>/<debug option>

The output was with debugging disabled.  Except when I tried adding that
slub_debug=O on the kernel command line, as the warning suggested, I did
not have any slub_debug statement on the command line; and did not have
CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are

CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SLABINFO=y
# CONFIG_SLUB_DEBUG_ON is not set
CONFIG_SLUB_STATS=y

>
> > But it was still order 4 when booted with slub_debug=O, which surprised me.
> > And that surprises you too?  If so, then we ought to dig into it further.
>
> No it does no longer. I dont think slub_debug=O does disable debugging
> (frankly I am not sure what it does). Please do not specify any debug options.

But I think you are now surprised, when I say no slub_debug options
were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
(before I tried the new kernel with Aneesh's fix patch)
in case they tell you anything...

pgtable-2^12/aliases:0
pgtable-2^12/align:32768
grep: pgtable-2^12/alloc_calls: Function not implemented
pgtable-2^12/alloc_fastpath:5847 C0=1587 C1=1449 C2=1392 C3=1419
pgtable-2^12/alloc_from_partial:12637 C0=3292 C1=3020 C2=3051 C3=3274
pgtable-2^12/alloc_node_mismatch:0
pgtable-2^12/alloc_refill:41038 C0=10600 C1=10025 C2=10191 C3=10222
pgtable-2^12/alloc_slab:517 C0=148 C1=110 C2=105 C3=154
pgtable-2^12/alloc_slowpath:54203 C0=14041 C1=13157 C2=13349 C3=13656
pgtable-2^12/cache_dma:0
pgtable-2^12/cmpxchg_double_cpu_fail:0
pgtable-2^12/cmpxchg_double_fail:0
pgtable-2^12/cpu_partial:2
pgtable-2^12/cpu_partial_alloc:25894 C0=6719 C1=6334 C2=6288 C3=6553
pgtable-2^12/cpu_partial_drain:8441 C0=2035 C1=2211 C2=2268 C3=1927
pgtable-2^12/cpu_partial_free:38987 C0=9642 C1=10042 C2=10132 C3=9171
pgtable-2^12/cpu_partial_node:12237 C0=3183 C1=2928 C2=2961 C3=3165
pgtable-2^12/cpu_slabs:11
pgtable-2^12/cpuslab_flush:17 C0=5 C2=4 C3=8
pgtable-2^12/ctor:pgd_ctor+0x0/0x18
pgtable-2^12/deactivate_bypass:39027 C0=10153 C1=9463 C2=9439 C3=9972
pgtable-2^12/deactivate_empty:446 C0=98 C1=118 C2=123 C3=107
pgtable-2^12/deactivate_full:16 C0=5 C2=3 C3=8
pgtable-2^12/deactivate_remote_frees:0
pgtable-2^12/deactivate_to_head:1 C2=1
pgtable-2^12/deactivate_to_tail:0
pgtable-2^12/destroy_by_rcu:0
pgtable-2^12/free_add_partial:24877 C0=6007 C1=6515 C2=6681 C3=5674
grep: pgtable-2^12/free_calls: Function not implemented
pgtable-2^12/free_fastpath:5849 C0=1587 C1=1449 C2=1394 C3=1419
pgtable-2^12/free_frozen:15145 C0=3989 C1=3701 C2=3683 C3=3772
pgtable-2^12/free_remove_partial:0
pgtable-2^12/free_slab:446 C0=98 C1=118 C2=123 C3=107
pgtable-2^12/free_slowpath:54132 C0=13631 C1=13743 C2=13815 C3=12943
pgtable-2^12/hwcache_align:0
pgtable-2^12/min_partial:8
pgtable-2^12/object_size:32768
pgtable-2^12/objects:67
pgtable-2^12/objects_partial:0
pgtable-2^12/objs_per_slab:1
pgtable-2^12/order:4
pgtable-2^12/order_fallback:13 C0=2 C1=1 C2=5 C3=5
pgtable-2^12/partial:4
pgtable-2^12/poison:0
pgtable-2^12/reclaim_account:0
pgtable-2^12/red_zone:0
pgtable-2^12/reserved:0
pgtable-2^12/sanity_checks:0
pgtable-2^12/slab_size:65536
pgtable-2^12/slabs:71
pgtable-2^12/slabs_cpu_partial:7(7) C0=1(1) C1=3(3) C2=1(1) C3=2(2)
pgtable-2^12/store_user:0
pgtable-2^12/total_objects:71
pgtable-2^12/trace:0

Hugh
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Christoph Lameter-9
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are
>
> CONFIG_SLUB_DEBUG=y
> # CONFIG_SLUB_MEMCG_SYSFS_ON is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> CONFIG_SLABINFO=y
> # CONFIG_SLUB_DEBUG_ON is not set
> CONFIG_SLUB_STATS=y

Thats fine.

> But I think you are now surprised, when I say no slub_debug options
> were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
> (before I tried the new kernel with Aneesh's fix patch)
> in case they tell you anything...
>
> pgtable-2^12/poison:0
> pgtable-2^12/red_zone:0
> pgtable-2^12/reserved:0
> pgtable-2^12/sanity_checks:0
> pgtable-2^12/store_user:0

Ok so debugging was off but the slab cache has a ctor callback which
mandates that the free pointer cannot use the free object space when
the object is not in use. Thus the size of the object must be increased to
accomodate the freepointer.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Hugh Dickins-2
On Thu, 1 Jun 2017, Christoph Lameter wrote:
>
> Ok so debugging was off but the slab cache has a ctor callback which
> mandates that the free pointer cannot use the free object space when
> the object is not in use. Thus the size of the object must be increased to
> accomodate the freepointer.

Thanks a lot for working that out.  Makes sense, fully understood now,
nothing to worry about (though makes one wonder whether it's efficient
to use ctors on high-alignment caches; or whether an internal "zero-me"
ctor would be useful).

Hugh
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Michael Ellerman-2
Hugh Dickins <[hidden email]> writes:

> On Thu, 1 Jun 2017, Christoph Lameter wrote:
>>
>> Ok so debugging was off but the slab cache has a ctor callback which
>> mandates that the free pointer cannot use the free object space when
>> the object is not in use. Thus the size of the object must be increased to
>> accomodate the freepointer.
>
> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Or should we just be using kmem_cache_zalloc() when we allocate from
those slabs?

Given all the ctor's do is memset to 0.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Hugh Dickins-2
On Fri, 2 Jun 2017, Michael Ellerman wrote:

> Hugh Dickins <[hidden email]> writes:
> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
> >>
> >> Ok so debugging was off but the slab cache has a ctor callback which
> >> mandates that the free pointer cannot use the free object space when
> >> the object is not in use. Thus the size of the object must be increased to
> >> accomodate the freepointer.
> >
> > Thanks a lot for working that out.  Makes sense, fully understood now,
> > nothing to worry about (though makes one wonder whether it's efficient
> > to use ctors on high-alignment caches; or whether an internal "zero-me"
> > ctor would be useful).
>
> Or should we just be using kmem_cache_zalloc() when we allocate from
> those slabs?
>
> Given all the ctor's do is memset to 0.

I'm not sure.  From a memory-utilization point of view, with SLUB,
using kmem_cache_zalloc() there would certainly be better.

But you may be forgetting that the constructor is applied only when a
new slab of objects is allocated, not each time an object is allocated
from that slab (and the user of those objects agrees to free objects
back to the cache in a reusable state: zeroed in this case).

So from a cpu-utilization point of view, it's better to use the ctor:
it's saving you lots of redundant memsets.

SLUB versus SLAB, cpu versus memory?  Since someone has taken the
trouble to write it with ctors in the past, I didn't feel on firm
enough ground to recommend such a change.  But it may be obvious
to someone else that your suggestion would be better (or worse).

Hugh
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Christoph Lameter-9
In reply to this post by Hugh Dickins-2
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Use kzalloc to zero it. And here is another example of using slab
allocations for page frames. Use the page allocator for this? The page
allocator is there for allocating page frames. The slab allocator main
purpose is to allocate small objects....


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Christoph Lameter-9
In reply to this post by Hugh Dickins-2
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> SLUB versus SLAB, cpu versus memory?  Since someone has taken the
> trouble to write it with ctors in the past, I didn't feel on firm
> enough ground to recommend such a change.  But it may be obvious
> to someone else that your suggestion would be better (or worse).

Umm how about using alloc_pages() for pageframes?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Michael Ellerman-2
In reply to this post by Hugh Dickins-2
Hugh Dickins <[hidden email]> writes:

> On Fri, 2 Jun 2017, Michael Ellerman wrote:
>> Hugh Dickins <[hidden email]> writes:
>> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> >>
>> >> Ok so debugging was off but the slab cache has a ctor callback which
>> >> mandates that the free pointer cannot use the free object space when
>> >> the object is not in use. Thus the size of the object must be increased to
>> >> accomodate the freepointer.
>> >
>> > Thanks a lot for working that out.  Makes sense, fully understood now,
>> > nothing to worry about (though makes one wonder whether it's efficient
>> > to use ctors on high-alignment caches; or whether an internal "zero-me"
>> > ctor would be useful).
>>
>> Or should we just be using kmem_cache_zalloc() when we allocate from
>> those slabs?
>>
>> Given all the ctor's do is memset to 0.
>
> I'm not sure.  From a memory-utilization point of view, with SLUB,
> using kmem_cache_zalloc() there would certainly be better.
>
> But you may be forgetting that the constructor is applied only when a
> new slab of objects is allocated, not each time an object is allocated
> from that slab (and the user of those objects agrees to free objects
> back to the cache in a reusable state: zeroed in this case).

Ah yes, I was "forgetting" that :) - ie. didn't know it.

> So from a cpu-utilization point of view, it's better to use the ctor:
> it's saving you lots of redundant memsets.

OK. Presumably we guarantee (somewhere) that the page tables are zeroed
before we free them, which is a natural result of tearing down all
mappings?

But then I see other arches (x86, arm64 at least), which don't use a
constructor, and use __GPF_ZERO (via PGALLOC_GFP) at allocation time.

eg. arm64:

        pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_SIZE,
                                      SLAB_PANIC, NULL);
        ...
        return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);


So that's a bit puzzling.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 4.12-rc ppc64 4k-page needs costly allocations

Michael Ellerman-2
In reply to this post by Christoph Lameter-9
Christoph Lameter <[hidden email]> writes:

> On Thu, 1 Jun 2017, Hugh Dickins wrote:
>
>> Thanks a lot for working that out.  Makes sense, fully understood now,
>> nothing to worry about (though makes one wonder whether it's efficient
>> to use ctors on high-alignment caches; or whether an internal "zero-me"
>> ctor would be useful).
>
> Use kzalloc to zero it.

But that's changing a per slab creation memset into a per object
allocation memset, isn't it?

> And here is another example of using slab allocations for page frames.
> Use the page allocator for this? The page allocator is there for
> allocating page frames. The slab allocator main purpose is to allocate
> small objects....

Well usually they are small (< PAGE_SIZE), because we have 64K pages.

But we could rework the code to use the page allocator on 4K configs.

cheers
Loading...