[PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Michael Ellerman-2
In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
switched to the generic implementation of cpu_to_node(), which uses a percpu
variable to hold the NUMA node for each CPU.

Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
of our percpu areas, leading to a chicken and egg problem. In practice what
happens is when we are setting up the percpu areas, cpu_to_node() reports that
all CPUs are on node 0, so we allocate all percpu areas on node 0.

This is visible in the dmesg output, as all pcpu allocs being in group 0:

  pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
  pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
  pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
  pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
  pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
  pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47

To fix it we need an early_cpu_to_node() which can run prior to percpu being
setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
in. With the patch dmesg output shows two groups, 0 and 1:

  pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
  pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
  pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
  pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
  pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
  pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47

We can also check the data_offset in the paca of various CPUs, with the fix we
see:

  CPU 0:  data_offset = 0x0ffe8b0000
  CPU 24: data_offset = 0x1ffe5b0000

And we can see from dmesg that CPU 24 has an allocation on node 1:

  node   0: [mem 0x0000000000000000-0x0000000fffffffff]
  node   1: [mem 0x0000001000000000-0x0000001fffffffff]

Cc: [hidden email] # v3.16+
Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
Signed-off-by: Michael Ellerman <[hidden email]>
---
 arch/powerpc/include/asm/topology.h | 14 ++++++++++++++
 arch/powerpc/kernel/setup_64.c      | 19 ++++++++++++++-----
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 8b3b46b7b0f2..8f3b2ec09b9e 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -44,8 +44,22 @@ extern void __init dump_numa_cpu_topology(void);
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct device *dev, int nid);
 
+static inline int early_cpu_to_node(int cpu)
+{
+ int nid;
+
+ nid = numa_cpu_lookup_table[cpu];
+
+ /*
+ * Some functions, eg. node_distance() don't cope with -1, so instead
+ * fall back to node 0 if nid is unset (it should be, except bugs).
+ */
+ return (nid < 0) ? 0 : nid;
+}
 #else
 
+static inline int early_cpu_to_node(int cpu) { return 0; }
+
 static inline void dump_numa_cpu_topology(void) {}
 
 static inline int sysfs_add_device_to_node(struct device *dev, int nid)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index f35ff9dea4fb..a1aa503946d3 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -661,7 +661,7 @@ void __init emergency_stack_init(void)
 
 static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
 {
- return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
+ return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size, align,
     __pa(MAX_DMA_ADDRESS));
 }
 
@@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
- if (cpu_to_node(from) == cpu_to_node(to))
- return LOCAL_DISTANCE;
- else
- return REMOTE_DISTANCE;
+#ifndef CONFIG_NUMA
+ return LOCAL_DISTANCE;
+#else
+ int from_nid, to_nid;
+
+ from_nid = early_cpu_to_node(from);
+ to_nid   = early_cpu_to_node(to);
+
+ if (from_nid == -1 || to_nid == -1)
+ return LOCAL_DISTANCE; /* Or assume remote? */
+
+ return node_distance(from_nid, to_nid);
+#endif
 }
 
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
--
2.7.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Nicholas Piggin-2
On Fri,  2 Jun 2017 15:14:47 +1000
Michael Ellerman <[hidden email]> wrote:

> In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
> switched to the generic implementation of cpu_to_node(), which uses a percpu
> variable to hold the NUMA node for each CPU.
>
> Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
> of our percpu areas, leading to a chicken and egg problem. In practice what
> happens is when we are setting up the percpu areas, cpu_to_node() reports that
> all CPUs are on node 0, so we allocate all percpu areas on node 0.
>
> This is visible in the dmesg output, as all pcpu allocs being in group 0:
>
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
>   pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
>   pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
>
> To fix it we need an early_cpu_to_node() which can run prior to percpu being
> setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
> in. With the patch dmesg output shows two groups, 0 and 1:
>
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
>   pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
>   pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
>
> We can also check the data_offset in the paca of various CPUs, with the fix we
> see:
>
>   CPU 0:  data_offset = 0x0ffe8b0000
>   CPU 24: data_offset = 0x1ffe5b0000
>
> And we can see from dmesg that CPU 24 has an allocation on node 1:
>
>   node   0: [mem 0x0000000000000000-0x0000000fffffffff]
>   node   1: [mem 0x0000001000000000-0x0000001fffffffff]

Nice bug :)

I wonder what happens if you put a WARN if cpu_to_node is used
before it is set up?


> @@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
>  
>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
>  {
> - if (cpu_to_node(from) == cpu_to_node(to))
> - return LOCAL_DISTANCE;
> - else
> - return REMOTE_DISTANCE;
> +#ifndef CONFIG_NUMA
> + return LOCAL_DISTANCE;
> +#else
> + int from_nid, to_nid;
> +
> + from_nid = early_cpu_to_node(from);
> + to_nid   = early_cpu_to_node(to);
> +
> + if (from_nid == -1 || to_nid == -1)
> + return LOCAL_DISTANCE; /* Or assume remote? */
> +
> + return node_distance(from_nid, to_nid);

If you made node_distance() return LOCAL_NODE for !NUMA, this
should fall out and not require the ifdef?

Looks good to me though.

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Michael Ellerman-2
Nicholas Piggin <[hidden email]> writes:

> On Fri,  2 Jun 2017 15:14:47 +1000
> Michael Ellerman <[hidden email]> wrote:
>
>> In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
>> switched to the generic implementation of cpu_to_node(), which uses a percpu
>> variable to hold the NUMA node for each CPU.
>>
>> Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
>> of our percpu areas, leading to a chicken and egg problem. In practice what
>> happens is when we are setting up the percpu areas, cpu_to_node() reports that
>> all CPUs are on node 0, so we allocate all percpu areas on node 0.
>>
>> This is visible in the dmesg output, as all pcpu allocs being in group 0:
>>
>>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>>   pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
>>   pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
>>   pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
>>
>> To fix it we need an early_cpu_to_node() which can run prior to percpu being
>> setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
>> in. With the patch dmesg output shows two groups, 0 and 1:
>>
>>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>>   pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
>>   pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
>>   pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
>>
>> We can also check the data_offset in the paca of various CPUs, with the fix we
>> see:
>>
>>   CPU 0:  data_offset = 0x0ffe8b0000
>>   CPU 24: data_offset = 0x1ffe5b0000
>>
>> And we can see from dmesg that CPU 24 has an allocation on node 1:
>>
>>   node   0: [mem 0x0000000000000000-0x0000000fffffffff]
>>   node   1: [mem 0x0000001000000000-0x0000001fffffffff]
>
> Nice bug :)

Yeah what a shocker.

> I wonder what happens if you put a WARN if cpu_to_node is used
> before it is set up?

Yeah we definitely need some way to debug these things.

>> @@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
>>  
>>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
>>  {
>> - if (cpu_to_node(from) == cpu_to_node(to))
>> - return LOCAL_DISTANCE;
>> - else
>> - return REMOTE_DISTANCE;
>> +#ifndef CONFIG_NUMA
>> + return LOCAL_DISTANCE;
>> +#else
>> + int from_nid, to_nid;
>> +
>> + from_nid = early_cpu_to_node(from);
>> + to_nid   = early_cpu_to_node(to);
>> +
>> + if (from_nid == -1 || to_nid == -1)
>> + return LOCAL_DISTANCE; /* Or assume remote? */
>> +
>> + return node_distance(from_nid, to_nid);
>
> If you made node_distance() return LOCAL_NODE for !NUMA, this
> should fall out and not require the ifdef?

Maybe yeah. This is designed to be minimal for backporting though.

cheers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Nicholas Piggin-2
On Fri, 02 Jun 2017 19:54:32 +1000
Michael Ellerman <[hidden email]> wrote:

> >> @@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
> >>  
> >>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
> >>  {
> >> - if (cpu_to_node(from) == cpu_to_node(to))
> >> - return LOCAL_DISTANCE;
> >> - else
> >> - return REMOTE_DISTANCE;
> >> +#ifndef CONFIG_NUMA
> >> + return LOCAL_DISTANCE;
> >> +#else
> >> + int from_nid, to_nid;
> >> +
> >> + from_nid = early_cpu_to_node(from);
> >> + to_nid   = early_cpu_to_node(to);
> >> +
> >> + if (from_nid == -1 || to_nid == -1)
> >> + return LOCAL_DISTANCE; /* Or assume remote? */
> >> +
> >> + return node_distance(from_nid, to_nid);  
> >
> > If you made node_distance() return LOCAL_NODE for !NUMA, this
> > should fall out and not require the ifdef?  
>
> Maybe yeah. This is designed to be minimal for backporting though.

Okay fair enough. Is it expected to get back -1 from this ever? I think
all we need is local vs not local to direct whether to pack CPUs into
the same chunks or not so I wonder if the equality test is simpler.

Probably not a big deal though. Patch looks good as is.

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Michael Ellerman-2
Nicholas Piggin <[hidden email]> writes:

> On Fri, 02 Jun 2017 19:54:32 +1000
> Michael Ellerman <[hidden email]> wrote:
>
>> >> @@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
>> >>  
>> >>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
>> >>  {
>> >> - if (cpu_to_node(from) == cpu_to_node(to))
>> >> - return LOCAL_DISTANCE;
>> >> - else
>> >> - return REMOTE_DISTANCE;
>> >> +#ifndef CONFIG_NUMA
>> >> + return LOCAL_DISTANCE;
>> >> +#else
>> >> + int from_nid, to_nid;
>> >> +
>> >> + from_nid = early_cpu_to_node(from);
>> >> + to_nid   = early_cpu_to_node(to);
>> >> +
>> >> + if (from_nid == -1 || to_nid == -1)
>> >> + return LOCAL_DISTANCE; /* Or assume remote? */
>> >> +
>> >> + return node_distance(from_nid, to_nid);  
>> >
>> > If you made node_distance() return LOCAL_NODE for !NUMA, this
>> > should fall out and not require the ifdef?  
>>
>> Maybe yeah. This is designed to be minimal for backporting though.
>
> Okay fair enough. Is it expected to get back -1 from this ever? I think
> all we need is local vs not local to direct whether to pack CPUs into
> the same chunks or not so I wonder if the equality test is simpler.

Yeah you're right, I'll do a v2.

I think I was worried about the fact that our node_distance() does
complicated stuff for multi-level NUMA, but that doesn't actually matter
for the per-cpu calculation as you say so it's probably better to keep
it simple.

cheers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Balbir Singh
In reply to this post by Michael Ellerman-2
On Fri, Jun 2, 2017 at 3:14 PM, Michael Ellerman <[hidden email]> wrote:

> In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
> switched to the generic implementation of cpu_to_node(), which uses a percpu
> variable to hold the NUMA node for each CPU.
>
> Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
> of our percpu areas, leading to a chicken and egg problem. In practice what
> happens is when we are setting up the percpu areas, cpu_to_node() reports that
> all CPUs are on node 0, so we allocate all percpu areas on node 0.
>
> This is visible in the dmesg output, as all pcpu allocs being in group 0:
>
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
>   pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
>   pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
>
> To fix it we need an early_cpu_to_node() which can run prior to percpu being
> setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
> in. With the patch dmesg output shows two groups, 0 and 1:
>
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
>   pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
>   pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
>
> We can also check the data_offset in the paca of various CPUs, with the fix we
> see:
>
>   CPU 0:  data_offset = 0x0ffe8b0000
>   CPU 24: data_offset = 0x1ffe5b0000
>
> And we can see from dmesg that CPU 24 has an allocation on node 1:
>
>   node   0: [mem 0x0000000000000000-0x0000000fffffffff]
>   node   1: [mem 0x0000001000000000-0x0000001fffffffff]
>
> Cc: [hidden email] # v3.16+
> Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
> Signed-off-by: Michael Ellerman <[hidden email]>
> ---
>  arch/powerpc/include/asm/topology.h | 14 ++++++++++++++
>  arch/powerpc/kernel/setup_64.c      | 19 ++++++++++++++-----
>  2 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 8b3b46b7b0f2..8f3b2ec09b9e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -44,8 +44,22 @@ extern void __init dump_numa_cpu_topology(void);
>  extern int sysfs_add_device_to_node(struct device *dev, int nid);
>  extern void sysfs_remove_device_from_node(struct device *dev, int nid);
>
> +static inline int early_cpu_to_node(int cpu)
> +{
> +       int nid;
> +
> +       nid = numa_cpu_lookup_table[cpu];
> +
> +       /*
> +        * Some functions, eg. node_distance() don't cope with -1, so instead
> +        * fall back to node 0 if nid is unset (it should be, except bugs).
> +        */
> +       return (nid < 0) ? 0 : nid;
> +}
>  #else

Not sure if its entirely related, but I had tried to do

https://patchwork.ozlabs.org/patch/683556/

to setup the mapping earlier, but we would have still missed the pcpu_fc_alloc.

Balbir
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Michael Ellerman-2
Balbir Singh <[hidden email]> writes:

> On Fri, Jun 2, 2017 at 3:14 PM, Michael Ellerman <[hidden email]> wrote:
>> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
>> index 8b3b46b7b0f2..8f3b2ec09b9e 100644
>> --- a/arch/powerpc/include/asm/topology.h
>> +++ b/arch/powerpc/include/asm/topology.h
>> @@ -44,8 +44,22 @@ extern void __init dump_numa_cpu_topology(void);
>>  extern int sysfs_add_device_to_node(struct device *dev, int nid);
>>  extern void sysfs_remove_device_from_node(struct device *dev, int nid);
>>
>> +static inline int early_cpu_to_node(int cpu)
>> +{
>> +       int nid;
>> +
>> +       nid = numa_cpu_lookup_table[cpu];
>> +
>> +       /*
>> +        * Some functions, eg. node_distance() don't cope with -1, so instead
>> +        * fall back to node 0 if nid is unset (it should be, except bugs).
>> +        */
>> +       return (nid < 0) ? 0 : nid;
>> +}
>>  #else
>
> Not sure if its entirely related, but I had tried to do
>
> https://patchwork.ozlabs.org/patch/683556/
>
> to setup the mapping earlier, but we would have still missed the pcpu_fc_alloc.

It's related. But doesn't fix this bug.

I didn't merge it because I assumed it wouldn't build with
CONFIG_NUMA=n, but seems it does so I'll grab it.

cheers