[Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann

Removing or adding memory via the PowerPC hotplug interface shows
anomalies in the association between memory and nodes.  The code
was updated to initialize more possible nodes to make them available
to subsequent DLPAR hotplug-memory operations, even if they are not
needed at boot time.

Signed-off-by: Michael Bringmann <[hidden email]>
---
 arch/powerpc/mm/numa.c |   44 ++++++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 15c2dd5..3d58c1f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -870,7 +870,7 @@ void __init dump_numa_cpu_topology(void)
 }
 
 /* Initialize NODE_DATA for a node on the local memory */
-static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
+static void setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
  u64 spanned_pages = end_pfn - start_pfn;
  const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
@@ -878,23 +878,41 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
  void *nd;
  int tnid;
 
- nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
- nd = __va(nd_pa);
+ if (!node_data[nid]) {
+ nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd = __va(nd_pa);
 
- /* report and initialize */
- pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
- nd_pa, nd_pa + nd_size - 1);
- tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
- if (tnid != nid)
- pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+ node_data[nid] = nd;
+ memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+ NODE_DATA(nid)->node_id = nid;
+
+ /* report and initialize */
+ pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
+ nd_pa, nd_pa + nd_size - 1);
+ tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+ if (tnid != nid)
+ pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+ } else {
+ nd_pa = (u64) node_data[nid];
+ nd = __va(nd_pa);
+ }
 
- node_data[nid] = nd;
- memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
- NODE_DATA(nid)->node_id = nid;
  NODE_DATA(nid)->node_start_pfn = start_pfn;
  NODE_DATA(nid)->node_spanned_pages = spanned_pages;
 }
 
+static void setup_nodes(void)
+{
+ int i, l = 32 /* MAX_NUMNODES */;
+
+ for (i = 0; i < l; i++) {
+ if (!node_possible(i)) {
+ setup_node_data(i, 0, 0);
+ node_set(i, node_possible_map);
+ }
+ }
+}
+
 void __init initmem_init(void)
 {
  int nid, cpu;
@@ -914,6 +932,8 @@ void __init initmem_init(void)
  */
  nodes_and(node_possible_map, node_possible_map, node_online_map);
 
+ setup_nodes();
+
  for_each_online_node(nid) {
  unsigned long start_pfn, end_pfn;
 

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Reza Arbab
On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:

>+static void setup_nodes(void)
>+{
>+ int i, l = 32 /* MAX_NUMNODES */;
>+
>+ for (i = 0; i < l; i++) {
>+ if (!node_possible(i)) {
>+ setup_node_data(i, 0, 0);
>+ node_set(i, node_possible_map);
>+ }
>+ }
>+}

This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset
node_possible_map to only node_online_map").

Balbir, you have a patchset which reverts it. Do you think that will be
getting merged?

http://lkml.kernel.org/r/1479253501-26261-1-git-send-email-bsingharora@...
(see patch 3/3)

--
Reza Arbab

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann


On 05/23/2017 10:52 AM, Reza Arbab wrote:

> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>> +static void setup_nodes(void)
>> +{
>> +    int i, l = 32 /* MAX_NUMNODES */;
>> +
>> +    for (i = 0; i < l; i++) {
>> +        if (!node_possible(i)) {
>> +            setup_node_data(i, 0, 0);
>> +            node_set(i, node_possible_map);
>> +        }
>> +    }
>> +}
>
> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").

They may be related, but that commit is not a replacement.  The above patch ensures that
there are enough of the nodes initialized at startup to allow for memory hot-add into a
node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
recording that the node was initialized.

I didn't see where any part of commit 3af229f2071f would touch the 'node_possible_map'
which is needed by 'numa.c' and 'workqueue.c'.  The nodemask created and updated by
'mem_cgroup_may_update_nodemask()' does not appear to be the same mask.

>
> Balbir, you have a patchset which reverts it. Do you think that will be getting merged?
>
> http://lkml.kernel.org/r/1479253501-26261-1-git-send-email-bsingharora@...
> (see patch 3/3)
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Reza Arbab
On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:

>On 05/23/2017 10:52 AM, Reza Arbab wrote:
>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>> +static void setup_nodes(void)
>>> +{
>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>> +
>>> +    for (i = 0; i < l; i++) {
>>> +        if (!node_possible(i)) {
>>> +            setup_node_data(i, 0, 0);
>>> +            node_set(i, node_possible_map);
>>> +        }
>>> +    }
>>> +}
>>
>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>
>They may be related, but that commit is not a replacement.  The above patch ensures that
>there are enough of the nodes initialized at startup to allow for memory hot-add into a
>node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>recording that the node was initialized.

Is it really necessary to preinitialize these empty nodes using
setup_node_data()? When you do memory hotadd into a node that was not
used at boot, the node data already gets set up by

add_memory
  add_memory_resource
    hotadd_new_pgdat
      arch_alloc_nodedata <-- allocs the pg_data_t
      ...
      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.

Removing setup_node_data() from that loop leaves only the call to
node_set(). If 3af229f2071f (which reduces node_possible_map) was
reverted, you wouldn't need to do that either.

>I didn't see where any part of commit 3af229f2071f would touch the 'node_possible_map'
>which is needed by 'numa.c' and 'workqueue.c'.  The nodemask created and updated by
>'mem_cgroup_may_update_nodemask()' does not appear to be the same mask.

Are you sure you're looking at 3af229f2071f? It only adds one line of
code; the reduction of node_possible_map.

--
Reza Arbab

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann


On 05/23/2017 04:49 PM, Reza Arbab wrote:

> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:
>> On 05/23/2017 10:52 AM, Reza Arbab wrote:
>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>>> +static void setup_nodes(void)
>>>> +{
>>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>>> +
>>>> +    for (i = 0; i < l; i++) {
>>>> +        if (!node_possible(i)) {
>>>> +            setup_node_data(i, 0, 0);
>>>> +            node_set(i, node_possible_map);
>>>> +        }
>>>> +    }
>>>> +}
>>>
>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>>
>> They may be related, but that commit is not a replacement.  The above patch ensures that
>> there are enough of the nodes initialized at startup to allow for memory hot-add into a
>> node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>> recording that the node was initialized.
>
> Is it really necessary to preinitialize these empty nodes using setup_node_data()? When you do memory hotadd into a node that was not used at boot, the node data already gets set up by
>
> add_memory
>  add_memory_resource
>    hotadd_new_pgdat
>      arch_alloc_nodedata <-- allocs the pg_data_t
>      ...
>      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.

I see that code now, but for some reason it did not work when I hot-added
memory.

>
> Removing setup_node_data() from that loop leaves only the call to node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, you wouldn't need to do that either.
>
>> I didn't see where any part of commit 3af229f2071f would touch the 'node_possible_map'
>> which is needed by 'numa.c' and 'workqueue.c'.  The nodemask created and updated by
>> 'mem_cgroup_may_update_nodemask()' does not appear to be the same mask.
>
> Are you sure you're looking at 3af229f2071f? It only adds one line of code; the reduction of node_possible_map.
>

The 3rd file in the patch set removes,

- nodes_and(node_possible_map, node_possible_map, node_online_map);

I need to add bits to 'node_possible_map' -- bits which may not be used
for the memory at boot, but which would be used when memory is hot-added
later.  I haven't found anything outside of the boot code that adds bits
to the 'possible' mask.

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann
In reply to this post by Reza Arbab


On 05/23/2017 04:49 PM, Reza Arbab wrote:

> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:
>> On 05/23/2017 10:52 AM, Reza Arbab wrote:
>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>>> +static void setup_nodes(void)
>>>> +{
>>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>>> +
>>>> +    for (i = 0; i < l; i++) {
>>>> +        if (!node_possible(i)) {
>>>> +            setup_node_data(i, 0, 0);
>>>> +            node_set(i, node_possible_map);
>>>> +        }
>>>> +    }
>>>> +}
>>>
>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>>
>> They may be related, but that commit is not a replacement.  The above patch ensures that
>> there are enough of the nodes initialized at startup to allow for memory hot-add into a
>> node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>> recording that the node was initialized.
>
> Is it really necessary to preinitialize these empty nodes using setup_node_data()? When you do memory hotadd into a node that was not used at boot, the node data already gets set up by
>
> add_memory
>  add_memory_resource
>    hotadd_new_pgdat
>      arch_alloc_nodedata <-- allocs the pg_data_t
>      ...
>      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.
>
> Removing setup_node_data() from that loop leaves only the call to node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, you wouldn't need to do that either.

With or without 3af229f2071f, we would still need to add something, somewhere to add new
bits to the 'node_possible_map'.  That is not being done.

>
>> I didn't see where any part of commit 3af229f2071f would touch the 'node_possible_map'
>> which is needed by 'numa.c' and 'workqueue.c'.  The nodemask created and updated by
>> 'mem_cgroup_may_update_nodemask()' does not appear to be the same mask.
>
> Are you sure you're looking at 3af229f2071f? It only adds one line of code; the reduction of node_possible_map.
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Ellerman-2
Michael Bringmann <[hidden email]> writes:

> On 05/23/2017 04:49 PM, Reza Arbab wrote:
>> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:
>>> On 05/23/2017 10:52 AM, Reza Arbab wrote:
>>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>>>> +static void setup_nodes(void)
>>>>> +{
>>>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>>>> +
>>>>> +    for (i = 0; i < l; i++) {
>>>>> +        if (!node_possible(i)) {
>>>>> +            setup_node_data(i, 0, 0);
>>>>> +            node_set(i, node_possible_map);
>>>>> +        }
>>>>> +    }
>>>>> +}
>>>>
>>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>>>
>>> They may be related, but that commit is not a replacement.  The above patch ensures that
>>> there are enough of the nodes initialized at startup to allow for memory hot-add into a
>>> node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>>> recording that the node was initialized.
>>
>> Is it really necessary to preinitialize these empty nodes using setup_node_data()? When you do memory hotadd into a node that was not used at boot, the node data already gets set up by
>>
>> add_memory
>>  add_memory_resource
>>    hotadd_new_pgdat
>>      arch_alloc_nodedata <-- allocs the pg_data_t
>>      ...
>>      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.
>>
>> Removing setup_node_data() from that loop leaves only the call to node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, you wouldn't need to do that either.
>
> With or without 3af229f2071f, we would still need to add something, somewhere to add new
> bits to the 'node_possible_map'.  That is not being done.

You mustn't add bits to the possible map after boot.

That's its purpose, to tell you what nodes could ever *possibly* exist.

cheers
Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Reza Arbab
In reply to this post by Michael Bringmann
On Tue, May 23, 2017 at 05:44:23PM -0500, Michael Bringmann wrote:

>On 05/23/2017 04:49 PM, Reza Arbab wrote:
>> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:
>>> On 05/23/2017 10:52 AM, Reza Arbab wrote:
>>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>>>> +static void setup_nodes(void)
>>>>> +{
>>>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>>>> +
>>>>> +    for (i = 0; i < l; i++) {
>>>>> +        if (!node_possible(i)) {
>>>>> +            setup_node_data(i, 0, 0);
>>>>> +            node_set(i, node_possible_map);
>>>>> +        }
>>>>> +    }
>>>>> +}
>>>>
>>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>>>
>>> They may be related, but that commit is not a replacement.  The above patch ensures that
>>> there are enough of the nodes initialized at startup to allow for memory hot-add into a
>>> node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>>> recording that the node was initialized.
>>
>> Is it really necessary to preinitialize these empty nodes using setup_node_data()? When you do memory hotadd into a node that was not used at boot, the node data already gets set up by
>>
>> add_memory
>>  add_memory_resource
>>    hotadd_new_pgdat
>>      arch_alloc_nodedata <-- allocs the pg_data_t
>>      ...
>>      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.
>>
>> Removing setup_node_data() from that loop leaves only the call to node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, you wouldn't need to do that either.
>
>With or without 3af229f2071f, we would still need to add something, somewhere to add new
>bits to the 'node_possible_map'.  That is not being done.

Without 3af229f2071f, those bits would already BE set in
node_possible_map. You wouldn't have to do anything.

--
Reza Arbab

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann
I will get a log based on the latest 4.12 kernel to show what happens
one way or the other, with this patch removed.

On 05/24/2017 09:36 AM, Reza Arbab wrote:

> On Tue, May 23, 2017 at 05:44:23PM -0500, Michael Bringmann wrote:
>> On 05/23/2017 04:49 PM, Reza Arbab wrote:
>>> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:
>>>> On 05/23/2017 10:52 AM, Reza Arbab wrote:
>>>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>>>>> +static void setup_nodes(void)
>>>>>> +{
>>>>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>>>>> +
>>>>>> +    for (i = 0; i < l; i++) {
>>>>>> +        if (!node_possible(i)) {
>>>>>> +            setup_node_data(i, 0, 0);
>>>>>> +            node_set(i, node_possible_map);
>>>>>> +        }
>>>>>> +    }
>>>>>> +}
>>>>>
>>>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>>>>
>>>> They may be related, but that commit is not a replacement.  The above patch ensures that
>>>> there are enough of the nodes initialized at startup to allow for memory hot-add into a
>>>> node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>>>> recording that the node was initialized.
>>>
>>> Is it really necessary to preinitialize these empty nodes using setup_node_data()? When you do memory hotadd into a node that was not used at boot, the node data already gets set up by
>>>
>>> add_memory
>>>  add_memory_resource
>>>    hotadd_new_pgdat
>>>      arch_alloc_nodedata <-- allocs the pg_data_t
>>>      ...
>>>      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.
>>>
>>> Removing setup_node_data() from that loop leaves only the call to node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, you wouldn't need to do that either.
>>
>> With or without 3af229f2071f, we would still need to add something, somewhere to add new
>> bits to the 'node_possible_map'.  That is not being done.
>
> Without 3af229f2071f, those bits would already BE set in node_possible_map. You wouldn't have to do anything.
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann
In reply to this post by Michael Ellerman-2


On 05/24/2017 06:19 AM, Michael Ellerman wrote:

> Michael Bringmann <[hidden email]> writes:
>
>> On 05/23/2017 04:49 PM, Reza Arbab wrote:
>>> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote:
>>>> On 05/23/2017 10:52 AM, Reza Arbab wrote:
>>>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote:
>>>>>> +static void setup_nodes(void)
>>>>>> +{
>>>>>> +    int i, l = 32 /* MAX_NUMNODES */;
>>>>>> +
>>>>>> +    for (i = 0; i < l; i++) {
>>>>>> +        if (!node_possible(i)) {
>>>>>> +            setup_node_data(i, 0, 0);
>>>>>> +            node_set(i, node_possible_map);
>>>>>> +        }
>>>>>> +    }
>>>>>> +}
>>>>>
>>>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset node_possible_map to only node_online_map").
>>>>
>>>> They may be related, but that commit is not a replacement.  The above patch ensures that
>>>> there are enough of the nodes initialized at startup to allow for memory hot-add into a
>>>> node that was not used at boot.  (See 'setup_node_data' function in 'numa.c'.)  That and
>>>> recording that the node was initialized.
>>>
>>> Is it really necessary to preinitialize these empty nodes using setup_node_data()? When you do memory hotadd into a node that was not used at boot, the node data already gets set up by
>>>
>>> add_memory
>>>  add_memory_resource
>>>    hotadd_new_pgdat
>>>      arch_alloc_nodedata <-- allocs the pg_data_t
>>>      ...
>>>      free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc.
>>>
>>> Removing setup_node_data() from that loop leaves only the call to node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, you wouldn't need to do that either.
>>
>> With or without 3af229f2071f, we would still need to add something, somewhere to add new
>> bits to the 'node_possible_map'.  That is not being done.
>
> You mustn't add bits to the possible map after boot.
>
> That's its purpose, to tell you what nodes could ever *possibly* exist.

The problem that I have been encountering is that the 'possible map' did *not*
show all of the possible nodes.  Rather, it showed only the nodes that were
assigned memory at boot-up.  If more memory were hot-added to the kernel, it
could be assigned into one of the nodes that were skipped at boot.  However,
nothing was updating the 'node_possible_map' correctly in the kernel memory
code.

Reza pointed out a code change in commit 3af229f2071f that has not made it into
the 4.12 checkout i.e. removing the instruction that reduces the node_possible_map.
This may well be a suitable replacement for the code that I have here, and I
will test it here next.

>
> cheers
>
>
Later.

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Ellerman-2
Michael Bringmann <[hidden email]> writes:

> On 05/24/2017 06:19 AM, Michael Ellerman wrote:
>> Michael Bringmann <[hidden email]> writes:
>>>
>>> With or without 3af229f2071f, we would still need to add something, somewhere to add new
>>> bits to the 'node_possible_map'.  That is not being done.
>>
>> You mustn't add bits to the possible map after boot.
>>
>> That's its purpose, to tell you what nodes could ever *possibly* exist.
>
> The problem that I have been encountering is that the 'possible map' did *not*
> show all of the possible nodes.

OK so how did that happen?

The commit message for 3af229f2071f says:

    In practice, we never see a system with 256 NUMA nodes, and in fact, we
    do not support node hotplug on power in the first place, so the nodes
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    that are online when we come up are the nodes that will be present for
    the lifetime of this kernel.

Is that no longer true?

cheers
Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann


On 05/25/2017 01:19 AM, Michael Ellerman wrote:

> Michael Bringmann <[hidden email]> writes:
>
>> On 05/24/2017 06:19 AM, Michael Ellerman wrote:
>>> Michael Bringmann <[hidden email]> writes:
>>>>
>>>> With or without 3af229f2071f, we would still need to add something, somewhere to add new
>>>> bits to the 'node_possible_map'.  That is not being done.
>>>
>>> You mustn't add bits to the possible map after boot.
>>>
>>> That's its purpose, to tell you what nodes could ever *possibly* exist.
>>
>> The problem that I have been encountering is that the 'possible map' did *not*
>> show all of the possible nodes.
>
> OK so how did that happen?
>
> The commit message for 3af229f2071f says:
>
>     In practice, we never see a system with 256 NUMA nodes, and in fact, we
>     do not support node hotplug on power in the first place, so the nodes
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>     that are online when we come up are the nodes that will be present for
>     the lifetime of this kernel.
>
> Is that no longer true?

Take a look at the last part of commit 3af229f2071f for file numa.c.  It undoes
a piece of code that restricts the 'node possible map', created earlier, to the
set of online nodes.  That piece of code has not made it into the mainline, at
least not into 4.12.  I am testing to verify whether it is sufficient for my
configuration now.

>
> cheers
>
Regards.

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Reza Arbab
In reply to this post by Michael Ellerman-2
On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>The commit message for 3af229f2071f says:
>
>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>    do not support node hotplug on power in the first place, so the nodes
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    that are online when we come up are the nodes that will be present for
>    the lifetime of this kernel.
>
>Is that no longer true?

I don't know what the reasoning behind that statement was at the time,
but as far as I can tell, the only thing missing for node hotplug now is
Balbir's patchset [1]. He fixes the resource issue which motivated
3af229f2071f and reverts it.

With that set, I can instantiate a new numa node just by doing
add_memory(nid, ...) where nid doesn't currently exist.

[1] https://lkml.kernel.org/r/1479253501-26261-1-git-send-email-bsingharora@...

--
Reza Arbab

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann


On 05/25/2017 10:10 AM, Reza Arbab wrote:

> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>> The commit message for 3af229f2071f says:
>>
>>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>    do not support node hotplug on power in the first place, so the nodes
>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>    that are online when we come up are the nodes that will be present for
>>    the lifetime of this kernel.
>>
>> Is that no longer true?
>
> I don't know what the reasoning behind that statement was at the time, but as far as I can tell, the only thing missing for node hotplug now is Balbir's patchset [1]. He fixes the resource issue which motivated 3af229f2071f and reverts it.
>
> With that set, I can instantiate a new numa node just by doing add_memory(nid, ...) where nid doesn't currently exist.
>
> [1] https://lkml.kernel.org/r/1479253501-26261-1-git-send-email-bsingharora@...
>

Yes, the change to 'numa.c' looks to be sufficient for my needs as well.

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Balbir Singh
In reply to this post by Reza Arbab
On Thu, 25 May 2017 10:10:11 -0500
Reza Arbab <[hidden email]> wrote:

> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
> >The commit message for 3af229f2071f says:
> >
> >    In practice, we never see a system with 256 NUMA nodes, and in fact, we
> >    do not support node hotplug on power in the first place, so the nodes
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >    that are online when we come up are the nodes that will be present for
> >    the lifetime of this kernel.
> >
> >Is that no longer true?  
>
> I don't know what the reasoning behind that statement was at the time,
> but as far as I can tell, the only thing missing for node hotplug now is
> Balbir's patchset [1]. He fixes the resource issue which motivated
> 3af229f2071f and reverts it.
>
> With that set, I can instantiate a new numa node just by doing
> add_memory(nid, ...) where nid doesn't currently exist.
>
> [1] https://lkml.kernel.org/r/1479253501-26261-1-git-send-email-bsingharora@...
>

I guess I should try and revive that patchset. One of the suggestions of
then was to limit maximum possible nodes in firmware, but I'm double checking
to see if we can do that in a well defined manner.

Balbir Singh
Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Ellerman-2
In reply to this post by Reza Arbab
Reza Arbab <[hidden email]> writes:

> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>The commit message for 3af229f2071f says:
>>
>>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>    do not support node hotplug on power in the first place, so the nodes
>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>    that are online when we come up are the nodes that will be present for
>>    the lifetime of this kernel.
>>
>>Is that no longer true?
>
> I don't know what the reasoning behind that statement was at the time,
> but as far as I can tell, the only thing missing for node hotplug now is
> Balbir's patchset [1]. He fixes the resource issue which motivated
> 3af229f2071f and reverts it.
>
> With that set, I can instantiate a new numa node just by doing
> add_memory(nid, ...) where nid doesn't currently exist.

But does that actually happen on any real system?

cheers
Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann
I am running into this problem on PowerPC systems where Balbir's patch set
was targeted.  So, yes, I do need to be able to add/enable a new numa node
during system execution in cases where more resources (memory, virtual
processors) are added to the system dynamically.

On 05/25/2017 10:46 PM, Michael Ellerman wrote:

> Reza Arbab <[hidden email]> writes:
>
>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>> The commit message for 3af229f2071f says:
>>>
>>>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>>    do not support node hotplug on power in the first place, so the nodes
>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>    that are online when we come up are the nodes that will be present for
>>>    the lifetime of this kernel.
>>>
>>> Is that no longer true?
>>
>> I don't know what the reasoning behind that statement was at the time,
>> but as far as I can tell, the only thing missing for node hotplug now is
>> Balbir's patchset [1]. He fixes the resource issue which motivated
>> 3af229f2071f and reverts it.
>>
>> With that set, I can instantiate a new numa node just by doing
>> add_memory(nid, ...) where nid doesn't currently exist.
>
> But does that actually happen on any real system?
>
> cheers
>
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Reza Arbab
In reply to this post by Michael Ellerman-2
On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote:

>Reza Arbab <[hidden email]> writes:
>
>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>>The commit message for 3af229f2071f says:
>>>
>>>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>>    do not support node hotplug on power in the first place, so the nodes
>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>    that are online when we come up are the nodes that will be present for
>>>    the lifetime of this kernel.
>>>
>>>Is that no longer true?
>>
>> I don't know what the reasoning behind that statement was at the time,
>> but as far as I can tell, the only thing missing for node hotplug now is
>> Balbir's patchset [1]. He fixes the resource issue which motivated
>> 3af229f2071f and reverts it.
>>
>> With that set, I can instantiate a new numa node just by doing
>> add_memory(nid, ...) where nid doesn't currently exist.
>
>But does that actually happen on any real system?

I don't know if anything currently tries to do this. My interest in
having this working is so that in the future, our coherent gpu memory
could be added as a distinct node by the device driver.

--
Reza Arbab

Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Ellerman-2
Reza Arbab <[hidden email]> writes:

> On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote:
>>Reza Arbab <[hidden email]> writes:
>>
>>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>>>The commit message for 3af229f2071f says:
>>>>
>>>>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>>>    do not support node hotplug on power in the first place, so the nodes
>>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>    that are online when we come up are the nodes that will be present for
>>>>    the lifetime of this kernel.
>>>>
>>>>Is that no longer true?
>>>
>>> I don't know what the reasoning behind that statement was at the time,
>>> but as far as I can tell, the only thing missing for node hotplug now is
>>> Balbir's patchset [1]. He fixes the resource issue which motivated
>>> 3af229f2071f and reverts it.
>>>
>>> With that set, I can instantiate a new numa node just by doing
>>> add_memory(nid, ...) where nid doesn't currently exist.
>>
>>But does that actually happen on any real system?
>
> I don't know if anything currently tries to do this. My interest in
> having this working is so that in the future, our coherent gpu memory
> could be added as a distinct node by the device driver.

Sure. If/when that happens, we would hopefully still have some way to
limit the size of the possible map.

That would ideally be a firmware property that tells us the maximum
number of GPUs that might be hot-added, or we punt and cap it at some
"sane" maximum number.

But until that happens it's silly to say we can have up to 256 nodes
when in practice most of our systems have 8 or less.

So I'm still waiting for an explanation from Michael B on how he's
seeing this bug in practice.

cheers
Reply | Threaded
Open this post in threaded view
|

Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

Michael Bringmann


On 05/29/2017 12:32 AM, Michael Ellerman wrote:

> Reza Arbab <[hidden email]> writes:
>
>> On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote:
>>> Reza Arbab <[hidden email]> writes:
>>>
>>>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>>>> The commit message for 3af229f2071f says:
>>>>>
>>>>>    In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>>>>    do not support node hotplug on power in the first place, so the nodes
>>>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>    that are online when we come up are the nodes that will be present for
>>>>>    the lifetime of this kernel.
>>>>>
>>>>> Is that no longer true?
>>>>
>>>> I don't know what the reasoning behind that statement was at the time,
>>>> but as far as I can tell, the only thing missing for node hotplug now is
>>>> Balbir's patchset [1]. He fixes the resource issue which motivated
>>>> 3af229f2071f and reverts it.
>>>>
>>>> With that set, I can instantiate a new numa node just by doing
>>>> add_memory(nid, ...) where nid doesn't currently exist.
>>>
>>> But does that actually happen on any real system?
>>
>> I don't know if anything currently tries to do this. My interest in
>> having this working is so that in the future, our coherent gpu memory
>> could be added as a distinct node by the device driver.
>
> Sure. If/when that happens, we would hopefully still have some way to
> limit the size of the possible map.
>
> That would ideally be a firmware property that tells us the maximum
> number of GPUs that might be hot-added, or we punt and cap it at some
> "sane" maximum number.
>
> But until that happens it's silly to say we can have up to 256 nodes
> when in practice most of our systems have 8 or less.
>
> So I'm still waiting for an explanation from Michael B on how he's
> seeing this bug in practice.

I already answered this in an earlier message.  I will give an example.

* Let there be a configuration with nodes (0, 4-5, 8) that boots with 1 VP
  and 10G of memory in a shared processor configuration.
* At boot time, 4 nodes are put into the possible map by the PowerPC boot
  code.
* Subsequently, the NUMA code executes and puts the 10G memory into nodes
  4 & 5.  No memory goes into Node 0.  So we now have 2 nodes in the
  node_online_map.
* The VP and its threads get assigned to Node 4.
* Then when 'initmem_init()' in 'powerpc/numa.c' executes the instruction,
     node_and(node_possible_map, node_possible_map, node_online_map);
  the content of the node_possible_map is reduced to nodes 4-5.
* Later on we hot-add 90G of memory to the system.  It tries to put the
  memory into nodes 0, 4-5, 8 based on the memory association map.  We
  should see memory put into all 4 nodes.  However, since we have reduced
  the 'node_possible_map' to only nodes 4 & 5, we can now only put memory
  into 2 of the configured nodes.

# We want to be able to put memory into all 4 nodes via hot-add operations,
  not only the nodes that 'survive' boot time initialization.  We could
  make a number of changes to ensure that all of the nodes in the initial
  configuration provided by the pHyp can be used, but this one appears to
  be the simplest, only using resources requested by the pHyp at boot --
  even if those resource are not used immediately.

>
> cheers
>

Regards,
Michael

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
[hidden email]

12