[PATCH 1/6] powerpc/mm: Move pgdir setting into a helper

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/6] powerpc/mm: Move pgdir setting into a helper

Benjamin Herrenschmidt
Makes switch_mm_irqs_off() a bit more readable

Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
---
 arch/powerpc/include/asm/mmu_context.h | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 94d93f256b94..603b0e5cdec0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -77,6 +77,26 @@ extern void switch_cop(struct mm_struct *next);
 extern int use_cop(unsigned long acop, struct mm_struct *mm);
 extern void drop_cop(unsigned long acop, struct mm_struct *mm);
 
+#if defined(CONFIG_PPC32)
+static inline void switch_mm_pgdir(struct task_struct *tsk,
+   struct mm_struct *mm)
+{
+ /* 32-bit keeps track of the current PGDIR in the thread struct */
+ tsk->thread.pgdir = mm->pgd;
+}
+#elif defined(CONFIG_PPC_BOOK3E_64)
+static inline void switch_mm_pgdir(struct task_struct *tsk,
+   struct mm_struct *mm)
+{
+ /* 64-bit Book3E keeps track of current PGD in the PACA */
+ get_paca()->pgd = mm->pgd;
+}
+#else
+static inline void switch_mm_pgdir(struct task_struct *tsk,
+   struct mm_struct *mm) { }
+#endif
+
+
 /*
  * switch_mm is the entry point called from the architecture independent
  * code in kernel/sched/core.c
@@ -93,15 +113,9 @@ static inline void switch_mm_irqs_off(struct mm_struct *prev,
  new_on_cpu = true;
  }
 
- /* 32-bit keeps track of the current PGDIR in the thread struct */
-#ifdef CONFIG_PPC32
- tsk->thread.pgdir = next->pgd;
-#endif /* CONFIG_PPC32 */
+ /* Some subarchs need to track the PGD elsewhere */
+ switch_mm_pgdir(tsk, next);
 
- /* 64-bit Book3E keeps track of current PGD in the PACA */
-#ifdef CONFIG_PPC_BOOK3E_64
- get_paca()->pgd = next->pgd;
-#endif
  /* Nothing else to do if we aren't actually switching */
  if (prev == next)
  return;
--
2.13.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/6] powerpc/mm: Avoid double irq save/restore in activate_mm

Benjamin Herrenschmidt
It calls switch_mm() which already does the irq save/restore
these days.

Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
---
 arch/powerpc/include/asm/mmu_context.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 603b0e5cdec0..ed9a36ee3107 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -158,11 +158,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  */
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
- unsigned long flags;
-
- local_irq_save(flags);
  switch_mm(prev, next, current);
- local_irq_restore(flags);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
--
2.13.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

Benjamin Herrenschmidt
In reply to this post by Benjamin Herrenschmidt
There is no guarantee that the various isync's involved with
the context switch will order the update of the CPU mask with
the first TLB entry for the new context being loaded by the HW.

Be safe here and add a memory barrier to order any subsequent
load/store which may bring entries into the TLB.

The corresponding barrier on the other side already exists as
pte updates use pte_xchg() which uses __cmpxchg_u64 which has
a sync after the atomic operation.

Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
---
 arch/powerpc/include/asm/mmu_context.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index ed9a36ee3107..ff1aeb2cd19f 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -110,6 +110,7 @@ static inline void switch_mm_irqs_off(struct mm_struct *prev,
  /* Mark this context has been used on the new CPU */
  if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
  cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
+ smp_mb();
  new_on_cpu = true;
  }
 
--
2.13.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 4/6] powerpc/mm: Use mm_is_thread_local() instread of open-coding

Benjamin Herrenschmidt
In reply to this post by Benjamin Herrenschmidt
We open-code testing for the mm being local to the current CPU
in a few places. Use our existing helper instead.

Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
---
 arch/powerpc/mm/hash_utils_64.c  | 6 ++----
 arch/powerpc/mm/hugetlbpage.c    | 3 +--
 arch/powerpc/mm/pgtable-hash64.c | 4 +---
 arch/powerpc/mm/tlb_hash64.c     | 7 ++-----
 4 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 7a20669c19e7..943d9fe2b9a3 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1228,7 +1228,6 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
  unsigned long vsid;
  pte_t *ptep;
  unsigned hugeshift;
- const struct cpumask *tmp;
  int rc, user_region = 0;
  int psize, ssize;
 
@@ -1280,8 +1279,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
  }
 
  /* Check CPU locality */
- tmp = cpumask_of(smp_processor_id());
- if (user_region && cpumask_equal(mm_cpumask(mm), tmp))
+ if (user_region && mm_is_thread_local(mm))
  flags |= HPTE_LOCAL_UPDATE;
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -1543,7 +1541,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
 #endif /* CONFIG_PPC_64K_PAGES */
 
  /* Is that local to this CPU ? */
- if (cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))
+ if (mm_is_thread_local(mm))
  update_flags |= HPTE_LOCAL_UPDATE;
 
  /* Hash it in */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index e1bf5ca397fe..122844c0734a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -407,8 +407,7 @@ static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
  batchp = &get_cpu_var(hugepd_freelist_cur);
 
  if (atomic_read(&tlb->mm->mm_users) < 2 ||
-    cpumask_equal(mm_cpumask(tlb->mm),
-  cpumask_of(smp_processor_id()))) {
+    mm_is_thread_local(tlb->mm)) {
  kmem_cache_free(hugepte_cache, hugepte);
  put_cpu_var(hugepd_freelist_cur);
  return;
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 443a2c66a304..d89c637cb600 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -329,7 +329,6 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
  unsigned int psize;
  unsigned long vsid;
  unsigned long flags = 0;
- const struct cpumask *tmp;
 
  /* get the base page size,vsid and segment size */
 #ifdef CONFIG_DEBUG_VM
@@ -350,8 +349,7 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
  ssize = mmu_kernel_ssize;
  }
 
- tmp = cpumask_of(smp_processor_id());
- if (cpumask_equal(mm_cpumask(mm), tmp))
+ if (mm_is_thread_local(mm))
  flags |= HPTE_LOCAL_UPDATE;
 
  return flush_hash_hugepage(vsid, addr, pmdp, psize, ssize, flags);
diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index b5b0fb97b9c0..4dde13d7452d 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -138,13 +138,10 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
  */
 void __flush_tlb_pending(struct ppc64_tlb_batch *batch)
 {
- const struct cpumask *tmp;
- int i, local = 0;
+ int i, local;
 
  i = batch->index;
- tmp = cpumask_of(smp_processor_id());
- if (cpumask_equal(mm_cpumask(batch->mm), tmp))
- local = 1;
+ local = mm_is_thread_local(batch->mm);
  if (i == 1)
  flush_hash_page(batch->vpn[0], batch->pte[0],
  batch->psize, batch->ssize, local);
--
2.13.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Benjamin Herrenschmidt
In reply to this post by Benjamin Herrenschmidt
Instead of comparing the whole CPU mask every time, let's
keep a counter of how many bits are set in the mask. Thus
testing for a local mm only requires testing if that counter
is 1 and the current CPU bit is set in the mask.

Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 +++
 arch/powerpc/include/asm/mmu_context.h   |  9 +++++++++
 arch/powerpc/include/asm/tlb.h           | 11 ++++++++++-
 arch/powerpc/mm/mmu_context_book3s64.c   |  2 ++
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 1a220cdff923..c3b00e8ff791 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -83,6 +83,9 @@ typedef struct {
  mm_context_id_t id;
  u16 user_psize; /* page size index */
 
+ /* Number of bits in the mm_cpumask */
+ atomic_t active_cpus;
+
  /* NPU NMMU context */
  struct npu_context *npu_context;
 
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index ff1aeb2cd19f..cf8f50cd4030 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -96,6 +96,14 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
    struct mm_struct *mm) { }
 #endif
 
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline void inc_mm_active_cpus(struct mm_struct *mm)
+{
+ atomic_inc(&mm->context.active_cpus);
+}
+#else
+static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
+#endif
 
 /*
  * switch_mm is the entry point called from the architecture independent
@@ -110,6 +118,7 @@ static inline void switch_mm_irqs_off(struct mm_struct *prev,
  /* Mark this context has been used on the new CPU */
  if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
  cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
+ inc_mm_active_cpus(next);
  smp_mb();
  new_on_cpu = true;
  }
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 609557569f65..a7eabff27a0f 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -69,13 +69,22 @@ static inline int mm_is_core_local(struct mm_struct *mm)
       topology_sibling_cpumask(smp_processor_id()));
 }
 
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline int mm_is_thread_local(struct mm_struct *mm)
+{
+ if (atomic_read(&mm->context.active_cpus) > 1)
+ return false;
+ return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
+}
+#else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
  return cpumask_equal(mm_cpumask(mm),
       cpumask_of(smp_processor_id()));
 }
+#endif /* !CONFIG_PPC_BOOK3S_64 */
 
-#else
+#else /* CONFIG_SMP */
 static inline int mm_is_core_local(struct mm_struct *mm)
 {
  return 1;
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index 8159f5219137..de17d3e714aa 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -174,6 +174,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 #ifdef CONFIG_SPAPR_TCE_IOMMU
  mm_iommu_init(mm);
 #endif
+ atomic_set(&mm->context.active_cpus, 0);
+
  return 0;
 }
 
--
2.13.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 6/6] powerpc/mm: Make switch_mm_irqs_off() out of line

Benjamin Herrenschmidt
In reply to this post by Benjamin Herrenschmidt
It's too big to be inline, there is no reason to keep it
that way.

Signed-off-by: Benjamin Herrenschmidt <[hidden email]>

# Conflicts:
# arch/powerpc/include/asm/mmu_context.h
---
 arch/powerpc/include/asm/mmu_context.h | 73 ++----------------------------
 arch/powerpc/mm/Makefile               |  2 +-
 arch/powerpc/mm/mmu_context.c          | 81 ++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 71 deletions(-)
 create mode 100644 arch/powerpc/mm/mmu_context.c

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index cf8f50cd4030..ed949b2675cb 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -77,76 +77,9 @@ extern void switch_cop(struct mm_struct *next);
 extern int use_cop(unsigned long acop, struct mm_struct *mm);
 extern void drop_cop(unsigned long acop, struct mm_struct *mm);
 
-#if defined(CONFIG_PPC32)
-static inline void switch_mm_pgdir(struct task_struct *tsk,
-   struct mm_struct *mm)
-{
- /* 32-bit keeps track of the current PGDIR in the thread struct */
- tsk->thread.pgdir = mm->pgd;
-}
-#elif defined(CONFIG_PPC_BOOK3E_64)
-static inline void switch_mm_pgdir(struct task_struct *tsk,
-   struct mm_struct *mm)
-{
- /* 64-bit Book3E keeps track of current PGD in the PACA */
- get_paca()->pgd = mm->pgd;
-}
-#else
-static inline void switch_mm_pgdir(struct task_struct *tsk,
-   struct mm_struct *mm) { }
-#endif
-
-#ifdef CONFIG_PPC_BOOK3S_64
-static inline void inc_mm_active_cpus(struct mm_struct *mm)
-{
- atomic_inc(&mm->context.active_cpus);
-}
-#else
-static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
-#endif
-
-/*
- * switch_mm is the entry point called from the architecture independent
- * code in kernel/sched/core.c
- */
-static inline void switch_mm_irqs_off(struct mm_struct *prev,
-      struct mm_struct *next,
-      struct task_struct *tsk)
-{
- bool new_on_cpu = false;
-
- /* Mark this context has been used on the new CPU */
- if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
- cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
- inc_mm_active_cpus(next);
- smp_mb();
- new_on_cpu = true;
- }
-
- /* Some subarchs need to track the PGD elsewhere */
- switch_mm_pgdir(tsk, next);
-
- /* Nothing else to do if we aren't actually switching */
- if (prev == next)
- return;
-
- /* We must stop all altivec streams before changing the HW
- * context
- */
-#ifdef CONFIG_ALTIVEC
- if (cpu_has_feature(CPU_FTR_ALTIVEC))
- asm volatile ("dssall");
-#endif /* CONFIG_ALTIVEC */
-
- if (new_on_cpu)
- radix_kvm_prefetch_workaround(next);
-
- /*
- * The actual HW switching method differs between the various
- * sub architectures. Out of line for now
- */
- switch_mmu_context(prev, next, tsk);
-}
+extern void switch_mm_irqs_off(struct mm_struct *prev,
+       struct mm_struct *next,
+       struct task_struct *tsk);
 
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
      struct task_struct *tsk)
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index fa6a6ba34355..fb844d2f266e 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -8,7 +8,7 @@ ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC)
 
 obj-y := fault.o mem.o pgtable.o mmap.o \
    init_$(BITS).o pgtable_$(BITS).o \
-   init-common.o
+   init-common.o mmu_context.o
 obj-$(CONFIG_PPC_MMU_NOHASH) += mmu_context_nohash.o tlb_nohash.o \
    tlb_nohash_low.o
 obj-$(CONFIG_PPC_BOOK3E) += tlb_low_$(BITS)e.o
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
new file mode 100644
index 000000000000..5f7b59c7ace5
--- /dev/null
+++ b/arch/powerpc/mm/mmu_context.c
@@ -0,0 +1,81 @@
+/*
+ *  Common implementation of switch_mm_irqs_off
+ *
+ *  Copyright IBM Corp. 2017
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/mm.h>
+#include <linux/cpu.h>
+
+#include <asm/mmu_context.h>
+
+#if defined(CONFIG_PPC32)
+static inline void switch_mm_pgdir(struct task_struct *tsk,
+   struct mm_struct *mm)
+{
+ /* 32-bit keeps track of the current PGDIR in the thread struct */
+ tsk->thread.pgdir = mm->pgd;
+}
+#elif defined(CONFIG_PPC_BOOK3E_64)
+static inline void switch_mm_pgdir(struct task_struct *tsk,
+   struct mm_struct *mm)
+{
+ /* 64-bit Book3E keeps track of current PGD in the PACA */
+ get_paca()->pgd = mm->pgd;
+}
+#else
+static inline void switch_mm_pgdir(struct task_struct *tsk,
+   struct mm_struct *mm) { }
+#endif
+
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline void inc_mm_active_cpus(struct mm_struct *mm)
+{
+ atomic_inc(&mm->context.active_cpus);
+}
+#else
+static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
+#endif
+
+void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ bool new_on_cpu = false;
+
+ /* Mark this context has been used on the new CPU */
+ if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
+ cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
+ inc_mm_active_cpus(next);
+ smp_mb();
+ new_on_cpu = true;
+ }
+
+ /* Some subarchs need to track the PGD elsewhere */
+ switch_mm_pgdir(tsk, next);
+
+ /* Nothing else to do if we aren't actually switching */
+ if (prev == next)
+ return;
+
+ /* We must stop all altivec streams before changing the HW
+ * context
+ */
+ if (cpu_has_feature(CPU_FTR_ALTIVEC))
+ asm volatile ("dssall");
+
+ if (new_on_cpu)
+ radix_kvm_prefetch_workaround(next);
+
+ /*
+ * The actual HW switching method differs between the various
+ * sub architectures. Out of line for now
+ */
+ switch_mmu_context(prev, next, tsk);
+}
+
--
2.13.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

Nicholas Piggin-2
In reply to this post by Benjamin Herrenschmidt
On Mon, 24 Jul 2017 14:28:00 +1000
Benjamin Herrenschmidt <[hidden email]> wrote:

> There is no guarantee that the various isync's involved with
> the context switch will order the update of the CPU mask with
> the first TLB entry for the new context being loaded by the HW.
>
> Be safe here and add a memory barrier to order any subsequent
> load/store which may bring entries into the TLB.
>
> The corresponding barrier on the other side already exists as
> pte updates use pte_xchg() which uses __cmpxchg_u64 which has
> a sync after the atomic operation.
>
> Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
> ---
>  arch/powerpc/include/asm/mmu_context.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index ed9a36ee3107..ff1aeb2cd19f 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -110,6 +110,7 @@ static inline void switch_mm_irqs_off(struct mm_struct *prev,
>   /* Mark this context has been used on the new CPU */
>   if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
>   cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> + smp_mb();
>   new_on_cpu = true;
>   }
>  

I think this is the right thing to do, but it should be commented.
Is hwsync the right barrier? (i.e., it will order the page table walk)

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Nicholas Piggin-2
In reply to this post by Benjamin Herrenschmidt
On Mon, 24 Jul 2017 14:28:02 +1000
Benjamin Herrenschmidt <[hidden email]> wrote:

> Instead of comparing the whole CPU mask every time, let's
> keep a counter of how many bits are set in the mask. Thus
> testing for a local mm only requires testing if that counter
> is 1 and the current CPU bit is set in the mask.
>
> Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
> ---
>  arch/powerpc/include/asm/book3s/64/mmu.h |  3 +++
>  arch/powerpc/include/asm/mmu_context.h   |  9 +++++++++
>  arch/powerpc/include/asm/tlb.h           | 11 ++++++++++-
>  arch/powerpc/mm/mmu_context_book3s64.c   |  2 ++
>  4 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 1a220cdff923..c3b00e8ff791 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -83,6 +83,9 @@ typedef struct {
>   mm_context_id_t id;
>   u16 user_psize; /* page size index */
>  
> + /* Number of bits in the mm_cpumask */
> + atomic_t active_cpus;
> +
>   /* NPU NMMU context */
>   struct npu_context *npu_context;
>  
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index ff1aeb2cd19f..cf8f50cd4030 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -96,6 +96,14 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
>     struct mm_struct *mm) { }
>  #endif
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +static inline void inc_mm_active_cpus(struct mm_struct *mm)
> +{
> + atomic_inc(&mm->context.active_cpus);
> +}
> +#else
> +static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
> +#endif

This is a bit awkward. Can we just move the entire function to test
cpumask and set / increment into helper functions and define them
together with mm_is_thread_local, so it's all in one place?

The extra atomic does not need to be defined when it's not used either.

Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
If it's <= then it should be similar load and compare, no?

Looks like a good optimisation though.

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Michael Ellerman-2
Nicholas Piggin <[hidden email]> writes:

> On Mon, 24 Jul 2017 14:28:02 +1000
> Benjamin Herrenschmidt <[hidden email]> wrote:
>
>> Instead of comparing the whole CPU mask every time, let's
>> keep a counter of how many bits are set in the mask. Thus
>> testing for a local mm only requires testing if that counter
>> is 1 and the current CPU bit is set in the mask.
...
>
> Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
> If it's <= then it should be similar load and compare, no?

Do we make a machine with that few CPUs? ;)

I don't think it's worth special casing, all the distros run with much
much larger NR_CPUs than that.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

Benjamin Herrenschmidt
In reply to this post by Nicholas Piggin-2
On Mon, 2017-07-24 at 21:20 +1000, Nicholas Piggin wrote:
> I think this is the right thing to do, but it should be commented.
> Is hwsync the right barrier? (i.e., it will order the page table walk)

This is an open question, I've asked the architects and HW guys and
waiting for an answer.

That said, are we really trying to order the page table walk or are
we trying to order any (speculative or not) load/store that may trigger
a page table update ?

Cheers,
Ben.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Benjamin Herrenschmidt
In reply to this post by Nicholas Piggin-2
On Mon, 2017-07-24 at 21:25 +1000, Nicholas Piggin wrote:

> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +static inline void inc_mm_active_cpus(struct mm_struct *mm)
> > +{
> > +     atomic_inc(&mm->context.active_cpus);
> > +}
> > +#else
> > +static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
> > +#endif
>
> This is a bit awkward. Can we just move the entire function to test
> cpumask and set / increment into helper functions and define them
> together with mm_is_thread_local, so it's all in one place?

I thought about it but then we have 2 variants, unless I start moving
the active_cpus into mm_context_t on all the 32-bit subarchs too, etc..

It gets messy either way.

> The extra atomic does not need to be defined when it's not used either.
>
> Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
> If it's <= then it should be similar load and compare, no?

Right, we could.

> Looks like a good optimisation though.

Thx. It's a pre-req for further optimizations such as flushing the PID
when a single threaded process moves, so we don't have to constantly
scan the mask.

Cheers,
Ben.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Nicholas Piggin-2
In reply to this post by Michael Ellerman-2
On Mon, 24 Jul 2017 23:46:44 +1000
Michael Ellerman <[hidden email]> wrote:

> Nicholas Piggin <[hidden email]> writes:
>
> > On Mon, 24 Jul 2017 14:28:02 +1000
> > Benjamin Herrenschmidt <[hidden email]> wrote:
> >  
> >> Instead of comparing the whole CPU mask every time, let's
> >> keep a counter of how many bits are set in the mask. Thus
> >> testing for a local mm only requires testing if that counter
> >> is 1 and the current CPU bit is set in the mask.  
> ...
> >
> > Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
> > If it's <= then it should be similar load and compare, no?  
>
> Do we make a machine with that few CPUs? ;)
>
> I don't think it's worth special casing, all the distros run with much
> much larger NR_CPUs than that.

Not further special-casing, but just casing it based on NR_CPUS
rather than BOOK3S.

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Nicholas Piggin-2
In reply to this post by Benjamin Herrenschmidt
On Tue, 25 Jul 2017 06:58:46 +1000
Benjamin Herrenschmidt <[hidden email]> wrote:

> On Mon, 2017-07-24 at 21:25 +1000, Nicholas Piggin wrote:
> > > +#ifdef CONFIG_PPC_BOOK3S_64
> > > +static inline void inc_mm_active_cpus(struct mm_struct *mm)
> > > +{
> > > +     atomic_inc(&mm->context.active_cpus);
> > > +}
> > > +#else
> > > +static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
> > > +#endif  
> >
> > This is a bit awkward. Can we just move the entire function to test
> > cpumask and set / increment into helper functions and define them
> > together with mm_is_thread_local, so it's all in one place?  
>
> I thought about it but then we have 2 variants, unless I start moving
> the active_cpus into mm_context_t on all the 32-bit subarchs too, etc..

The two variants are just cleaner versions of the two variants you
already introduced.

static inline bool mm_activate_cpu(struct mm_struct *mm)
{
    if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
        cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
#if CONFIG_PPC_BOOK3S_64
        atomic_inc(&mm->context.active_cpus);
#endif
        smp_mb();
        return true;
    }
    return false;
}

I think it would be nicer to put something like that with
mm_is_thread_local etc definitions so you can see how it all works
in one place.

> It gets messy either way.
>
> > The extra atomic does not need to be defined when it's not used either.
> >
> > Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
> > If it's <= then it should be similar load and compare, no?  
>
> Right, we could.
>
> > Looks like a good optimisation though.  
>
> Thx. It's a pre-req for further optimizations such as flushing the PID
> when a single threaded process moves, so we don't have to constantly
> scan the mask.

Yep, will be very interesting to see how much global tlbies can be
reduced.

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Benjamin Herrenschmidt
On Tue, 2017-07-25 at 10:44 +1000, Nicholas Piggin wrote:

> The two variants are just cleaner versions of the two variants you
> already introduced.
>
> static inline bool mm_activate_cpu(struct mm_struct *mm)
> {
>     if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
>         cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> #if CONFIG_PPC_BOOK3S_64
>         atomic_inc(&mm->context.active_cpus);
> #endif
>         smp_mb();
>         return true;
>     }
>     return false;
> }

Well the above is what I originally wrote, which Michael encouraged me
to turn into a helper ;-) I was removing ifdef's from switch_mm in
this series...

> I think it would be nicer to put something like that with
> mm_is_thread_local etc definitions so you can see how it all works
> in one place.
>
> > It gets messy either way.
> >
> > > The extra atomic does not need to be defined when it's not used either.
> > >
> > > Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
> > > If it's <= then it should be similar load and compare, no?  
> >
> > Right, we could.
> >
> > > Looks like a good optimisation though.  
> >
> > Thx. It's a pre-req for further optimizations such as flushing the PID
> > when a single threaded process moves, so we don't have to constantly
> > scan the mask.
>
> Yep, will be very interesting to see how much global tlbies can be
> reduced.
>
> Thanks,
> Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Nicholas Piggin-2
On Tue, 25 Jul 2017 11:03:45 +1000
Benjamin Herrenschmidt <[hidden email]> wrote:

> On Tue, 2017-07-25 at 10:44 +1000, Nicholas Piggin wrote:
> > The two variants are just cleaner versions of the two variants you
> > already introduced.
> >
> > static inline bool mm_activate_cpu(struct mm_struct *mm)
> > {
> >     if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
> >         cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > #if CONFIG_PPC_BOOK3S_64
> >         atomic_inc(&mm->context.active_cpus);
> > #endif
> >         smp_mb();
> >         return true;
> >     }
> >     return false;
> > }  
>
> Well the above is what I originally wrote, which Michael encouraged me
> to turn into a helper ;-) I was removing ifdef's from switch_mm in
> this series...

Well I won't harp on about it if you guys prefer the increment helper.
Just the comment would be good. The rest of the series seems okay to
me.

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Michael Ellerman-2
In reply to this post by Nicholas Piggin-2
Nicholas Piggin <[hidden email]> writes:

> On Mon, 24 Jul 2017 23:46:44 +1000
> Michael Ellerman <[hidden email]> wrote:
>
>> Nicholas Piggin <[hidden email]> writes:
>>
>> > On Mon, 24 Jul 2017 14:28:02 +1000
>> > Benjamin Herrenschmidt <[hidden email]> wrote:
>> >  
>> >> Instead of comparing the whole CPU mask every time, let's
>> >> keep a counter of how many bits are set in the mask. Thus
>> >> testing for a local mm only requires testing if that counter
>> >> is 1 and the current CPU bit is set in the mask.  
>> ...
>> >
>> > Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG?
>> > If it's <= then it should be similar load and compare, no?  
>>
>> Do we make a machine with that few CPUs? ;)
>>
>> I don't think it's worth special casing, all the distros run with much
>> much larger NR_CPUs than that.
>
> Not further special-casing, but just casing it based on NR_CPUS
> rather than BOOK3S.

The problem is the mm_context_t is defined based on BookE vs BookS etc.
not based on NR_CPUS.

So we'd have to add the atomic_t to all mm_context_t's, but #ifdef'ed
based on NR_CPUS.

But then some platforms don't support SMP, so it's a waste there. The
existing cpumask check compiles to ~= nothing on UP.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Frederic Barrat-2
In reply to this post by Benjamin Herrenschmidt


Le 24/07/2017 à 06:28, Benjamin Herrenschmidt a écrit :

> Instead of comparing the whole CPU mask every time, let's
> keep a counter of how many bits are set in the mask. Thus
> testing for a local mm only requires testing if that counter
> is 1 and the current CPU bit is set in the mask.
>
> Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
> ---
>   arch/powerpc/include/asm/book3s/64/mmu.h |  3 +++
>   arch/powerpc/include/asm/mmu_context.h   |  9 +++++++++
>   arch/powerpc/include/asm/tlb.h           | 11 ++++++++++-
>   arch/powerpc/mm/mmu_context_book3s64.c   |  2 ++
>   4 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 1a220cdff923..c3b00e8ff791 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -83,6 +83,9 @@ typedef struct {
>   mm_context_id_t id;
>   u16 user_psize; /* page size index */
>
> + /* Number of bits in the mm_cpumask */
> + atomic_t active_cpus;
> +
>   /* NPU NMMU context */
>   struct npu_context *npu_context;
>
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index ff1aeb2cd19f..cf8f50cd4030 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -96,6 +96,14 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
>     struct mm_struct *mm) { }
>   #endif
>
> +#ifdef CONFIG_PPC_BOOK3S_64
> +static inline void inc_mm_active_cpus(struct mm_struct *mm)
> +{
> + atomic_inc(&mm->context.active_cpus);
> +}
> +#else
> +static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
> +#endif
>
>   /*
>    * switch_mm is the entry point called from the architecture independent
> @@ -110,6 +118,7 @@ static inline void switch_mm_irqs_off(struct mm_struct *prev,
>   /* Mark this context has been used on the new CPU */
>   if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
>   cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> + inc_mm_active_cpus(next);
>   smp_mb();
>   new_on_cpu = true;
>   }
> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index 609557569f65..a7eabff27a0f 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -69,13 +69,22 @@ static inline int mm_is_core_local(struct mm_struct *mm)
>        topology_sibling_cpumask(smp_processor_id()));
>   }
>
> +#ifdef CONFIG_PPC_BOOK3S_64
> +static inline int mm_is_thread_local(struct mm_struct *mm)
> +{
> + if (atomic_read(&mm->context.active_cpus) > 1)
> + return false;
> + return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
> +}
> +#else /* CONFIG_PPC_BOOK3S_64 */


While working on something related (mark memory context as needing
global TLBI if used behind a NPU or PSL):
http://patchwork.ozlabs.org/patch/796775/

Michael raised the point that the store for the pte update cannot be
reordered with the load which decides the scope of the TLBI, and had
convinced me that a memory barrier was required.

Couldn't we have the same problem here, where the atomic read is
reordered with the store of the invalid PTE?

Thanks,

   Fred


>   static inline int mm_is_thread_local(struct mm_struct *mm)
>   {
>   return cpumask_equal(mm_cpumask(mm),
>        cpumask_of(smp_processor_id()));
>   }
> +#endif /* !CONFIG_PPC_BOOK3S_64 */
>
> -#else
> +#else /* CONFIG_SMP */
>   static inline int mm_is_core_local(struct mm_struct *mm)
>   {
>   return 1;
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index 8159f5219137..de17d3e714aa 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -174,6 +174,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>   #ifdef CONFIG_SPAPR_TCE_IOMMU
>   mm_iommu_init(mm);
>   #endif
> + atomic_set(&mm->context.active_cpus, 0);
> +
>   return 0;
>   }
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

Benjamin Herrenschmidt
On Fri, 2017-08-04 at 14:06 +0200, Frederic Barrat wrote:

> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +static inline int mm_is_thread_local(struct mm_struct *mm)
> > +{
> > +     if (atomic_read(&mm->context.active_cpus) > 1)
> > +             return false;
> > +     return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
> > +}
> > +#else /* CONFIG_PPC_BOOK3S_64 */
>
>
> While working on something related (mark memory context as needing
> global TLBI if used behind a NPU or PSL):
> http://patchwork.ozlabs.org/patch/796775/
>
> Michael raised the point that the store for the pte update cannot be
> reordered with the load which decides the scope of the TLBI, and had
> convinced me that a memory barrier was required.
>
> Couldn't we have the same problem here, where the atomic read is
> reordered with the store of the invalid PTE?

The store of the invalid PTE is done with a pte_update which contains a
sync as far as I can tell.

Cheers,
Ben.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

Nicholas Piggin-2
In reply to this post by Nicholas Piggin-2
On Mon, 24 Jul 2017 21:20:07 +1000
Nicholas Piggin <[hidden email]> wrote:

> On Mon, 24 Jul 2017 14:28:00 +1000
> Benjamin Herrenschmidt <[hidden email]> wrote:
>
> > There is no guarantee that the various isync's involved with
> > the context switch will order the update of the CPU mask with
> > the first TLB entry for the new context being loaded by the HW.
> >
> > Be safe here and add a memory barrier to order any subsequent
> > load/store which may bring entries into the TLB.
> >
> > The corresponding barrier on the other side already exists as
> > pte updates use pte_xchg() which uses __cmpxchg_u64 which has
> > a sync after the atomic operation.
> >
> > Signed-off-by: Benjamin Herrenschmidt <[hidden email]>
> > ---
> >  arch/powerpc/include/asm/mmu_context.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > index ed9a36ee3107..ff1aeb2cd19f 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -110,6 +110,7 @@ static inline void switch_mm_irqs_off(struct mm_struct *prev,
> >   /* Mark this context has been used on the new CPU */
> >   if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
> >   cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > + smp_mb();
> >   new_on_cpu = true;
> >   }
> >    
>
> I think this is the right thing to do, but it should be commented.
> Is hwsync the right barrier? (i.e., it will order the page table walk)

After some offline discussion, I think we have an agreement that
this is the right barrier, as it orders with the subsequent load
of next->context.id that the mtpid depends on (or slbmte for HPT).

So we should have a comment here to that effect, and including
the pte_xchg comments from your changelog. Some comment (at least
refer back to here) added at pte_xchg too please.

Other than that your series seems good to me if you repost it you
can add

Reviewed-by: Nicholas Piggin <[hidden email]>

This one out of the series is the bugfix so it should go to stable
as well, right?

Thanks,
Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

Benjamin Herrenschmidt
On Fri, 2017-08-11 at 21:06 +1000, Nicholas Piggin wrote:
> Other than that your series seems good to me if you repost it you
> can add
>
> Reviewed-by: Nicholas Piggin <[hidden email]>
>
> This one out of the series is the bugfix so it should go to stable
> as well, right?

Yup.

Ben.

12
Loading...