[RFC v2 00/12] powerpc: Memory Protection Keys

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 00/12] powerpc: Memory Protection Keys

Ram Pai
Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

 A process allocates a   key  and associates it with
 a  address  range  within    its   address   space.
 The process  than  can  dynamically  set read/write
 permissions on  the   key   without  involving  the
 kernel. Any  code that  violates   the  permissions
 off the address space; as defined by its associated
 key, will receive a segmentation fault.

This patch series enables the feature on PPC64.
It is enabled on HPTE 64K-page platform.

ISA3.0 section 5.7.13 describes the detailed specifications.


Testing:
        This patch series has passed all the protection key
        tests available in  the selftests directory.
        The tests are updated to work on both x86 and powerpc.


version v2:
        (1) documentation and selftest added
  (2) fixed a bug in 4k hpte backed 64k pte where page
            invalidation was not done correctly, and
            initialization of second-part-of-the-pte was not
            done correctly if the pte was not yet Hashed
            with a hpte.  Reported by Aneesh.
        (3) Fixed ABI breakage caused in siginfo structure.
                Reported by Anshuman.
       
        Outstanding known issue:
          Calls to sys_swapcontext with a made-up context will end
          up with a crap AMR if done by code who didn't know about
          that register. -- Reported by Ben.

version v1: Initial version

Thanks-to: Dave Hansen, Aneesh, Paul Mackerras,
           Michael Ellermen


Ram Pai (12):
  Free up four 64K PTE bits in 4K backed hpte pages.
  Free up four 64K PTE bits in 64K backed hpte pages.
  Implement sys_pkey_alloc and sys_pkey_free system call.
  store and restore the pkey state across context switches.
  Implementation for sys_mprotect_pkey() system call.
  Program HPTE key protection bits.
  Macro the mask used for checking DSI exception
  Handle exceptions caused by violation of pkey protection.
  Deliver SEGV signal on pkey violation.
  Read AMR only if pkey-violation caused the exception.
  Documentation updates.
  Updated protection key selftest

 Documentation/vm/protection-keys.txt          |  110 ++
 Documentation/x86/protection-keys.txt         |   85 --
 arch/powerpc/Kconfig                          |   15 +
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   20 +
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   48 +-
 arch/powerpc/include/asm/book3s/64/hash.h     |   15 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   10 +
 arch/powerpc/include/asm/book3s/64/mmu.h      |   10 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   84 +-
 arch/powerpc/include/asm/mman.h               |   29 +-
 arch/powerpc/include/asm/mmu_context.h        |   12 +
 arch/powerpc/include/asm/paca.h               |    1 +
 arch/powerpc/include/asm/pkeys.h              |  159 +++
 arch/powerpc/include/asm/processor.h          |    5 +
 arch/powerpc/include/asm/reg.h                |   10 +-
 arch/powerpc/include/asm/systbl.h             |    3 +
 arch/powerpc/include/asm/unistd.h             |    6 +-
 arch/powerpc/include/uapi/asm/ptrace.h        |    3 +-
 arch/powerpc/include/uapi/asm/unistd.h        |    3 +
 arch/powerpc/kernel/asm-offsets.c             |    5 +
 arch/powerpc/kernel/exceptions-64s.S          |   18 +-
 arch/powerpc/kernel/process.c                 |   18 +
 arch/powerpc/kernel/signal_32.c               |   14 +
 arch/powerpc/kernel/signal_64.c               |   14 +
 arch/powerpc/kernel/traps.c                   |   49 +
 arch/powerpc/mm/Makefile                      |    1 +
 arch/powerpc/mm/dump_linuxpagetables.c        |    3 +-
 arch/powerpc/mm/fault.c                       |   25 +-
 arch/powerpc/mm/hash64_4k.c                   |   14 +-
 arch/powerpc/mm/hash64_64k.c                  |   93 +-
 arch/powerpc/mm/hash_utils_64.c               |   35 +-
 arch/powerpc/mm/hugetlbpage-hash64.c          |   16 +-
 arch/powerpc/mm/mmu_context_book3s64.c        |    5 +
 arch/powerpc/mm/pkeys.c                       |  267 +++++
 include/linux/mm.h                            |   32 +-
 include/uapi/asm-generic/mman-common.h        |    2 +-
 tools/testing/selftests/vm/Makefile           |    1 +
 tools/testing/selftests/vm/pkey-helpers.h     |  365 +++++++
 tools/testing/selftests/vm/protection_keys.c  | 1451 +++++++++++++++++++++++++
 tools/testing/selftests/x86/Makefile          |    2 +-
 tools/testing/selftests/x86/pkey-helpers.h    |  219 ----
 tools/testing/selftests/x86/protection_keys.c | 1395 ------------------------
 42 files changed, 2828 insertions(+), 1844 deletions(-)
 create mode 100644 Documentation/vm/protection-keys.txt
 delete mode 100644 Documentation/x86/protection-keys.txt
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c
 create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
 create mode 100644 tools/testing/selftests/vm/protection_keys.c
 delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
 delete mode 100644 tools/testing/selftests/x86/protection_keys.c

--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

Ram Pai
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
in the 4K backed hpte pages. These bits continue to be used
for 64K backed hpte pages in this patch, but will be freed
up in the next patch.

The patch does the following change to the 64K PTE format

H_PAGE_BUSY moves from bit 3 to bit 9
H_PAGE_F_SECOND which occupied bit 4 moves to the second part
        of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
        second part of the pte.

the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
released immediately. In  other  words, even  though   0xF   is   a
valid slot we discard  and consider it as an invalid
slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
depend on a bit in the primary PTE in order to determine the
validity of a slot.

When  we  release  a    hpte   in the 0xF   slot we also   release a
legitimate primary   slot  and    unmap    that  entry. This  is  to
ensure  that we do get a   legimate   non-0xF  slot the next time we
retry for a slot.

Though treating 0xF slot as invalid reduces the number of available
slots  and  may  have an effect  on the performance, the probabilty
of hitting a 0xF is extermely low.

Compared  to the current scheme, the above described scheme reduces
the number of false hash table updates  significantly  and  has the
added  advantage  of  releasing  four  valuable  PTE bits for other
purpose.

This idea was jointly developed by Paul Mackerras, Aneesh, Michael
Ellermen and myself.

4K PTE format remain unchanged currently.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 20 +++++++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
 arch/powerpc/include/asm/book3s/64/hash.h     | 15 +++--
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
 arch/powerpc/mm/dump_linuxpagetables.c        |  3 +-
 arch/powerpc/mm/hash64_4k.c                   | 14 ++---
 arch/powerpc/mm/hash64_64k.c                  | 81 ++++++++++++---------------
 arch/powerpc/mm/hash_utils_64.c               | 30 +++++++---
 8 files changed, 122 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b..5ef1d81 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,18 @@
 #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+
+/*
+ * Only supported by 4k linux page size
+ */
+#define H_PAGE_F_SECOND        _RPAGE_RSV2     /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX           (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT     56
+
+#define H_PAGE_BUSY _RPAGE_RSV1     /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43    /* PTE has associated HPTE */
+
+
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
  H_PAGE_F_SECOND | H_PAGE_F_GIX)
@@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
 }
 #endif
 
+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot)
+{
+ return (slot << H_PAGE_F_GIX_SHIFT) &
+ (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..0eb3c89 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,23 +10,25 @@
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
  */
-#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+
+#define H_PAGE_BUSY _RPAGE_RPN42     /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43    /* PTE has associated HPTE */
+
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
- H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
+
 /*
  * we support 16 fragments per PTE page of 64K size.
  */
@@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
  return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
 }
 
+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot)
+{
+ unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+ rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+ *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
+ return 0x0UL;
+}
+
 #define __rpte_to_pte(r) ((r).pte)
 extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..e7cf03a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,11 +8,8 @@
  *
  */
 #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIFT 56
-#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
-#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
+#define INIT_HIDX (~0x0UL)
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include <asm/book3s/64/hash-64k.h>
@@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
  return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
 }
 
+static inline bool hpte_soft_invalid(unsigned long slot)
+{
+ return ((slot & 0xfUL) == 0xfUL);
+}
+
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index);
+
 /* This low level function performs the actual PTE insertion
  * Setting the PTE depends on the MMU type and other factors. It's
  * an horrible mess that I'm not going to try to clean up now but
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..cfb8169 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
 extern int __hash_page_64K(unsigned long ea, unsigned long access,
    unsigned long vsid, pte_t *ptep, unsigned long trap,
    unsigned long flags, int ssize);
+extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot);
+extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index);
+
 struct mm_struct;
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
 extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
index 44fe483..b832ed3 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
  .val = H_PAGE_4K_PFN,
  .set = "4K_pfn",
  }, {
-#endif
+#else
  .mask = H_PAGE_F_GIX,
  .val = H_PAGE_F_GIX,
  .set = "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
  .val = H_PAGE_F_SECOND,
  .set = "f_second",
  }, {
+#endif /* CONFIG_PPC_64K_PAGES */
 #endif
  .mask = _PAGE_SPECIAL,
  .val = _PAGE_SPECIAL,
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..c673829 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
    pte_t *ptep, unsigned long trap, unsigned long flags,
    int ssize, int subpg_prot)
 {
+ real_pte_t rpte;
  unsigned long hpte_group;
  unsigned long rflags, pa;
  unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  * need to add in 0x1 if it's a read-only user page
  */
  rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);
 
  if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
     !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  /*
  * There MIGHT be an HPTE for this pte
  */
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+ unsigned long gslot = get_hidx_gslot(vpn, shift,
+ ssize, rpte, 0);
 
- if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
+ if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
        MMU_PAGE_4K, ssize, flags) == -1)
  old_pte &= ~_PAGE_HPTEFLAGS;
  }
@@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  return -1;
  }
  new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
  }
  *ptep = __pte(new_pte & ~H_PAGE_BUSY);
  return 0;
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb1..3702a3c 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -15,34 +15,13 @@
 #include <linux/mm.h>
 #include <asm/machdep.h>
 #include <asm/mmu.h>
+
 /*
  * index from 0 - 15
  */
 bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
 {
- unsigned long g_idx;
- unsigned long ptev = pte_val(rpte.pte);
-
- g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
- index = index >> 2;
- if (g_idx & (0x1 << index))
- return true;
- else
- return false;
-}
-/*
- * index from 0 - 15
- */
-static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
-{
- unsigned long g_idx;
-
- if (!(ptev & H_PAGE_COMBO))
- return ptev;
- index = index >> 2;
- g_idx = 0x1 << index;
-
- return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
+ return !(hpte_soft_invalid(rpte.hidx >> (index << 2)));
 }
 
 int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
@@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
    int ssize, int subpg_prot)
 {
  real_pte_t rpte;
- unsigned long *hidxp;
  unsigned long hpte_group;
  unsigned int subpg_index;
- unsigned long rflags, pa, hidx;
+ unsigned long rflags, pa;
  unsigned long old_pte, new_pte, subpg_pte;
  unsigned long vpn, hash, slot;
  unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
@@ -116,28 +94,23 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  * On hash insert failure we use old pte value and we don't
  * want slot information there if we have a insert failure.
  */
- old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
- new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
+ old_pte &= ~(H_PAGE_HASHPTE);
+ new_pte &= ~(H_PAGE_HASHPTE);
  goto htab_insert_hpte;
  }
  /*
  * Check for sub page valid and update
  */
  if (__rpte_sub_valid(rpte, subpg_index)) {
- int ret;
 
- hash = hpt_hash(vpn, shift, ssize);
- hidx = __rpte_to_hidx(rpte, subpg_index);
- if (hidx & _PTEIDX_SECONDARY)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += hidx & _PTEIDX_GROUP_IX;
+ unsigned long gslot = get_hidx_gslot(vpn, shift,
+ ssize, rpte, subpg_index);
 
- ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
+ int ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
  MMU_PAGE_4K, MMU_PAGE_4K,
  ssize, flags);
  /*
- *if we failed because typically the HPTE wasn't really here
+ * if we failed because typically the HPTE wasn't really here
  * we try an insertion.
  */
  if (ret == -1)
@@ -148,6 +121,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  }
 
 htab_insert_hpte:
+
+ /*
+ * initialize all hidx entries to a invalid value,
+ * the first time the PTE is about to allocate
+ * a 4K hpte
+ */
+ if (!(old_pte & H_PAGE_COMBO))
+ rpte.hidx = INIT_HIDX;
+
  /*
  * handle H_PAGE_4K_PFN case
  */
@@ -177,10 +159,20 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  rflags, HPTE_V_SECONDARY,
  MMU_PAGE_4K, MMU_PAGE_4K,
  ssize);
- if (slot == -1) {
- if (mftb() & 0x1)
+
+ if (unlikely(hpte_soft_invalid(slot))) {
+ slot = slot & _PTEIDX_GROUP_IX;
+ mmu_hash_ops.hpte_invalidate(hpte_group+slot, vpn,
+ MMU_PAGE_4K, MMU_PAGE_4K,
+ ssize, flags);
+ }
+
+ if (unlikely(slot == -1 || hpte_soft_invalid(slot))) {
+
+ if (hpte_soft_invalid(slot) || (mftb() & 0x1))
  hpte_group = ((hash & htab_hash_mask) *
       HPTES_PER_GROUP) & ~0x7UL;
+
  mmu_hash_ops.hpte_remove(hpte_group);
  /*
  * FIXME!! Should be try the group from which we removed ?
@@ -204,11 +196,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
  * Since we have H_PAGE_BUSY set on ptep, we can be sure
  * nobody is undating hidx.
  */
- hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
- rpte.hidx &= ~(0xfUL << (subpg_index << 2));
- *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
- new_pte = mark_subptegroup_valid(new_pte, subpg_index);
- new_pte |=  H_PAGE_HASHPTE;
+ new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
+ new_pte |= H_PAGE_HASHPTE;
+
  /*
  * check __real_pte for details on matching smp_rmb()
  */
@@ -322,9 +312,10 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
    MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
  return -1;
  }
- new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
+
  new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
  }
  *ptep = __pte(new_pte & ~H_PAGE_BUSY);
  return 0;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index f2095ce..c0f4b46 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
 
 void __init hash__early_init_mmu(void)
 {
+#ifndef CONFIG_PPC_64K_PAGES
  /*
- * We have code in __hash_page_64K() and elsewhere, which assumes it can
+ * We have code in __hash_page_4K() and elsewhere, which assumes it can
  * do the following:
  *   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
  *
@@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
  * with a BUILD_BUG_ON().
  */
  BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
+#endif /* CONFIG_PPC_64K_PAGES */
 
  htab_init_page_sizes();
 
@@ -1589,29 +1591,39 @@ static inline void tm_flush_hash_page(int local)
 }
 #endif
 
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+ unsigned long hash, slot, hidx;
+
+ hash = hpt_hash(vpn, shift, ssize);
+ hidx = __rpte_to_hidx(rpte, subpg_index);
+ if (hidx & _PTEIDX_SECONDARY)
+ hash = ~hash;
+ slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+ slot += hidx & _PTEIDX_GROUP_IX;
+ return slot;
+}
+
+
 /* WARNING: This is called from hash_low_64.S, if you change this prototype,
  *          do not forget to update the assembly call site !
  */
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
      unsigned long flags)
 {
- unsigned long hash, index, shift, hidx, slot;
+ unsigned long hash, index, shift, hidx, gslot;
  int local = flags & HPTE_LOCAL_UPDATE;
 
  DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
  pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
- hash = hpt_hash(vpn, shift, ssize);
- hidx = __rpte_to_hidx(pte, index);
- if (hidx & _PTEIDX_SECONDARY)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += hidx & _PTEIDX_GROUP_IX;
+ gslot = get_hidx_gslot(vpn, shift, ssize, pte, index);
  DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
  /*
  * We use same base page size and actual psize, because we don't
  * use these functions for hugepage
  */
- mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
+ mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
      ssize, local);
  } pte_iterate_hashed_end();
 
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

Ram Pai
In reply to this post by Ram Pai
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
in the 64K backed hpte pages. This along with the earlier
patch will entirely free up the four bits from 64K PTE.

This patch does the following change to 64K PTE that is
backed by 64K hpte.

H_PAGE_F_SECOND which occupied bit 4 moves to the second part
        of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
        second part of the pte.

since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
to bit 7. Trying to minimize gaps so that contiguous bits
can be allocated if needed in the future.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 26 ++++++++------------------
 arch/powerpc/mm/hash64_64k.c                  | 16 +++++++---------
 arch/powerpc/mm/hugetlbpage-hash64.c          | 16 ++++++----------
 3 files changed, 21 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0eb3c89..2fa5c60 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,12 +12,8 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_F_GIX_SHIFT 56
 
-
-#define H_PAGE_BUSY _RPAGE_RPN42     /* software: PTE & hash are busy */
+#define H_PAGE_BUSY _RPAGE_RPN44     /* software: PTE & hash are busy */
 #define H_PAGE_HASHPTE _RPAGE_RPN43    /* PTE has associated HPTE */
 
 /*
@@ -56,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
  unsigned long *hidxp;
 
  rpte.pte = pte;
- rpte.hidx = 0;
- if (pte_val(pte) & H_PAGE_COMBO) {
- /*
- * Make sure we order the hidx load against the H_PAGE_COMBO
- * check. The store side ordering is done in __hash_page_4K
- */
- smp_rmb();
- hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
- rpte.hidx = *hidxp;
- }
+ /*
+ * The store side ordering is done in __hash_page_4K
+ */
+ smp_rmb();
+ hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+ rpte.hidx = *hidxp;
  return rpte;
 }
 
 static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
 {
- if ((pte_val(rpte.pte) & H_PAGE_COMBO))
- return (rpte.hidx >> (index<<2)) & 0xf;
- return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
+ return ((rpte.hidx >> (index<<2)) & 0xfUL);
 }
 
 static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 3702a3c..1c25ec2 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -211,6 +211,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
     unsigned long vsid, pte_t *ptep, unsigned long trap,
     unsigned long flags, int ssize)
 {
+ real_pte_t rpte;
  unsigned long hpte_group;
  unsigned long rflags, pa;
  unsigned long old_pte, new_pte;
@@ -247,6 +248,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
  } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
  rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);
 
  if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
     !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -254,16 +256,13 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
 
  vpn  = hpt_vpn(ea, vsid, ssize);
  if (unlikely(old_pte & H_PAGE_HASHPTE)) {
+ unsigned long gslot;
+
  /*
  * There MIGHT be an HPTE for this pte
  */
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
- if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
+ gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
+ if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_64K,
        MMU_PAGE_64K, ssize,
        flags) == -1)
  old_pte &= ~_PAGE_HPTEFLAGS;
@@ -313,8 +312,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
  return -1;
  }
 
- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ set_hidx_slot(ptep, rpte, 0, slot);
  new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
  }
  *ptep = __pte(new_pte & ~H_PAGE_BUSY);
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
index a84bb44..239ca86 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
      pte_t *ptep, unsigned long trap, unsigned long flags,
      int ssize, unsigned int shift, unsigned int mmu_psize)
 {
+ real_pte_t rpte;
  unsigned long vpn;
  unsigned long old_pte, new_pte;
  unsigned long rflags, pa, sz;
@@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
  } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
  rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);
 
  sz = ((1UL) << shift);
  if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -71,15 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
  /* Check if pte already has an hpte (case 2) */
  if (unlikely(old_pte & H_PAGE_HASHPTE)) {
  /* There MIGHT be an HPTE for this pte */
- unsigned long hash, slot;
+ unsigned long gslot;
 
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
- if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
+ gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
+ if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
        mmu_psize, ssize, flags) == -1)
  old_pte &= ~_PAGE_HPTEFLAGS;
  }
@@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
  return -1;
  }
 
- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
  }
 
  /*
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 03/12] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.

Ram Pai
In reply to this post by Ram Pai
Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free()  frees up the pkey.

Total 32 keys are supported on powerpc. However pkey 0,1 and 31
are reserved. So effectively we have 29 pkeys.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/Kconfig                         |  15 ++++
 arch/powerpc/include/asm/book3s/64/mmu.h     |  10 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  62 ++++++++++++++
 arch/powerpc/include/asm/pkeys.h             | 124 +++++++++++++++++++++++++++
 arch/powerpc/include/asm/systbl.h            |   2 +
 arch/powerpc/include/asm/unistd.h            |   4 +-
 arch/powerpc/include/uapi/asm/unistd.h       |   2 +
 arch/powerpc/mm/Makefile                     |   1 +
 arch/powerpc/mm/mmu_context_book3s64.c       |   5 ++
 arch/powerpc/mm/pkeys.c                      |  88 +++++++++++++++++++
 include/linux/mm.h                           |  31 ++++---
 include/uapi/asm-generic/mman-common.h       |   2 +-
 12 files changed, 331 insertions(+), 15 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..b6960617 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP
 
   If unsure, say Y. Only embedded should say N here.
 
+config PPC64_MEMORY_PROTECTION_KEYS
+ prompt "PowerPC Memory Protection Keys"
+ def_bool y
+ # Note: only available in 64-bit mode
+ depends on PPC64 && PPC_64K_PAGES
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select ARCH_HAS_PKEYS
+ ---help---
+  Memory Protection Keys provides a mechanism for enforcing
+  page-based protections, but without requiring modification of the
+  page tables when an application changes protection domains.
+
+  For details, see Documentation/powerpc/protection-keys.txt
+
+  If unsure, say y.
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
  struct list_head iommu_group_mem_list;
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ /*
+ * Each bit represents one protection key.
+ * bit set   -> key allocated
+ * bit unset -> key available for allocation
+ */
+ u32 pkey_allocation_map;
+ s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
 } mm_context_t;
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
  pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include <asm/reg.h>
+static inline u64 read_amr(void)
+{
+ return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+ mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+ return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+ mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+ return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+ mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_amr(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_uamor(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_iamr(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
        unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 0000000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey()  32
+
+#define AMR_AD_BIT 0x1UL
+#define AMR_WD_BIT 0x2UL
+#define IAMR_EX_BIT 0x1UL
+#define AMR_BITS_PER_PKEY 2
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | \
+ VM_PKEY_BIT1 | \
+ VM_PKEY_BIT2 | \
+ VM_PKEY_BIT3 | \
+ VM_PKEY_BIT4)
+
+/*
+ * Bits are in BE format.
+ * NOTE: key 31, 1, 0 are not used.
+ * key 0 is used by default. It give read/write/execute permission.
+ * key 31 is reserved by the hypervisor.
+ * key 1 is recommended to be not used.
+ * PowerISA(3.0) page 1015, programming note.
+ */
+#define PKEY_INITIAL_ALLOCAION  0xc0000001
+
+#define pkeybit_mask(pkey) (0x1 << (arch_max_pkey() - pkey - 1))
+
+#define mm_pkey_allocation_map(mm) (mm->context.pkey_allocation_map)
+
+#define mm_set_pkey_allocated(mm, pkey) { \
+ mm_pkey_allocation_map(mm) |= pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_free(mm, pkey) { \
+ mm_pkey_allocation_map(mm) &= ~pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_is_allocated(mm, pkey) \
+ (mm_pkey_allocation_map(mm) & pkeybit_mask(pkey))
+
+#define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
+ pkeybit_mask(pkey))
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+ /* a reserved key is never considered as 'explicitly allocated' */
+ return (!mm_set_pkey_is_reserved(mm, pkey) &&
+ mm_set_pkey_is_allocated(mm, pkey));
+}
+
+/*
+ * Returns a positive, 5-bit key on success, or -1 on failure.
+ */
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+ /*
+ * Note: this is the one and only place we make sure
+ * that the pkey is valid as far as the hardware is
+ * concerned.  The rest of the kernel trusts that
+ * only good, valid pkeys come out of here.
+ */
+ u32 all_pkeys_mask = (u32)(~(0x0));
+ int ret;
+
+ /*
+ * Are we out of pkeys?  We must handle this specially
+ * because ffz() behavior is undefined if there are no
+ * zeros.
+ */
+ if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+ return -1;
+
+ ret = arch_max_pkey() -
+ ffz((u32)mm_pkey_allocation_map(mm))
+ - 1;
+ mm_set_pkey_allocated(mm, ret);
+ return ret;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+ if (!mm_pkey_is_allocated(mm, pkey))
+ return -EINVAL;
+
+ mm_set_pkey_free(mm, pkey);
+
+ return 0;
+}
+
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+extern int __execute_only_pkey(struct mm_struct *mm);
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+ return __execute_only_pkey(mm);
+}
+
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+ int prot, int pkey);
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+ int prot, int pkey)
+{
+ return __arch_override_mprotect_pkey(vma, prot, pkey);
+}
+
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val);
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+{
+ return __arch_set_user_pkey_access(tsk, pkey, init_val);
+}
+
+static inline pkey_mm_init(struct mm_struct *mm)
+{
+ mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
+ /* -1 means unallocated or invalid */
+ mm->context.execute_only_pkey = -1;
+}
+
+#endif /*_ASM_PPC64_PKEYS_H */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 1c94708..22dd776 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -388,3 +388,5 @@
 COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
+SYSCALL(pkey_alloc)
+SYSCALL(pkey_free)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 9ba11db..e0273bc 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,13 +12,11 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls 384
+#define NR_syscalls 386
 
 #define __NR__exit __NR_exit
 
 #define __IGNORE_pkey_mprotect
-#define __IGNORE_pkey_alloc
-#define __IGNORE_pkey_free
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index b85f142..7993a07 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -394,5 +394,7 @@
 #define __NR_pwritev2 381
 #define __NR_kexec_file_load 382
 #define __NR_statx 383
+#define __NR_pkey_alloc 384
+#define __NR_pkey_free 385
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 7414034..8cc2ff1 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -45,3 +45,4 @@ obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
 obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_iommu.o
 obj-$(CONFIG_PPC_PTDUMP) += dump_linuxpagetables.o
 obj-$(CONFIG_PPC_HTDUMP) += dump_hashpagetable.o
+obj-$(CONFIG_PPC64_MEMORY_PROTECTION_KEYS) += pkeys.o
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2a..2da9931 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -16,6 +16,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/pkeys.h>
 #include <linux/spinlock.h>
 #include <linux/idr.h>
 #include <linux/export.h>
@@ -120,6 +121,10 @@ static int hash__init_new_context(struct mm_struct *mm)
 
  subpage_prot_init_new_context(mm);
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ pkey_mm_init(mm);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  return index;
 }
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
new file mode 100644
index 0000000..b97366e
--- /dev/null
+++ b/arch/powerpc/mm/pkeys.c
@@ -0,0 +1,88 @@
+/*
+ * PowerPC Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/pkeys.h>                /* PKEY_*                       */
+#include <uapi/asm-generic/mman-common.h>
+
+
+/*
+ * set the access right in AMR IAMR and UAMOR register
+ * for @pkey to that specified in @init_val.
+ */
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+{
+ u64 old_amr, old_uamor, old_iamr;
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+ u64 new_amr_bits = 0x0ul;
+ u64 new_iamr_bits = 0x0ul;
+ u64 new_uamor_bits = 0x3ul;
+
+ /* Set the bits we need in AMR:  */
+ if (init_val & PKEY_DISABLE_ACCESS)
+ new_amr_bits |= AMR_AD_BIT;
+ if (init_val & PKEY_DISABLE_WRITE)
+ new_amr_bits |= AMR_WD_BIT;
+
+ /*
+ * By default execute is disabled.
+ * To enable execute, PKEY_ENABLE_EXECUTE
+ * needs to be specified.
+ */
+ if ((init_val & PKEY_DISABLE_EXECUTE))
+ new_iamr_bits |= IAMR_EX_BIT;
+
+ /* Shift the bits in to the correct place in AMR for pkey: */
+ new_amr_bits <<= pkey_shift;
+ new_iamr_bits <<= pkey_shift;
+ new_uamor_bits <<= pkey_shift;
+
+ /* Get old AMR and mask off any old bits in place: */
+ old_amr = read_amr();
+ old_amr &= ~((u64)(AMR_AD_BIT|AMR_WD_BIT) << pkey_shift);
+
+ old_iamr = read_iamr();
+ old_iamr &= ~(0x3ul << pkey_shift);
+
+ old_uamor = read_uamor();
+ old_uamor &= ~(0x3ul << pkey_shift);
+
+ /* Write old part along with new part: */
+ write_amr(old_amr | new_amr_bits);
+ write_iamr(old_iamr | new_iamr_bits);
+ write_uamor(old_uamor | new_uamor_bits);
+
+ return 0;
+}
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+ return -1;
+}
+
+/*
+ * This should only be called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
+ int pkey)
+{
+ /*
+ * Is this an mprotect_pkey() call?  If so, never
+ * override the value that came from the user.
+ */
+ if (pkey != -1)
+ return pkey;
+
+ return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..34ddac7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */
 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
 # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
-# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
-# define VM_PKEY_BIT1 VM_HIGH_ARCH_1
-# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
-# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
-#endif
+#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
+ || defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
+#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #elif defined(CONFIG_PPC)
+#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+#define VM_PKEY_BIT4 VM_HIGH_ARCH_4  /* intel does not use this bit */
+ /* but reserved for future expansion */
 # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
 # define VM_GROWSUP VM_ARCH_1
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0..b13ecc6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -76,5 +76,5 @@
 #define PKEY_DISABLE_WRITE 0x2
 #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
  PKEY_DISABLE_WRITE)
-
+#define PKEY_DISABLE_EXECUTE 0x4
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 04/12] powerpc: store and restore the pkey state across context switches.

Ram Pai
In reply to this post by Ram Pai
Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/processor.h |  5 +++++
 arch/powerpc/kernel/process.c        | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
  struct thread_vr_state ckvr_state; /* Checkpointed VR state */
  unsigned long ckvrsave; /* Checkpointed VRSAVE */
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+ unsigned long iamr;
+ unsigned long uamor;
+#endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
  void* kvm_shadow_vcpu; /* KVM internal data */
 #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
  t->tar = mfspr(SPRN_TAR);
  }
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ t->amr = mfspr(SPRN_AMR);
+ t->iamr = mfspr(SPRN_IAMR);
+ t->uamor = mfspr(SPRN_UAMOR);
+#endif
 }
 
 static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct *old_thread,
  mtspr(SPRN_TAR, new_thread->tar);
  }
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (old_thread->amr != new_thread->amr)
+ mtspr(SPRN_AMR, new_thread->amr);
+ if (old_thread->iamr != new_thread->iamr)
+ mtspr(SPRN_IAMR, new_thread->iamr);
+ if (old_thread->uamor != new_thread->uamor)
+ mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
 }
 
 struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp)
  current->thread.tm_texasr = 0;
  current->thread.tm_tfiar = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ current->thread.amr   = 0x0ul;
+ current->thread.iamr  = 0x0ul;
+ current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL(start_thread);
 
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 05/12] powerpc: Implementation for sys_mprotect_pkey() system call.

Ram Pai
In reply to this post by Ram Pai
This system call, associates the pkey with PTE of all
pages corresponding to the given address range.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++++++-
 arch/powerpc/include/asm/mman.h              | 29 +++++----
 arch/powerpc/include/asm/pkeys.h             | 21 ++++++-
 arch/powerpc/include/asm/systbl.h            |  1 +
 arch/powerpc/include/asm/unistd.h            |  4 +-
 arch/powerpc/include/uapi/asm/unistd.h       |  1 +
 arch/powerpc/mm/pkeys.c                      | 93 +++++++++++++++++++++++++++-
 include/linux/mm.h                           |  1 +
 8 files changed, 154 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 87e9a89..bc845cd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,6 +37,7 @@
 #define _RPAGE_RSV2 0x0800000000000000UL
 #define _RPAGE_RSV3 0x0400000000000000UL
 #define _RPAGE_RSV4 0x0200000000000000UL
+#define _RPAGE_RSV5 0x00040UL
 
 #define _PAGE_PTE 0x4000000000000000UL /* distinguishes PTEs from pointers */
 #define _PAGE_PRESENT 0x8000000000000000UL /* pte contains a translation */
@@ -56,6 +57,20 @@
 /* Max physical address bit as per radix table */
 #define _RPAGE_PA_MAX 57
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define H_PAGE_PKEY_BIT0 _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1 _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2 _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3 _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4 _RPAGE_RSV5
+#else /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0 0
+#define H_PAGE_PKEY_BIT1 0
+#define H_PAGE_PKEY_BIT2 0
+#define H_PAGE_PKEY_BIT3 0
+#define H_PAGE_PKEY_BIT4 0
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 /*
  * Max physical address bit we will use for now.
  *
@@ -122,7 +137,12 @@
 #define PAGE_PROT_BITS  (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
  H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
  _PAGE_READ | _PAGE_WRITE |  _PAGE_DIRTY | _PAGE_EXEC | \
- _PAGE_SOFT_DIRTY)
+ _PAGE_SOFT_DIRTY | \
+ H_PAGE_PKEY_BIT0 | \
+ H_PAGE_PKEY_BIT1 | \
+ H_PAGE_PKEY_BIT2 | \
+ H_PAGE_PKEY_BIT3 | \
+ H_PAGE_PKEY_BIT4)
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..14cc1aa 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,24 +13,31 @@
 
 #include <asm/cputable.h>
 #include <linux/mm.h>
+#include <linux/pkeys.h>
 #include <asm/cpu_has_feature.h>
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
 /*
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
- unsigned long pkey)
-{
- return (prot & PROT_SAO) ? VM_SAO : 0;
-}
-#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+#define arch_calc_vm_prot_bits(prot, key) (             \
+ ((prot) & PROT_SAO ? VM_SAO : 0) | \
+ pkey_to_vmflag_bits(key))
+#define arch_vm_get_page_prot(vm_flags) __pgprot(       \
+ ((vm_flags) & VM_SAO ? _PAGE_SAO : 0) | \
+ vmflag_to_page_pkey_bits(vm_flags))
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+#define arch_calc_vm_prot_bits(prot, key) ( \
+ ((prot) & PROT_SAO ? VM_SAO : 0))
+#define arch_vm_get_page_prot(vm_flags) __pgprot( \
+ ((vm_flags) & VM_SAO ? _PAGE_SAO : 0))
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
-static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
-{
- return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
-}
-#define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
 static inline bool arch_validate_prot(unsigned long prot)
 {
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..0f3dca8 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,19 @@
  VM_PKEY_BIT3 | \
  VM_PKEY_BIT4)
 
+#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
+ ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
+ ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
+ ((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) | \
+ ((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL))
+
+#define vmflag_to_page_pkey_bits(vm_flags)  \
+ (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)|     \
+ ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) |     \
+ ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) |     \
+ ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) |     \
+ ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
@@ -42,6 +55,12 @@
 #define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
  pkeybit_mask(pkey))
 
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+ return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
+}
+
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
  /* a reserved key is never considered as 'explicitly allocated' */
@@ -114,7 +133,7 @@ static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
  return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
-static inline pkey_mm_init(struct mm_struct *mm)
+static inline void pkey_mm_init(struct mm_struct *mm)
 {
  mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
  /* -1 means unallocated or invalid */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 22dd776..b33b551 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@
 SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
+SYSCALL(pkey_mprotect)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index e0273bc..daf1ba9 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,12 +12,10 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls 386
+#define NR_syscalls 387
 
 #define __NR__exit __NR_exit
 
-#define __IGNORE_pkey_mprotect
-
 #ifndef __ASSEMBLY__
 
 #include <linux/types.h>
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 7993a07..71ae45e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_statx 383
 #define __NR_pkey_alloc 384
 #define __NR_pkey_free 385
+#define __NR_pkey_mprotect 386
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index b97366e..11a32b3 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -15,6 +15,17 @@
 #include <linux/pkeys.h>                /* PKEY_*                       */
 #include <uapi/asm-generic/mman-common.h>
 
+#define pkeyshift(pkey) ((arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY)
+
+static inline bool pkey_allows_readwrite(int pkey)
+{
+ int pkey_shift = pkeyshift(pkey);
+
+ if (!(read_uamor() & (0x3UL << pkey_shift)))
+ return true;
+
+ return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
+}
 
 /*
  * set the access right in AMR IAMR and UAMOR register
@@ -68,7 +79,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
- return -1;
+ bool need_to_set_mm_pkey = false;
+ int execute_only_pkey = mm->context.execute_only_pkey;
+ int ret;
+
+ /* Do we need to assign a pkey for mm's execute-only maps? */
+ if (execute_only_pkey == -1) {
+ /* Go allocate one to use, which might fail */
+ execute_only_pkey = mm_pkey_alloc(mm);
+ if (execute_only_pkey < 0)
+ return -1;
+ need_to_set_mm_pkey = true;
+ }
+
+ /*
+ * We do not want to go through the relatively costly
+ * dance to set AMR if we do not need to.  Check it
+ * first and assume that if the execute-only pkey is
+ * readwrite-disabled than we do not have to set it
+ * ourselves.
+ */
+ if (!need_to_set_mm_pkey &&
+    !pkey_allows_readwrite(execute_only_pkey))
+ return execute_only_pkey;
+
+ /*
+ * Set up AMR so that it denies access for everything
+ * other than execution.
+ */
+ ret = __arch_set_user_pkey_access(current, execute_only_pkey,
+ (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+ /*
+ * If the AMR-set operation failed somehow, just return
+ * 0 and effectively disable execute-only support.
+ */
+ if (ret) {
+ mm_set_pkey_free(mm, execute_only_pkey);
+ return -1;
+ }
+
+ /* We got one, store it and use it from here on out */
+ if (need_to_set_mm_pkey)
+ mm->context.execute_only_pkey = execute_only_pkey;
+ return execute_only_pkey;
+}
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+ /* Do this check first since the vm_flags should be hot */
+ if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+ return false;
+ if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey)
+ return false;
+
+ return true;
 }
 
 /*
@@ -84,5 +148,30 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
  if (pkey != -1)
  return pkey;
 
- return 0;
+ /*
+ * Look for a protection-key-drive execute-only mapping
+ * which is now being given permissions that are not
+ * execute-only.  Move it back to the default pkey.
+ */
+ if (vma_is_pkey_exec_only(vma) &&
+    (prot & (PROT_READ|PROT_WRITE))) {
+ return 0;
+ }
+ /*
+ * The mapping is execute-only.  Go try to get the
+ * execute-only protection key.  If we fail to do that,
+ * fall through as if we do not have execute-only
+ * support.
+ */
+ if (prot == PROT_EXEC) {
+ pkey = execute_only_pkey(vma->vm_mm);
+ if (pkey > 0)
+ return pkey;
+ }
+ /*
+ * This is a vanilla, non-pkey mprotect (or we failed to
+ * setup execute-only), inherit the pkey from the VMA we
+ * are working on.
+ */
+ return vma_pkey(vma);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 34ddac7..5399031 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #define VM_PKEY_BIT3 VM_HIGH_ARCH_3
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #elif defined(CONFIG_PPC)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
 #define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
 #define VM_PKEY_BIT1 VM_HIGH_ARCH_1
 #define VM_PKEY_BIT2 VM_HIGH_ARCH_2
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 06/12] powerpc: Program HPTE key protection bits.

Ram Pai
In reply to this post by Ram Pai
Map the PTE protection key bits to the HPTE key protection bits,
while creatiing HPTE  entries.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +++++
 arch/powerpc/include/asm/pkeys.h              | 7 +++++++
 arch/powerpc/mm/hash_utils_64.c               | 5 +++++
 3 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cfb8169..3d7872c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
 #define HPTE_R_PP0 ASM_CONST(0x8000000000000000)
 #define HPTE_R_TS ASM_CONST(0x4000000000000000)
 #define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000)
+#define HPTE_R_KEY_BIT0 ASM_CONST(0x2000000000000000)
+#define HPTE_R_KEY_BIT1 ASM_CONST(0x1000000000000000)
 #define HPTE_R_RPN_SHIFT 12
 #define HPTE_R_RPN ASM_CONST(0x0ffffffffffff000)
 #define HPTE_R_RPN_3_0 ASM_CONST(0x01fffffffffff000)
@@ -104,6 +106,9 @@
 #define HPTE_R_C ASM_CONST(0x0000000000000080)
 #define HPTE_R_R ASM_CONST(0x0000000000000100)
 #define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00)
+#define HPTE_R_KEY_BIT2 ASM_CONST(0x0000000000000800)
+#define HPTE_R_KEY_BIT3 ASM_CONST(0x0000000000000400)
+#define HPTE_R_KEY_BIT4 ASM_CONST(0x0000000000000200)
 
 #define HPTE_V_1TB_SEG ASM_CONST(0x4000000000000000)
 #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ffffff000000)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0f3dca8..9b6820d 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -27,6 +27,13 @@
  ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) |     \
  ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
 
+#define calc_pte_to_hpte_pkey_bits(pteflags) \
+ (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index c0f4b46..7d974cd 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
 #include <linux/memblock.h>
 #include <linux/context_tracking.h>
 #include <linux/libfdt.h>
+#include <linux/pkeys.h>
 
 #include <asm/debugfs.h>
 #include <asm/processor.h>
@@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags)
  */
  rflags |= HPTE_R_M;
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ rflags |= calc_pte_to_hpte_pkey_bits(pteflags);
+#endif
+
  return rflags;
 }
 
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

Ram Pai
In reply to this post by Ram Pai
Replace the magic number used to check for DSI exception
with a meaningful value.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/reg.h       | 9 ++++++++-
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..2dcb8a1 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,23 @@
 #define SPRN_DAR 0x013 /* Data Address Register */
 #define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
 #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
+#define   DSISR_BIT32 0x80000000 /* not defined */
 #define   DSISR_NOHPTE 0x40000000 /* no translation found */
+#define   DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
+#define   DSISR_BIT35 0x10000000 /* not defined */
 #define   DSISR_PROTFAULT 0x08000000 /* protection fault */
 #define   DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
 #define   DSISR_ISSTORE 0x02000000 /* access was a store */
 #define   DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
-#define   DSISR_NOSEGMENT 0x00200000 /* SLB miss */
 #define   DSISR_KEYFAULT 0x00200000 /* Key fault */
+#define   DSISR_BIT43 0x00100000 /* not defined */
 #define   DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
 #define   DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT      0x00020000      /* Fault on page directory */
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+ DSISR_PAGEATTR_CONFLT | \
+ DSISR_BADACCESS |       \
+ DSISR_BIT43)
 #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..3fd0528 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
  .balign IFETCH_ALIGN_BYTES
 do_hash_page:
 #ifdef CONFIG_PPC_STD_MMU_64
- andis. r0,r4,0xa410 /* weird error? */
+ andis. r0,r4,DSISR_PAGE_FAULT_MASK@h
  bne- handle_page_fault /* if not, try to insert a HPTE */
  andis.  r0,r4,DSISR_DABRMATCH@h
  bne-    handle_dabr_fault
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

Ram Pai
In reply to this post by Ram Pai
Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai <[hidden email]>
(cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)

Conflicts:
        arch/powerpc/include/asm/reg.h
        arch/powerpc/kernel/exceptions-64s.S
---
 arch/powerpc/include/asm/mmu_context.h | 12 +++++
 arch/powerpc/include/asm/pkeys.h       |  9 ++++
 arch/powerpc/include/asm/reg.h         |  7 +--
 arch/powerpc/mm/fault.c                | 21 +++++++-
 arch/powerpc/mm/pkeys.c                | 90 ++++++++++++++++++++++++++++++++++
 5 files changed, 134 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+ /* by default, allow everything */
+ return true;
+}
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
  bool write, bool execute, bool foreign)
 {
  /* by default, allow everything */
  return true;
 }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 9b6820d..405e7db 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
  VM_PKEY_BIT3 | \
  VM_PKEY_BIT4)
 
+static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
+{
+ return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
+}
+
 #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
  ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
  ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 2dcb8a1..a11977f 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -285,9 +285,10 @@
 #define   DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
 #define   DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT      0x00020000      /* Fault on page directory */
-#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
- DSISR_PAGEATTR_CONFLT | \
- DSISR_BADACCESS |       \
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+ DSISR_PAGEATTR_CONFLT | \
+ DSISR_BADACCESS | \
+ DSISR_KEYFAULT | \
  DSISR_BIT43)
 #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..c31624f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
  * bits we are interested in.  But there are some bits which
  * indicate errors in DSISR but can validly be set in SRR1.
  */
- if (trap == 0x400)
+ if (trap == 0x400) {
  error_code &= 0x48200000;
- else
+ flags |= FAULT_FLAG_INSTRUCTION;
+ } else
  is_write = error_code & DSISR_ISSTORE;
 #else
  is_write = error_code & ESR_DST;
@@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
  }
 #endif
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (error_code & DSISR_KEYFAULT) {
+ code = SEGV_PKUERR;
+ goto bad_area_nosemaphore;
+ }
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  /* We restore the interrupt state now */
  if (!arch_irq_disabled_regs(regs))
  local_irq_enable();
@@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
  WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 #endif /* CONFIG_PPC_STD_MMU */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ 0)) {
+ code = SEGV_PKUERR;
+ goto bad_area;
+ }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  /*
  * If for any reason at all we couldn't handle the fault,
  * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..439241a 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
  return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
 }
 
+static inline bool pkey_allows_read(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_amr() & (AMR_AD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_write(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_amr() & (AMR_WD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_execute(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
+}
+
+
 /*
  * set the access right in AMR IAMR and UAMOR register
  * for @pkey to that specified in @init_val.
@@ -175,3 +206,62 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
  */
  return vma_pkey(vma);
 }
+
+bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+ int pkey = pte_flags_to_pkey(pte_val(pte));
+
+ if (!pkey_allows_read(pkey))
+ return false;
+ if (write && !pkey_allows_write(pkey))
+ return false;
+ return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to AMR/IAMR for other
+ * processes or any way to tell *which * AMR/IAMR in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+ if (!current->mm)
+ return true;
+ /*
+ * if the VMA is from another process, then AMR/IAMR has no
+ * relevance and should not be enforced.
+ */
+ if (current->mm != vma->vm_mm)
+ return true;
+
+ return false;
+}
+
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign)
+{
+ int pkey;
+ /* allow access if the VMA is not one from this process */
+ if (foreign || vma_is_foreign(vma))
+ return true;
+
+ pkey = vma_pkey(vma);
+
+ if (!pkey)
+ return true;
+
+ if (execute)
+ return pkey_allows_execute(pkey);
+
+ if (!pkey_allows_read(pkey))
+ return false;
+
+ if (write)
+ return pkey_allows_write(pkey);
+
+ return true;
+}
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

Ram Pai
In reply to this post by Ram Pai
The value of the AMR register at the time of exception
is made available in gp_regs[PT_AMR] of the siginfo.

This field can be used to reprogram the permission bits of
any valid pkey.

Similarly the value of the pkey, whose protection got violated,
is made available at si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/include/asm/paca.h        |  1 +
 arch/powerpc/include/uapi/asm/ptrace.h |  3 ++-
 arch/powerpc/kernel/asm-offsets.c      |  5 ++++
 arch/powerpc/kernel/exceptions-64s.S   |  8 ++++++
 arch/powerpc/kernel/signal_32.c        | 14 ++++++++++
 arch/powerpc/kernel/signal_64.c        | 14 ++++++++++
 arch/powerpc/kernel/traps.c            | 49 ++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/fault.c                |  4 +++
 8 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1c09f8f..a41afd3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -92,6 +92,7 @@ struct paca_struct {
  struct dtl_entry *dispatch_log_end;
 #endif /* CONFIG_PPC_STD_MMU_64 */
  u64 dscr_default; /* per-CPU default DSCR */
+ u64 paca_amr; /* value of amr at exception */
 
 #ifdef CONFIG_PPC_STD_MMU_64
  /*
diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..7ec2428 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -108,8 +108,9 @@ struct pt_regs {
 #define PT_DAR 41
 #define PT_DSISR 42
 #define PT_RESULT 43
-#define PT_DSCR 44
 #define PT_REGS_COUNT 44
+#define PT_DSCR 44
+#define PT_AMR 45
 
 #define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */
 
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 709e234..17f5d8a 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -241,6 +241,11 @@ int main(void)
  OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
  OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
  OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ OFFSET(PACA_AMR, paca_struct, paca_amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
  OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
  OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 3fd0528..8db9ef8 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
  ld r12,_MSR(r1)
  ld r3,PACA_EXGEN+EX_DAR(r13)
  lwz r4,PACA_EXGEN+EX_DSISR(r13)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ mfspr r5,SPRN_AMR
+ std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
  li r5,0x300
  std r3,_DAR(r1)
  std r4,_DSISR(r1)
@@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
  ld r12,_MSR(r1)
  ld r3,_NIP(r1)
  andis. r4,r12,0x5820
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ mfspr r5,SPRN_AMR
+ std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
  li r5,0x400
  std r3,_DAR(r1)
  std r4,_DSISR(r1)
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..059766a 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
    (unsigned long) &frame->tramp[2]);
  }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
+ return 1;
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  return 0;
 }
 
@@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
  long err;
  unsigned int save_r2 = 0;
  unsigned long msr;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #ifdef CONFIG_VSX
  int i;
 #endif
@@ -750,6 +758,12 @@ static long restore_user_regs(struct pt_regs *regs,
  return 1;
 #endif /* CONFIG_SPE */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ err |= __get_user(amr, &sr->mc_gregs[PT_AMR]);
+ if (!err && amr != get_paca()->paca_amr)
+ write_amr(amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  return 0;
 }
 
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index c83c115..35df2e4 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
  if (set != NULL)
  err |=  __put_user(set->sig[0], &sc->oldmask);
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ err |= __put_user(get_paca()->paca_amr, &sc->gp_regs[PT_AMR]);
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  return err;
 }
 
@@ -327,6 +331,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
  unsigned long save_r13 = 0;
  unsigned long msr;
  struct pt_regs *regs = tsk->thread.regs;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #ifdef CONFIG_VSX
  int i;
 #endif
@@ -406,6 +413,13 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
  tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
  }
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ err |= __get_user(amr, &sc->gp_regs[PT_AMR]);
+ if (!err && amr != get_paca()->paca_amr)
+ write_amr(amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  return err;
 }
 
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d4e545d..cc4bde8b 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -20,6 +20,7 @@
 #include <linux/sched/debug.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/pkeys.h>
 #include <linux/stddef.h>
 #include <linux/unistd.h>
 #include <linux/ptrace.h>
@@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
  info->si_addr = (void __user *)regs->nip;
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
+{
+ struct vm_area_struct *vma;
+
+ /* Fault not from Protection Keys: nothing to do */
+ if (si_code != SEGV_PKUERR)
+ return;
+
+ down_read(&current->mm->mmap_sem);
+ /*
+ * we could be racing with pkey_mprotect().
+ * If pkey_mprotect() wins the key value could
+ * get modified...xxx
+ */
+ vma = find_vma(current->mm, addr);
+ up_read(&current->mm->mmap_sem);
+
+ /*
+ * force_sig_info_fault() is called from a number of
+ * contexts, some of which have a VMA and some of which
+ * do not.  The Pkey-fault handing happens after we have a
+ * valid VMA, so we should never reach this without a
+ * valid VMA.
+ */
+ if (!vma) {
+ WARN_ONCE(1, "Pkey fault with no VMA passed in");
+ info->si_pkey = 0;
+ return;
+ }
+
+ /*
+ * We could report the incorrect key because of the reason
+ * explained above.
+ *
+ * si_pkey should be thought off as a strong hint, but not
+ * an absolutely guarantee because of the race explained
+ * above.
+ */
+ info->si_pkey = vma_pkey(vma);
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
 {
  siginfo_t info;
@@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
  info.si_signo = signr;
  info.si_code = code;
  info.si_addr = (void __user *) addr;
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ fill_sig_info_pkey(code, &info, addr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
  force_sig_info(signr, &info, current);
 }
 
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c31624f..dd448d2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -453,6 +453,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
  if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
  flags & FAULT_FLAG_INSTRUCTION,
  0)) {
+
+ /* our caller may not have saved the amr. Lets save it */
+ get_paca()->paca_amr = read_amr();
+
  code = SEGV_PKUERR;
  goto bad_area;
  }
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

Ram Pai
In reply to this post by Ram Pai
Signed-off-by: Ram Pai <[hidden email]>
---
 arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 8db9ef8..a4de1b4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
  ld r12,_MSR(r1)
  ld r3,PACA_EXGEN+EX_DAR(r13)
  lwz r4,PACA_EXGEN+EX_DSISR(r13)
+ std r3,_DAR(r1)
+ std r4,_DSISR(r1)
 #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+ beq+ 1f
  mfspr r5,SPRN_AMR
  std r5,PACA_AMR(r13)
 #endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
- li r5,0x300
- std r3,_DAR(r1)
- std r4,_DSISR(r1)
+1: li r5,0x300
 BEGIN_MMU_FTR_SECTION
  b do_hash_page /* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
@@ -565,13 +567,15 @@ EXC_COMMON_BEGIN(instruction_access_common)
  ld r12,_MSR(r1)
  ld r3,_NIP(r1)
  andis. r4,r12,0x5820
+ std r3,_DAR(r1)
+ std r4,_DSISR(r1)
 #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+ beq+ 1f
  mfspr r5,SPRN_AMR
  std r5,PACA_AMR(r13)
 #endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
- li r5,0x400
- std r3,_DAR(r1)
- std r4,_DSISR(r1)
+1: li r5,0x400
 BEGIN_MMU_FTR_SECTION
  b do_hash_page /* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 11/12]Documentation: Documentation updates.

Ram Pai
In reply to this post by Ram Pai
The Documentaton file is moved from x86 into the generic area,
since this feature is now supported by more than one archs.

Signed-off-by: Ram Pai <[hidden email]>
---
 Documentation/vm/protection-keys.txt  | 110 ++++++++++++++++++++++++++++++++++
 Documentation/x86/protection-keys.txt |  85 --------------------------
 2 files changed, 110 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/vm/protection-keys.txt
 delete mode 100644 Documentation/x86/protection-keys.txt

diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt
new file mode 100644
index 0000000..b49e6bb
--- /dev/null
+++ b/Documentation/vm/protection-keys.txt
@@ -0,0 +1,110 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+found in new generation of intel CPUs on PowerPC CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.
+
+
+On Intel:
+
+It works by dedicating 4 previously ignored bits in each page table
+entry to a "protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+
+On PowerPC:
+
+It works by dedicating 5 page table entry to a "protection key",
+giving 32 possible keys.
+
+There is a user-accessible register (AMR) with two separate bits
+(Access Disable and Write Disable) for each key.  Being a CPU
+register, AMR is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+NOTE: Disabling read permission does not disable
+write and vice-versa.
+
+The feature is available on 64-bit HPTE mode only.
+
+'mtspr 0xd, mem' reads the AMR register
+'mfspr mem, 0xd' writes into the AMR register.
+
+Permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+=========================== Syscalls ===========================
+
+There are 3 system calls which directly interact with pkeys:
+
+ int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+ int pkey_free(int pkey);
+ int pkey_mprotect(unsigned long start, size_t len,
+  unsigned long prot, int pkey);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc().  An application calls the WRPKRU instruction
+directly in order to change access permissions to memory covered
+with a key.  In this example WRPKRU is wrapped by a C function
+called pkey_set().
+
+ int real_prot = PROT_READ|PROT_WRITE;
+ pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+ ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+ ... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+ pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+ *ptr = foo; // assign something
+ pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+ munmap(ptr, PAGE_SIZE);
+ pkey_free(pkey);
+
+(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+ An example implementation can be found in
+ tools/testing/selftests/x86/protection_keys.c)
+
+=========================== Behavior ===========================
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+ mprotect(ptr, size, PROT_NONE);
+ something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+ pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+ pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
+ something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+ *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+ read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
diff --git a/Documentation/x86/protection-keys.txt b/Documentation/x86/protection-keys.txt
deleted file mode 100644
index b643045..0000000
--- a/Documentation/x86/protection-keys.txt
+++ /dev/null
@@ -1,85 +0,0 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
-
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register.  The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs.  These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
-
-=========================== Syscalls ===========================
-
-There are 3 system calls which directly interact with pkeys:
-
- int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
- int pkey_free(int pkey);
- int pkey_mprotect(unsigned long start, size_t len,
-  unsigned long prot, int pkey);
-
-Before a pkey can be used, it must first be allocated with
-pkey_alloc().  An application calls the WRPKRU instruction
-directly in order to change access permissions to memory covered
-with a key.  In this example WRPKRU is wrapped by a C function
-called pkey_set().
-
- int real_prot = PROT_READ|PROT_WRITE;
- pkey = pkey_alloc(0, PKEY_DENY_WRITE);
- ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
- ... application runs here
-
-Now, if the application needs to update the data at 'ptr', it can
-gain access, do the update, then remove its write access:
-
- pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
- *ptr = foo; // assign something
- pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
-
-Now when it frees the memory, it will also free the pkey since it
-is no longer in use:
-
- munmap(ptr, PAGE_SIZE);
- pkey_free(pkey);
-
-(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
- An example implementation can be found in
- tools/testing/selftests/x86/protection_keys.c)
-
-=========================== Behavior ===========================
-
-The kernel attempts to make protection keys consistent with the
-behavior of a plain mprotect().  For instance if you do this:
-
- mprotect(ptr, size, PROT_NONE);
- something(ptr);
-
-you can expect the same effects with protection keys when doing this:
-
- pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
- pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
- something(ptr);
-
-That should be true whether something() is a direct access to 'ptr'
-like:
-
- *ptr = foo;
-
-or when the kernel does the access on the application's behalf like
-with a read():
-
- read(fd, ptr, 1);
-
-The kernel will send a SIGSEGV in both cases, but si_code will be set
-to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
-the plain mprotect() permissions are violated.
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC v2 12/12]selftest: Updated protection key selftest

Ram Pai
In reply to this post by Ram Pai
Added test support for PowerPC implementation off protection keys.

Signed-off-by: Ram Pai <[hidden email]>
---
 tools/testing/selftests/vm/Makefile           |    1 +
 tools/testing/selftests/vm/pkey-helpers.h     |  365 +++++++
 tools/testing/selftests/vm/protection_keys.c  | 1451 +++++++++++++++++++++++++
 tools/testing/selftests/x86/Makefile          |    2 +-
 tools/testing/selftests/x86/pkey-helpers.h    |  219 ----
 tools/testing/selftests/x86/protection_keys.c | 1395 ------------------------
 6 files changed, 1818 insertions(+), 1615 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
 create mode 100644 tools/testing/selftests/vm/protection_keys.c
 delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
 delete mode 100644 tools/testing/selftests/x86/protection_keys.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index cbb29e4..1d32f78 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += protection_keys
 
 TEST_PROGS := run_vmtests
 
diff --git a/tools/testing/selftests/vm/pkey-helpers.h b/tools/testing/selftests/vm/pkey-helpers.h
new file mode 100644
index 0000000..5fec0a2
--- /dev/null
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -0,0 +1,365 @@
+#ifndef _PKEYS_HELPER_H
+#define _PKEYS_HELPER_H
+#define _GNU_SOURCE
+#include <string.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+
+/* Define some kernel-like types */
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__ /* arch */
+
+#define SYS_mprotect_key 380
+#define SYS_pkey_alloc 381
+#define SYS_pkey_free 382
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x14
+
+#define NR_PKEYS 16
+#define NR_RESERVED_PKEYS 1
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS 0x1
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<21)
+
+#define INIT_PRKU 0x0UL
+
+#elif __powerpc64__ /* arch */
+
+#define SYS_mprotect_key 386
+#define SYS_pkey_alloc 384
+#define SYS_pkey_free 385
+#define si_pkey_offset 0x20
+#define REG_IP_IDX PT_NIP
+#define REG_TRAPNO PT_TRAP
+#define REG_AMR 45
+#define gregs gp_regs
+#define fpregs fp_regs
+
+#define NR_PKEYS 32
+#define NR_RESERVED_PKEYS 3
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS 0x3  /* disable read and write */
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<24)
+
+#define INIT_PRKU 0x3UL
+#else /* arch */
+
+ NOT SUPPORTED
+
+#endif /* arch */
+
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+
+
+static inline u32 pkey_to_shift(int pkey)
+{
+#ifdef __i386__
+ return pkey * PKRU_BITS_PER_PKEY;
+#elif __powerpc64__
+ return (NR_PKEYS - pkey - 1) * PKRU_BITS_PER_PKEY;
+#endif
+}
+
+
+extern int dprint_in_signal;
+extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+static inline void sigsafe_printf(const char *format, ...)
+{
+ va_list ap;
+
+ va_start(ap, format);
+ if (!dprint_in_signal) {
+ vprintf(format, ap);
+ } else {
+ int len = vsnprintf(dprint_in_signal_buffer,
+    DPRINT_IN_SIGNAL_BUF_SIZE,
+    format, ap);
+ /*
+ * len is amount that would have been printed,
+ * but actual write is truncated at BUF_SIZE.
+ */
+ if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
+ len = DPRINT_IN_SIGNAL_BUF_SIZE;
+ write(1, dprint_in_signal_buffer, len);
+ }
+ va_end(ap);
+}
+#define dprintf_level(level, args...) do { \
+ if (level <= DEBUG_LEVEL) \
+ sigsafe_printf(args); \
+ fflush(NULL); \
+} while (0)
+#define dprintf0(args...) dprintf_level(0, args)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern u64 shadow_pkey_reg;
+
+static inline u64 __rdpkey_reg(void)
+{
+#ifdef __i386__
+ unsigned int eax, edx;
+ unsigned int ecx = 0;
+ unsigned int pkey_reg;
+
+ asm volatile(".byte 0x0f,0x01,0xee\n\t"
+     : "=a" (eax), "=d" (edx)
+     : "c" (ecx));
+#elif __powerpc64__
+ u64 eax;
+ u64 pkey_reg;
+
+ asm volatile("mfspr %0, 0xd" : "=r" ((u64)(eax)));
+#endif
+ pkey_reg = (u64)eax;
+ return pkey_reg;
+}
+
+static inline u64 _rdpkey_reg(int line)
+{
+ u64 pkey_reg = __rdpkey_reg();
+
+ dprintf4("rdpkey_reg(line=%d) pkey_reg: %lx shadow: %lx\n",
+ line, pkey_reg, shadow_pkey_reg);
+ assert(pkey_reg == shadow_pkey_reg);
+
+ return pkey_reg;
+}
+
+#define rdpkey_reg() _rdpkey_reg(__LINE__)
+
+static inline void __wrpkey_reg(u64 pkey_reg)
+{
+#ifdef __i386__
+ unsigned int eax = pkey_reg;
+ unsigned int ecx = 0;
+ unsigned int edx = 0;
+
+ dprintf4("%s() changing %lx to %lx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+ asm volatile(".byte 0x0f,0x01,0xef\n\t"
+     : : "a" (eax), "c" (ecx), "d" (edx));
+ dprintf4("%s() PKRUP after changing %lx to %lx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+#else
+ u64 eax = pkey_reg;
+
+ dprintf4("%s() changing %llx to %llx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+ asm volatile("mtspr 0xd, %0" : : "r" ((unsigned long)(eax)) : "memory");
+ dprintf4("%s() PKRUP after changing %llx to %llx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+#endif
+ assert(pkey_reg == __rdpkey_reg());
+}
+
+static inline void wrpkey_reg(u64 pkey_reg)
+{
+ dprintf4("%s() changing %lx to %lx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+ /* will do the shadow check for us: */
+ rdpkey_reg();
+ __wrpkey_reg(pkey_reg);
+ shadow_pkey_reg = pkey_reg;
+ dprintf4("%s(%lx) pkey_reg: %lx\n",
+ __func__, pkey_reg, __rdpkey_reg());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+ u64 pkey_reg = rdpkey_reg();
+ int bit = pkey * 2;
+
+ if (do_allow)
+ pkey_reg &= (1<<bit);
+ else
+ pkey_reg |= (1<<bit);
+
+ dprintf4("pkey_reg now: %lx\n", rdpkey_reg());
+ wrpkey_reg(pkey_reg);
+}
+
+static inline void __pkey_write_allow(int pkey, int do_allow_write)
+{
+ u64 pkey_reg = rdpkey_reg();
+ int bit = pkey * 2 + 1;
+
+ if (do_allow_write)
+ pkey_reg &= (1<<bit);
+ else
+ pkey_reg |= (1<<bit);
+
+ wrpkey_reg(pkey_reg);
+ dprintf4("pkey_reg now: %lx\n", rdpkey_reg());
+}
+
+#define MB (1<<20)
+
+#ifdef __i386__
+
+#define PAGE_SIZE 4096
+static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
+ unsigned int *ecx, unsigned int *edx)
+{
+ /* ecx is often an input as well as an output. */
+ asm volatile(
+ "cpuid;"
+ : "=a" (*eax),
+  "=b" (*ebx),
+  "=c" (*ecx),
+  "=d" (*edx)
+ : "0" (*eax), "2" (*ecx));
+}
+
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
+#define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
+
+static inline int cpu_has_pkey(void)
+{
+ unsigned int eax;
+ unsigned int ebx;
+ unsigned int ecx;
+ unsigned int edx;
+
+ eax = 0x7;
+ ecx = 0x0;
+ __cpuid(&eax, &ebx, &ecx, &edx);
+
+ if (!(ecx & X86_FEATURE_PKU)) {
+ dprintf2("cpu does not have PKU\n");
+ return 0;
+ }
+ if (!(ecx & X86_FEATURE_OSPKE)) {
+ dprintf2("cpu does not have OSPKE\n");
+ return 0;
+ }
+ return 1;
+}
+
+#define XSTATE_PKRU_BIT (9)
+#define XSTATE_PKRU 0x200
+int pkru_xstate_offset(void)
+{
+ unsigned int eax;
+ unsigned int ebx;
+ unsigned int ecx;
+ unsigned int edx;
+ int xstate_offset;
+ int xstate_size;
+ unsigned long XSTATE_CPUID = 0xd;
+ int leaf;
+
+ /* assume that XSTATE_PKRU is set in XCR0 */
+ leaf = XSTATE_PKRU_BIT;
+ {
+ eax = XSTATE_CPUID;
+ ecx = leaf;
+ __cpuid(&eax, &ebx, &ecx, &edx);
+
+ if (leaf == XSTATE_PKRU_BIT) {
+ xstate_offset = ebx;
+ xstate_size = eax;
+ }
+ }
+
+ if (xstate_size == 0) {
+ printf("could not find size/offset of PKRU in xsave state\n");
+ return 0;
+ }
+
+ return xstate_offset;
+}
+
+/* 8-bytes of instruction * 512 bytes = 1 page */
+#define __page_o_noops() asm(".rept 512 ; nopl 0x7eeeeeee(%eax) ; .endr")
+
+#elif __powerpc64__
+
+#define PAGE_SIZE (0x1UL << 16)
+static inline int cpu_has_pkey(void)
+{
+ return 1;
+}
+
+/* 4-bytes of instruction * 16384bytes = 1 page */
+#define __page_o_noops() asm(".rept 16384 ; nop; .endr")
+
+#endif
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1))
+#define ALIGN_DOWN(x, align_to) ((x) & ~((align_to)-1))
+#define ALIGN_PTR_UP(p, ptr_align_to) \
+ ((typeof(p))ALIGN_UP((unsigned long)(p), ptr_align_to))
+#define ALIGN_PTR_DOWN(p, ptr_align_to) \
+ ((typeof(p))ALIGN_DOWN((unsigned long)(p), ptr_align_to))
+#define __stringify_1(x...)     #x
+#define __stringify(x...)       __stringify_1(x)
+
+#define PTR_ERR_ENOTSUP ((void *)-ENOTSUP)
+
+extern void abort_hooks(void);
+#define pkey_assert(condition) do { \
+ if (!(condition)) { \
+ dprintf0("assert() at %s::%d test_nr: %d iteration: %d\n", \
+ __FILE__, __LINE__, \
+ test_nr, iteration_nr); \
+ dprintf0("errno at assert: %d", errno); \
+ abort_hooks(); \
+ assert(condition); \
+ } \
+} while (0)
+#define raw_assert(cond) assert(cond)
+
+
+static inline int open_hugepage_file(int flag)
+{
+ int fd;
+#ifdef __i386__
+ fd = open("/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages",
+ O_RDONLY);
+#elif __powerpc64__
+ fd = open("/sys/kernel/mm/hugepages/hugepages-16384kB/nr_hugepages",
+ O_RDONLY);
+#else
+ NOT SUPPORTED
+#endif
+ return fd;
+}
+
+static inline int get_start_key(void)
+{
+#ifdef __i386__
+ return 1;
+#elif __powerpc64__
+ return 0;
+#else
+ NOT SUPPORTED
+#endif
+}
+
+#endif /* _PKEYS_HELPER_H */
diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
new file mode 100644
index 0000000..26c5e5a
--- /dev/null
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -0,0 +1,1451 @@
+/*
+ * Tests Memory Protection Keys (see Documentation/vm/protection-keys.txt)
+ *
+ * There are examples in here of:
+ *  * how to set protection keys on memory
+ *  * how to set/clear bits in PKRU (the rights register)
+ *  * how to handle SEGV_PKRU signals and extract pkey-relevant
+ *    information from the siginfo
+ *
+ * Things to add:
+ * make sure KSM and KSM COW breaking works
+ * prefault pages in at malloc, or not
+ * protect MPX bounds tables with protection keys?
+ * make sure VMA splitting/merging is working correctly
+ * OOMs can destroy mm->mmap (see exit_mmap()),
+ * so make sure it is immune to pkeys
+ * look for pkey "leaks" where it is still set on a VMA
+ * but "freed" back to the kernel
+ * do a plain mprotect() to a mprotect_pkey() area and make
+ * sure the pkey sticks
+ *
+ * Compile like this:
+ * gcc      -o protection_keys    -O2 -g -std=gnu99
+ * -pthread -Wall protection_keys.c -lrt -ldl -lm
+ * gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99
+ * -pthread -Wall protection_keys.c -lrt -ldl -lm
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <time.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+#include <setjmp.h>
+
+#include "pkey-helpers.h"
+
+int iteration_nr = 1;
+int test_nr;
+u64 shadow_pkey_reg;
+
+int dprint_in_signal;
+char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+
+void cat_into_file(char *str, char *file)
+{
+ int fd = open(file, O_RDWR);
+ int ret;
+
+ dprintf2("%s(): writing '%s' to '%s'\n", __func__, str, file);
+ /*
+ * these need to be raw because they are called under
+ * pkey_assert()
+ */
+ raw_assert(fd >= 0);
+ ret = write(fd, str, strlen(str));
+ if (ret != strlen(str)) {
+ perror("write to file failed");
+ fprintf(stderr, "filename: '%s' str: '%s'\n", file, str);
+ raw_assert(0);
+ }
+ close(fd);
+}
+
+#if CONTROL_TRACING > 0
+static int warned_tracing;
+int tracing_root_ok(void)
+{
+ if (geteuid() != 0) {
+ if (!warned_tracing)
+ fprintf(stderr, "WARNING: not run as root, "
+ "can not do tracing control\n");
+ warned_tracing = 1;
+ return 0;
+ }
+ return 1;
+}
+#endif
+
+void tracing_on(void)
+{
+#if CONTROL_TRACING > 0
+#define TRACEDIR "/sys/kernel/debug/tracing"
+ char pidstr[32];
+
+ if (!tracing_root_ok())
+ return;
+
+ sprintf(pidstr, "%d", getpid());
+ cat_into_file("0", TRACEDIR "/tracing_on");
+ cat_into_file("\n", TRACEDIR "/trace");
+ if (1) {
+ cat_into_file("function_graph", TRACEDIR "/current_tracer");
+ cat_into_file("1", TRACEDIR "/options/funcgraph-proc");
+ } else {
+ cat_into_file("nop", TRACEDIR "/current_tracer");
+ }
+ cat_into_file(pidstr, TRACEDIR "/set_ftrace_pid");
+ cat_into_file("1", TRACEDIR "/tracing_on");
+ dprintf1("enabled tracing\n");
+#endif
+}
+
+void tracing_off(void)
+{
+#if CONTROL_TRACING > 0
+ if (!tracing_root_ok())
+ return;
+ cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void abort_hooks(void)
+{
+ fprintf(stderr, "running %s()...\n", __func__);
+ tracing_off();
+#ifdef SLEEP_ON_ABORT
+ sleep(SLEEP_ON_ABORT);
+#endif
+}
+
+
+/*
+ * This attempts to have roughly a page of instructions followed by a few
+ * instructions that do a write, and another page of instructions.  That
+ * way, we are pretty sure that the write is in the second page of
+ * instructions and has at least a page of padding behind it.
+ *
+ * *That* lets us be sure to madvise() away the write instruction, which
+ * will then fault, which makes sure that the fault code handles
+ * execute-only memory properly.
+ */
+__attribute__((__aligned__(PAGE_SIZE)))
+void lots_o_noops_around_write(int *write_to_me)
+{
+ dprintf3("running %s()\n", __func__);
+ __page_o_noops();
+ /* Assume this happens in the second page of instructions: */
+ *write_to_me = __LINE__;
+ /* pad out by another page: */
+ __page_o_noops();
+ dprintf3("%s() done\n", __func__);
+}
+
+void dump_mem(void *dumpme, int len_bytes)
+{
+ char *c = (void *)dumpme;
+ int i;
+
+ for (i = 0; i < len_bytes; i += sizeof(u64)) {
+ u64 *ptr = (u64 *)(c + i);
+
+ dprintf1("dump[%03d][@%p]: %016jx\n", i, ptr, *ptr);
+ }
+}
+
+#define __SI_FAULT      (3 << 16)
+#define SEGV_BNDERR     (__SI_FAULT|3)  /* failed address bound checks */
+#define SEGV_PKUERR     (__SI_FAULT|4)
+
+static char *si_code_str(int si_code)
+{
+ if (si_code & SEGV_MAPERR)
+ return "SEGV_MAPERR";
+ if (si_code & SEGV_ACCERR)
+ return "SEGV_ACCERR";
+ if (si_code & SEGV_BNDERR)
+ return "SEGV_BNDERR";
+ if (si_code & SEGV_PKUERR)
+ return "SEGV_PKUERR";
+ return "UNKNOWN";
+}
+
+int pkey_faults;
+int last_si_pkey = -1;
+
+u64 reset_bits(int pkey, u64 bits)
+{
+ u32 shift = pkey_to_shift(pkey);
+
+ return ~(bits << shift);
+}
+
+u64 left_shift_bits(int pkey, u64 bits)
+{
+ u32 shift = pkey_to_shift(pkey);
+
+ return (bits << shift);
+}
+
+u64 right_shift_bits(int pkey, u64 bits)
+{
+ u32 shift = pkey_to_shift(pkey);
+
+ return (bits >> shift);
+}
+
+void signal_handler(int signum, siginfo_t *si, void *vucontext)
+{
+ ucontext_t *uctxt = vucontext;
+ int trapno;
+ unsigned long ip;
+ char *fpregs;
+ u64 *pkey_reg_ptr;
+ u64 si_pkey;
+ u32 *si_pkey_ptr;
+
+ dprint_in_signal = 1;
+ dprintf1(">>>>===============SIGSEGV============================\n");
+ dprintf1("%s()::%d, pkey_reg: 0x%lx shadow: %lx\n", __func__, __LINE__,
+ __rdpkey_reg(), shadow_pkey_reg);
+
+ trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
+ ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
+ fpregs = (char *) uctxt->uc_mcontext.fpregs;
+
+ dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__,
+ trapno, ip, si_code_str(si->si_code), si->si_code);
+#ifdef __i386__
+ /*
+ * 32-bit has some extra padding so that userspace can tell whether
+ * the XSTATE header is present in addition to the "legacy" FPU
+ * state.  We just assume that it is here.
+ */
+ fpregs += 0x70;
+ pkey_reg_ptr = (void *)(&fpregs[pkru_xstate_offset()]);
+ /*
+ * If we got a PKRU fault, we *HAVE* to have at least one bit set in
+ * here.
+ */
+ dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
+ if (DEBUG_LEVEL > 4)
+ dump_mem(pkey_reg_ptr - 128, 256);
+#elif __powerpc64__
+ pkey_reg_ptr = &uctxt->uc_mcontext.gregs[REG_AMR];
+#endif
+
+
+ dprintf1("siginfo: %p\n", si);
+ dprintf1(" fpregs: %p\n", fpregs);
+ pkey_assert(*pkey_reg_ptr);
+
+ si_pkey_ptr = (u32 *)(((u8 *)si) + si_pkey_offset);
+ dprintf1("si_pkey_ptr: %p\n", si_pkey_ptr);
+ dump_mem(si_pkey_ptr - 8, 24);
+ si_pkey = *si_pkey_ptr;
+ pkey_assert(si_pkey < NR_PKEYS);
+ last_si_pkey = si_pkey;
+
+ if ((si->si_code == SEGV_MAPERR) ||
+    (si->si_code == SEGV_ACCERR) ||
+    (si->si_code == SEGV_BNDERR)) {
+ printf("non-PK si_code, exiting...\n");
+ exit(4);
+ }
+
+ dprintf1("signal pkey_reg : %08x\n", *pkey_reg_ptr);
+ /*
+ * need __rdpkey_reg() version so we do not do
+ * shadow_pkey_reg checking
+ */
+ dprintf1("signal pkey_reg from  pkey_reg: %08x\n", __rdpkey_reg());
+ dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
+ *(u64 *)pkey_reg_ptr &= reset_bits(si_pkey, PKEY_DISABLE_ACCESS);
+ shadow_pkey_reg &= reset_bits(si_pkey, PKEY_DISABLE_ACCESS);
+ dprintf1("WARNING: set PRKU=0 to allow faulting instruction "
+ "to continue\n");
+ pkey_faults++;
+ dprintf1("<<<<==================================================\n");
+}
+
+int wait_all_children(void)
+{
+ int status;
+
+ return waitpid(-1, &status, 0);
+}
+
+void sig_chld(int x)
+{
+ dprint_in_signal = 1;
+ dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
+ dprint_in_signal = 0;
+}
+
+void setup_sigsegv_handler(void)
+{
+ int r, rs;
+ struct sigaction newact;
+ struct sigaction oldact;
+
+ /* #PF is mapped to sigsegv */
+ int signum  = SIGSEGV;
+
+ newact.sa_handler = 0;
+ newact.sa_sigaction = signal_handler;
+
+ /*sigset_t - signals to block while in the handler */
+ /* get the old signal mask. */
+ rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
+ pkey_assert(rs == 0);
+
+ /* call sa_sigaction, not sa_handler*/
+ newact.sa_flags = SA_SIGINFO;
+
+ newact.sa_restorer = 0;  /* void(*)(), obsolete */
+ r = sigaction(signum, &newact, &oldact);
+ r = sigaction(SIGALRM, &newact, &oldact);
+ pkey_assert(r == 0);
+}
+
+void setup_handlers(void)
+{
+ signal(SIGCHLD, &sig_chld);
+ setup_sigsegv_handler();
+}
+
+pid_t fork_lazy_child(void)
+{
+ pid_t forkret;
+
+ forkret = fork();
+ pkey_assert(forkret >= 0);
+ dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
+
+ if (!forkret) {
+ /* in the child */
+ while (1) {
+ dprintf1("child sleeping...\n");
+ sleep(30);
+ }
+ }
+ return forkret;
+}
+
+void davecmp(void *_a, void *_b, int len)
+{
+ int i;
+ unsigned long *a = _a;
+ unsigned long *b = _b;
+
+ for (i = 0; i < len / sizeof(*a); i++) {
+ if (a[i] == b[i])
+ continue;
+
+ dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
+ }
+}
+
+void dumpit(char *f)
+{
+ int fd = open(f, O_RDONLY);
+ char buf[100];
+ int nr_read;
+
+ dprintf2("maps fd: %d\n", fd);
+ do {
+ nr_read = read(fd, &buf[0], sizeof(buf));
+ write(1, buf, nr_read);
+ } while (nr_read > 0);
+ close(fd);
+}
+
+u64 pkey_get(int pkey, unsigned long flags)
+{
+ u64 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
+ u64 pkey_reg = __rdpkey_reg();
+ u64 shifted_pkey_reg;
+ u64 masked_pkey_reg;
+
+ dprintf1("%s(pkey=%d, flags=%lx) = %x / %d\n",
+ __func__, pkey, flags, 0, 0);
+ dprintf2("%s() raw pkey_reg: %lx\n", __func__, pkey_reg);
+
+ shifted_pkey_reg = right_shift_bits(pkey, pkey_reg);
+ dprintf2("%s() shifted_pkey_reg: %lx\n", __func__, shifted_pkey_reg);
+ masked_pkey_reg = shifted_pkey_reg & mask;
+ dprintf2("%s() masked  pkey_reg: %lx\n", __func__, masked_pkey_reg);
+ /*
+ * shift down the relevant bits to the lowest two, then
+ * mask off all the other high bits.
+ */
+ return masked_pkey_reg;
+}
+
+int pkey_set(int pkey, unsigned long rights, unsigned long flags)
+{
+ u64 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
+ u64 old_pkey_reg = __rdpkey_reg();
+ u64 new_pkey_reg;
+
+ /* make sure that 'rights' only contains the bits we expect: */
+ assert(!(rights & ~mask));
+
+ /* copy old pkey_reg */
+ new_pkey_reg = old_pkey_reg;
+ /* mask out bits from pkey in old value: */
+ new_pkey_reg &= reset_bits(pkey, mask);
+ /* OR in new bits for pkey: */
+ new_pkey_reg |= left_shift_bits(pkey, rights);
+
+ __wrpkey_reg(new_pkey_reg);
+
+ dprintf3("%s(pkey=%d, rights=%lx, flags=%lx) = %x "
+ "pkey_reg now: %x old_pkey_reg: %x\n",
+ __func__, pkey, rights, flags,
+ 0, __rdpkey_reg(), old_pkey_reg);
+ return 0;
+}
+
+void pkey_disable_set(int pkey, int flags)
+{
+ unsigned long syscall_flags = 0;
+ int ret;
+ u64 pkey_rights;
+ u64 orig_pkey_reg = rdpkey_reg();
+
+ dprintf1("START->%s(%d, 0x%x)\n", __func__,
+ pkey, flags);
+ pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+ pkey_rights = pkey_get(pkey, syscall_flags);
+
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+ pkey_assert(pkey_rights >= 0);
+
+ /* process flags only if they have some new bits enabled */
+ if (flags && !(pkey_rights & flags)) {
+ pkey_rights |= flags;
+
+ ret = pkey_set(pkey, pkey_rights, syscall_flags);
+ assert(!ret);
+ /*pkey_reg and flags have the same format */
+ shadow_pkey_reg |= left_shift_bits(pkey, flags);
+ dprintf1("%s(%d) shadow: 0x%x\n",
+ __func__, pkey, shadow_pkey_reg);
+
+ pkey_assert(ret >= 0);
+
+ pkey_rights = pkey_get(pkey, syscall_flags);
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+
+ dprintf1("%s(%d) pkey_reg: 0x%lx\n",
+ __func__, pkey, rdpkey_reg());
+ if (flags)
+ pkey_assert(rdpkey_reg() > orig_pkey_reg);
+ }
+ dprintf1("END<---%s(%d, 0x%x)\n", __func__,
+ pkey, flags);
+}
+
+void pkey_disable_clear(int pkey, int flags)
+{
+ unsigned long syscall_flags = 0;
+ int ret;
+ u64 pkey_rights = pkey_get(pkey, syscall_flags);
+ u64 orig_pkey_reg = rdpkey_reg();
+
+ pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+ pkey_assert(pkey_rights >= 0);
+
+ pkey_rights &= ~flags;
+
+ ret = pkey_set(pkey, pkey_rights, 0);
+ /* pkey_reg and flags have the same format */
+ shadow_pkey_reg &= reset_bits(pkey, flags);
+ pkey_assert(ret >= 0);
+
+ pkey_rights = pkey_get(pkey, syscall_flags);
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+
+ dprintf1("%s(%d) pkey_reg: 0x%x\n",
+ __func__, pkey, rdpkey_reg());
+ if (flags)
+ assert(rdpkey_reg() > orig_pkey_reg);
+}
+
+void pkey_write_allow(int pkey)
+{
+ pkey_disable_clear(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_write_deny(int pkey)
+{
+ pkey_disable_set(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_access_allow(int pkey)
+{
+ pkey_disable_clear(pkey, PKEY_DISABLE_ACCESS);
+}
+void pkey_access_deny(int pkey)
+{
+ pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
+}
+
+int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+ unsigned long pkey)
+{
+ int sret;
+
+ dprintf2("%s(0x%p, %zx, prot=%lx, pkey=%lx)\n", __func__,
+ ptr, size, orig_prot, pkey);
+
+ errno = 0;
+ sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+ if (errno) {
+ dprintf2("SYS_mprotect_key sret: %d\n", sret);
+ dprintf2("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
+ dprintf2("SYS_mprotect_key failed, errno: %d\n", errno);
+ if (DEBUG_LEVEL >= 2)
+ perror("SYS_mprotect_pkey");
+ }
+ return sret;
+}
+
+int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
+{
+ int ret = syscall(SYS_pkey_alloc, flags, init_val);
+
+ dprintf1("%s(flags=%lx, init_val=%lx) syscall ret: %d errno: %d\n",
+ __func__, flags, init_val, ret, errno);
+ return ret;
+}
+
+void pkey_setup_shadow(void)
+{
+ shadow_pkey_reg = __rdpkey_reg();
+}
+
+void pkey_reset_shadow(u32 key)
+{
+ shadow_pkey_reg &= reset_bits(key, 0x3);
+}
+
+void pkey_set_shadow(u32 key, u64 init_val)
+{
+ shadow_pkey_reg |=  left_shift_bits(key, init_val);
+}
+
+int alloc_pkey(void)
+{
+ int ret;
+ u64 init_val = 0x0;
+
+ dprintf1("%s()::%d, pkey_reg: 0x%x shadow: %x\n",
+ __func__, __LINE__, __rdpkey_reg(),
+ shadow_pkey_reg);
+ ret = sys_pkey_alloc(0, init_val);
+ /*
+ * pkey_alloc() sets PKRU, so we need to reflect it in
+ * shadow_pkey_reg:
+ */
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ if (ret) {
+ /* clear both the bits: */
+ pkey_reset_shadow(ret);
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow:"
+ " 0x%x\n",
+ __func__, __LINE__, ret,
+ __rdpkey_reg(), shadow_pkey_reg);
+ /*
+ * move the new state in from init_val
+ * (remember, we cheated and init_val == pkey_reg format)
+ */
+ pkey_set_shadow(ret, init_val);
+ }
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ dprintf1("%s()::%d errno: %d\n", __func__, __LINE__, errno);
+ /* for shadow checking: */
+ rdpkey_reg();
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ return ret;
+}
+
+int sys_pkey_free(unsigned long pkey)
+{
+ int ret = syscall(SYS_pkey_free, pkey);
+
+ dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
+ return ret;
+}
+
+/*
+ * I had a bug where pkey bits could be set by mprotect() but
+ * not cleared.  This ensures we get lots of random bit sets
+ * and clears on the vma and pte pkey bits.
+ */
+int alloc_random_pkey(void)
+{
+ int max_nr_pkey_allocs;
+ int ret;
+ int i;
+ int alloced_pkeys[NR_PKEYS];
+ int nr_alloced = 0;
+ int random_index;
+
+ memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
+ srand((unsigned int)time(NULL));
+
+ /* allocate every possible key and make a note of which ones we got */
+ max_nr_pkey_allocs = NR_PKEYS;
+ for (i = 0; i < max_nr_pkey_allocs; i++) {
+ int new_pkey = alloc_pkey();
+
+ if (new_pkey < 0)
+ break;
+ alloced_pkeys[nr_alloced++] = new_pkey;
+ }
+
+ pkey_assert(nr_alloced > 0);
+ /* select a random one out of the allocated ones */
+ random_index = rand() % nr_alloced;
+ ret = alloced_pkeys[random_index];
+ /* now zero it out so we don't free it next */
+ alloced_pkeys[random_index] = 0;
+
+ /* go through the allocated ones that we did not want and free them */
+ for (i = 0; i < nr_alloced; i++) {
+ int free_ret;
+
+ if (!alloced_pkeys[i])
+ continue;
+ free_ret = sys_pkey_free(alloced_pkeys[i]);
+ pkey_assert(!free_ret);
+ }
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n", __func__,
+ __LINE__, ret, __rdpkey_reg(), shadow_pkey_reg);
+ return ret;
+}
+
+int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+ unsigned long pkey)
+{
+ int nr_iterations = random() % 100;
+ int ret;
+
+ while (0) {
+ int rpkey = alloc_random_pkey();
+
+ ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
+
+ dprintf1("sys_mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) "
+ "ret: %d\n",
+ ptr, size, orig_prot, pkey, ret);
+ if (nr_iterations-- < 0)
+ break;
+
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ sys_pkey_free(rpkey);
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ }
+ pkey_assert(pkey < NR_PKEYS);
+
+ ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
+ dprintf1("mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
+ ptr, size, orig_prot, pkey, ret);
+ pkey_assert(!ret);
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n", __func__,
+ __LINE__, ret, __rdpkey_reg(), shadow_pkey_reg);
+ return ret;
+}
+
+struct pkey_malloc_record {
+ void *ptr;
+ long size;
+};
+struct pkey_malloc_record *pkey_malloc_records;
+long nr_pkey_malloc_records;
+void record_pkey_malloc(void *ptr, long size)
+{
+ long i;
+ struct pkey_malloc_record *rec = NULL;
+
+ for (i = 0; i < nr_pkey_malloc_records; i++) {
+ rec = &pkey_malloc_records[i];
+ /* find a free record */
+ if (rec)
+ break;
+ }
+ if (!rec) {
+ /* every record is full */
+ size_t old_nr_records = nr_pkey_malloc_records;
+ size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
+ size_t new_size = new_nr_records *
+ sizeof(struct pkey_malloc_record);
+
+ dprintf2("new_nr_records: %zd\n", new_nr_records);
+ dprintf2("new_size: %zd\n", new_size);
+ pkey_malloc_records = realloc(pkey_malloc_records, new_size);
+ pkey_assert(pkey_malloc_records != NULL);
+ rec = &pkey_malloc_records[nr_pkey_malloc_records];
+ /*
+ * realloc() does not initialize memory, so zero it from
+ * the first new record all the way to the end.
+ */
+ for (i = 0; i < new_nr_records - old_nr_records; i++)
+ memset(rec + i, 0, sizeof(*rec));
+ }
+ dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
+ (int)(rec - pkey_malloc_records), rec, ptr, size);
+ rec->ptr = ptr;
+ rec->size = size;
+ nr_pkey_malloc_records++;
+}
+
+void free_pkey_malloc(void *ptr)
+{
+ long i;
+ int ret;
+
+ dprintf3("%s(%p)\n", __func__, ptr);
+ for (i = 0; i < nr_pkey_malloc_records; i++) {
+ struct pkey_malloc_record *rec = &pkey_malloc_records[i];
+
+ dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
+ ptr, i, rec, rec->ptr, rec->size);
+ if ((ptr <  rec->ptr) ||
+    (ptr >= rec->ptr + rec->size))
+ continue;
+
+ dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
+ ptr, i, rec, rec->ptr, rec->size);
+ nr_pkey_malloc_records--;
+ ret = munmap(rec->ptr, rec->size);
+ dprintf3("munmap ret: %d\n", ret);
+ pkey_assert(!ret);
+ dprintf3("clearing rec->ptr, rec: %p\n", rec);
+ rec->ptr = NULL;
+ dprintf3("done clearing rec->ptr, rec: %p\n", rec);
+ return;
+ }
+ pkey_assert(false);
+}
+
+
+void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+{
+ void *ptr;
+ int ret;
+
+ rdpkey_reg();
+ dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+ size, prot, pkey);
+ pkey_assert(pkey < NR_PKEYS);
+ ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ pkey_assert(ptr != (void *)-1);
+ ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
+ pkey_assert(!ret);
+ record_pkey_malloc(ptr, size);
+ rdpkey_reg();
+
+ dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
+ return ptr;
+}
+
+void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
+{
+ int ret;
+ void *ptr;
+
+ dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+ size, prot, pkey);
+ /*
+ * Guarantee we can fit at least one huge page in the resulting
+ * allocation by allocating space for 2:
+ */
+ size = ALIGN_UP(size, HPAGE_SIZE * 2);
+ ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ pkey_assert(ptr != (void *)-1);
+ record_pkey_malloc(ptr, size);
+ mprotect_pkey(ptr, size, prot, pkey);
+
+ dprintf1("unaligned ptr: %p\n", ptr);
+ ptr = ALIGN_PTR_UP(ptr, HPAGE_SIZE);
+ dprintf1("  aligned ptr: %p\n", ptr);
+ ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
+ dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
+ ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
+ dprintf1("MADV_WILLNEED ret: %d\n", ret);
+ memset(ptr, 0, HPAGE_SIZE);
+
+ dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
+ return ptr;
+}
+
+int hugetlb_setup_ok;
+#define GET_NR_HUGE_PAGES 10
+void setup_hugetlbfs(void)
+{
+ int err;
+ int fd;
+ char buf[] = "123";
+
+ if (geteuid() != 0) {
+ fprintf(stderr,
+ "WARNING: not run as root, can not do hugetlb test\n");
+ return;
+ }
+
+ cat_into_file(__stringify(GET_NR_HUGE_PAGES),
+ "/proc/sys/vm/nr_hugepages");
+
+ /*
+ * Now go make sure that we got the pages and that they
+ * are 2M pages.  Someone might have made 1G the default.
+ */
+ fd = open_hugepage_file(O_RDONLY);
+ if (fd < 0) {
+ perror("opening sysfs 2M hugetlb config");
+ return;
+ }
+
+ /* -1 to guarantee leaving the trailing \0 */
+ err = read(fd, buf, sizeof(buf)-1);
+ close(fd);
+ if (err <= 0) {
+ perror("reading sysfs 2M hugetlb config");
+ return;
+ }
+
+ if (atoi(buf) != GET_NR_HUGE_PAGES) {
+ fprintf(stderr, "could not confirm 2M pages, got:"
+ " '%s' expected %d\n",
+ buf, GET_NR_HUGE_PAGES);
+ return;
+ }
+
+ hugetlb_setup_ok = 1;
+}
+
+void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
+{
+ void *ptr;
+ int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
+
+ if (!hugetlb_setup_ok)
+ return PTR_ERR_ENOTSUP;
+
+ dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
+ size = ALIGN_UP(size, HPAGE_SIZE * 2);
+ pkey_assert(pkey < NR_PKEYS);
+ ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
+ pkey_assert(ptr != (void *)-1);
+ mprotect_pkey(ptr, size, prot, pkey);
+
+ record_pkey_malloc(ptr, size);
+
+ dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
+ return ptr;
+}
+
+void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
+{
+ void *ptr;
+ int fd;
+
+ dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+ size, prot, pkey);
+ pkey_assert(pkey < NR_PKEYS);
+ fd = open("/dax/foo", O_RDWR);
+ pkey_assert(fd >= 0);
+
+ ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
+ pkey_assert(ptr != (void *)-1);
+
+ mprotect_pkey(ptr, size, prot, pkey);
+
+ record_pkey_malloc(ptr, size);
+
+ dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+ close(fd);
+ return ptr;
+}
+
+void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
+
+ malloc_pkey_with_mprotect,
+ malloc_pkey_anon_huge,
+ malloc_pkey_hugetlb
+/* can not do direct with the pkey_mprotect() API:
+ * malloc_pkey_mmap_direct,
+ * malloc_pkey_mmap_dax,
+ */
+};
+
+void *malloc_pkey(long size, int prot, u16 pkey)
+{
+ void *ret;
+ static int malloc_type;
+ int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
+
+ pkey_assert(pkey < NR_PKEYS);
+
+ while (1) {
+ pkey_assert(malloc_type < nr_malloc_types);
+
+ ret = pkey_malloc[malloc_type](size, prot, pkey);
+ pkey_assert(ret != (void *)-1);
+
+ malloc_type++;
+ if (malloc_type >= nr_malloc_types)
+ malloc_type = (random()%nr_malloc_types);
+
+ /* try again if the malloc_type we tried is unsupported */
+ if (ret == PTR_ERR_ENOTSUP)
+ continue;
+
+ break;
+ }
+
+ dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__,
+ size, prot, pkey, ret);
+ return ret;
+}
+
+int last_pkey_faults;
+void expected_pkey_faults(int pkey)
+{
+ dprintf2("%s(): last_pkey_faults: %d pkey_faults: %d\n",
+ __func__, last_pkey_faults, pkey_faults);
+ dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
+ pkey_assert(last_pkey_faults + 1 == pkey_faults);
+ pkey_assert(last_si_pkey == pkey);
+ /*
+ * The signal handler shold have cleared out PKRU to let the
+ * test program continue.  We now have to restore it.
+ */
+ if (__rdpkey_reg() != shadow_pkey_reg)
+ pkey_assert(0);
+
+ __wrpkey_reg(shadow_pkey_reg);
+ dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
+ __func__, shadow_pkey_reg);
+ last_pkey_faults = pkey_faults;
+ last_si_pkey = -1;
+}
+
+void do_not_expect_pk_fault(void)
+{
+ pkey_assert(last_pkey_faults == pkey_faults);
+}
+
+int test_fds[10] = { -1 };
+int nr_test_fds;
+void __save_test_fd(int fd)
+{
+ pkey_assert(fd >= 0);
+ pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
+ test_fds[nr_test_fds] = fd;
+ nr_test_fds++;
+}
+
+int get_test_read_fd(void)
+{
+ int test_fd = open("/etc/passwd", O_RDONLY);
+
+ __save_test_fd(test_fd);
+ return test_fd;
+}
+
+void close_test_fds(void)
+{
+ int i;
+
+ for (i = 0; i < nr_test_fds; i++) {
+ if (test_fds[i] < 0)
+ continue;
+ close(test_fds[i]);
+ test_fds[i] = -1;
+ }
+ nr_test_fds = 0;
+}
+
+#define barrier() (__asm__ __volatile__("" : : : "memory"))
+__attribute__((noinline)) int read_ptr(int *ptr)
+{
+ /*
+ * Keep GCC from optimizing this away somehow
+ */
+ barrier();
+ return *ptr;
+}
+
+void test_read_of_write_disabled_region(int *ptr, u16 pkey)
+{
+ int ptr_contents;
+
+ dprintf1("disabling write access to PKEY[1], doing read\n");
+ pkey_write_deny(pkey);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("*ptr: %d\n", ptr_contents);
+ dprintf1("\n");
+ do_not_expect_pk_fault();
+}
+
+void test_read_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ int ptr_contents;
+
+ dprintf1("disabling access to PKEY[%02d], doing read @ %p\n",
+ pkey, ptr);
+ rdpkey_reg();
+ pkey_access_deny(pkey);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("*ptr: %d\n", ptr_contents);
+ expected_pkey_faults(pkey);
+}
+
+void test_read_of_access_disabled_region_with_page_already_mapped(int *ptr,
+ u16 pkey)
+{
+ int ptr_contents;
+
+ dprintf1("disabling access to PKEY[%02d], doing read @ %p\n",
+ pkey, ptr);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("reading ptr before disabling the read : %d\n",
+ ptr_contents);
+ rdpkey_reg();
+ pkey_access_deny(pkey);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("*ptr: %d\n", ptr_contents);
+ expected_pkey_faults(pkey);
+}
+
+void test_write_of_write_disabled_region_with_page_already_mapped(int *ptr,
+ u16 pkey)
+{
+ *ptr = __LINE__;
+ dprintf1("disabling write access; after accessing the page, "
+ "to PKEY[%02d], doing write\n", pkey);
+ pkey_write_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+
+void test_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+ dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
+ pkey_write_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+void test_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
+ pkey_access_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+
+void test_write_of_access_disabled_region_with_page_already_mapped(int *ptr,
+ u16 pkey)
+{
+ *ptr = __LINE__;
+ dprintf1("disabling access; after accessing the page, "
+ " to PKEY[%02d], doing write\n", pkey);
+ pkey_access_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+
+void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ int ret;
+ int test_fd = get_test_read_fd();
+
+ dprintf1("disabling access to PKEY[%02d], "
+ "having kernel read() to buffer\n", pkey);
+ pkey_access_deny(pkey);
+ ret = read(test_fd, ptr, 1);
+ dprintf1("read ret: %d\n", ret);
+ pkey_assert(ret);
+}
+void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+ int ret;
+ int test_fd = get_test_read_fd();
+
+ pkey_write_deny(pkey);
+ ret = read(test_fd, ptr, 100);
+ dprintf1("read ret: %d\n", ret);
+ if (ret < 0 && (DEBUG_LEVEL > 0))
+ perror("verbose read result (OK for this to be bad)");
+ pkey_assert(ret);
+}
+
+void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ int pipe_ret, vmsplice_ret;
+ struct iovec iov;
+ int pipe_fds[2];
+
+ pipe_ret = pipe(pipe_fds);
+
+ pkey_assert(pipe_ret == 0);
+ dprintf1("disabling access to PKEY[%02d], "
+ "having kernel vmsplice from buffer\n", pkey);
+ pkey_access_deny(pkey);
+ iov.iov_base = ptr;
+ iov.iov_len = PAGE_SIZE;
+ vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
+ dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
+ pkey_assert(vmsplice_ret == -1);
+
+ close(pipe_fds[0]);
+ close(pipe_fds[1]);
+}
+
+void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
+{
+ int ignored = 0xdada;
+ int futex_ret;
+ int some_int = __LINE__;
+
+ dprintf1("disabling write to PKEY[%02d], "
+ "doing futex gunk in buffer\n", pkey);
+ *ptr = some_int;
+ pkey_write_deny(pkey);
+ futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL,
+ &ignored, ignored);
+ if (DEBUG_LEVEL > 0)
+ perror("futex");
+ dprintf1("futex() ret: %d\n", futex_ret);
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
+{
+ int err;
+ int i = get_start_key();
+
+ /* Note: 0 is the default pkey, so don't mess with it */
+ for (; i < NR_PKEYS; i++) {
+ if (pkey == i)
+ continue;
+
+ dprintf1("trying get/set/free to non-allocated pkey: %2d\n", i);
+ err = sys_pkey_free(i);
+ pkey_assert(err);
+
+ err = sys_pkey_free(i);
+ pkey_assert(err);
+
+ err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, i);
+ pkey_assert(err);
+ }
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_syscalls_bad_args(int *ptr, u16 pkey)
+{
+ int err;
+ int bad_pkey = NR_PKEYS+pkey;
+
+ /* pass a known-invalid pkey in: */
+ err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, bad_pkey);
+ pkey_assert(err);
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
+{
+ int err = 0;
+ int allocated_pkeys[NR_PKEYS] = {0};
+ int nr_allocated_pkeys = 0;
+ int i;
+
+ for (i = 0; i < NR_PKEYS*2; i++) {
+ int new_pkey;
+
+ dprintf1("%s() alloc loop: %d\n", __func__, i);
+ new_pkey = alloc_pkey();
+ dprintf4("%s()::%d, err: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, err, __rdpkey_reg(),
+ shadow_pkey_reg);
+ rdpkey_reg(); /* for shadow checking */
+ dprintf2("%s() errno: %d ENOSPC: %d\n", __func__, errno,
+ ENOSPC);
+ if ((new_pkey == -1) && (errno == ENOSPC)) {
+ dprintf2("%s() allocate failed pkey after %d tries\n",
+ __func__, nr_allocated_pkeys);
+ break;
+ }
+ pkey_assert(nr_allocated_pkeys < NR_PKEYS);
+ allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
+ }
+
+ dprintf3("%s()::%d\n", __func__, __LINE__);
+
+ /*
+ * ensure it did not reach the end of the loop without
+ * failure:
+ */
+ pkey_assert(i < NR_PKEYS*2);
+ /*
+ * There are NR_PKEYS pkeys supported in hardware.  NR_RESERVED_KEYS
+ * are reserved. One can be taken up by an execute-only mapping.
+ * Ensure that we can allocate at least the remaining.
+ */
+ pkey_assert(i >= (NR_PKEYS-NR_RESERVED_PKEYS-1));
+
+ for (i = 0; i < nr_allocated_pkeys; i++) {
+ err = sys_pkey_free(allocated_pkeys[i]);
+ pkey_assert(!err);
+ rdpkey_reg(); /* for shadow checking */
+ }
+}
+
+void test_ptrace_of_child(int *ptr, u16 pkey)
+{
+ __attribute__((__unused__)) int peek_result;
+ pid_t child_pid;
+ void *ignored = 0;
+ long ret;
+ int status;
+ /*
+ * This is the "control" for our little expermient.  Make sure
+ * we can always access it when ptracing.
+ */
+ int *plain_ptr_unaligned = malloc(HPAGE_SIZE);
+ int *plain_ptr = ALIGN_PTR_UP(plain_ptr_unaligned, PAGE_SIZE);
+
+ /*
+ * Fork a child which is an exact copy of this process, of course.
+ * That means we can do all of our tests via ptrace() and then plain
+ * memory access and ensure they work differently.
+ */
+ child_pid = fork_lazy_child();
+ dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
+
+ ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
+ if (ret)
+ perror("attach");
+ dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
+ pkey_assert(ret != -1);
+ ret = waitpid(child_pid, &status, WUNTRACED);
+ if ((ret != child_pid) || !(WIFSTOPPED(status))) {
+ fprintf(stderr, "weird waitpid result %ld stat %x\n",
+ ret, status);
+ pkey_assert(0);
+ }
+ dprintf2("waitpid ret: %ld\n", ret);
+ dprintf2("waitpid status: %d\n", status);
+
+ pkey_access_deny(pkey);
+ pkey_write_deny(pkey);
+
+ /* Write access, untested for now:
+ * ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
+ * pkey_assert(ret != -1);
+ * dprintf1("poke at %p: %ld\n", peek_at, ret);
+ */
+
+ /*
+ * Try to access the pkey-protected "ptr" via ptrace:
+ */
+ ret = ptrace(PTRACE_PEEKDATA, child_pid, ptr, ignored);
+ /* expect it to work, without an error: */
+ pkey_assert(ret != -1);
+ /* Now access from the current task, and expect an exception: */
+ peek_result = read_ptr(ptr);
+ expected_pkey_faults(pkey);
+
+ /*
+ * Try to access the NON-pkey-protected "plain_ptr" via ptrace:
+ */
+ ret = ptrace(PTRACE_PEEKDATA, child_pid, plain_ptr, ignored);
+ /* expect it to work, without an error: */
+ pkey_assert(ret != -1);
+ /* Now access from the current task, and expect NO exception: */
+ peek_result = read_ptr(plain_ptr);
+ do_not_expect_pk_fault();
+
+ ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
+ pkey_assert(ret != -1);
+
+ ret = kill(child_pid, SIGKILL);
+ pkey_assert(ret != -1);
+
+ wait(&status);
+
+ free(plain_ptr_unaligned);
+}
+
+void test_executing_on_unreadable_memory(int *ptr, u16 pkey)
+{
+ void *p1;
+ int scratch;
+ int ptr_contents;
+ int ret;
+
+ p1 = ALIGN_PTR_UP(&lots_o_noops_around_write, PAGE_SIZE);
+ dprintf3("&lots_o_noops: %p\n", &lots_o_noops_around_write);
+ /* lots_o_noops_around_write should be page-aligned already */
+ assert(p1 == &lots_o_noops_around_write);
+
+ /* Point 'p1' at the *second* page of the function: */
+ p1 += PAGE_SIZE;
+
+ madvise(p1, PAGE_SIZE, MADV_DONTNEED);
+ lots_o_noops_around_write(&scratch);
+ ptr_contents = read_ptr(p1);
+ dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
+
+ ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC, (u64)pkey);
+ pkey_assert(!ret);
+ pkey_access_deny(pkey);
+
+ dprintf2("pkey_reg: %x\n", rdpkey_reg());
+
+ /*
+ * Make sure this is an *instruction* fault
+ */
+ madvise(p1, PAGE_SIZE, MADV_DONTNEED);
+ lots_o_noops_around_write(&scratch);
+ do_not_expect_pk_fault();
+ ptr_contents = read_ptr(p1);
+ dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
+ expected_pkey_faults(pkey);
+}
+
+void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
+{
+ int size = PAGE_SIZE;
+ int sret;
+
+ if (cpu_has_pkey()) {
+ dprintf1("SKIP: %s: no CPU support\n", __func__);
+ return;
+ }
+
+ sret = syscall(SYS_mprotect_key, ptr, size, PROT_READ, pkey);
+ pkey_assert(sret < 0);
+}
+
+void (*pkey_tests[])(int *ptr, u16 pkey) = {
+ test_read_of_write_disabled_region,
+ test_read_of_access_disabled_region,
+ test_read_of_access_disabled_region_with_page_already_mapped,
+ test_write_of_write_disabled_region,
+ test_write_of_write_disabled_region_with_page_already_mapped,
+ test_write_of_access_disabled_region,
+ test_write_of_access_disabled_region_with_page_already_mapped,
+ test_kernel_write_of_access_disabled_region,
+ test_kernel_write_of_write_disabled_region,
+ test_kernel_gup_of_access_disabled_region,
+ test_kernel_gup_write_to_write_disabled_region,
+ test_executing_on_unreadable_memory,
+ test_ptrace_of_child,
+ test_pkey_syscalls_on_non_allocated_pkey,
+ test_pkey_syscalls_bad_args,
+ test_pkey_alloc_exhaust,
+};
+
+void run_tests_once(void)
+{
+ int *ptr;
+ int prot = PROT_READ|PROT_WRITE;
+
+ for (test_nr = 0; test_nr < ARRAY_SIZE(pkey_tests); test_nr++) {
+ int pkey;
+ int orig_pkey_faults = pkey_faults;
+
+ dprintf1("======================\n");
+ dprintf1("test %d preparing...\n", test_nr);
+
+ tracing_on();
+ pkey = alloc_random_pkey();
+ dprintf1("test %d starting with pkey: %d\n", test_nr, pkey);
+ ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
+ dprintf1("test %d starting...\n", test_nr);
+ pkey_tests[test_nr](ptr, pkey);
+ dprintf1("freeing test memory: %p\n", ptr);
+ free_pkey_malloc(ptr);
+ sys_pkey_free(pkey);
+
+ dprintf1("pkey_faults: %d\n", pkey_faults);
+ dprintf1("orig_pkey_faults: %d\n", orig_pkey_faults);
+
+ tracing_off();
+ close_test_fds();
+
+ printf("test %2d PASSED (iteration %d)\n",
+ test_nr, iteration_nr);
+ dprintf1("======================\n\n");
+ }
+ iteration_nr++;
+}
+
+int main(void)
+{
+ int nr_iterations = 22;
+
+ setup_handlers();
+
+ printf("has pkey support: %d\n", cpu_has_pkey());
+
+ if (!cpu_has_pkey()) {
+ int size = PAGE_SIZE;
+ int *ptr;
+
+ printf("running PKEY tests for unsupported CPU/OS\n");
+
+ ptr  = mmap(NULL, size, PROT_NONE,
+ MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ assert(ptr != (void *)-1);
+ test_mprotect_pkey_on_unsupported_cpu(ptr, 1);
+ exit(0);
+ }
+
+ pkey_setup_shadow();
+ printf("startup pkey_reg: %lx\n", rdpkey_reg());
+ setup_hugetlbfs();
+
+ while (nr_iterations-- > 0)
+ run_tests_once();
+
+ printf("done (all tests OK)\n");
+ return 0;
+}
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 97f187e..fee6181 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -6,7 +6,7 @@ include ../lib.mk
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_syscall test_mremap_vdso \
  check_initial_reg_state sigreturn ldt_gdt iopl mpx-mini-test ioperm \
- protection_keys test_vdso
+ test_vdso
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \
  test_FCMOV test_FCOMI test_FISTTP \
  vdso_restorer
diff --git a/tools/testing/selftests/x86/pkey-helpers.h b/tools/testing/selftests/x86/pkey-helpers.h
deleted file mode 100644
index b202939..0000000
--- a/tools/testing/selftests/x86/pkey-helpers.h
+++ /dev/null
@@ -1,219 +0,0 @@
-#ifndef _PKEYS_HELPER_H
-#define _PKEYS_HELPER_H
-#define _GNU_SOURCE
-#include <string.h>
-#include <stdarg.h>
-#include <stdio.h>
-#include <stdint.h>
-#include <stdbool.h>
-#include <signal.h>
-#include <assert.h>
-#include <stdlib.h>
-#include <ucontext.h>
-#include <sys/mman.h>
-
-#define NR_PKEYS 16
-#define PKRU_BITS_PER_PKEY 2
-
-#ifndef DEBUG_LEVEL
-#define DEBUG_LEVEL 0
-#endif
-#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
-extern int dprint_in_signal;
-extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
-static inline void sigsafe_printf(const char *format, ...)
-{
- va_list ap;
-
- va_start(ap, format);
- if (!dprint_in_signal) {
- vprintf(format, ap);
- } else {
- int len = vsnprintf(dprint_in_signal_buffer,
-    DPRINT_IN_SIGNAL_BUF_SIZE,
-    format, ap);
- /*
- * len is amount that would have been printed,
- * but actual write is truncated at BUF_SIZE.
- */
- if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
- len = DPRINT_IN_SIGNAL_BUF_SIZE;
- write(1, dprint_in_signal_buffer, len);
- }
- va_end(ap);
-}
-#define dprintf_level(level, args...) do { \
- if (level <= DEBUG_LEVEL) \
- sigsafe_printf(args); \
- fflush(NULL); \
-} while (0)
-#define dprintf0(args...) dprintf_level(0, args)
-#define dprintf1(args...) dprintf_level(1, args)
-#define dprintf2(args...) dprintf_level(2, args)
-#define dprintf3(args...) dprintf_level(3, args)
-#define dprintf4(args...) dprintf_level(4, args)
-
-extern unsigned int shadow_pkru;
-static inline unsigned int __rdpkru(void)
-{
- unsigned int eax, edx;
- unsigned int ecx = 0;
- unsigned int pkru;
-
- asm volatile(".byte 0x0f,0x01,0xee\n\t"
-     : "=a" (eax), "=d" (edx)
-     : "c" (ecx));
- pkru = eax;
- return pkru;
-}
-
-static inline unsigned int _rdpkru(int line)
-{
- unsigned int pkru = __rdpkru();
-
- dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
- line, pkru, shadow_pkru);
- assert(pkru == shadow_pkru);
-
- return pkru;
-}
-
-#define rdpkru() _rdpkru(__LINE__)
-
-static inline void __wrpkru(unsigned int pkru)
-{
- unsigned int eax = pkru;
- unsigned int ecx = 0;
- unsigned int edx = 0;
-
- dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
- asm volatile(".byte 0x0f,0x01,0xef\n\t"
-     : : "a" (eax), "c" (ecx), "d" (edx));
- assert(pkru == __rdpkru());
-}
-
-static inline void wrpkru(unsigned int pkru)
-{
- dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
- /* will do the shadow check for us: */
- rdpkru();
- __wrpkru(pkru);
- shadow_pkru = pkru;
- dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
-}
-
-/*
- * These are technically racy. since something could
- * change PKRU between the read and the write.
- */
-static inline void __pkey_access_allow(int pkey, int do_allow)
-{
- unsigned int pkru = rdpkru();
- int bit = pkey * 2;
-
- if (do_allow)
- pkru &= (1<<bit);
- else
- pkru |= (1<<bit);
-
- dprintf4("pkru now: %08x\n", rdpkru());
- wrpkru(pkru);
-}
-
-static inline void __pkey_write_allow(int pkey, int do_allow_write)
-{
- long pkru = rdpkru();
- int bit = pkey * 2 + 1;
-
- if (do_allow_write)
- pkru &= (1<<bit);
- else
- pkru |= (1<<bit);
-
- wrpkru(pkru);
- dprintf4("pkru now: %08x\n", rdpkru());
-}
-
-#define PROT_PKEY0     0x10            /* protection key value (bit 0) */
-#define PROT_PKEY1     0x20            /* protection key value (bit 1) */
-#define PROT_PKEY2     0x40            /* protection key value (bit 2) */
-#define PROT_PKEY3     0x80            /* protection key value (bit 3) */
-
-#define PAGE_SIZE 4096
-#define MB (1<<20)
-
-static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
- unsigned int *ecx, unsigned int *edx)
-{
- /* ecx is often an input as well as an output. */
- asm volatile(
- "cpuid;"
- : "=a" (*eax),
-  "=b" (*ebx),
-  "=c" (*ecx),
-  "=d" (*edx)
- : "0" (*eax), "2" (*ecx));
-}
-
-/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
-#define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
-#define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
-
-static inline int cpu_has_pku(void)
-{
- unsigned int eax;
- unsigned int ebx;
- unsigned int ecx;
- unsigned int edx;
-
- eax = 0x7;
- ecx = 0x0;
- __cpuid(&eax, &ebx, &ecx, &edx);
-
- if (!(ecx & X86_FEATURE_PKU)) {
- dprintf2("cpu does not have PKU\n");
- return 0;
- }
- if (!(ecx & X86_FEATURE_OSPKE)) {
- dprintf2("cpu does not have OSPKE\n");
- return 0;
- }
- return 1;
-}
-
-#define XSTATE_PKRU_BIT (9)
-#define XSTATE_PKRU 0x200
-
-int pkru_xstate_offset(void)
-{
- unsigned int eax;
- unsigned int ebx;
- unsigned int ecx;
- unsigned int edx;
- int xstate_offset;
- int xstate_size;
- unsigned long XSTATE_CPUID = 0xd;
- int leaf;
-
- /* assume that XSTATE_PKRU is set in XCR0 */
- leaf = XSTATE_PKRU_BIT;
- {
- eax = XSTATE_CPUID;
- ecx = leaf;
- __cpuid(&eax, &ebx, &ecx, &edx);
-
- if (leaf == XSTATE_PKRU_BIT) {
- xstate_offset = ebx;
- xstate_size = eax;
- }
- }
-
- if (xstate_size == 0) {
- printf("could not find size/offset of PKRU in xsave state\n");
- return 0;
- }
-
- return xstate_offset;
-}
-
-#endif /* _PKEYS_HELPER_H */
diff --git a/tools/testing/selftests/x86/protection_keys.c b/tools/testing/selftests/x86/protection_keys.c
deleted file mode 100644
index 3237bc0..0000000
--- a/tools/testing/selftests/x86/protection_keys.c
+++ /dev/null
@@ -1,1395 +0,0 @@
-/*
- * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
- *
- * There are examples in here of:
- *  * how to set protection keys on memory
- *  * how to set/clear bits in PKRU (the rights register)
- *  * how to handle SEGV_PKRU signals and extract pkey-relevant
- *    information from the siginfo
- *
- * Things to add:
- * make sure KSM and KSM COW breaking works
- * prefault pages in at malloc, or not
- * protect MPX bounds tables with protection keys?
- * make sure VMA splitting/merging is working correctly
- * OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune to pkeys
- * look for pkey "leaks" where it is still set on a VMA but "freed" back to the kernel
- * do a plain mprotect() to a mprotect_pkey() area and make sure the pkey sticks
- *
- * Compile like this:
- * gcc      -o protection_keys    -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
- * gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
- */
-#define _GNU_SOURCE
-#include <errno.h>
-#include <linux/futex.h>
-#include <sys/time.h>
-#include <sys/syscall.h>
-#include <string.h>
-#include <stdio.h>
-#include <stdint.h>
-#include <stdbool.h>
-#include <signal.h>
-#include <assert.h>
-#include <stdlib.h>
-#include <ucontext.h>
-#include <sys/mman.h>
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-#include <unistd.h>
-#include <sys/ptrace.h>
-#include <setjmp.h>
-
-#include "pkey-helpers.h"
-
-int iteration_nr = 1;
-int test_nr;
-
-unsigned int shadow_pkru;
-
-#define HPAGE_SIZE (1UL<<21)
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
-#define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1))
-#define ALIGN_DOWN(x, align_to) ((x) & ~((align_to)-1))
-#define ALIGN_PTR_UP(p, ptr_align_to) ((typeof(p))ALIGN_UP((unsigned long)(p), ptr_align_to))
-#define ALIGN_PTR_DOWN(p, ptr_align_to) ((typeof(p))ALIGN_DOWN((unsigned long)(p), ptr_align_to))
-#define __stringify_1(x...)     #x
-#define __stringify(x...)       __stringify_1(x)
-
-#define PTR_ERR_ENOTSUP ((void *)-ENOTSUP)
-
-int dprint_in_signal;
-char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
-
-extern void abort_hooks(void);
-#define pkey_assert(condition) do { \
- if (!(condition)) { \
- dprintf0("assert() at %s::%d test_nr: %d iteration: %d\n", \
- __FILE__, __LINE__, \
- test_nr, iteration_nr); \
- dprintf0("errno at assert: %d", errno); \
- abort_hooks(); \
- assert(condition); \
- } \
-} while (0)
-#define raw_assert(cond) assert(cond)
-
-void cat_into_file(char *str, char *file)
-{
- int fd = open(file, O_RDWR);
- int ret;
-
- dprintf2("%s(): writing '%s' to '%s'\n", __func__, str, file);
- /*
- * these need to be raw because they are called under
- * pkey_assert()
- */
- raw_assert(fd >= 0);
- ret = write(fd, str, strlen(str));
- if (ret != strlen(str)) {
- perror("write to file failed");
- fprintf(stderr, "filename: '%s' str: '%s'\n", file, str);
- raw_assert(0);
- }
- close(fd);
-}
-
-#if CONTROL_TRACING > 0
-static int warned_tracing;
-int tracing_root_ok(void)
-{
- if (geteuid() != 0) {
- if (!warned_tracing)
- fprintf(stderr, "WARNING: not run as root, "
- "can not do tracing control\n");
- warned_tracing = 1;
- return 0;
- }
- return 1;
-}
-#endif
-
-void tracing_on(void)
-{
-#if CONTROL_TRACING > 0
-#define TRACEDIR "/sys/kernel/debug/tracing"
- char pidstr[32];
-
- if (!tracing_root_ok())
- return;
-
- sprintf(pidstr, "%d", getpid());
- cat_into_file("0", TRACEDIR "/tracing_on");
- cat_into_file("\n", TRACEDIR "/trace");
- if (1) {
- cat_into_file("function_graph", TRACEDIR "/current_tracer");
- cat_into_file("1", TRACEDIR "/options/funcgraph-proc");
- } else {
- cat_into_file("nop", TRACEDIR "/current_tracer");
- }
- cat_into_file(pidstr, TRACEDIR "/set_ftrace_pid");
- cat_into_file("1", TRACEDIR "/tracing_on");
- dprintf1("enabled tracing\n");
-#endif
-}
-
-void tracing_off(void)
-{
-#if CONTROL_TRACING > 0
- if (!tracing_root_ok())
- return;
- cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
-#endif
-}
-
-void abort_hooks(void)
-{
- fprintf(stderr, "running %s()...\n", __func__);
- tracing_off();
-#ifdef SLEEP_ON_ABORT
- sleep(SLEEP_ON_ABORT);
-#endif
-}
-
-static inline void __page_o_noops(void)
-{
- /* 8-bytes of instruction * 512 bytes = 1 page */
- asm(".rept 512 ; nopl 0x7eeeeeee(%eax) ; .endr");
-}
-
-/*
- * This attempts to have roughly a page of instructions followed by a few
- * instructions that do a write, and another page of instructions.  That
- * way, we are pretty sure that the write is in the second page of
- * instructions and has at least a page of padding behind it.
- *
- * *That* lets us be sure to madvise() away the write instruction, which
- * will then fault, which makes sure that the fault code handles
- * execute-only memory properly.
- */
-__attribute__((__aligned__(PAGE_SIZE)))
-void lots_o_noops_around_write(int *write_to_me)
-{
- dprintf3("running %s()\n", __func__);
- __page_o_noops();
- /* Assume this happens in the second page of instructions: */
- *write_to_me = __LINE__;
- /* pad out by another page: */
- __page_o_noops();
- dprintf3("%s() done\n", __func__);
-}
-
-/* Define some kernel-like types */
-#define  u8 uint8_t
-#define u16 uint16_t
-#define u32 uint32_t
-#define u64 uint64_t
-
-#ifdef __i386__
-#define SYS_mprotect_key 380
-#define SYS_pkey_alloc 381
-#define SYS_pkey_free 382
-#define REG_IP_IDX REG_EIP
-#define si_pkey_offset 0x14
-#else
-#define SYS_mprotect_key 329
-#define SYS_pkey_alloc 330
-#define SYS_pkey_free 331
-#define REG_IP_IDX REG_RIP
-#define si_pkey_offset 0x20
-#endif
-
-void dump_mem(void *dumpme, int len_bytes)
-{
- char *c = (void *)dumpme;
- int i;
-
- for (i = 0; i < len_bytes; i += sizeof(u64)) {
- u64 *ptr = (u64 *)(c + i);
- dprintf1("dump[%03d][@%p]: %016jx\n", i, ptr, *ptr);
- }
-}
-
-#define __SI_FAULT      (3 << 16)
-#define SEGV_BNDERR     (__SI_FAULT|3)  /* failed address bound checks */
-#define SEGV_PKUERR     (__SI_FAULT|4)
-
-static char *si_code_str(int si_code)
-{
- if (si_code & SEGV_MAPERR)
- return "SEGV_MAPERR";
- if (si_code & SEGV_ACCERR)
- return "SEGV_ACCERR";
- if (si_code & SEGV_BNDERR)
- return "SEGV_BNDERR";
- if (si_code & SEGV_PKUERR)
- return "SEGV_PKUERR";
- return "UNKNOWN";
-}
-
-int pkru_faults;
-int last_si_pkey = -1;
-void signal_handler(int signum, siginfo_t *si, void *vucontext)
-{
- ucontext_t *uctxt = vucontext;
- int trapno;
- unsigned long ip;
- char *fpregs;
- u32 *pkru_ptr;
- u64 si_pkey;
- u32 *si_pkey_ptr;
- int pkru_offset;
- fpregset_t fpregset;
-
- dprint_in_signal = 1;
- dprintf1(">>>>===============SIGSEGV============================\n");
- dprintf1("%s()::%d, pkru: 0x%x shadow: %x\n", __func__, __LINE__,
- __rdpkru(), shadow_pkru);
-
- trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
- ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
- fpregset = uctxt->uc_mcontext.fpregs;
- fpregs = (void *)fpregset;
-
- dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__,
- trapno, ip, si_code_str(si->si_code), si->si_code);
-#ifdef __i386__
- /*
- * 32-bit has some extra padding so that userspace can tell whether
- * the XSTATE header is present in addition to the "legacy" FPU
- * state.  We just assume that it is here.
- */
- fpregs += 0x70;
-#endif
- pkru_offset = pkru_xstate_offset();
- pkru_ptr = (void *)(&fpregs[pkru_offset]);
-
- dprintf1("siginfo: %p\n", si);
- dprintf1(" fpregs: %p\n", fpregs);
- /*
- * If we got a PKRU fault, we *HAVE* to have at least one bit set in
- * here.
- */
- dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
- if (DEBUG_LEVEL > 4)
- dump_mem(pkru_ptr - 128, 256);
- pkey_assert(*pkru_ptr);
-
- si_pkey_ptr = (u32 *)(((u8 *)si) + si_pkey_offset);
- dprintf1("si_pkey_ptr: %p\n", si_pkey_ptr);
- dump_mem(si_pkey_ptr - 8, 24);
- si_pkey = *si_pkey_ptr;
- pkey_assert(si_pkey < NR_PKEYS);
- last_si_pkey = si_pkey;
-
- if ((si->si_code == SEGV_MAPERR) ||
-    (si->si_code == SEGV_ACCERR) ||
-    (si->si_code == SEGV_BNDERR)) {
- printf("non-PK si_code, exiting...\n");
- exit(4);
- }
-
- dprintf1("signal pkru from xsave: %08x\n", *pkru_ptr);
- /* need __rdpkru() version so we do not do shadow_pkru checking */
- dprintf1("signal pkru from  pkru: %08x\n", __rdpkru());
- dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
- *(u64 *)pkru_ptr = 0x00000000;
- dprintf1("WARNING: set PRKU=0 to allow faulting instruction to continue\n");
- pkru_faults++;
- dprintf1("<<<<==================================================\n");
- return;
- if (trapno == 14) {
- fprintf(stderr,
- "ERROR: In signal handler, page fault, trapno = %d, ip = %016lx\n",
- trapno, ip);
- fprintf(stderr, "si_addr %p\n", si->si_addr);
- fprintf(stderr, "REG_ERR: %lx\n",
- (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
- exit(1);
- } else {
- fprintf(stderr, "unexpected trap %d! at 0x%lx\n", trapno, ip);
- fprintf(stderr, "si_addr %p\n", si->si_addr);
- fprintf(stderr, "REG_ERR: %lx\n",
- (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
- exit(2);
- }
- dprint_in_signal = 0;
-}
-
-int wait_all_children(void)
-{
- int status;
- return waitpid(-1, &status, 0);
-}
-
-void sig_chld(int x)
-{
- dprint_in_signal = 1;
- dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
- dprint_in_signal = 0;
-}
-
-void setup_sigsegv_handler(void)
-{
- int r, rs;
- struct sigaction newact;
- struct sigaction oldact;
-
- /* #PF is mapped to sigsegv */
- int signum  = SIGSEGV;
-
- newact.sa_handler = 0;
- newact.sa_sigaction = signal_handler;
-
- /*sigset_t - signals to block while in the handler */
- /* get the old signal mask. */
- rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
- pkey_assert(rs == 0);
-
- /* call sa_sigaction, not sa_handler*/
- newact.sa_flags = SA_SIGINFO;
-
- newact.sa_restorer = 0;  /* void(*)(), obsolete */
- r = sigaction(signum, &newact, &oldact);
- r = sigaction(SIGALRM, &newact, &oldact);
- pkey_assert(r == 0);
-}
-
-void setup_handlers(void)
-{
- signal(SIGCHLD, &sig_chld);
- setup_sigsegv_handler();
-}
-
-pid_t fork_lazy_child(void)
-{
- pid_t forkret;
-
- forkret = fork();
- pkey_assert(forkret >= 0);
- dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
-
- if (!forkret) {
- /* in the child */
- while (1) {
- dprintf1("child sleeping...\n");
- sleep(30);
- }
- }
- return forkret;
-}
-
-void davecmp(void *_a, void *_b, int len)
-{
- int i;
- unsigned long *a = _a;
- unsigned long *b = _b;
-
- for (i = 0; i < len / sizeof(*a); i++) {
- if (a[i] == b[i])
- continue;
-
- dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
- }
-}
-
-void dumpit(char *f)
-{
- int fd = open(f, O_RDONLY);
- char buf[100];
- int nr_read;
-
- dprintf2("maps fd: %d\n", fd);
- do {
- nr_read = read(fd, &buf[0], sizeof(buf));
- write(1, buf, nr_read);
- } while (nr_read > 0);
- close(fd);
-}
-
-#define PKEY_DISABLE_ACCESS    0x1
-#define PKEY_DISABLE_WRITE     0x2
-
-u32 pkey_get(int pkey, unsigned long flags)
-{
- u32 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
- u32 pkru = __rdpkru();
- u32 shifted_pkru;
- u32 masked_pkru;
-
- dprintf1("%s(pkey=%d, flags=%lx) = %x / %d\n",
- __func__, pkey, flags, 0, 0);
- dprintf2("%s() raw pkru: %x\n", __func__, pkru);
-
- shifted_pkru = (pkru >> (pkey * PKRU_BITS_PER_PKEY));
- dprintf2("%s() shifted_pkru: %x\n", __func__, shifted_pkru);
- masked_pkru = shifted_pkru & mask;
- dprintf2("%s() masked  pkru: %x\n", __func__, masked_pkru);
- /*
- * shift down the relevant bits to the lowest two, then
- * mask off all the other high bits.
- */
- return masked_pkru;
-}
-
-int pkey_set(int pkey, unsigned long rights, unsigned long flags)
-{
- u32 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
- u32 old_pkru = __rdpkru();
- u32 new_pkru;
-
- /* make sure that 'rights' only contains the bits we expect: */
- assert(!(rights & ~mask));
-
- /* copy old pkru */
- new_pkru = old_pkru;
- /* mask out bits from pkey in old value: */
- new_pkru &= ~(mask << (pkey * PKRU_BITS_PER_PKEY));
- /* OR in new bits for pkey: */
- new_pkru |= (rights << (pkey * PKRU_BITS_PER_PKEY));
-
- __wrpkru(new_pkru);
-
- dprintf3("%s(pkey=%d, rights=%lx, flags=%lx) = %x pkru now: %x old_pkru: %x\n",
- __func__, pkey, rights, flags, 0, __rdpkru(), old_pkru);
- return 0;
-}
-
-void pkey_disable_set(int pkey, int flags)
-{
- unsigned long syscall_flags = 0;
- int ret;
- int pkey_rights;
- u32 orig_pkru = rdpkru();
-
- dprintf1("START->%s(%d, 0x%x)\n", __func__,
- pkey, flags);
- pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
-
- pkey_rights = pkey_get(pkey, syscall_flags);
-
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
- pkey_assert(pkey_rights >= 0);
-
- pkey_rights |= flags;
-
- ret = pkey_set(pkey, pkey_rights, syscall_flags);
- assert(!ret);
- /*pkru and flags have the same format */
- shadow_pkru |= flags << (pkey * 2);
- dprintf1("%s(%d) shadow: 0x%x\n", __func__, pkey, shadow_pkru);
-
- pkey_assert(ret >= 0);
-
- pkey_rights = pkey_get(pkey, syscall_flags);
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
-
- dprintf1("%s(%d) pkru: 0x%x\n", __func__, pkey, rdpkru());
- if (flags)
- pkey_assert(rdpkru() > orig_pkru);
- dprintf1("END<---%s(%d, 0x%x)\n", __func__,
- pkey, flags);
-}
-
-void pkey_disable_clear(int pkey, int flags)
-{
- unsigned long syscall_flags = 0;
- int ret;
- int pkey_rights = pkey_get(pkey, syscall_flags);
- u32 orig_pkru = rdpkru();
-
- pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
-
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
- pkey_assert(pkey_rights >= 0);
-
- pkey_rights |= flags;
-
- ret = pkey_set(pkey, pkey_rights, 0);
- /* pkru and flags have the same format */
- shadow_pkru &= ~(flags << (pkey * 2));
- pkey_assert(ret >= 0);
-
- pkey_rights = pkey_get(pkey, syscall_flags);
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
-
- dprintf1("%s(%d) pkru: 0x%x\n", __func__, pkey, rdpkru());
- if (flags)
- assert(rdpkru() > orig_pkru);
-}
-
-void pkey_write_allow(int pkey)
-{
- pkey_disable_clear(pkey, PKEY_DISABLE_WRITE);
-}
-void pkey_write_deny(int pkey)
-{
- pkey_disable_set(pkey, PKEY_DISABLE_WRITE);
-}
-void pkey_access_allow(int pkey)
-{
- pkey_disable_clear(pkey, PKEY_DISABLE_ACCESS);
-}
-void pkey_access_deny(int pkey)
-{
- pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
-}
-
-int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
- unsigned long pkey)
-{
- int sret;
-
- dprintf2("%s(0x%p, %zx, prot=%lx, pkey=%lx)\n", __func__,
- ptr, size, orig_prot, pkey);
-
- errno = 0;
- sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
- if (errno) {
- dprintf2("SYS_mprotect_key sret: %d\n", sret);
- dprintf2("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
- dprintf2("SYS_mprotect_key failed, errno: %d\n", errno);
- if (DEBUG_LEVEL >= 2)
- perror("SYS_mprotect_pkey");
- }
- return sret;
-}
-
-int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
-{
- int ret = syscall(SYS_pkey_alloc, flags, init_val);
- dprintf1("%s(flags=%lx, init_val=%lx) syscall ret: %d errno: %d\n",
- __func__, flags, init_val, ret, errno);
- return ret;
-}
-
-int alloc_pkey(void)
-{
- int ret;
- unsigned long init_val = 0x0;
-
- dprintf1("alloc_pkey()::%d, pkru: 0x%x shadow: %x\n",
- __LINE__, __rdpkru(), shadow_pkru);
- ret = sys_pkey_alloc(0, init_val);
- /*
- * pkey_alloc() sets PKRU, so we need to reflect it in
- * shadow_pkru:
- */
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- if (ret) {
- /* clear both the bits: */
- shadow_pkru &= ~(0x3      << (ret * 2));
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- /*
- * move the new state in from init_val
- * (remember, we cheated and init_val == pkru format)
- */
- shadow_pkru |=  (init_val << (ret * 2));
- }
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- dprintf1("alloc_pkey()::%d errno: %d\n", __LINE__, errno);
- /* for shadow checking: */
- rdpkru();
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- return ret;
-}
-
-int sys_pkey_free(unsigned long pkey)
-{
- int ret = syscall(SYS_pkey_free, pkey);
- dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
- return ret;
-}
-
-/*
- * I had a bug where pkey bits could be set by mprotect() but
- * not cleared.  This ensures we get lots of random bit sets
- * and clears on the vma and pte pkey bits.
- */
-int alloc_random_pkey(void)
-{
- int max_nr_pkey_allocs;
- int ret;
- int i;
- int alloced_pkeys[NR_PKEYS];
- int nr_alloced = 0;
- int random_index;
- memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
-
- /* allocate every possible key and make a note of which ones we got */
- max_nr_pkey_allocs = NR_PKEYS;
- max_nr_pkey_allocs = 1;
- for (i = 0; i < max_nr_pkey_allocs; i++) {
- int new_pkey = alloc_pkey();
- if (new_pkey < 0)
- break;
- alloced_pkeys[nr_alloced++] = new_pkey;
- }
-
- pkey_assert(nr_alloced > 0);
- /* select a random one out of the allocated ones */
- random_index = rand() % nr_alloced;
- ret = alloced_pkeys[random_index];
- /* now zero it out so we don't free it next */
- alloced_pkeys[random_index] = 0;
-
- /* go through the allocated ones that we did not want and free them */
- for (i = 0; i < nr_alloced; i++) {
- int free_ret;
- if (!alloced_pkeys[i])
- continue;
- free_ret = sys_pkey_free(alloced_pkeys[i]);
- pkey_assert(!free_ret);
- }
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- return ret;
-}
-
-int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
- unsigned long pkey)
-{
- int nr_iterations = random() % 100;
- int ret;
-
- while (0) {
- int rpkey = alloc_random_pkey();
- ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
- dprintf1("sys_mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
- ptr, size, orig_prot, pkey, ret);
- if (nr_iterations-- < 0)
- break;
-
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- sys_pkey_free(rpkey);
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- }
- pkey_assert(pkey < NR_PKEYS);
-
- ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
- dprintf1("mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
- ptr, size, orig_prot, pkey, ret);
- pkey_assert(!ret);
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- return ret;
-}
-
-struct pkey_malloc_record {
- void *ptr;
- long size;
-};
-struct pkey_malloc_record *pkey_malloc_records;
-long nr_pkey_malloc_records;
-void record_pkey_malloc(void *ptr, long size)
-{
- long i;
- struct pkey_malloc_record *rec = NULL;
-
- for (i = 0; i < nr_pkey_malloc_records; i++) {
- rec = &pkey_malloc_records[i];
- /* find a free record */
- if (rec)
- break;
- }
- if (!rec) {
- /* every record is full */
- size_t old_nr_records = nr_pkey_malloc_records;
- size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
- size_t new_size = new_nr_records * sizeof(struct pkey_malloc_record);
- dprintf2("new_nr_records: %zd\n", new_nr_records);
- dprintf2("new_size: %zd\n", new_size);
- pkey_malloc_records = realloc(pkey_malloc_records, new_size);
- pkey_assert(pkey_malloc_records != NULL);
- rec = &pkey_malloc_records[nr_pkey_malloc_records];
- /*
- * realloc() does not initialize memory, so zero it from
- * the first new record all the way to the end.
- */
- for (i = 0; i < new_nr_records - old_nr_records; i++)
- memset(rec + i, 0, sizeof(*rec));
- }
- dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
- (int)(rec - pkey_malloc_records), rec, ptr, size);
- rec->ptr = ptr;
- rec->size = size;
- nr_pkey_malloc_records++;
-}
-
-void free_pkey_malloc(void *ptr)
-{
- long i;
- int ret;
- dprintf3("%s(%p)\n", __func__, ptr);
- for (i = 0; i < nr_pkey_malloc_records; i++) {
- struct pkey_malloc_record *rec = &pkey_malloc_records[i];
- dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
- ptr, i, rec, rec->ptr, rec->size);
- if ((ptr <  rec->ptr) ||
-    (ptr >= rec->ptr + rec->size))
- continue;
-
- dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
- ptr, i, rec, rec->ptr, rec->size);
- nr_pkey_malloc_records--;
- ret = munmap(rec->ptr, rec->size);
- dprintf3("munmap ret: %d\n", ret);
- pkey_assert(!ret);
- dprintf3("clearing rec->ptr, rec: %p\n", rec);
- rec->ptr = NULL;
- dprintf3("done clearing rec->ptr, rec: %p\n", rec);
- return;
- }
- pkey_assert(false);
-}
-
-
-void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
-{
- void *ptr;
- int ret;
-
- rdpkru();
- dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
- size, prot, pkey);
- pkey_assert(pkey < NR_PKEYS);
- ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- pkey_assert(ptr != (void *)-1);
- ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
- pkey_assert(!ret);
- record_pkey_malloc(ptr, size);
- rdpkru();
-
- dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
- return ptr;
-}
-
-void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
-{
- int ret;
- void *ptr;
-
- dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
- size, prot, pkey);
- /*
- * Guarantee we can fit at least one huge page in the resulting
- * allocation by allocating space for 2:
- */
- size = ALIGN_UP(size, HPAGE_SIZE * 2);
- ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- pkey_assert(ptr != (void *)-1);
- record_pkey_malloc(ptr, size);
- mprotect_pkey(ptr, size, prot, pkey);
-
- dprintf1("unaligned ptr: %p\n", ptr);
- ptr = ALIGN_PTR_UP(ptr, HPAGE_SIZE);
- dprintf1("  aligned ptr: %p\n", ptr);
- ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
- dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
- ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
- dprintf1("MADV_WILLNEED ret: %d\n", ret);
- memset(ptr, 0, HPAGE_SIZE);
-
- dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
- return ptr;
-}
-
-int hugetlb_setup_ok;
-#define GET_NR_HUGE_PAGES 10
-void setup_hugetlbfs(void)
-{
- int err;
- int fd;
- char buf[] = "123";
-
- if (geteuid() != 0) {
- fprintf(stderr, "WARNING: not run as root, can not do hugetlb test\n");
- return;
- }
-
- cat_into_file(__stringify(GET_NR_HUGE_PAGES), "/proc/sys/vm/nr_hugepages");
-
- /*
- * Now go make sure that we got the pages and that they
- * are 2M pages.  Someone might have made 1G the default.
- */
- fd = open("/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages", O_RDONLY);
- if (fd < 0) {
- perror("opening sysfs 2M hugetlb config");
- return;
- }
-
- /* -1 to guarantee leaving the trailing \0 */
- err = read(fd, buf, sizeof(buf)-1);
- close(fd);
- if (err <= 0) {
- perror("reading sysfs 2M hugetlb config");
- return;
- }
-
- if (atoi(buf) != GET_NR_HUGE_PAGES) {
- fprintf(stderr, "could not confirm 2M pages, got: '%s' expected %d\n",
- buf, GET_NR_HUGE_PAGES);
- return;
- }
-
- hugetlb_setup_ok = 1;
-}
-
-void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
-{
- void *ptr;
- int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
-
- if (!hugetlb_setup_ok)
- return PTR_ERR_ENOTSUP;
-
- dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
- size = ALIGN_UP(size, HPAGE_SIZE * 2);
- pkey_assert(pkey < NR_PKEYS);
- ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
- pkey_assert(ptr != (void *)-1);
- mprotect_pkey(ptr, size, prot, pkey);
-
- record_pkey_malloc(ptr, size);
-
- dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
- return ptr;
-}
-
-void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
-{
- void *ptr;
- int fd;
-
- dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
- size, prot, pkey);
- pkey_assert(pkey < NR_PKEYS);
- fd = open("/dax/foo", O_RDWR);
- pkey_assert(fd >= 0);
-
- ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
- pkey_assert(ptr != (void *)-1);
-
- mprotect_pkey(ptr, size, prot, pkey);
-
- record_pkey_malloc(ptr, size);
-
- dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
- close(fd);
- return ptr;
-}
-
-void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
-
- malloc_pkey_with_mprotect,
- malloc_pkey_anon_huge,
- malloc_pkey_hugetlb
-/* can not do direct with the pkey_mprotect() API:
- malloc_pkey_mmap_direct,
- malloc_pkey_mmap_dax,
-*/
-};
-
-void *malloc_pkey(long size, int prot, u16 pkey)
-{
- void *ret;
- static int malloc_type;
- int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
-
- pkey_assert(pkey < NR_PKEYS);
-
- while (1) {
- pkey_assert(malloc_type < nr_malloc_types);
-
- ret = pkey_malloc[malloc_type](size, prot, pkey);
- pkey_assert(ret != (void *)-1);
-
- malloc_type++;
- if (malloc_type >= nr_malloc_types)
- malloc_type = (random()%nr_malloc_types);
-
- /* try again if the malloc_type we tried is unsupported */
- if (ret == PTR_ERR_ENOTSUP)
- continue;
-
- break;
- }
-
- dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__,
- size, prot, pkey, ret);
- return ret;
-}
-
-int last_pkru_faults;
-void expected_pk_fault(int pkey)
-{
- dprintf2("%s(): last_pkru_faults: %d pkru_faults: %d\n",
- __func__, last_pkru_faults, pkru_faults);
- dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
- pkey_assert(last_pkru_faults + 1 == pkru_faults);
- pkey_assert(last_si_pkey == pkey);
- /*
- * The signal handler shold have cleared out PKRU to let the
- * test program continue.  We now have to restore it.
- */
- if (__rdpkru() != 0)
- pkey_assert(0);
-
- __wrpkru(shadow_pkru);
- dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
- __func__, shadow_pkru);
- last_pkru_faults = pkru_faults;
- last_si_pkey = -1;
-}
-
-void do_not_expect_pk_fault(void)
-{
- pkey_assert(last_pkru_faults == pkru_faults);
-}
-
-int test_fds[10] = { -1 };
-int nr_test_fds;
-void __save_test_fd(int fd)
-{
- pkey_assert(fd >= 0);
- pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
- test_fds[nr_test_fds] = fd;
- nr_test_fds++;
-}
-
-int get_test_read_fd(void)
-{
- int test_fd = open("/etc/passwd", O_RDONLY);
- __save_test_fd(test_fd);
- return test_fd;
-}
-
-void close_test_fds(void)
-{
- int i;
-
- for (i = 0; i < nr_test_fds; i++) {
- if (test_fds[i] < 0)
- continue;
- close(test_fds[i]);
- test_fds[i] = -1;
- }
- nr_test_fds = 0;
-}
-
-#define barrier() __asm__ __volatile__("": : :"memory")
-__attribute__((noinline)) int read_ptr(int *ptr)
-{
- /*
- * Keep GCC from optimizing this away somehow
- */
- barrier();
- return *ptr;
-}
-
-void test_read_of_write_disabled_region(int *ptr, u16 pkey)
-{
- int ptr_contents;
-
- dprintf1("disabling write access to PKEY[1], doing read\n");
- pkey_write_deny(pkey);
- ptr_contents = read_ptr(ptr);
- dprintf1("*ptr: %d\n", ptr_contents);
- dprintf1("\n");
-}
-void test_read_of_access_disabled_region(int *ptr, u16 pkey)
-{
- int ptr_contents;
-
- dprintf1("disabling access to PKEY[%02d], doing read @ %p\n", pkey, ptr);
- rdpkru();
- pkey_access_deny(pkey);
- ptr_contents = read_ptr(ptr);
- dprintf1("*ptr: %d\n", ptr_contents);
- expected_pk_fault(pkey);
-}
-void test_write_of_write_disabled_region(int *ptr, u16 pkey)
-{
- dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
- pkey_write_deny(pkey);
- *ptr = __LINE__;
- expected_pk_fault(pkey);
-}
-void test_write_of_access_disabled_region(int *ptr, u16 pkey)
-{
- dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
- pkey_access_deny(pkey);
- *ptr = __LINE__;
- expected_pk_fault(pkey);
-}
-void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
-{
- int ret;
- int test_fd = get_test_read_fd();
-
- dprintf1("disabling access to PKEY[%02d], "
- "having kernel read() to buffer\n", pkey);
- pkey_access_deny(pkey);
- ret = read(test_fd, ptr, 1);
- dprintf1("read ret: %d\n", ret);
- pkey_assert(ret);
-}
-void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
-{
- int ret;
- int test_fd = get_test_read_fd();
-
- pkey_write_deny(pkey);
- ret = read(test_fd, ptr, 100);
- dprintf1("read ret: %d\n", ret);
- if (ret < 0 && (DEBUG_LEVEL > 0))
- perror("verbose read result (OK for this to be bad)");
- pkey_assert(ret);
-}
-
-void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
-{
- int pipe_ret, vmsplice_ret;
- struct iovec iov;
- int pipe_fds[2];
-
- pipe_ret = pipe(pipe_fds);
-
- pkey_assert(pipe_ret == 0);
- dprintf1("disabling access to PKEY[%02d], "
- "having kernel vmsplice from buffer\n", pkey);
- pkey_access_deny(pkey);
- iov.iov_base = ptr;
- iov.iov_len = PAGE_SIZE;
- vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
- dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
- pkey_assert(vmsplice_ret == -1);
-
- close(pipe_fds[0]);
- close(pipe_fds[1]);
-}
-
-void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
-{
- int ignored = 0xdada;
- int futex_ret;
- int some_int = __LINE__;
-
- dprintf1("disabling write to PKEY[%02d], "
- "doing futex gunk in buffer\n", pkey);
- *ptr = some_int;
- pkey_write_deny(pkey);
- futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL,
- &ignored, ignored);
- if (DEBUG_LEVEL > 0)
- perror("futex");
- dprintf1("futex() ret: %d\n", futex_ret);
-}
-
-/* Assumes that all pkeys other than 'pkey' are unallocated */
-void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
-{
- int err;
- int i;
-
- /* Note: 0 is the default pkey, so don't mess with it */
- for (i = 1; i < NR_PKEYS; i++) {
- if (pkey == i)
- continue;
-
- dprintf1("trying get/set/free to non-allocated pkey: %2d\n", i);
- err = sys_pkey_free(i);
- pkey_assert(err);
-
- err = sys_pkey_free(i);
- pkey_assert(err);
-
- err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, i);
- pkey_assert(err);
- }
-}
-
-/* Assumes that all pkeys other than 'pkey' are unallocated */
-void test_pkey_syscalls_bad_args(int *ptr, u16 pkey)
-{
- int err;
- int bad_pkey = NR_PKEYS+99;
-
- /* pass a known-invalid pkey in: */
- err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, bad_pkey);
- pkey_assert(err);
-}
-
-/* Assumes that all pkeys other than 'pkey' are unallocated */
-void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
-{
- int err;
- int allocated_pkeys[NR_PKEYS] = {0};
- int nr_allocated_pkeys = 0;
- int i;
-
- for (i = 0; i < NR_PKEYS*2; i++) {
- int new_pkey;
- dprintf1("%s() alloc loop: %d\n", __func__, i);
- new_pkey = alloc_pkey();
- dprintf4("%s()::%d, err: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, err, __rdpkru(), shadow_pkru);
- rdpkru(); /* for shadow checking */
- dprintf2("%s() errno: %d ENOSPC: %d\n", __func__, errno, ENOSPC);
- if ((new_pkey == -1) && (errno == ENOSPC)) {
- dprintf2("%s() failed to allocate pkey after %d tries\n",
- __func__, nr_allocated_pkeys);
- break;
- }
- pkey_assert(nr_allocated_pkeys < NR_PKEYS);
- allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
- }
-
- dprintf3("%s()::%d\n", __func__, __LINE__);
-
- /*
- * ensure it did not reach the end of the loop without
- * failure:
- */
- pkey_assert(i < NR_PKEYS*2);
-
- /*
- * There are 16 pkeys supported in hardware.  One is taken
- * up for the default (0) and another can be taken up by
- * an execute-only mapping.  Ensure that we can allocate
- * at least 14 (16-2).
- */
- pkey_assert(i >= NR_PKEYS-2);
-
- for (i = 0; i < nr_allocated_pkeys; i++) {
- err = sys_pkey_free(allocated_pkeys[i]);
- pkey_assert(!err);
- rdpkru(); /* for shadow checking */
- }
-}
-
-void test_ptrace_of_child(int *ptr, u16 pkey)
-{
- __attribute__((__unused__)) int peek_result;
- pid_t child_pid;
- void *ignored = 0;
- long ret;
- int status;
- /*
- * This is the "control" for our little expermient.  Make sure
- * we can always access it when ptracing.
- */
- int *plain_ptr_unaligned = malloc(HPAGE_SIZE);
- int *plain_ptr = ALIGN_PTR_UP(plain_ptr_unaligned, PAGE_SIZE);
-
- /*
- * Fork a child which is an exact copy of this process, of course.
- * That means we can do all of our tests via ptrace() and then plain
- * memory access and ensure they work differently.
- */
- child_pid = fork_lazy_child();
- dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
-
- ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
- if (ret)
- perror("attach");
- dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
- pkey_assert(ret != -1);
- ret = waitpid(child_pid, &status, WUNTRACED);
- if ((ret != child_pid) || !(WIFSTOPPED(status))) {
- fprintf(stderr, "weird waitpid result %ld stat %x\n",
- ret, status);
- pkey_assert(0);
- }
- dprintf2("waitpid ret: %ld\n", ret);
- dprintf2("waitpid status: %d\n", status);
-
- pkey_access_deny(pkey);
- pkey_write_deny(pkey);
-
- /* Write access, untested for now:
- ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
- pkey_assert(ret != -1);
- dprintf1("poke at %p: %ld\n", peek_at, ret);
- */
-
- /*
- * Try to access the pkey-protected "ptr" via ptrace:
- */
- ret = ptrace(PTRACE_PEEKDATA, child_pid, ptr, ignored);
- /* expect it to work, without an error: */
- pkey_assert(ret != -1);
- /* Now access from the current task, and expect an exception: */
- peek_result = read_ptr(ptr);
- expected_pk_fault(pkey);
-
- /*
- * Try to access the NON-pkey-protected "plain_ptr" via ptrace:
- */
- ret = ptrace(PTRACE_PEEKDATA, child_pid, plain_ptr, ignored);
- /* expect it to work, without an error: */
- pkey_assert(ret != -1);
- /* Now access from the current task, and expect NO exception: */
- peek_result = read_ptr(plain_ptr);
- do_not_expect_pk_fault();
-
- ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
- pkey_assert(ret != -1);
-
- ret = kill(child_pid, SIGKILL);
- pkey_assert(ret != -1);
-
- wait(&status);
-
- free(plain_ptr_unaligned);
-}
-
-void test_executing_on_unreadable_memory(int *ptr, u16 pkey)
-{
- void *p1;
- int scratch;
- int ptr_contents;
- int ret;
-
- p1 = ALIGN_PTR_UP(&lots_o_noops_around_write, PAGE_SIZE);
- dprintf3("&lots_o_noops: %p\n", &lots_o_noops_around_write);
- /* lots_o_noops_around_write should be page-aligned already */
- assert(p1 == &lots_o_noops_around_write);
-
- /* Point 'p1' at the *second* page of the function: */
- p1 += PAGE_SIZE;
-
- madvise(p1, PAGE_SIZE, MADV_DONTNEED);
- lots_o_noops_around_write(&scratch);
- ptr_contents = read_ptr(p1);
- dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
-
- ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC, (u64)pkey);
- pkey_assert(!ret);
- pkey_access_deny(pkey);
-
- dprintf2("pkru: %x\n", rdpkru());
-
- /*
- * Make sure this is an *instruction* fault
- */
- madvise(p1, PAGE_SIZE, MADV_DONTNEED);
- lots_o_noops_around_write(&scratch);
- do_not_expect_pk_fault();
- ptr_contents = read_ptr(p1);
- dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
- expected_pk_fault(pkey);
-}
-
-void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
-{
- int size = PAGE_SIZE;
- int sret;
-
- if (cpu_has_pku()) {
- dprintf1("SKIP: %s: no CPU support\n", __func__);
- return;
- }
-
- sret = syscall(SYS_mprotect_key, ptr, size, PROT_READ, pkey);
- pkey_assert(sret < 0);
-}
-
-void (*pkey_tests[])(int *ptr, u16 pkey) = {
- test_read_of_write_disabled_region,
- test_read_of_access_disabled_region,
- test_write_of_write_disabled_region,
- test_write_of_access_disabled_region,
- test_kernel_write_of_access_disabled_region,
- test_kernel_write_of_write_disabled_region,
- test_kernel_gup_of_access_disabled_region,
- test_kernel_gup_write_to_write_disabled_region,
- test_executing_on_unreadable_memory,
- test_ptrace_of_child,
- test_pkey_syscalls_on_non_allocated_pkey,
- test_pkey_syscalls_bad_args,
- test_pkey_alloc_exhaust,
-};
-
-void run_tests_once(void)
-{
- int *ptr;
- int prot = PROT_READ|PROT_WRITE;
-
- for (test_nr = 0; test_nr < ARRAY_SIZE(pkey_tests); test_nr++) {
- int pkey;
- int orig_pkru_faults = pkru_faults;
-
- dprintf1("======================\n");
- dprintf1("test %d preparing...\n", test_nr);
-
- tracing_on();
- pkey = alloc_random_pkey();
- dprintf1("test %d starting with pkey: %d\n", test_nr, pkey);
- ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
- dprintf1("test %d starting...\n", test_nr);
- pkey_tests[test_nr](ptr, pkey);
- dprintf1("freeing test memory: %p\n", ptr);
- free_pkey_malloc(ptr);
- sys_pkey_free(pkey);
-
- dprintf1("pkru_faults: %d\n", pkru_faults);
- dprintf1("orig_pkru_faults: %d\n", orig_pkru_faults);
-
- tracing_off();
- close_test_fds();
-
- printf("test %2d PASSED (iteration %d)\n", test_nr, iteration_nr);
- dprintf1("======================\n\n");
- }
- iteration_nr++;
-}
-
-void pkey_setup_shadow(void)
-{
- shadow_pkru = __rdpkru();
-}
-
-int main(void)
-{
- int nr_iterations = 22;
-
- setup_handlers();
-
- printf("has pku: %d\n", cpu_has_pku());
-
- if (!cpu_has_pku()) {
- int size = PAGE_SIZE;
- int *ptr;
-
- printf("running PKEY tests for unsupported CPU/OS\n");
-
- ptr  = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- assert(ptr != (void *)-1);
- test_mprotect_pkey_on_unsupported_cpu(ptr, 1);
- exit(0);
- }
-
- pkey_setup_shadow();
- printf("startup pkru: %x\n", rdpkru());
- setup_hugetlbfs();
-
- while (nr_iterations-- > 0)
- run_tests_once();
-
- printf("done (all tests OK)\n");
- return 0;
-}
--
1.8.3.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 12/12]selftest: Updated protection key selftest

Michael Ellerman-2
Ram Pai <[hidden email]> writes:

> Added test support for PowerPC implementation off protection keys.
>
> Signed-off-by: Ram Pai <[hidden email]>
> ---
>  tools/testing/selftests/vm/Makefile           |    1 +
>  tools/testing/selftests/vm/pkey-helpers.h     |  365 +++++++
>  tools/testing/selftests/vm/protection_keys.c  | 1451 +++++++++++++++++++++++++
>  tools/testing/selftests/x86/Makefile          |    2 +-
>  tools/testing/selftests/x86/pkey-helpers.h    |  219 ----
>  tools/testing/selftests/x86/protection_keys.c | 1395 ------------------------

Please split the move and the addition of the powerpc code into two
separate patches (move first). That way we can actually see what you're
doing to add powerpc support.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

Michael Ellerman-2
In reply to this post by Ram Pai
Ram Pai <[hidden email]> writes:

> Signed-off-by: Ram Pai <[hidden email]>
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 8db9ef8..a4de1b4 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
>   ld r12,_MSR(r1)
>   ld r3,PACA_EXGEN+EX_DAR(r13)
>   lwz r4,PACA_EXGEN+EX_DSISR(r13)
> + std r3,_DAR(r1)
> + std r4,_DSISR(r1)
>  #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
> + beq+ 1f

This seems to be incremental on top of one of your other patches.

But I don't see why, can you please just squash this into whatever patch
adds this code in the first place.

cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 03/12] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.

Michael Ellerman-2
In reply to this post by Ram Pai
Hi Ram,

Ram Pai <[hidden email]> writes:

> Sys_pkey_alloc() allocates and returns available pkey
> Sys_pkey_free()  frees up the pkey.
>
> Total 32 keys are supported on powerpc. However pkey 0,1 and 31
> are reserved. So effectively we have 29 pkeys.
>
> Signed-off-by: Ram Pai <[hidden email]>
> ---
>  include/linux/mm.h                           |  31 ++++---
>  include/uapi/asm-generic/mman-common.h       |   2 +-

Those changes need to be split out and acked by mm folks.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cb17c6..34ddac7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>  #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
>  
>  #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
> -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
> -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
> -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit arch */
> +#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit arch */
> +#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit arch */
> +#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit arch */

Please don't change the comments, it makes the diff harder to read.

You're actually just adding this AFAICS:

> +#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */

>  #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #if defined(CONFIG_X86)
               ^

>  # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
> -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
> -# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> -# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
> -# define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> -# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> -# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> -#endif
> +#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
> + || defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
> +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
                                                                 ^ 4?
> +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */

That appears to be inside an #if defined(CONFIG_X86) ?

>  #elif defined(CONFIG_PPC)
                 ^
Should be CONFIG_PPC64_MEMORY_PROTECTION_KEYS no?

> +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
> +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> +#define VM_PKEY_BIT4 VM_HIGH_ARCH_4  /* intel does not use this bit */
> + /* but reserved for future expansion */

But this hunk is for PPC ?

Is it OK for the other arches & generic code to add another VM_PKEY_BIT4 ?

Do you need to update show_smap_vma_flags() ?

>  # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
>  #elif defined(CONFIG_PARISC)
>  # define VM_GROWSUP VM_ARCH_1

> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 8c27db0..b13ecc6 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -76,5 +76,5 @@
>  #define PKEY_DISABLE_WRITE 0x2
>  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
>   PKEY_DISABLE_WRITE)
> -
> +#define PKEY_DISABLE_EXECUTE 0x4

How you can set that if it's not in PKEY_ACCESS_MASK?

See:

SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
{
        int pkey;
        int ret;

        /* No flags supported yet. */
        if (flags)
                return -EINVAL;
        /* check for unsupported init values */
        if (init_val & ~PKEY_ACCESS_MASK)
                return -EINVAL;


cheers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

Ram Pai
In reply to this post by Michael Ellerman-2
On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote:

> Ram Pai <[hidden email]> writes:
>
> > Signed-off-by: Ram Pai <[hidden email]>
> > ---
> >  arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > index 8db9ef8..a4de1b4 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
> >   ld r12,_MSR(r1)
> >   ld r3,PACA_EXGEN+EX_DAR(r13)
> >   lwz r4,PACA_EXGEN+EX_DSISR(r13)
> > + std r3,_DAR(r1)
> > + std r4,_DSISR(r1)
> >  #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
> > + beq+ 1f
>
> This seems to be incremental on top of one of your other patches.
>
> But I don't see why, can you please just squash this into whatever patch
> adds this code in the first place.

It was an optimization added later. But yes it can be squashed into an
earlier patch.

RP

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 00/12] powerpc: Memory Protection Keys

Balbir Singh
In reply to this post by Ram Pai
On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote:
> Memory protection keys enable applications to protect its
> address space from inadvertent access or corruption from
> itself.

I presume by itself you mean protection between threads?

>
> The overall idea:
>
>  A process allocates a   key  and associates it with
>  a  address  range  within    its   address   space.

OK, so this is per VMA?

>  The process  than  can  dynamically  set read/write
>  permissions on  the   key   without  involving  the
>  kernel.

This bit is not clear, how can the key be set without
involving the kernel? I presume you mean the key is set
in the PTE's and the access protection values can be
set without involving the kernel?

 Any  code that  violates   the  permissions

>  off the address space; as defined by its associated
>  key, will receive a segmentation fault.
>
> This patch series enables the feature on PPC64.
> It is enabled on HPTE 64K-page platform.
>
> ISA3.0 section 5.7.13 describes the detailed specifications.
>
>
> Testing:
> This patch series has passed all the protection key
> tests available in  the selftests directory.
> The tests are updated to work on both x86 and powerpc.

Balbir

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 00/12] powerpc: Memory Protection Keys

Anshuman Khandual
On 06/20/2017 10:40 AM, Balbir Singh wrote:
> On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote:
>> Memory protection keys enable applications to protect its
>> address space from inadvertent access or corruption from
>> itself.
>
> I presume by itself you mean protection between threads?

Between threads due to race conditions or from the same thread
because of programming error.

>
>>
>> The overall idea:
>>
>>  A process allocates a   key  and associates it with
>>  a  address  range  within    its   address   space.
>
> OK, so this is per VMA?

Yeah but the same key can be given to multiple VMAs. Any
change will effect every VMA who got tagged by it.

>
>>  The process  than  can  dynamically  set read/write
>>  permissions on  the   key   without  involving  the
>>  kernel.
>
> This bit is not clear, how can the key be set without
> involving the kernel? I presume you mean the key is set

With pkey_mprotect() system call, all the effected PTEs get
tagged for once. Switching the permission happens just by
writing into the register on the fly.

> in the PTE's and the access protection values can be
> set without involving the kernel?

PTE setting happens once, access protection values can be
changed on the fly through register.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC v2 11/12]Documentation: Documentation updates.

Anshuman Khandual
In reply to this post by Ram Pai
On 06/17/2017 09:22 AM, Ram Pai wrote:
> The Documentaton file is moved from x86 into the generic area,
> since this feature is now supported by more than one archs.
>
> Signed-off-by: Ram Pai <[hidden email]>
> ---
>  Documentation/vm/protection-keys.txt  | 110 ++++++++++++++++++++++++++++++++++
>  Documentation/x86/protection-keys.txt |  85 --------------------------

I am not sure whether this is a good idea. There might be
specifics for each architecture which need to be detailed
again in this new generic one.

>  2 files changed, 110 insertions(+), 85 deletions(-)
>  create mode 100644 Documentation/vm/protection-keys.txt
>  delete mode 100644 Documentation/x86/protection-keys.txt
>
> diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt
> new file mode 100644
> index 0000000..b49e6bb
> --- /dev/null
> +++ b/Documentation/vm/protection-keys.txt
> @@ -0,0 +1,110 @@
> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +found in new generation of intel CPUs on PowerPC CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.

Does resultant access through protection keys should be a
subset of the protection bits enabled through original PTE
PROT format ? Does the semantics exactly the same on x86
and powerpc ?

> +
> +
> +On Intel:
> +
> +It works by dedicating 4 previously ignored bits in each page table
> +entry to a "protection key", giving 16 possible keys.
> +
> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key.  Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register.  The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs.  These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.
> +
> +
> +On PowerPC:
> +
> +It works by dedicating 5 page table entry to a "protection key",
> +giving 32 possible keys.
> +
> +There is a user-accessible register (AMR) with two separate bits
> +(Access Disable and Write Disable) for each key.  Being a CPU
> +register, AMR is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.

Small nit. Space needed here.

> +NOTE: Disabling read permission does not disable
> +write and vice-versa.
> +
> +The feature is available on 64-bit HPTE mode only.
> +
> +'mtspr 0xd, mem' reads the AMR register
> +'mfspr mem, 0xd' writes into the AMR register.
> +
> +Permissions are enforced on data access only and have no effect on
> +instruction fetches.
> +
> +=========================== Syscalls ===========================
> +
> +There are 3 system calls which directly interact with pkeys:
> +
> + int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
> + int pkey_free(int pkey);
> + int pkey_mprotect(unsigned long start, size_t len,
> +  unsigned long prot, int pkey);
> +
> +Before a pkey can be used, it must first be allocated with
> +pkey_alloc().  An application calls the WRPKRU instruction
> +directly in order to change access permissions to memory covered
> +with a key.  In this example WRPKRU is wrapped by a C function
> +called pkey_set().
> +
> + int real_prot = PROT_READ|PROT_WRITE;
> + pkey = pkey_alloc(0, PKEY_DENY_WRITE);
> + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
> + ... application runs here
> +
> +Now, if the application needs to update the data at 'ptr', it can
> +gain access, do the update, then remove its write access:
> +
> + pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
> + *ptr = foo; // assign something
> + pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
> +
> +Now when it frees the memory, it will also free the pkey since it
> +is no longer in use:
> +
> + munmap(ptr, PAGE_SIZE);
> + pkey_free(pkey);
> +
> +(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
> + An example implementation can be found in
> + tools/testing/selftests/x86/protection_keys.c)
> +
> +=========================== Behavior ===========================
> +
> +The kernel attempts to make protection keys consistent with the
> +behavior of a plain mprotect().  For instance if you do this:
> +
> + mprotect(ptr, size, PROT_NONE);
> + something(ptr);
> +
> +you can expect the same effects with protection keys when doing this:
> +
> + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
> + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
> + something(ptr);
> +
> +That should be true whether something() is a direct access to 'ptr'
> +like:
> +
> + *ptr = foo;
> +
> +or when the kernel does the access on the application's behalf like
> +with a read():
> +
> + read(fd, ptr, 1);
> +
> +The kernel will send a SIGSEGV in both cases, but si_code will be set
> +to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
> +the plain mprotect() permissions are violated.

I guess the right thing would be to have three files

* Documentation/vm/protection-keys.txt

        - Generic interface, system calls
        - Signal handling, error codes
        - Semantics of programming with an example

* Documentation/x86/protection-keys.txt

        - Number of active protections keys inside an address space
        - X86 protection key instruction details
        - PTE protection bits placement details
        - Page fault handling
        - Implementation details a bit ?

* Documentation/powerpc/protection-keys.txt

        - Number of active protections keys inside an address space
        - Powerpc instructions details
        - PTE protection bits placement details
        - Page fault handling
        - Implementation details a bit ?

123
Loading...