Re: copy_4K_page() doesn't use dcbtst?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: copy_4K_page() doesn't use dcbtst?

Paul Mackerras
Hollis Blanchard writes:

> Hi Paul, some Xen people were just noticing that copy_4K_page
> (arch/powerpc/lib/copypage_64.S) doesn't use the dcbtst instruction. Why
> doesn't it help there?

Why would we want to read the cache lines for the destination from
memory when we're only going to overwrite them completely anyway?

A stronger argument would be for using dcbz, but IIRC it actually made
things slower (on POWER4 at least).  I suspect the hardware is
gathering the stores for the whole of each cache line automatically,
so using dcbz doesn't provide any benefit.

I did a lot of measurements of memory copy speed on POWER4 (using
different copy loops, copy sizes, alignments, cache hot/cold cases)
and the copy_4K_page loop is the fastest I could come up with for
POWER4.  If anyone can come up with a routine that is measurably
faster on current machines, I'm happy to look at it, of course.

Paul.

_______________________________________________
Linuxppc-dev mailing list
[hidden email]
https://ozlabs.org/mailman/listinfo/linuxppc-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: copy_4K_page() doesn't use dcbtst?

Hollis Blanchard-2
On Tue, 2006-08-29 at 10:16 +1000, Paul Mackerras wrote:

> Hollis Blanchard writes:
>
> > Hi Paul, some Xen people were just noticing that copy_4K_page
> > (arch/powerpc/lib/copypage_64.S) doesn't use the dcbtst instruction. Why
> > doesn't it help there?
>
> Why would we want to read the cache lines for the destination from
> memory when we're only going to overwrite them completely anyway?
>
> A stronger argument would be for using dcbz, but IIRC it actually made
> things slower (on POWER4 at least).  I suspect the hardware is
> gathering the stores for the whole of each cache line automatically,
> so using dcbz doesn't provide any benefit.

Yes, dcbz makes more sense.

> I did a lot of measurements of memory copy speed on POWER4 (using
> different copy loops, copy sizes, alignments, cache hot/cold cases)
> and the copy_4K_page loop is the fastest I could come up with for
> POWER4.  If anyone can come up with a routine that is measurably
> faster on current machines, I'm happy to look at it, of course.

I figured you had done measurements; we were just curious about the
unexpected results. Thanks!

--
Hollis Blanchard
IBM Linux Technology Center

_______________________________________________
Linuxppc-dev mailing list
[hidden email]
https://ozlabs.org/mailman/listinfo/linuxppc-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: copy_4K_page() doesn't use dcbtst?

Segher Boessenkool
In reply to this post by Paul Mackerras
> A stronger argument would be for using dcbz, but IIRC it actually made
> things slower (on POWER4 at least).  I suspect the hardware is
> gathering the stores for the whole of each cache line automatically,
> so using dcbz doesn't provide any benefit.

It seems on 970 at least it still is a nice win.  Do you have any
good benchmarks I could run?

> I did a lot of measurements of memory copy speed on POWER4 (using
> different copy loops, copy sizes, alignments, cache hot/cold cases)
> and the copy_4K_page loop is the fastest I could come up with for
> POWER4.

Yeah, POWER4 is quite a different beast (its memory subsystem,
anyway).  I'm surprised dcbz hurt though; did you schedule it
early enough before the actual data copy?


Segher

_______________________________________________
Linuxppc-dev mailing list
[hidden email]
https://ozlabs.org/mailman/listinfo/linuxppc-dev
Loading...