Avoid memory barrier in read_seqcount() through load acquire
Some architectures support load acquire which can save us a memory
barrier and save some cycles.
A typical sequence
do {
seq = read_seqcount_begin(&s);
<something>
} while (read_seqcount_retry(&s, seq);
requires 13 cycles on ARM64 for an empty loop. Two read memory
barriers are needed. One for each of the seqcount_* functions.
We can replace the first read barrier with a load acquire of
the seqcount which saves us one barrier.
On ARM64 doing so reduces the cycle count from 13 to 8.
This is a general improvement for the ARM64 architecture and not
specific to a certain processor. The cycle count here was
obtained on a Neoverse N1 (Ampere Altra).
We can further optimize handling by using the cond_load_acquire logic
which will give an ARM CPU a chance to enter some power saving mode
while waiting for changes to a cacheline thereby avoiding busy loops
and therefore saving power.
The ARM documentation states that load acquire is more effective
than a load plus barrier. In general that tends to be true on all
compute platforms that support both.
See (as quoted by Linus Torvalds):
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
"Weaker ordering requirements that are imposed by Load-Acquire and
Store-Release instructions allow for micro-architectural
optimizations, which could reduce some of the performance impacts that
are otherwise imposed by an explicit memory barrier.
If the ordering requirement is satisfied using either a Load-Acquire
or Store-Release, then it would be preferable to use these
instructions instead of a DMB"
The patch benefited significantly from the knowledge of the innards
of the seqlock code by Thomas Gleixner.
Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>