Improve aarch64 MonitorEntry/Exit assembly code

We make two kinds of changes:

1) We remove some redundant moves, which appeared to have been copied
from some architecture with a 2 address instruction format.

2) We avoid the use of dmb barrier instructions, and instead use
acquire/release instructions for the actual lock loads/updates.

(2) is a clear win on A53/A57, where there seems to be very little
additional cost associated with acquire/release when
used with "exclusive" memory operations, as they are here.
On the cores used in 2016 Pixel phones, the story is more mixed.
But the addition of acquire/release to a pair of exclusive load/store
operations still seems to cost enough less than 2 dmb's, so that
even if 10% of lock acquisitions are nested and unnecessarily
enforce ordering, we come out slightly ahead. ARM's advice for
the future is also to move in this direction.

Test: AOSP boots. AOSP art test failures seem attributable to other
issues.

Change-Id: I2399baeab3df93196471e65612c00d95ad4e2b62
1 file changed