[optimizing] Improve x86 parallel moves/swaps

Add a new constructor to ScratchRegisterScope that will supply a
register if there is a free one, but not spill to force one.  Use this
to generated alternate code that doesn't use a temporary, as the
spill/restore of a register generates extra instructions that aren't
necessary on x86.

Here is the benefit for a 32 bit memory-to-memory exchange with no
free registers:
<        50    	       push eax
<        53    	       push ebx
<  8B44244C    	       mov eax, [esp + 76]
<  8B5C246C    	       mov ebx, [esp + 108]
<  8944246C    	       mov [esp + 108], eax
<  895C244C    	       mov [esp + 76], ebx
<        5B    	       pop ebx
<        58    	       pop eax
---
>  FF742444    	       push [esp + 68]
>  FF742468    	       push [esp + 104]
>  8F44244C    	       pop [esp + 72]
>  8F442468    	       pop [esp + 100]

Avoid using xchg instruction, as it is slow on smaller processors.

Change-Id: Id29ee3abd998577baaee552d55d23e60ae0c7871
Signed-off-by: Mark Mendell <mark.p.mendell@intel.com>
6 files changed