Thread local bump pointer allocator.

Added a thread local allocator to the heap, each thread has three
pointers which specify the thread local buffer: start, cur, and
end. When the remaining space in the thread local buffer isn't large
enough for the allocation, the allocator allocates a new thread
local buffer using the bump pointer allocator.

The bump pointer space had to be modified to accomodate thread
local buffers. These buffers are called "blocks", where a block
is a buffer which contains a set of adjacent objects. Blocks
aren't necessarily full and may have wasted memory towards the
end. Blocks have an 8 byte header which specifies their size and is
required for traversing bump pointer spaces.

Memory usage is in between full bump pointer and ROSAlloc since
madvised memory limits wasted ram to an average of 1/2 page per
block.

Added a runtime option -XX:UseTLAB which specifies whether or
not to use the thread local allocator. Its a NOP if the garbage
collector is not the semispace collector.

TODO: Smarter block accounting to prevent us reading objects until
we either hit the end of the block or GetClass() == null which
signifies that the block isn't 100% full. This would provide a
slight speedup to BumpPointerSpace::Walk.

Timings: -XX:HeapMinFree=4m -XX:HeapMaxFree=8m -Xmx48m
ritzperf memalloc:
Dalvik -Xgc:concurrent: 11678
Dalvik -Xgc:noconcurrent: 6697
-Xgc:MS: 5978
-Xgc:SS: 4271
-Xgc:CMS: 4150
-Xgc:SS -XX:UseTLAB: 3255

Bug: 9986565
Bug: 12042213

Change-Id: Ib7e1d4b199a8199f3b1de94b0a7b6e1730689cad
diff --git a/runtime/runtime.cc b/runtime/runtime.cc
index e1b4d7e..ff7b8f5 100644
--- a/runtime/runtime.cc
+++ b/runtime/runtime.cc
@@ -355,7 +355,7 @@
   parsed->heap_min_free_ = gc::Heap::kDefaultMinFree;
   parsed->heap_max_free_ = gc::Heap::kDefaultMaxFree;
   parsed->heap_target_utilization_ = gc::Heap::kDefaultTargetUtilization;
-  parsed->heap_growth_limit_ = 0;  // 0 means no growth limit.
+  parsed->heap_growth_limit_ = 0;  // 0 means no growth limit .
   // Default to number of processors minus one since the main GC thread also does work.
   parsed->parallel_gc_threads_ = sysconf(_SC_NPROCESSORS_CONF) - 1;
   // Only the main GC thread, no workers.
@@ -365,6 +365,7 @@
   parsed->stack_size_ = 0;  // 0 means default.
   parsed->max_spins_before_thin_lock_inflation_ = Monitor::kDefaultMaxSpinsBeforeThinLockInflation;
   parsed->low_memory_mode_ = false;
+  parsed->use_tlab_ = false;
 
   parsed->is_compiler_ = false;
   parsed->is_zygote_ = false;
@@ -540,6 +541,8 @@
       parsed->ignore_max_footprint_ = true;
     } else if (option == "-XX:LowMemoryMode") {
       parsed->low_memory_mode_ = true;
+    } else if (option == "-XX:UseTLAB") {
+      parsed->use_tlab_ = true;
     } else if (StartsWith(option, "-D")) {
       parsed->properties_.push_back(option.substr(strlen("-D")));
     } else if (StartsWith(option, "-Xjnitrace:")) {
@@ -925,7 +928,8 @@
                        options->low_memory_mode_,
                        options->long_pause_log_threshold_,
                        options->long_gc_log_threshold_,
-                       options->ignore_max_footprint_);
+                       options->ignore_max_footprint_,
+                       options->use_tlab_);
 
   dump_gc_performance_on_shutdown_ = options->dump_gc_performance_on_shutdown_;