Add mutator collector coordination documentation

This affects a lot of ART and should thus be documented.

It is sufficiently low level that I believe it belongs in the source
tree. It should only be of interest to ART developers, in a broad sense
that includes anyone trying to understand failures involving ART.

Various comment improvements around thread suspension and checkpoints.

Test: Built AOSP.
Change-Id: I77338ff1d6f7c6aefb7be849449770438c15c982
diff --git a/runtime/barrier.h b/runtime/barrier.h
index ad8f66f..432df76 100644
--- a/runtime/barrier.h
+++ b/runtime/barrier.h
@@ -14,7 +14,7 @@
  * limitations under the License.
  */
 
-// CAUTION: THIS IS NOT A FULLY GENERAL BARRIER API.
+// CAUTION: THIS IS NOT A FULLY GENERAL BARRIER API. Some names are unconventional.
 
 // It may either be used as a "latch" or single-use barrier, or it may be reused under
 // very limited conditions, e.g. if only Pass(), but not Wait() is called.  Unlike a standard
@@ -22,7 +22,7 @@
 // Pass() or Wait(), and only then set the count using the Increment() method.  Threads at
 // a Wait() are only awoken if the count reaches zero AFTER the decrement is applied.
 // This works because, also unlike most latch APIs, there is no way to Wait() without
-// decrementing the count, and thus nobody can spuriosly wake up on the initial zero.
+// decrementing the count, and thus nobody can spuriously wake up on the initial zero.
 
 #ifndef ART_RUNTIME_BARRIER_H_
 #define ART_RUNTIME_BARRIER_H_
@@ -52,7 +52,7 @@
   // Pass through the barrier, decrement the count but do not block.
   void Pass(Thread* self) REQUIRES(!GetLock());
 
-  // Wait on the barrier, decrement the count.
+  // Decrement the count, then wait until the count is zero.
   void Wait(Thread* self) REQUIRES(!GetLock());
 
   // The following three calls are only safe if we somehow know that no other thread both
@@ -61,13 +61,13 @@
   // If these calls are made in that situation, the offending thread is likely to go back
   // to sleep, resulting in a deadlock.
 
-  // Increment the count by delta, wait on condition if count is non zero.  If LockHandling is
+  // Increment the count by delta, wait on condition while count is non zero.  If LockHandling is
   // kAllowHoldingLocks we will not check that all locks are released when waiting.
   template <Barrier::LockHandling locks = kDisallowHoldingLocks>
   void Increment(Thread* self, int delta) REQUIRES(!GetLock());
 
-  // Increment the count by delta, wait on condition if count is non zero, with a timeout. Returns
-  // true if time out occurred.
+  // Increment the count by delta, wait on condition while count is non zero, with a timeout.
+  // Returns true if time out occurred.
   bool Increment(Thread* self, int delta, uint32_t timeout_ms) REQUIRES(!GetLock());
 
   // Set the count to a new value.  This should only be used if there is no possibility that
diff --git a/runtime/mutator_gc_coord.md b/runtime/mutator_gc_coord.md
new file mode 100644
index 0000000..84327c0
--- /dev/null
+++ b/runtime/mutator_gc_coord.md
@@ -0,0 +1,146 @@
+Mechanisms for Coordination Between Garbage Collector and Mutator
+-----------------------------------------------------------------
+
+Most garbage collection work can proceed concurrently with the client or
+mutator Java threads. But in certain places, for example while tracing from
+thread stacks, the garbage collector needs to ensure that Java data processed
+by the collector is consistent and complete. At these points, the mutators
+should not hold references to the heap that are invisible to the garbage
+collector. And they should not be modifying the data that is visible to the
+collector.
+
+Logically, the collector and mutator share a reader-writer lock on the Java
+heap and associated data structures. Mutators hold the lock in reader or shared mode
+while running Java code or touching heap-related data structures. The collector
+holds the lock in writer or exclusive mode while it needs the heap data
+structures to be stable. However, this reader-writer lock has a very customized
+implementation that also provides additional facilities, such as the ability
+to exclude only a single thread, so that we can specifically examine its heap
+references.
+
+In order to ensure consistency of the Java data, the compiler inserts "suspend
+points", sometimes also called "safe points" into the code. These allow a thread
+to respond to external requests.
+
+Whenever a thread is runnable, i.e. whenever a thread logically holds the
+mutator lock in shared mode, it is expected to regularly execute such a suspend
+point, and check for pending requests. They are currently implemented by
+setting a flag in the thread structure[^1], which is then explicitly tested by the
+compiler-generated code.
+
+A thread responds to suspend requests only when it is "runnable", i.e. logically
+running Java code. When it runs native code, or is blocked in a kernel call, it
+logically releases the mutator lock. When the garbage collector needs mutator
+cooperation, and the thread is not runnable, it is assured that the mutator is
+not touching Java data, and hence the collector can safely perform the required
+action itself, on the mutator thread's behalf.
+
+Normally, when a thread makes a JNI call, it is not considered runnable while
+executing native code. This makes the transitions to and from running native JNI
+code somewhat expensive (see below). But these transitions are necessary to
+ensure that such code, which does not execute "suspend points", and can thus not
+cooperate with the GC, doesn't delay GC completion. `@FastNative` and
+`@CriticalNative` calls avoid these transitions, instead allowing the thread to
+remain "runnable", at the expense of potentially delaying GC operations for the
+duration of the call.
+
+Although we say that a thread is "suspended" when it is not running Java code,
+it may in fact still be running native code and touching data structures that
+are not considered "Java data". This distinction can be a fine line. For
+example, a Java thread blocked on a Java monitor will normally be "suspended"
+and blocked on a mutex contained in the monitor data structure. But it may wake
+up for reasons beyond ARTs control, which will normally result in touching the
+mutex. The monitor code must be quite careful to ensure that this does not cause
+problems, especially if the ART runtime was shut down in the interim and the
+monitor data structure has been reclaimed.
+
+Calls to change thread state
+----------------------------
+
+When a thread changes between running Java and native code, it has to
+correspondingly change its state between "runnable" and one of several
+other states, all of which are considered to be "suspended" for our purposes.
+When a Java thread starts to execute native code, and may thus not respond
+promptly to suspend requests, it will normally create an object of type
+`ScopedThreadSuspension`. `ScopedThreadSuspension`'s constructor changes state to
+the "suspended" state given as an argument, logically releasing the mutator lock
+and promising to no longer touch Java data structures. It also handles any
+pending suspension requests that slid in just before it changed state.
+
+Conversely, `ScopedThreadSuspension`'s destructor waits until the GC has finished
+any actions it is currently performing on the thread's behalf and effectively
+released the mutator exclusive lock, and then returns to runnable state,
+re-acquiring the mutator lock.
+
+Occasionally a thread running native code needs to temporarily again access Java
+data structures, performing the above transitions in the opposite order.
+`ScopedObjectAccess` is a similar RAII object whose constructor and destructor
+perform those transitions in the reverse order from `ScopedThreadSuspension`.
+
+Mutator lock implementation
+---------------------------
+
+The mutator lock is not implemented as a conventional mutex. But it plays by the
+rules of our normal static thread-safety analysis. Thus a function that is
+expected to be called in runnable state, with the ability to access Java data,
+should be annotated with `REQUIRES_SHARED(Locks::mutator_lock_)`.
+
+There is an explicit `mutator_lock_` object, of type `MutatorMutex`. `MutatorMutex` is
+seemingly a minor refinement of `ReaderWriterMutex`, but it is used entirely
+differently. It is acquired explicitly by clients that need to hold it
+exclusively, and in a small number of cases, it is acquired in shared mode, e.g.
+via `SharedTryLock()`, or by the GC itself. However, more commonly
+`MutatorMutex::TransitionFromSuspendedToRunnable()`, is used to logically acquire
+the mutator mutex, e.g. as part of `ScopedObjectAccess` construction.
+
+`TransitionFromSuspendedToRunnable()` does not physically acquire the
+`ReaderWriterMutex` in shared mode. Thus any thread acquiring the lock in exclusive mode
+must, in addition, explicitly arrange for mutator threads to be suspended via the
+thread suspension mechanism, and then make them runnable again on release.
+
+Logically the mutator lock is held in shared/reader mode if ***either*** the
+underlying reader-writer lock is held in shared mode, ***or*** if a mutator is in
+runnable state.
+
+Suspension and checkpoint API
+-----------------------------
+
+Suspend point checks enable three kinds of communication with mutator threads:
+
+**Checkpoints**
+: Checkpoint requests are used to get a thread to perform an action
+on our behalf. `RequestCheckpoint()` asks a specific thread to execute the closure
+supplied as an argument at its leisure. `RequestSynchronousCheckpoint()` in
+addition waits for the thread to complete running the closure, and handles
+suspended threads by running the closure on their behalf. In addition to these
+functions provided by `Thread`, `ThreadList` provides the `RunCheckpoint()` function
+that runs a checkpoint function on behalf of each thread, either by using
+`RequestCheckpoint()` to run it inside a running thread, or by ensuring that a
+suspended thread stays suspended, and then running the function on its behalf.
+`RunCheckpoint()` does not wait for completion of the function calls triggered by
+the resulting `RequestCheckpoint()` invocations.
+
+**Empty Checkpoints**
+: ThreadList provides `RunEmptyCheckpoint()`, which waits until
+all threads have either passed a suspend point, or have been suspended. This
+ensures that no thread is still executing Java code inside the same
+suspend-point-delimited code interval it was executing before the call. For
+example, a read-barrier started before a `RunEmptyCheckpoint()` call will have
+finished before the call returns.
+
+**Thread suspension**
+: ThreadList provides a number of `SuspendThread...()` calls and
+a `SuspendAll()` call to suspend one or all threads until they are resumed by
+`Resume()` or `ResumeAll()`. The `Suspend...` calls guarantee that the target
+thread(s) are suspended (again, only in the sense of not running Java code)
+when the call returns.
+
+[^1]: Some comments in the code refer to a not-yet-really-implemented scheme in
+which the compiler-generated code would load through the address at
+`tlsPtr_.suspend_trigger`. A thread suspension is requested by setting this to
+null, triggering a `SIGSEGV`, causing that thread to check for GC cooperation
+requests. The real mechanism instead sets an appropriate `ThreadFlag` entry to
+request suspension or a checkpoint. Note that the actual checkpoint function
+value is set, along with the flag, while holding `suspend_count_lock_`. If the
+target thread notices that a checkpoint is requested, it then acquires
+the `suspend_count_lock_` to read the checkpoint function.
diff --git a/runtime/thread-inl.h b/runtime/thread-inl.h
index a6da16f..9d96e9d 100644
--- a/runtime/thread-inl.h
+++ b/runtime/thread-inl.h
@@ -202,7 +202,8 @@
     new_state_and_flags.as_struct.flags = old_state_and_flags.as_struct.flags;
     new_state_and_flags.as_struct.state = new_state;
 
-    // CAS the value with a memory ordering.
+    // CAS the value, ensuring that prior memory operations are visible to any thread
+    // that observes that we are suspended.
     bool done =
         tls32_.state_and_flags.as_atomic_int.CompareAndSetWeakRelease(old_state_and_flags.as_int,
                                                                         new_state_and_flags.as_int);
diff --git a/runtime/thread.h b/runtime/thread.h
index 039fdd9..8cd16c3 100644
--- a/runtime/thread.h
+++ b/runtime/thread.h
@@ -279,13 +279,18 @@
       WARN_UNUSED
       REQUIRES(Locks::thread_suspend_count_lock_);
 
-  // Requests a checkpoint closure to run on another thread. The closure will be run when the thread
-  // gets suspended. This will return true if the closure was added and will (eventually) be
-  // executed. It returns false otherwise.
+  // Requests a checkpoint closure to run on another thread. The closure will be run when the
+  // thread notices the request, either in an explicit runtime CheckSuspend() call, or in a call
+  // originating from a compiler generated suspend point check. This returns true if the closure
+  // was added and will (eventually) be executed. It returns false otherwise.
   //
-  // Since multiple closures can be queued and some closures can delay other threads from running no
-  // closure should attempt to suspend another thread while running.
+  // Since multiple closures can be queued and some closures can delay other threads from running,
+  // no closure should attempt to suspend another thread while running.
   // TODO We should add some debug option that verifies this.
+  //
+  // This guarantees that the RequestCheckpoint invocation happens-before the function invocation:
+  // RequestCheckpointFunction holds thread_suspend_count_lock_, and RunCheckpointFunction
+  // acquires it.
   bool RequestCheckpoint(Closure* function)
       REQUIRES(Locks::thread_suspend_count_lock_);
 
@@ -1180,6 +1185,9 @@
   // Trigger a suspend check by making the suspend_trigger_ TLS value an invalid pointer.
   // The next time a suspend check is done, it will load from the value at this address
   // and trigger a SIGSEGV.
+  // Only needed if Runtime::implicit_suspend_checks_ is true and fully implemented.  It currently
+  // is always false. Client code currently just looks at the thread flags directly to determine
+  // whether we should suspend, so this call is currently unnecessary.
   void TriggerSuspend() {
     tlsPtr_.suspend_trigger = nullptr;
   }
@@ -1454,7 +1462,7 @@
   // Runs a single checkpoint function. If there are no more pending checkpoint functions it will
   // clear the kCheckpointRequest flag. The caller is responsible for calling this in a loop until
   // the kCheckpointRequest flag is cleared.
-  void RunCheckpointFunction();
+  void RunCheckpointFunction() REQUIRES(!Locks::thread_suspend_count_lock_);
   void RunEmptyCheckpoint();
 
   bool PassActiveSuspendBarriers(Thread* self)
@@ -1478,7 +1486,7 @@
     StateAndFlags() {}
     struct PACKED(4) {
       // Bitfield of flag values. Must be changed atomically so that flag values aren't lost. See
-      // ThreadFlags for bit field meanings.
+      // ThreadFlag for bit field meanings.
       volatile uint16_t flags;
       // Holds the ThreadState. May be changed non-atomically between Suspended (ie not Runnable)
       // transitions. Changing to Runnable requires that the suspend_request be part of the atomic
diff --git a/runtime/thread_list.cc b/runtime/thread_list.cc
index acc5f17..fb4f6d3 100644
--- a/runtime/thread_list.cc
+++ b/runtime/thread_list.cc
@@ -326,7 +326,7 @@
   std::vector<Thread*> suspended_count_modified_threads;
   size_t count = 0;
   {
-    // Call a checkpoint function for each thread, threads which are suspend get their checkpoint
+    // Call a checkpoint function for each thread, threads which are suspended get their checkpoint
     // manually called.
     MutexLock mu(self, *Locks::thread_list_lock_);
     MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
diff --git a/runtime/thread_list.h b/runtime/thread_list.h
index 8fc219b..1dcdf62 100644
--- a/runtime/thread_list.h
+++ b/runtime/thread_list.h
@@ -108,11 +108,13 @@
   // Used by Monitor to provide (mostly accurate) debugging information.
   bool Contains(Thread* thread) REQUIRES(Locks::thread_list_lock_);
 
-  // Run a checkpoint on threads, running threads are not suspended but run the checkpoint inside
-  // of the suspend check. Returns how many checkpoints that are expected to run, including for
-  // already suspended threads for b/24191051. Run the callback, if non-null, inside the
-  // thread_list_lock critical section after determining the runnable/suspended states of the
-  // threads.
+  // Run a checkpoint on all threads. Return the total number of threads for which the checkpoint
+  // function has been or will be called.
+  // Running threads are not suspended but run the checkpoint inside of the suspend check. The
+  // return value includes already suspended threads for b/24191051. Runs or requests the
+  // callback, if non-null, inside the thread_list_lock critical section after determining the
+  // runnable/suspended states of the threads. Does not wait for completion of the callbacks in
+  // running threads.
   size_t RunCheckpoint(Closure* checkpoint_function, Closure* callback = nullptr)
       REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_);