Skip to content

Conversation

@dkhalanskyjb
Copy link
Collaborator

Fixes #4418
(unless it keeps happening)

This problem couldn't be reproduced locally, to this fix is purely analytical.

The problematic test attempts to launch a coroutine then await until the coroutine suspends.
The way it was doing that before the change is:

  • Hold a monitor and wait on the test body side;
  • Acquire a monitor and notify on the coroutine side right before the suspension point;
  • On the test body side, wait for the coroutine thread to enter the TIMED_WAIT state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached.

The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed.

With this change, we establish a complete happens-before chain:

  • The test body wakes up after it's resumed as a coroutine.
  • complete on a latch happens-before the resume.
  • The suspension happens-before the complete, as suspension and the complete are done in the same thread.

With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.

Fixes #4418
(unless it keeps happening)

This problem couldn't be reproduced locally, to this fix is purely
analytical.

The problematic test attempts to launch a coroutine then await
until the coroutine suspends.
The way it was doing that before the change is:
- Hold a monitor and `wait` on the test body side;
- Acquire a monitor and `notify` on the coroutine side
  *right before* the suspension point;
- On the test body side, wait for the coroutine thread to enter the
  `TIMED_WAIT` state, indicating that its scheduler worker
  has finished its piece of work and now waits for new commands,
  which must mean the suspension point was reached.

The problem is that thread states are not synchronization
primitives, and no happens-before is established between the
code a thread executes before the state change and the code right
after the state change is observed.

With this change, we establish a complete happens-before chain:
- The test body wakes up after it's `resume`d as a coroutine.
- `complete` on a latch happens-before the `resume`.
- The suspension happens-before the `complete`,
  as suspension and the `complete` are done in the same thread.

With no way to verify the fix, it's unclear if that was
the problem, so we can only hope the change helps.
Copy link
Contributor

@murfel murfel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this helps and the test never fails in the next few months, please consider fixing the other usages of awaitCoroutine here as well.


private suspend fun sleepingNestedMethod() {
private suspend fun sleepingNestedMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) {
yield() // Suspension point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why this yield is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it to ensure that the test body arrives at the latch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's an insurance against any functions being optimized out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Removing this yield breaks the test. Adding yield back, but after your code but before delay (L206.5) makes the test pass. Is this consistent with your understanding of being optimized out?

Copy link
Contributor

@murfel murfel Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deciphered the yield() // TCE on L195 and confirm that that is needed, but this one is not clear for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with your understanding of being optimized out?

Pretty much.

I've double-checked what's going on, here's the decompiled version of sleepingNestedMethod with yield:

   private final Object sleepingNestedMethod(CoroutineDispatcher currentDispatcher, CompletableDeferred latch, Continuation $completion) {
      Continuation $continuation;
      label27: {
         if ($completion instanceof <undefinedtype>) {
            $continuation = (<undefinedtype>)$completion;
            if (($continuation.label & Integer.MIN_VALUE) != 0) {
               $continuation.label -= Integer.MIN_VALUE;
               break label27;
            }
         }

         $continuation = new ContinuationImpl($completion) {
            Object L$0;
            Object L$1;
            // $FF: synthetic field
            Object result;
            int label;

            @Nullable
            public final Object invokeSuspend(@NotNull Object $result) {
               this.result = $result;
               this.label |= Integer.MIN_VALUE;
               return CoroutinesDumpTest.this.sleepingNestedMethod((CoroutineDispatcher)null, (CompletableDeferred)null, (Continuation)this);
            }
         };
      }

      Object $result = $continuation.result;
      Object var6 = IntrinsicsKt.getCOROUTINE_SUSPENDED();
      switch ($continuation.label) {
         case 0:
            ResultKt.throwOnFailure($result);
            $continuation.L$0 = currentDispatcher;
            $continuation.L$1 = latch;
            $continuation.label = 1;
            if (YieldKt.yield($continuation) == var6) {
               return var6;
            }
            break;
         case 1:
            latch = (CompletableDeferred)$continuation.L$1;
            currentDispatcher = (CoroutineDispatcher)$continuation.L$0;
            ResultKt.throwOnFailure($result);
            break;
         case 2:
            latch = (CompletableDeferred)$continuation.L$1;
            currentDispatcher = (CoroutineDispatcher)$continuation.L$0;
            ResultKt.throwOnFailure($result);
            return Unit.INSTANCE;
         default:
            throw new IllegalStateException("call to 'resume' before 'invoke' with coroutine");
      }

      currentDispatcher.dispatch((CoroutineContext)currentDispatcher, CoroutinesDumpTest::sleepingNestedMethod$lambda$0);
      $continuation.L$0 = SpillingKt.nullOutSpilledVariable(currentDispatcher);
      $continuation.L$1 = SpillingKt.nullOutSpilledVariable(latch);
      $continuation.label = 2;
      if (DelayKt.delay(Long.MAX_VALUE, $continuation) == var6) {
         return var6;
      } else {
         return Unit.INSTANCE;
      }
   }

and without:

   private final Object sleepingNestedMethod(CoroutineDispatcher currentDispatcher, CompletableDeferred latch, Continuation $completion) {
      currentDispatcher.dispatch((CoroutineContext)currentDispatcher, CoroutinesDumpTest::sleepingNestedMethod$lambda$0);
      Object var10000 = DelayKt.delay(Long.MAX_VALUE, $completion);
      return var10000 == IntrinsicsKt.getCOROUTINE_SUSPENDED() ? var10000 : Unit.INSTANCE;
   }

Without yield, the whole state machine of sleepingNestedMethod gets eliminated. This means the continuation of delay is not sleepingNestedMethod but directly sleepingOuterMethod. If we force the state machine of sleepingNestedMethod to get generated, delay has to return to it.

The test will also pass with this:

    private suspend fun sleepingNestedMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) {
        /* Schedule a computation on the current single-threaded dispatcher.
        Since that thread is currently running this code,
        the start notification will happen *after* the currently running coroutine suspends. */
        currentDispatcher.dispatch(currentDispatcher) {
            coroutineThread = Thread.currentThread()
            latch.complete(Unit)
        }
        delay(Long.MAX_VALUE)
        yield() // <----------------
    }

I find this version a clearer instruction to the compiler, so I moved the yield().

@dkhalanskyjb
Copy link
Collaborator Author

dkhalanskyjb commented Dec 12, 2025

please consider fixing the other usages of awaitCoroutine here as well.

I'd gladly do that, but for the coroutines in other tests, we're interested in points when they are still running, not suspended, so I don't know how to establish a robust happens-before relationship.

@dkhalanskyjb dkhalanskyjb requested a review from murfel December 12, 2025 14:42
@murfel
Copy link
Contributor

murfel commented Dec 12, 2025

Oh. I guess a comment will do, or at least a mental note.

private suspend fun sleepingOuterMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) {
sleepingNestedMethod(currentDispatcher, latch)
yield() // TCE
yield() // TCE: make sure `sleepingOuterMethod` is contained in the continuation of `sleepingNestedMethod`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one avoids being inlined into the test method, so that sleepingOuterMethod definitely appears in the stackstrace

latch.complete(Unit)
}
delay(Long.MAX_VALUE)
yield() // TCE: make sure `sleepingNestedMethod` is contained in the continuation of `delay`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one avoids being inlined into sleepingOuterMethod, so that sleepingNestedMethod definitely appears in the stacktrace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to make the step from your comment to my understanding. Firstly, because I don't understand how's yield preventing inlining, and also how's the fact that yield is contained in a continuation prevents inlining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, let me read your huge comment above, which I didn't see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of see how it works now, but your comments skips a few steps of reasoning, so it's a bit hard to decipher without extra thinking or additional context.

Could add info to your comment:

  • Used as an explicit suspension point which enforces state-machine being generated for sleepingNestedMethod
  • Without the state machine, the delay's continuation is sleepingOuterMethod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't think that "TCE" comment is useful, really, either here or above. It is in fact tail-call elimination elimination, TCEE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a balance to strike so that the useful information contained in a comment doesn't get obscured by restating the knowledge that's implicitly required. Using yield() to ensure a suspend function's stack frame is preserved is a common pattern in our tests, and restating this everywhere feels excessive to me. In my opinion, https://github.com/Kotlin/KEEP/blob/main/proposals/KEEP-0164-coroutines.md, when taken into account together with the TCE mark, provide enough context for the yield().

In any case, this discussion doesn't feel relevant to the PR we're discussing. If you would like us to improve the documentation of the places where we disable the tail call optimization of suspend functions, we can do so in a separate change applied consistently in the codebase.

Comment on lines 209 to 214
private fun awaitCoroutine() = synchronized(monitor) {
while (coroutineThread == null) (monitor as Object).wait()
while (coroutineThread!!.state != Thread.State.TIMED_WAITING) {
// Wait until thread sleeps to have a consistent stacktrace
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave a comment that awaitCoroutine is problematic but there's no obvious fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants