Attempt to fix a flaky coroutine-dump-verifying test #4589

dkhalanskyjb · 2025-12-11T13:05:45Z

Fixes #4418
(unless it keeps happening)

This problem couldn't be reproduced locally, to this fix is purely analytical.

The problematic test attempts to launch a coroutine then await until the coroutine suspends.
The way it was doing that before the change is:

Hold a monitor and wait on the test body side;
Acquire a monitor and notify on the coroutine side right before the suspension point;
On the test body side, wait for the coroutine thread to enter the TIMED_WAIT state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached.

The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed.

With this change, we establish a complete happens-before chain:

The test body wakes up after it's resumed as a coroutine.
complete on a latch happens-before the resume.
The suspension happens-before the complete, as suspension and the complete are done in the same thread.

With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.

Fixes #4418 (unless it keeps happening) This problem couldn't be reproduced locally, to this fix is purely analytical. The problematic test attempts to launch a coroutine then await until the coroutine suspends. The way it was doing that before the change is: - Hold a monitor and `wait` on the test body side; - Acquire a monitor and `notify` on the coroutine side *right before* the suspension point; - On the test body side, wait for the coroutine thread to enter the `TIMED_WAIT` state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached. The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed. With this change, we establish a complete happens-before chain: - The test body wakes up after it's `resume`d as a coroutine. - `complete` on a latch happens-before the `resume`. - The suspension happens-before the `complete`, as suspension and the `complete` are done in the same thread. With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.

murfel

In case this helps and the test never fails in the next few months, please consider fixing the other usages of awaitCoroutine here as well.

murfel · 2025-12-12T12:54:33Z

kotlinx-coroutines-debug/test/CoroutinesDumpTest.kt


-    private suspend fun sleepingNestedMethod() {
+    private suspend fun sleepingNestedMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) {
        yield() // Suspension point


I'm confused why this yield is needed?

Is it to ensure that the test body arrives at the latch?

I think that's an insurance against any functions being optimized out.

Hmm. Removing this yield breaks the test. Adding yield back, but after your code but before delay (L206.5) makes the test pass. Is this consistent with your understanding of being optimized out?

I deciphered the yield() // TCE on L195 and confirm that that is needed, but this one is not clear for me.

Is this consistent with your understanding of being optimized out?

Pretty much.

I've double-checked what's going on, here's the decompiled version of sleepingNestedMethod with yield:

private final Object sleepingNestedMethod(CoroutineDispatcher currentDispatcher, CompletableDeferred latch, Continuation $completion) { Continuation $continuation; label27: { if ($completion instanceof <undefinedtype>) { $continuation = (<undefinedtype>)$completion; if (($continuation.label & Integer.MIN_VALUE) != 0) { $continuation.label -= Integer.MIN_VALUE; break label27; } } $continuation = new ContinuationImpl($completion) { Object L$0; Object L$1; // $FF: synthetic field Object result; int label; @Nullable public final Object invokeSuspend(@NotNull Object $result) { this.result = $result; this.label |= Integer.MIN_VALUE; return CoroutinesDumpTest.this.sleepingNestedMethod((CoroutineDispatcher)null, (CompletableDeferred)null, (Continuation)this); } }; } Object $result = $continuation.result; Object var6 = IntrinsicsKt.getCOROUTINE_SUSPENDED(); switch ($continuation.label) { case 0: ResultKt.throwOnFailure($result); $continuation.L$0 = currentDispatcher; $continuation.L$1 = latch; $continuation.label = 1; if (YieldKt.yield($continuation) == var6) { return var6; } break; case 1: latch = (CompletableDeferred)$continuation.L$1; currentDispatcher = (CoroutineDispatcher)$continuation.L$0; ResultKt.throwOnFailure($result); break; case 2: latch = (CompletableDeferred)$continuation.L$1; currentDispatcher = (CoroutineDispatcher)$continuation.L$0; ResultKt.throwOnFailure($result); return Unit.INSTANCE; default: throw new IllegalStateException("call to 'resume' before 'invoke' with coroutine"); } currentDispatcher.dispatch((CoroutineContext)currentDispatcher, CoroutinesDumpTest::sleepingNestedMethod$lambda$0); $continuation.L$0 = SpillingKt.nullOutSpilledVariable(currentDispatcher); $continuation.L$1 = SpillingKt.nullOutSpilledVariable(latch); $continuation.label = 2; if (DelayKt.delay(Long.MAX_VALUE, $continuation) == var6) { return var6; } else { return Unit.INSTANCE; } }

and without:

private final Object sleepingNestedMethod(CoroutineDispatcher currentDispatcher, CompletableDeferred latch, Continuation $completion) { currentDispatcher.dispatch((CoroutineContext)currentDispatcher, CoroutinesDumpTest::sleepingNestedMethod$lambda$0); Object var10000 = DelayKt.delay(Long.MAX_VALUE, $completion); return var10000 == IntrinsicsKt.getCOROUTINE_SUSPENDED() ? var10000 : Unit.INSTANCE; }

Without yield, the whole state machine of sleepingNestedMethod gets eliminated. This means the continuation of delay is not sleepingNestedMethod but directly sleepingOuterMethod. If we force the state machine of sleepingNestedMethod to get generated, delay has to return to it.

The test will also pass with this:

private suspend fun sleepingNestedMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) { /* Schedule a computation on the current single-threaded dispatcher. Since that thread is currently running this code, the start notification will happen *after* the currently running coroutine suspends. */ currentDispatcher.dispatch(currentDispatcher) { coroutineThread = Thread.currentThread() latch.complete(Unit) } delay(Long.MAX_VALUE) yield() // <---------------- }

I find this version a clearer instruction to the compiler, so I moved the yield().

kotlinx-coroutines-debug/test/CoroutinesDumpTest.kt

dkhalanskyjb · 2025-12-12T14:41:46Z

please consider fixing the other usages of awaitCoroutine here as well.

I'd gladly do that, but for the coroutines in other tests, we're interested in points when they are still running, not suspended, so I don't know how to establish a robust happens-before relationship.

murfel · 2025-12-12T15:21:01Z

Oh. I guess a comment will do, or at least a mental note.

murfel · 2025-12-15T12:56:26Z

kotlinx-coroutines-debug/test/CoroutinesDumpTest.kt

    private suspend fun sleepingOuterMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) {
        sleepingNestedMethod(currentDispatcher, latch)
-        yield() // TCE
+        yield() // TCE: make sure `sleepingOuterMethod` is contained in the continuation of `sleepingNestedMethod`


So this one avoids being inlined into the test method, so that sleepingOuterMethod definitely appears in the stackstrace

murfel · 2025-12-15T12:56:52Z

kotlinx-coroutines-debug/test/CoroutinesDumpTest.kt

            latch.complete(Unit)
        }
        delay(Long.MAX_VALUE)
+        yield() // TCE: make sure `sleepingNestedMethod` is contained in the continuation of `delay`


And this one avoids being inlined into sleepingOuterMethod, so that sleepingNestedMethod definitely appears in the stacktrace

I'm struggling to make the step from your comment to my understanding. Firstly, because I don't understand how's yield preventing inlining, and also how's the fact that yield is contained in a continuation prevents inlining.

Oh wait, let me read your huge comment above, which I didn't see.

I kind of see how it works now, but your comments skips a few steps of reasoning, so it's a bit hard to decipher without extra thinking or additional context.

Could add info to your comment:

Used as an explicit suspension point which enforces state-machine being generated for sleepingNestedMethod

Without the state machine, the delay's continuation is sleepingOuterMethod.

And I don't think that "TCE" comment is useful, really, either here or above. It is in fact tail-call elimination elimination, TCEE.

There's a balance to strike so that the useful information contained in a comment doesn't get obscured by restating the knowledge that's implicitly required. Using yield() to ensure a suspend function's stack frame is preserved is a common pattern in our tests, and restating this everywhere feels excessive to me. In my opinion, https://github.com/Kotlin/KEEP/blob/main/proposals/KEEP-0164-coroutines.md, when taken into account together with the TCE mark, provide enough context for the yield().

In any case, this discussion doesn't feel relevant to the PR we're discussing. If you would like us to improve the documentation of the places where we disable the tail call optimization of suspend functions, we can do so in a separate change applied consistently in the codebase.

kotlinx-coroutines-debug/test/CoroutinesDumpTest.kt

murfel · 2025-12-15T13:02:34Z

kotlinx-coroutines-debug/test/CoroutinesDumpTest.kt

    private fun awaitCoroutine() = synchronized(monitor) {
        while (coroutineThread == null) (monitor as Object).wait()
        while (coroutineThread!!.state != Thread.State.TIMED_WAITING) {
            // Wait until thread sleeps to have a consistent stacktrace
        }
    }


Could you leave a comment that awaitCoroutine is problematic but there's no obvious fix?

dkhalanskyjb requested a review from murfel December 11, 2025 13:05

dkhalanskyjb mentioned this pull request Dec 11, 2025

CoroutinesDumpTest#testSuspendedCoroutine failed #4418

Open

murfel reviewed Dec 12, 2025

View reviewed changes

dkhalanskyjb requested a review from murfel December 12, 2025 14:42

Clarify the yield() calls

3c44b9f

murfel reviewed Dec 15, 2025

View reviewed changes

Attempt to fix a flaky coroutine-dump-verifying test #4589

Are you sure you want to change the base?

Attempt to fix a flaky coroutine-dump-verifying test #4589

Conversation

dkhalanskyjb commented Dec 11, 2025

Uh oh!

murfel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

murfel Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dkhalanskyjb commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

murfel commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

murfel Dec 12, 2025 •

edited

Loading

dkhalanskyjb commented Dec 12, 2025 •

edited

Loading