-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Attempt to fix a flaky coroutine-dump-verifying test #4589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Fixes #4418 (unless it keeps happening) This problem couldn't be reproduced locally, to this fix is purely analytical. The problematic test attempts to launch a coroutine then await until the coroutine suspends. The way it was doing that before the change is: - Hold a monitor and `wait` on the test body side; - Acquire a monitor and `notify` on the coroutine side *right before* the suspension point; - On the test body side, wait for the coroutine thread to enter the `TIMED_WAIT` state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached. The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed. With this change, we establish a complete happens-before chain: - The test body wakes up after it's `resume`d as a coroutine. - `complete` on a latch happens-before the `resume`. - The suspension happens-before the `complete`, as suspension and the `complete` are done in the same thread. With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.
murfel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case this helps and the test never fails in the next few months, please consider fixing the other usages of awaitCoroutine here as well.
|
|
||
| private suspend fun sleepingNestedMethod() { | ||
| private suspend fun sleepingNestedMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) { | ||
| yield() // Suspension point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why this yield is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it to ensure that the test body arrives at the latch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's an insurance against any functions being optimized out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Removing this yield breaks the test. Adding yield back, but after your code but before delay (L206.5) makes the test pass. Is this consistent with your understanding of being optimized out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deciphered the yield() // TCE on L195 and confirm that that is needed, but this one is not clear for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this consistent with your understanding of being optimized out?
Pretty much.
I've double-checked what's going on, here's the decompiled version of sleepingNestedMethod with yield:
private final Object sleepingNestedMethod(CoroutineDispatcher currentDispatcher, CompletableDeferred latch, Continuation $completion) {
Continuation $continuation;
label27: {
if ($completion instanceof <undefinedtype>) {
$continuation = (<undefinedtype>)$completion;
if (($continuation.label & Integer.MIN_VALUE) != 0) {
$continuation.label -= Integer.MIN_VALUE;
break label27;
}
}
$continuation = new ContinuationImpl($completion) {
Object L$0;
Object L$1;
// $FF: synthetic field
Object result;
int label;
@Nullable
public final Object invokeSuspend(@NotNull Object $result) {
this.result = $result;
this.label |= Integer.MIN_VALUE;
return CoroutinesDumpTest.this.sleepingNestedMethod((CoroutineDispatcher)null, (CompletableDeferred)null, (Continuation)this);
}
};
}
Object $result = $continuation.result;
Object var6 = IntrinsicsKt.getCOROUTINE_SUSPENDED();
switch ($continuation.label) {
case 0:
ResultKt.throwOnFailure($result);
$continuation.L$0 = currentDispatcher;
$continuation.L$1 = latch;
$continuation.label = 1;
if (YieldKt.yield($continuation) == var6) {
return var6;
}
break;
case 1:
latch = (CompletableDeferred)$continuation.L$1;
currentDispatcher = (CoroutineDispatcher)$continuation.L$0;
ResultKt.throwOnFailure($result);
break;
case 2:
latch = (CompletableDeferred)$continuation.L$1;
currentDispatcher = (CoroutineDispatcher)$continuation.L$0;
ResultKt.throwOnFailure($result);
return Unit.INSTANCE;
default:
throw new IllegalStateException("call to 'resume' before 'invoke' with coroutine");
}
currentDispatcher.dispatch((CoroutineContext)currentDispatcher, CoroutinesDumpTest::sleepingNestedMethod$lambda$0);
$continuation.L$0 = SpillingKt.nullOutSpilledVariable(currentDispatcher);
$continuation.L$1 = SpillingKt.nullOutSpilledVariable(latch);
$continuation.label = 2;
if (DelayKt.delay(Long.MAX_VALUE, $continuation) == var6) {
return var6;
} else {
return Unit.INSTANCE;
}
}and without:
private final Object sleepingNestedMethod(CoroutineDispatcher currentDispatcher, CompletableDeferred latch, Continuation $completion) {
currentDispatcher.dispatch((CoroutineContext)currentDispatcher, CoroutinesDumpTest::sleepingNestedMethod$lambda$0);
Object var10000 = DelayKt.delay(Long.MAX_VALUE, $completion);
return var10000 == IntrinsicsKt.getCOROUTINE_SUSPENDED() ? var10000 : Unit.INSTANCE;
}Without yield, the whole state machine of sleepingNestedMethod gets eliminated. This means the continuation of delay is not sleepingNestedMethod but directly sleepingOuterMethod. If we force the state machine of sleepingNestedMethod to get generated, delay has to return to it.
The test will also pass with this:
private suspend fun sleepingNestedMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) {
/* Schedule a computation on the current single-threaded dispatcher.
Since that thread is currently running this code,
the start notification will happen *after* the currently running coroutine suspends. */
currentDispatcher.dispatch(currentDispatcher) {
coroutineThread = Thread.currentThread()
latch.complete(Unit)
}
delay(Long.MAX_VALUE)
yield() // <----------------
}I find this version a clearer instruction to the compiler, so I moved the yield().
I'd gladly do that, but for the coroutines in other tests, we're interested in points when they are still running, not suspended, so I don't know how to establish a robust happens-before relationship. |
|
Oh. I guess a comment will do, or at least a mental note. |
| private suspend fun sleepingOuterMethod(currentDispatcher: CoroutineDispatcher, latch: CompletableDeferred<Unit>) { | ||
| sleepingNestedMethod(currentDispatcher, latch) | ||
| yield() // TCE | ||
| yield() // TCE: make sure `sleepingOuterMethod` is contained in the continuation of `sleepingNestedMethod` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this one avoids being inlined into the test method, so that sleepingOuterMethod definitely appears in the stackstrace
| latch.complete(Unit) | ||
| } | ||
| delay(Long.MAX_VALUE) | ||
| yield() // TCE: make sure `sleepingNestedMethod` is contained in the continuation of `delay` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one avoids being inlined into sleepingOuterMethod, so that sleepingNestedMethod definitely appears in the stacktrace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to make the step from your comment to my understanding. Firstly, because I don't understand how's yield preventing inlining, and also how's the fact that yield is contained in a continuation prevents inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, let me read your huge comment above, which I didn't see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of see how it works now, but your comments skips a few steps of reasoning, so it's a bit hard to decipher without extra thinking or additional context.
Could add info to your comment:
- Used as an explicit suspension point which enforces state-machine being generated for
sleepingNestedMethod - Without the state machine, the delay's continuation is
sleepingOuterMethod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I don't think that "TCE" comment is useful, really, either here or above. It is in fact tail-call elimination elimination, TCEE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a balance to strike so that the useful information contained in a comment doesn't get obscured by restating the knowledge that's implicitly required. Using yield() to ensure a suspend function's stack frame is preserved is a common pattern in our tests, and restating this everywhere feels excessive to me. In my opinion, https://github.com/Kotlin/KEEP/blob/main/proposals/KEEP-0164-coroutines.md, when taken into account together with the TCE mark, provide enough context for the yield().
In any case, this discussion doesn't feel relevant to the PR we're discussing. If you would like us to improve the documentation of the places where we disable the tail call optimization of suspend functions, we can do so in a separate change applied consistently in the codebase.
| private fun awaitCoroutine() = synchronized(monitor) { | ||
| while (coroutineThread == null) (monitor as Object).wait() | ||
| while (coroutineThread!!.state != Thread.State.TIMED_WAITING) { | ||
| // Wait until thread sleeps to have a consistent stacktrace | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave a comment that awaitCoroutine is problematic but there's no obvious fix?
Fixes #4418
(unless it keeps happening)
This problem couldn't be reproduced locally, to this fix is purely analytical.
The problematic test attempts to launch a coroutine then await until the coroutine suspends.
The way it was doing that before the change is:
waiton the test body side;notifyon the coroutine side right before the suspension point;TIMED_WAITstate, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached.The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed.
With this change, we establish a complete happens-before chain:
resumed as a coroutine.completeon a latch happens-before theresume.complete, as suspension and thecompleteare done in the same thread.With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.