Skip to content

Conversation

@tnull
Copy link
Contributor

@tnull tnull commented Jan 26, 2026

Fixes #4346.

Previously, we refactored the GossipVerifier to not require holding a
circular reference. As part of this, we moved to a model where the
UtxoFutures are now polled by the background processor which checks
for completion through get_and_clear_pending_msg_events.

However, as part of this refactor we introduced race-condition: as we
only held Weak references in PendingChecksContext and the
UtxoFuture was directly dropped by the GossipVerifier after calling
resolve, the actual data was dropped with the future and gone when the
background processor attempted to retrieve and apply it via
check_resolved_futures.

Here, we fix this issue by simply converting the held channels field in
PendingChecksContext to hold an Arc, ensuring the data is still
present even after the UtxoFuture is dropped.

TODO:

  • Fix failing test_checks_backpressure_drop test which reuses the same future to resolve many lookups at once. Draft until then.

Signed-off-by: Elias Rohrer <dev@tnull.de>
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Jan 26, 2026

👋 Hi! This PR is now in draft status.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

@tnull tnull marked this pull request as draft January 26, 2026 16:16
@tnull tnull force-pushed the 2026-01-fix-gossip-verification branch from 2bbde85 to 7c871f2 Compare January 26, 2026 16:18
Previously, we refactored the `GossipVerifier` to not require holding a
circular reference. As part of this, we moved to a model where the
`UtxoFuture`s are now polled by the background processor which checks
for completion through `get_and_clear_pending_msg_events`.

However, as part of this refactor we introduced race-condition: as we
only held `Weak` references in `PendingChecksContext` and the
`UtxoFuture` was directly dropped by the `GossipVerifier` after calling
`resolve`, the actual data was dropped with the future and gone when the
background processor attempted to retrieve and apply it via
`check_resolved_futures`.

Here, we fix this issue by simply converting the held channels field in
`PendingChecksContext` to hold an `Arc`, ensuring the data is still
present even after the `UtxoFuture` is dropped.

Signed-off-by: Elias Rohrer <dev@tnull.de>
@tnull tnull force-pushed the 2026-01-fix-gossip-verification branch from 7c871f2 to 5c757b3 Compare January 26, 2026 16:18
Comment on lines +679 to +680
chan_update_a,
chan_update_b,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna just inline the builders here? There's not really any reason for the intermediate variables at this point.

assert!(network_graph
.read_only()
.channels()
.get(&valid_announcement.contents.short_channel_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bunch of places here where we could reasonably replace things like valid_announcement.contents.short_channel_id with an scid and make the whole thing more readable.

.get(&valid_announcement.contents.short_channel_id).as_ref().unwrap()
.two_to_one.as_ref().unwrap().last_update);
assert!(
graph_lock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol this is definitely not readable, vertical or not my eyes cannot parse this.

.read_only()
.channels()
.get(&valid_announcement.contents.short_channel_id)
.unwrap()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. the rustfmt behavior of treating things like as_ref().unwrap() as "equally important" as ".get(X) or ".features" makes this so hard to follow.

// (i.e. checks where the user told us they'd call back but drop'd the `UtxoFuture`
// instead) before we commit to applying backpressure.
pending_checks.channels.retain(|_, chan| Weak::upgrade(&chan).is_some());
pending_checks.channels.retain(|_, chan| Arc::strong_count(chan) > 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this race check_resolved_futures and hit the same issue that we're fixing again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, potentially. Note that I'm still experimenting with the approach and am not yet fully sold on going the strong count, as, even if we get it just right now, we could could easily mess it up again going forward. Please defer reviewing until this is undrafted.

let pending_matches = match &pending_msgs
.unsafe_well_ordered_double_lock_self()
.channel_announce
if Arc::strong_count(e.get()) == 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we limit this to strong_count == 1 && !resolved? Not exactly critical but if we did just finish validating a channel update and we get a copy of it from a peer we should let the old one finish getting handled in check_resolved_futures rather than dropping it and immediately starting another validation.

Some(msgs_ref) => {
if Arc::strong_count(e.get()) == 1 {
// The future was dropped and the entry is the last reference held.
e.remove();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically re-introduces the bug we're trying to fix in a (hopefully rare) race but by receiving a channel_announcement right after we finish processing some gossip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm aware. I had dropped it before and added it back for experimentation purposes. Please hold on review until this is undrafted.

let is_from_a = (msg.channel_flags & 1) == 1;
match Weak::upgrade(e.get()) {
Some(msgs_ref) => {
if Arc::strong_count(e.get()) == 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we limit this to strong_count == 1 && !resolved? If we get an update and we have a pending resolution that just finished we should just set the update to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gossip doesnt resolve async

3 participants