-
Notifications
You must be signed in to change notification settings - Fork 432
Fix race condition in async UtxoFuture resolution
#4348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Elias Rohrer <dev@tnull.de>
|
👋 Hi! This PR is now in draft status. |
2bbde85 to
7c871f2
Compare
Previously, we refactored the `GossipVerifier` to not require holding a circular reference. As part of this, we moved to a model where the `UtxoFuture`s are now polled by the background processor which checks for completion through `get_and_clear_pending_msg_events`. However, as part of this refactor we introduced race-condition: as we only held `Weak` references in `PendingChecksContext` and the `UtxoFuture` was directly dropped by the `GossipVerifier` after calling `resolve`, the actual data was dropped with the future and gone when the background processor attempted to retrieve and apply it via `check_resolved_futures`. Here, we fix this issue by simply converting the held channels field in `PendingChecksContext` to hold an `Arc`, ensuring the data is still present even after the `UtxoFuture` is dropped. Signed-off-by: Elias Rohrer <dev@tnull.de>
7c871f2 to
5c757b3
Compare
| chan_update_a, | ||
| chan_update_b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanna just inline the builders here? There's not really any reason for the intermediate variables at this point.
| assert!(network_graph | ||
| .read_only() | ||
| .channels() | ||
| .get(&valid_announcement.contents.short_channel_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bunch of places here where we could reasonably replace things like valid_announcement.contents.short_channel_id with an scid and make the whole thing more readable.
| .get(&valid_announcement.contents.short_channel_id).as_ref().unwrap() | ||
| .two_to_one.as_ref().unwrap().last_update); | ||
| assert!( | ||
| graph_lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol this is definitely not readable, vertical or not my eyes cannot parse this.
| .read_only() | ||
| .channels() | ||
| .get(&valid_announcement.contents.short_channel_id) | ||
| .unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. the rustfmt behavior of treating things like as_ref().unwrap() as "equally important" as ".get(X) or ".features" makes this so hard to follow.
| // (i.e. checks where the user told us they'd call back but drop'd the `UtxoFuture` | ||
| // instead) before we commit to applying backpressure. | ||
| pending_checks.channels.retain(|_, chan| Weak::upgrade(&chan).is_some()); | ||
| pending_checks.channels.retain(|_, chan| Arc::strong_count(chan) > 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this race check_resolved_futures and hit the same issue that we're fixing again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, potentially. Note that I'm still experimenting with the approach and am not yet fully sold on going the strong count, as, even if we get it just right now, we could could easily mess it up again going forward. Please defer reviewing until this is undrafted.
| let pending_matches = match &pending_msgs | ||
| .unsafe_well_ordered_double_lock_self() | ||
| .channel_announce | ||
| if Arc::strong_count(e.get()) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we limit this to strong_count == 1 && !resolved? Not exactly critical but if we did just finish validating a channel update and we get a copy of it from a peer we should let the old one finish getting handled in check_resolved_futures rather than dropping it and immediately starting another validation.
| Some(msgs_ref) => { | ||
| if Arc::strong_count(e.get()) == 1 { | ||
| // The future was dropped and the entry is the last reference held. | ||
| e.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically re-introduces the bug we're trying to fix in a (hopefully rare) race but by receiving a channel_announcement right after we finish processing some gossip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm aware. I had dropped it before and added it back for experimentation purposes. Please hold on review until this is undrafted.
| let is_from_a = (msg.channel_flags & 1) == 1; | ||
| match Weak::upgrade(e.get()) { | ||
| Some(msgs_ref) => { | ||
| if Arc::strong_count(e.get()) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we limit this to strong_count == 1 && !resolved? If we get an update and we have a pending resolution that just finished we should just set the update to handle.
Fixes #4346.
Previously, we refactored the
GossipVerifierto not require holding acircular reference. As part of this, we moved to a model where the
UtxoFutures are now polled by the background processor which checksfor completion through
get_and_clear_pending_msg_events.However, as part of this refactor we introduced race-condition: as we
only held
Weakreferences inPendingChecksContextand theUtxoFuturewas directly dropped by theGossipVerifierafter callingresolve, the actual data was dropped with the future and gone when thebackground processor attempted to retrieve and apply it via
check_resolved_futures.Here, we fix this issue by simply converting the held channels field in
PendingChecksContextto hold anArc, ensuring the data is stillpresent even after the
UtxoFutureis dropped.TODO:
test_checks_backpressure_droptest which reuses the same future to resolve many lookups at once. Draft until then.