-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Improve memory accounting for ArrowBytesViewMap #20077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve memory accounting for ArrowBytesViewMap #20077
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR attempts to improve memory accounting for ArrowBytesViewMap by changing from calculating null buffer size based on logical length (self.nulls.len() / 8) to using the allocated size (self.nulls.allocated_size()). The intent is to account for allocated memory rather than just the used memory, which aligns with DataFusion's memory accounting goals.
Changes:
- Modified
ArrowBytesViewMap::size()method to useNullBufferBuilder::allocated_size()for calculating null buffer memory usage
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @vigneshsiva11 🙏
| let in_progress_size = self.in_progress.capacity(); | ||
| let completed_size: usize = self.completed.iter().map(|b| b.len()).sum(); | ||
| let nulls_size = self.nulls.len() / 8; | ||
| let nulls_size = self.nulls.allocated_size() / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocated size is in bytes (not bits) https://docs.rs/arrow/latest/arrow/array/struct.NullBufferBuilder.html#method.allocated_size
Return the allocated size of this builder, in bytes, useful for memory accounting.
So I don't think we should still be dividing by 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification and the link! I’ve updated the code to use allocated_size() directly without dividing by 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code still appears to divide by 8 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the comment in datafusion/physical-plan/src/aggregates/group_values/null_builder.rs:66, "NullBufferBuilder::allocated_size returns capacity in bits". Other places in the codebase that use NullBufferBuilder directly (such as MaybeNullBufferBuilder::allocated_size() at datafusion/physical-plan/src/aggregates/group_values/null_builder.rs:67
Copilot is confused by this comment elsewhere in the source code (it's there).
Which issue does this PR close?
Rationale for this change
ArrowBytesViewMap was previously accounting for the logical number of null bits when reporting memory usage. This under-reported memory consumption is because NullBufferBuilder may allocate more memory than is currently used.
Memory accounting in DataFusion is expected to reflect allocated memory rather than logical usage to ensure accurate memory tracking.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?