-
Notifications
You must be signed in to change notification settings - Fork 111
Add lmdb as alternative file format
#852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
RasmusOrsoe
wants to merge
26
commits into
graphnet-team:main
Choose a base branch
from
RasmusOrsoe:lmdb_pr
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR extends the list of supported backends to include
lmdb, thereby addressing #834 and closing #820.The main benefits of LMDB are threefold: It requires roughly half the space of SQLite, it has significantly faster random access for larger events than SQLite, and it provides a generic way of pre-computing data representations.
The downsides are subjective: No SQL syntax, and accessing large subsets of the dataset in one go is also slow.
Major Changes
Adds
LMDBWriter:The writer outputs .lmdb databases, where entries are key-value pairs. The keys are created on the index column (similar to the primary key in sqlite) and the values associated with the entry is all extracted data for the given event. The values are serialized, and several common serialization methods (
json,pickle, etc) are supported.To make the files self-contained, the databases contain a
__meta__entry with information on the serialization method used, and utility functions are added that will identify the correct method and use it for deserialization on queries. As a result, the user doesn't need to know the serialization method in order to read the files. Below is an example of a query:I profiled the query speeds in a usual data loading scenario vs. the event size, and found the following relationship

The query speed includes both event-level truth and the pulsemap (deserialized), and is repeated 100 times for each event. Real-time computation of representations and computational overhead of establishing connections are not included. From the figure it can be seen that for large events,
lmdboffers a significant speed-up.Additionally, the
LMDBWriteraccepts a list ofDataRepresentations- and if provided - the representations are calculated and stored in the file alongside other extracted data. Another meta field is written to the files that contain the config files of the representations, allowing users to re-instantiate the data representation modules used to compute the representations. A utility function for retrieving these is added. As such, this PR also closes Graph construction before training #781. An example of retrieving the data representation from the meta data can be seen belowIt is assumed that the data representation used is part of the users graphnet installation - i.e. exotic representations that are not yet part of the library will fail. There is no robust way around this.
Adds
LMDBDatasetThe dataset is compatible with the .lmdb files and largely identical to the existing
SQLiteDataset. It supports str-selections and has a "pre-computed" mode, where the user may choose to query pre-computed data representations instead of calculating them in real-time.Adds
SQLiteToLMDBConverterA pre-configured converter that converts existing sqlite databases to lmdb format, similar to our
ParquetToSQLiteConverter. This converter also accepts a list of data representations, allowing your to export to lmdb alongside pre-computed representations.Minor Changes
GraphNeTDataModuleto support the lmdb backendTagging @astrojarred @giogiopg @Aske-Rosted @sevmag and @pweigel as we've all discussed various aspects of this in the past.