Skip to content

Conversation

@RasmusOrsoe
Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe commented Dec 9, 2025

This PR extends the list of supported backends to include lmdb, thereby addressing #834 and closing #820.

The main benefits of LMDB are threefold: It requires roughly half the space of SQLite, it has significantly faster random access for larger events than SQLite, and it provides a generic way of pre-computing data representations.

The downsides are subjective: No SQL syntax, and accessing large subsets of the dataset in one go is also slow.

Major Changes

  • Adds LMDBWriter:
    The writer outputs .lmdb databases, where entries are key-value pairs. The keys are created on the index column (similar to the primary key in sqlite) and the values associated with the entry is all extracted data for the given event. The values are serialized, and several common serialization methods (json, pickle, etc) are supported.

    To make the files self-contained, the databases contain a __meta__ entry with information on the serialization method used, and utility functions are added that will identify the correct method and use it for deserialization on queries. As a result, the user doesn't need to know the serialization method in order to read the files. Below is an example of a query:

    from graphnet.data.utilities.lmdb_utilities import query_database
    
    lmdb_path = "~/merged/merged.lmdb"
    event_no = 1
    
    # The query function automatically detects the serializer used and will deserialize the blop
    # Result is a dict with all table entries for a single event
    # Every entry in the dict is a `table`. E.g. result["truth"] 
    
    result = query_database(lmdb_path, event_no)

    I profiled the query speeds in a usual data loading scenario vs. the event size, and found the following relationship
    query_vs_n
    The query speed includes both event-level truth and the pulsemap (deserialized), and is repeated 100 times for each event. Real-time computation of representations and computational overhead of establishing connections are not included. From the figure it can be seen that for large events, lmdb offers a significant speed-up.

    Additionally, the LMDBWriter accepts a list of DataRepresentations - and if provided - the representations are calculated and stored in the file alongside other extracted data. Another meta field is written to the files that contain the config files of the representations, allowing users to re-instantiate the data representation modules used to compute the representations. A utility function for retrieving these is added. As such, this PR also closes Graph construction before training #781. An example of retrieving the data representation from the meta data can be seen below

    from graphnet.data.utilities.lmdb_utilities import query_database, get_data_representation_from_metadata
    
    event_no = 0
    lmdb_path = "~/merged/merged.lmdb"
    # Return list of available representations
    query_database(database = lmdb_path, index = event_no )["data_representations"].keys()
    
    # Returns perhaps dict_keys(['KNNGraph', 'GraphDefinition'])
    
    # Get the data representation for 'KNNGraph' from the metadata
    data_representation = get_data_representation_from_metadata(lmdb_path, "KNNGraph")

    It is assumed that the data representation used is part of the users graphnet installation - i.e. exotic representations that are not yet part of the library will fail. There is no robust way around this.

  • Adds LMDBDataset
    The dataset is compatible with the .lmdb files and largely identical to the existing SQLiteDataset. It supports str-selections and has a "pre-computed" mode, where the user may choose to query pre-computed data representations instead of calculating them in real-time.

  • Adds SQLiteToLMDBConverter
    A pre-configured converter that converts existing sqlite databases to lmdb format, similar to our ParquetToSQLiteConverter. This converter also accepts a list of data representations, allowing your to export to lmdb alongside pre-computed representations.

Minor Changes

  • Expanded the test suite to include the lmdb backend for unit testing
  • Replaced deprecated converters in the relevant test suite
  • Icetray conversion example adjusted to include the lmdb backend (which is now the default format in the example)
  • Added utility functions (graphnet.data.utilities.lmdb_utilities) - for querying events etc
  • Minor updates of the documentation. I kept these very small to not bloat the PR further.
  • Updated the GraphNeTDataModule to support the lmdb backend

Tagging @astrojarred @giogiopg @Aske-Rosted @sevmag and @pweigel as we've all discussed various aspects of this in the past.

@RasmusOrsoe RasmusOrsoe changed the title Lmdb pr Add lmdb as alternative file format Dec 10, 2025
@RasmusOrsoe RasmusOrsoe requested a review from giogiopg December 10, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Graph construction before training

1 participant