Skip to content

Conversation

@kjplows
Copy link

@kjplows kjplows commented Mar 27, 2025

This PR adds a suite of scripts that handle metadata for data processing in SBND.
These scripts are designed to manually construct the metadata for data files according to processing stage and output, and bypass the metadata construction of POMS.

A concise summary of files:

  • sbndpoms_metadata_injector.sh: Adds extra information and sets crucial environmental variables for SAM.
  • rm_empty_files.sh: Runs the count_events executable and deletes files with 0 events.
  • metadata_prescripts.sh: Inhales variables from parent files, constructs wrapper fcls, and figures out stream name
  • metadata_postscripts.sh: Workhorse file to handle ifdh calls and metadata construction.

Edit: Since this PR was originally opened in March, much more functionality has been added. This PR is a collaborative effort between myself, Thomas Wester (@tbwester) , and Mateus F. Carneiro (@mattfcs), and represents the quasi-final state of the repository.

New functionality is added in two main areas: configuration files, and scripts.

Configuration file updates

MC

  1. MCP2025B MC: Added cfgs of the "type" MCP2025B-${stage_combination}.cfg, where ${stage_combinations} refers to:
  • gen_g4_detsim_reco1: Single-stage from generation through to reco1 on a node.
  • reco2_caf: One configuration file for two single stages, reco2 and caf.
  • scrub: Using reco1 input, prepares the files to be fed into g4 again for new detector / physics variations. Single stage.
  • scrub_detsim_reco1: Single stage workflow, scrub into reco1.
  • g4_detsim_reco1: Same as gen_g4_detsim_reco1 without the generation stage.
  • reco1_reco2_caf: Same as reco2_caf adding a single reco1 stage.
  1. MCP2025C MC: Added a cfg, MCP2025C-gen_g4_detsim_reco1_reco2_caf.cfg, implementing a single-stage workflow where generation, g4, detector simulation, reco1, reco2, and caf is done on the same job on the grid. This prevents unnecessary copying back of files and better utilises resources on the grid.

Nota bene: This cfg was made for the SBND Fall production, which implements a custom workflow "inserting" the SBN release of GENIE, genie v3_06_02_sbn1p1 hosted on the SBN OpenScienceGrid, into the cfg. There are therefore "non-standard" prescripts in this cfg, which would not normally be present in an MC workflow.

DATA

  1. MCP2025B DATA: Added two cfgs, MCP2025B-DATA[-NOPOT]-decoder_reco1_reco2_caf.cfg, implementing data processing in multiple single stages, decode --> reco1 --> reco2 --> caf (i.e. raw files are sent to grid to be decoded, the decoded files are copied back, then sent back to the grid to run reco1, etc).

The qualifier [NOPOT] means an identical cfg, where no POT module fcl file (e.g. run_sbndbnbinfo_sbn.fcl) is run. This is occasionally required, e.g. in the case of runs where the MWR device is not running and therefore not accessible from ifbeam, or in the processing of crossingmuon streams.

  1. MCP2025C DATA: Added three configs:
  • MCP2025C-DATA-decode_reco1_reco2_caf.cfg: Implements data processing from raw to caf in a single stage.
  • MCP2025C-DATA-NOPOT-decode_reco1_reco2_caf.cfg: Same as above, without POT module being run.
  • MCP2025C-DATA-reco2_caf.cfg: Data processing only for reco2 --> caf, done as a single stage. This is useful for reprocessing DATA runs, taking reco1 files as input.

Nota bene: The MCP2025C DATA configs implement %(extrafclfile)s, designed to allow the user to insert more fcl files to be run in POMS (with minimal tweaks to the config). In order to do this, one should simply add an [executable_N] stage at the place where they want their fcl run, paying attention to the arguments. This is done in two places in the cfg: in the main block of [executable] where they just declare an echo exe, and in the [stage_${stage}] block, where they should explicitly override as executable_N.name=${name of exe} (typically lar), executable_N.arg_1=${first argument} (e.g. -c for lar -c), etc.

Script updates

Manual metadata scripts

These scripts are meant to be run as prescripts or postscripts, overriding the typical metadata handling of the LArSoft TFileMetadata service. This is done to increase flexibility of metadata declaration and copying back, and relies on "hand-crafting" JSON files for each stage to be copied back, on the grid. A concise summary is below:

Mainline scripts

  1. metadata_prescripts.sh: Sets environmental variables populated with the desired parent file for metadata declaration, and key metadata (such as blinding numbers for data).
  2. metadata_postscripts.sh: Handles metadata declaration, for single individual stages. Constructs a JSON file for the grid job output, and takes charge of copying it back. Meant to be paired with cfgs where addoutput in [job_output] is trivially set to . Designed to be used with MCP2025B DATA workflows.
  3. metadata_postscripts_onestage.sh: Same as metadata_postscripts.sh, handling the single-stage workflow of MCP2025C-DATA[-NOPOT]-decode_reco1_reco2_caf.cfg.
  4. metadata_postscripts_onestage_reprocessing.sh: Same as metadata_postscripts.sh, handling the reprocessing single-stage workflow of MCP2025C-DATA-reco2_caf.cfg.
  5. metadata_postscripts_MC_onestage.sh: The MC version of metadata_postscripts_onestage.sh, designed to be used with MCP2025C-gen_g4_detsim_reco1_reco2_caf.cfg.

Supporting scripts

  1. build_event_numbers.sh: Designed to be used with DATA workflows. Figures out the events processed in each job (as event numbers in DATA files are not contiguous), and exports the ${MT_EVENTSTRING} environment variable used to populate the sbnd.event_number_list field for DATA files in SAM.
  2. rm_empty_files.sh: Runs the count_events executable and deletes files with 0 events.
  3. sbndpoms_metadata_injector.sh: An "enhanced version" of sbnpoms_metadata_injector.sh that adds extra information and sets some environmental variables.

Custodianship scripts

These scripts are primarily used offline (i.e. on a GPVM) to clean up SAM datasets post-processing.

  1. directory_find_retire_delete.sh: Finds, retires, and deletes files using samweb OR only deletes files in a directory, or multiple directories.
  2. directory_find_retire_delete_nx.sh: Same as directory_find_retire_delete.sh, @tbwester / @mattfcs can you please let me know the difference here? Thanks!
  3. duplicate_parent_find_delete_retire.sh: Iterates over a samweb dataset, finds files in that dataset deriving from the same parent, and deletes and retires all but the oldest instance of a child of that parent.
  4. duplicate_parent_find_delete_retire_lineage.sh: Same as duplicate_parent_find_delete_retire.sh, but deletes and retires the descendants of duplicate files as well. This is useful in cases where duplication has occurred early on in processing.
  5. new_sam_delete_duplicates.py: Same functionality as duplicate_parent_find_delete_retire.sh, written in Python. This script uses multiprocessing so it runs considerably faster than its Bash equivalent.
  6. sam_declare.py: Python + multiprocessing-enabled script to declare either one file, or all files in a directory, to SAM.
  7. sam_delete_duplicates.py: Python + multiprocessing-enabled script to delete and retire all duplicate files from a SAM dataset.
  8. sam_delete.py: Python + multiprocessing-enabled script to delete and retire all files from a directory. PLEASE EXERCISE MAXIMAL CAUTION WHEN USING THIS SCRIPT.
  9. sbndpoms_duplicate_file_remover.sh: Actually, this looks like it's been deactivated.. @mattfcs should I remove this script from this PR?

Nota bene: Most of these scripts support -h and --test or --dry_run options. It is extremely highly recommended that production experts unfamiliar with these scripts use these functionalities first, to print out the commands that would be executed, without actually executing them. This prevents accidental deletions, and a LOT of headaches down the line...

@kjplows kjplows added the enhancement New feature or request label Mar 27, 2025
@kjplows kjplows requested a review from mattfcs March 27, 2025 07:11
@kjplows kjplows self-assigned this Mar 27, 2025
@kjplows kjplows moved this to PR in progress in SBN software development Mar 27, 2025
@kjplows kjplows marked this pull request as ready for review September 1, 2025 12:57
@kjplows kjplows requested a review from tbwester September 1, 2025 12:57
@kjplows kjplows moved this from PR in progress to Open pull requests in SBN software development Sep 24, 2025
sbndpro added 2 commits December 23, 2025 10:55
There are quite a few new scripts, and some cfg files.
Main among the cfg files are MCP2025C, which implement single-stage
processing.
@kjplows kjplows changed the title Added SBND metadata production scripts for data processing SBND configs and scripts, metadata enabled, plus onestage at DATA and MC. MCP2025B and MCP2025C Dec 25, 2025
@kjplows
Copy link
Author

kjplows commented Dec 25, 2025

@mattfcs, @tbwester , I have done a major rewrite of the documentation for these scripts (probably need to put these in the guide 😅). If you could please take a look after the holidays that'd be great! Would be awesome to have this merged.

P.S. Happy Christmas! 🎅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Open pull requests

Development

Successfully merging this pull request may close these issues.

2 participants