SBND configs and scripts, metadata enabled, plus onestage at DATA and MC. MCP2025B and MCP2025C #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Edit: Since this PR was originally opened in March, much more functionality has been added. This PR is a collaborative effort between myself, Thomas Wester (@tbwester) , and Mateus F. Carneiro (@mattfcs), and represents the quasi-final state of the repository.
New functionality is added in two main areas: configuration files, and scripts.
Configuration file updates
MC
MCP2025B-${stage_combination}.cfg, where ${stage_combinations} refers to:gen_g4_detsim_reco1: Single-stage fromgeneration through toreco1on a node.reco2_caf: One configuration file for two single stages,reco2andcaf.scrub: Usingreco1input, prepares the files to be fed intog4again for new detector / physics variations. Single stage.scrub_detsim_reco1: Single stage workflow,scrubintoreco1.g4_detsim_reco1: Same asgen_g4_detsim_reco1without thegeneration stage.reco1_reco2_caf: Same asreco2_cafadding a singlereco1stage.MCP2025C-gen_g4_detsim_reco1_reco2_caf.cfg, implementing a single-stage workflow wheregeneration,g4,detectorsimulation,reco1,reco2, andcafis done on the same job on the grid. This prevents unnecessary copying back of files and better utilises resources on the grid.Nota bene: This cfg was made for the SBND Fall production, which implements a custom workflow "inserting" the SBN release of GENIE,
genie v3_06_02_sbn1p1hosted on the SBN OpenScienceGrid, into the cfg. There are therefore "non-standard" prescripts in this cfg, which would not normally be present in an MC workflow.DATA
MCP2025B-DATA[-NOPOT]-decoder_reco1_reco2_caf.cfg, implementing data processing in multiple single stages,decode-->reco1-->reco2-->caf(i.e. raw files are sent to grid to be decoded, the decoded files are copied back, then sent back to the grid to runreco1, etc).The qualifier
[NOPOT]means an identical cfg, where no POT module fcl file (e.g.run_sbndbnbinfo_sbn.fcl) is run. This is occasionally required, e.g. in the case of runs where the MWR device is not running and therefore not accessible from ifbeam, or in the processing ofcrossingmuonstreams.MCP2025C-DATA-decode_reco1_reco2_caf.cfg: Implements data processing from raw tocafin a single stage.MCP2025C-DATA-NOPOT-decode_reco1_reco2_caf.cfg: Same as above, without POT module being run.MCP2025C-DATA-reco2_caf.cfg: Data processing only forreco2-->caf, done as a single stage. This is useful for reprocessing DATA runs, takingreco1files as input.Nota bene: The MCP2025C DATA configs implement
%(extrafclfile)s, designed to allow the user to insert more fcl files to be run in POMS (with minimal tweaks to the config). In order to do this, one should simply add an[executable_N]stage at the place where they want their fcl run, paying attention to the arguments. This is done in two places in the cfg: in the main block of[executable]where they just declare anechoexe, and in the[stage_${stage}]block, where they should explicitly override asexecutable_N.name=${name of exe}(typicallylar),executable_N.arg_1=${first argument}(e.g.-cforlar -c), etc.Script updates
Manual metadata scripts
These scripts are meant to be run as prescripts or postscripts, overriding the typical metadata handling of the LArSoft
TFileMetadataservice. This is done to increase flexibility of metadata declaration and copying back, and relies on "hand-crafting" JSON files for each stage to be copied back, on the grid. A concise summary is below:Mainline scripts
metadata_prescripts.sh: Sets environmental variables populated with the desired parent file for metadata declaration, and key metadata (such as blinding numbers for data).metadata_postscripts.sh: Handles metadata declaration, for single individual stages. Constructs a JSON file for the grid job output, and takes charge of copying it back. Meant to be paired with cfgs whereaddoutputin[job_output]is trivially set to. Designed to be used with MCP2025B DATA workflows.metadata_postscripts_onestage.sh: Same asmetadata_postscripts.sh, handling the single-stage workflow ofMCP2025C-DATA[-NOPOT]-decode_reco1_reco2_caf.cfg.metadata_postscripts_onestage_reprocessing.sh: Same asmetadata_postscripts.sh, handling the reprocessing single-stage workflow ofMCP2025C-DATA-reco2_caf.cfg.metadata_postscripts_MC_onestage.sh: The MC version ofmetadata_postscripts_onestage.sh, designed to be used withMCP2025C-gen_g4_detsim_reco1_reco2_caf.cfg.Supporting scripts
build_event_numbers.sh: Designed to be used with DATA workflows. Figures out the events processed in each job (as event numbers in DATA files are not contiguous), and exports the${MT_EVENTSTRING}environment variable used to populate thesbnd.event_number_listfield for DATA files in SAM.rm_empty_files.sh: Runs the count_events executable and deletes files with 0 events.sbndpoms_metadata_injector.sh: An "enhanced version" ofsbnpoms_metadata_injector.shthat adds extra information and sets some environmental variables.Custodianship scripts
These scripts are primarily used offline (i.e. on a GPVM) to clean up SAM datasets post-processing.
directory_find_retire_delete.sh: Finds, retires, and deletes files using samweb OR only deletes files in a directory, or multiple directories.directory_find_retire_delete_nx.sh: Same asdirectory_find_retire_delete.sh, @tbwester / @mattfcs can you please let me know the difference here? Thanks!duplicate_parent_find_delete_retire.sh: Iterates over a samweb dataset, finds files in that dataset deriving from the same parent, and deletes and retires all but the oldest instance of a child of that parent.duplicate_parent_find_delete_retire_lineage.sh: Same asduplicate_parent_find_delete_retire.sh, but deletes and retires the descendants of duplicate files as well. This is useful in cases where duplication has occurred early on in processing.new_sam_delete_duplicates.py: Same functionality asduplicate_parent_find_delete_retire.sh, written in Python. This script usesmultiprocessingso it runs considerably faster than its Bash equivalent.sam_declare.py: Python +multiprocessing-enabled script to declare either one file, or all files in a directory, to SAM.sam_delete_duplicates.py: Python +multiprocessing-enabled script to delete and retire all duplicate files from a SAM dataset.sam_delete.py: Python +multiprocessing-enabled script to delete and retire all files from a directory. PLEASE EXERCISE MAXIMAL CAUTION WHEN USING THIS SCRIPT.sbndpoms_duplicate_file_remover.sh: Actually, this looks like it's been deactivated.. @mattfcs should I remove this script from this PR?Nota bene: Most of these scripts support
-hand--testor--dry_runoptions. It is extremely highly recommended that production experts unfamiliar with these scripts use these functionalities first, to print out the commands that would be executed, without actually executing them. This prevents accidental deletions, and a LOT of headaches down the line...