Skip to content

SPANK plugin#125

Open
nchaimov wants to merge 19 commits intollnl:develfrom
ParaToolsInc:spank-plugin
Open

SPANK plugin#125
nchaimov wants to merge 19 commits intollnl:develfrom
ParaToolsInc:spank-plugin

Conversation

@nchaimov
Copy link
Collaborator

This pull request updates the Slurm SPANK plugin. It includes the commits from #52, for which the original description was:

Besides fixing several bugs in the context of the SPANK plugin this PR mainly introduces the utilization of spank_prepend_task_argv().

spank_prepend_task_argv() was introduced in Slurm 23.11 and allows to manipulate the (to be spawned) task's argv from within a SPANK plugin. This makes Spindle's SPANK plugin working again with Slurm 23.11 and above.

It seams to me that the SPANK plugin broke once the tweaking of libc's jump table in spindleHookSpindleArgsIntoExecBE() and interceptExecForMap() stopped working due to making this table read-only.

The following additional changes are made:

  • The commits from Enhance and fix SPANK plugin #52 are rebased against the current devel branch and updated to work with the refactored config mechanism.
  • A bug is fixed where parse_location_impl would segfault if a custom getenv was used because memory would be freed which was not allocated through malloc.
  • The configure script now allows building with newer Slurm versions without rshlaunch if the SPANK plugin is enabled.
  • Bugs are fixed where spindle_args_t was not initialized in the SPANK plugin.
  • Exit processing is improved to ensure that the BEs and logging daemons properly exit at the end of a step. The isBEProc function is adapted to also support selecting a single process per node to send the shutdown message.
  • Handling is added for the case where srun --spindle is used but the Spindle level is off.
  • A test driver is added for RM=slurm-plugin
  • An additional CI job is added which tests the SPANK plugin. Both the srun launcher with rshlaunch and the SPANK plugin are tested in two separate jobs.

Known issue: Support for sessions in the SPANK plugin is not implemented and will be addressed in a future PR.

@nchaimov
Copy link
Collaborator Author

Modified CI testing to use different container names for the srun/rshlaunch vs. SPANK jobs. While this didn't make any difference for GitHub Actions, using the same names prevented running these jobs simultaneously when running the CI locally through gh act.

Copy link
Collaborator

@rountree rountree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go.

if (result == -1) {
sdprintf(1, "ERROR: spindleExitBE returned and error on location %s\n", args.location);
return -1;
if (!args.location) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commpath branch gets rid of location. Changing args.location to args.commpath should work, but I'd want to test that out. If your PR goes in first, I can update this in the commpath branch.

j += env_value_len;
#if defined(CUSTOM_GETENV) && defined(CUSTOM_GETENV_FREE)
free(env_value);
if(env_value != env_value_str) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how this fixes the underlying problem, but it did take a while to figure out the code paths (which is an issue of the original function, not this fix). Would like the function to be more obviously correct, but this PR probably isn't the place for that.

@nchaimov
Copy link
Collaborator Author

nchaimov commented Feb 6, 2026

I've found a bug in this when running inside an allocation on fewer nodes than were allocated. Hold off on merging until I can fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants