Access reason why slurm stopped a job -
is there way find out why job canceled slurm? distinguish cases resource limit hit other reasons (like manual cancellation). in case resource limit hit, know one.
the slurm log file contains information explicitly. written job's output file like:
job <jobid> cancelled @ <time> due time limit
or
job <jobid> exceeded <mem> memory limit, being killed:
or
job <jobid> cancelled @ <time> due node failure
etc.
Comments
Post a Comment