ECCE:HOWTO Run a simulation campaign on the JLab farm

From eicug
Jump to navigation Jump to search

These are instructions for running an ECCE simulation campaign on the JLab farm.

User-specific campaign

These are instructions for how someone may run a small campaign using their own generated event files. One of the main differences here is that the generated even files and the output files will use custom directories.

The following example uses a directory on the /work/eic2 disk. There were some generated event files already at JLab and a couple that need to be copied from BNL S3. The following was done on ifarm1802.

Checkout production scripts and macros

   export MYDIR=/work/eic2/ECCE/users/${USER}/2021.07.02.incl_highq2_xiaochao
   mkdir -p $MYDIR
   cd $MYDIR
   git clone https://github.com/ECCE-EIC/productions
   git clone https://github.com/ECCE-EIC/macros

Modify the Fun4All_G4_EICDetector.C file

The Fun4All_G4_EICDetector.C macro is the main macro for running ECCE simulations. The default settings in it are likely not configured to what you want. You will need to edit it and make sure the the following lines are as follows (note that these appear in different parts of the macro).

   // Input::Simple = true;
   
   Input::READEIC = true;
   
   Enable::DSTOUT = true;

The first line needs to be commented out along with any other lines starting with "Input::" except for the "Input""READEIC" line. You also need to enable DST output.

Modify the makeSLURMJobsUser.py file

At this point you should edit the file productions/JLab/makeSLURMJobsUser.py. Check all of the settings in the "class pars:" section. Comments above that section tell what each of the variables mean:

   # MYDIR          : Directory where campaign is being run from (DST and log directories will be created here)
   # inputFileList  : File containing list of EICsmear formatted generated events files
   # macrosTopDir   : Directory "macros". (This is should contain "detectors/EICDetector/Fun4All_G4_EICDetector.C")
   # nEventsPerJob  : Jobs will be broken up into this many events per job
   # nTotalEvents   : Maximum number of events to process when summing all jobs.

=== Create generated_events_files.list file Crate the file generated_events_files.list (or whatever name you gave for inputFileList in the script above. This should contain one root filename per line. It needs to contain the full path to every file. If you happen to have all of these files in a directory, you can do this with something like:

   ls /work/eic/users/shimizu/djangoh/June2021Sim/*.root > generated_events_files.list

Run a test Job

So, so, so many things can go wrong you should really consider it mandatory to run a 2 event test job. Do this by simply setting the nTotalEvents parameter to be 2 and generate the submit scripts:

   python3 productions/JLAB/makeSLURMJobsUser.py 
   ./slurmJobs/submitJobs.sh

You can watch the status of your job with the following (hit ctl-C to stop it):

   watch -n 4 sacct

If the job completes successfully, you should see the output files


Production Campaign

Only certain people should run a full production campaign. These are campaigns of more than 1M events that will use a significant portion of the compute and storage resources allocated to EIC/ECCE. We restrict this to limited people to ensure coordination with the simulation WG so work is not repeated at multiple sites and these larger resource campaigns are aligned with EIC/ECCE goals since there are limited number of them we can do.

The following example uses a directory on the /work/eic2 disk.

Pre-stage input files (i.e. generated events)

The input files should be copied to an appropriate directory at JLab prior to starting the campaign. The preferred area would be in /work/eic2/ECCE/PRODUCTION/ProductionInputs since that makes them available via xrootd and therefore accessible from anywhere (e.g. OSG).

For example, some events produced for the Yellow Report can be found here:

   /work/eic2/ECCE/PRODUCTION/ProductionInputs/YR_SIDIS/ep_18x100/

Checkout production scripts

To start with you need to create a working directory for the campaign. All top-level production campaigns should be placed in the directory /work/eic2/ECCE/PRODUCTION. Because this will be used in a few places later, set MYDIR to the full path to this.

   export MYDIR=/work/eic2/ECCE/PRODUCTION/2021.07.21.Electroweak_Djangoh_ep-10x100nc-q2-100
   mkdir -p $MYDIR
   cd $MYDIR

Clone the productions repository.

   git clone https://github.com/ECCE-EIC/productions

Run the setupProduction.py script. This takes two arguments: 1.) the site the job submission scripts should be generated for (in this case "JLAB") and 2.) a config file that specifies the parameters of the job. It requires pyroot in order to open the input files and get the number of events in each. Thus, you need to have your environment set up with an appropriate version of root. This should be run from within the productions directory

   cd productions
   source /apps/root/6.18.04/setroot_CUE.bash
   python3 ./setupProduction.py JLAB productionSetups/run_Electroweak_Djangoh_ep-10x100nc-q2-100.txt

The setupProduction.py script will automatically clone the macros repository and checkout the correct branch based on the configuration file. It will then call the appropriate site-specific script for generating submission scripts for each job. A master top-level script called submitJobs.sh will also be created which can be used to submit all of the jobs in one command. All submission scripts will be placed in a directory tree starting with submissionFiles. This allows you to use a common productions and macros directory for all jobs.

macros Directory

As noted above, the macros directory that contains all of the geometry for the ECCE detector as well as the list of evaluators to run is automatically cloned and a production tag checked out when the setupProduction.py script is run.


Troubleshooting

Locating Failed Jobs

Here is how to find failed jobs and their output

1. Find a failed job number: Simply running sacct will list recent jobs, but often you may need to look further back in time. To find failed jobs that started within a 24hr period, do something like this:

   sacct -s FAILED -S 2021-09-23 -E 2021-09-24 -o JobID,JobName%64,CPUTime

This will give the job ID, name, and CPU time. Pick one job and note the name.

2. Scan log files: The stdout and stderr files are captured and placed in a "log" directory alongside the DST files. To find the exact location and name of these files, you can look in the slurm submission script corresponding to the job of interest. For example, if I have a job named:

   slurm-General_particleGun_singleElectron_000_4874000_02000

And I that I know this was run from the campaign in:

   /work/eic2/ECCE/PRODUCTION/2021.09.23.General_particleGun_singleElectron

Then, I can look at the submission script:


/work/eic2/ECCE/PRODUCTION/2021.09.23.General_particleGun_singleElectron/productions/submissionFiles/General/particleGun/singleElectron/slurmJobs/slurmJob_General_particleGun_singleElectron_000_4872000_02000.job

Looking in this file, the --output and --error give the exact locations/names of these files. For example:

   /work/eic2/ECCE/MC/prop.4/prop.4.0/General/particleGun/singleElectron/log/slurm-General_particleGun_singleElectron_000_4872000_02000.out
   /work/eic2/ECCE/MC/prop.4/prop.4.0/General/particleGun/singleElectron/log/slurm-General_particleGun_singleElectron_000_4872000_02000.err

Running evaluators manually in debugger

To track down problems with the evaluators you'll need to run the evaluator script manually within the singularity container and with the correct arguments.

First, specify the particular failed job you wish to debug. In this example, it is the last job in 2021.09.26.Electroweak_LQGENEP_ep-18x275-q2-100 which has the offset 999000. I set variables pointing to various directories since it makes the other command below more succinct and copyable.

   setenv OFFSET 999000
   setenv PRODDIR /work/eic2/ECCE/PRODUCTION/2021.09.26.Electroweak_LQGENEP_ep-18x275-q2-100/productions
   setenv DSTDIR `grep DSTDIR ${PRODDIR}/submissionFiles/*/*/*/*/submitParameters.dat | sed -e s/DSTDIR=//g`
   setenv MYBUILD `echo $DSTDIR | awk -F/ '{print $6}'`
   setenv MACROSTGZ `grep /work/eic2/ECCE/PRODUCTION/MACROS ${PRODDIR}/submissionFiles/*/*/*/*/*Job_*${OFFSET}_*.job | awk '{print $3}'`

Make a debug directory and unpack the macros there and cd to it. Also create a symlink to the DST file:

   mkdir -p ${PRODDIR}/debug
   cd ${PRODDIR}/debug
   tar -xzf ${MACROSTGZ}
   cd macros/detectors/EICDetector
   ln -s ${DSTDIR}/DST*${OFFSET}_*.root
   setenv DSTFILE `ls DST*.root`

Get get the singularity image used from the job script. Note that this could be a .simg file or just a directory. Run a shell using this image. Be sure to unset the LD_PRELOAD envar if set since every command in the shell will complain otherwise. On the plus side, the containerized shell will inherit your environment so the variables set above will still be available:

   setenv SIMG `grep SIMG= ${PRODDIR}/submissionFiles/*/*/*/*/*Job_*${OFFSET}_*.job | sed -s 's/SIMG=//g'`
   eval `/usr/bin/modulecmd bash load /apps/modulefiles/singularity/3.4.0`
   singularity shell --no-home -B /cvmfs:/cvmfs ${SIMG}

Once inside the shell, setup the environment and run the Fun4All evaluators

   source /cvmfs/eic.opensciencegrid.org/ecce/gcc-8.3/opt/fun4all/core/bin/ecce_setup.sh -n $MYBUILD
   root.exe -q -b 'Fun4All_runEvaluators.C(0,"'${DSTFILE}'",".",0,".")'