ECCE:HOWTO Run a simulation campaign on the OSG
These are instructions for running an ECCE simulation campaign on the OSG
Submitting from JLab
Only certain people should run a full production campaign. These are campaigns of more than 1M events that will use a significant portion of the compute and storage resources allocated to EIC/ECCE. We restrict this to limited people to ensure coordination with the simulation WG so work is not repeated at multiple sites and these larger resource campaigns are aligned with EIC/ECCE goals since there are limited number of them we can do.
The following example uses a directory on the /work/eic2 disk.
Pre-stage input files (i.e. generated events)
The input files should be copied to an appropriate directory at JLab prior to starting the campaign. Assuming they are not too big, the preferred area would be in /work/osgpool/eic since that makes them available via xrootd and therefore accessible from anywhere (e.g. OSG).
For example: /work/osgpool/eic/ECCE/ProductionInputs/Electroweak/ep-10x100/Djangoh/erhic-nc-yesradiation_ep_10_100_q2_10_evt.root
Checkout production scripts
To start with you need to create a working directory for the campaign. All top-level production campaigns should be placed in the directory /work/eic2/ECCE/PRODUCTION. Because this will be used in a few places later, set MYDIR to the full path to this.
setenv MYDIR /work/eic2/ECCE/PRODUCTION/2021.07.21.Electroweak_Djangoh_ep-10x100nc-q2-10 mkdir -p $MYDIR cd $MYDIR
Clone the productions repository.
git clone https://github.com/ECCE-EIC/productions
Run top-level script to generate all submission scripts
Run the setupProduction.py script. This takes two arguments: 1.) the site the job submission scripts should be generated for (in this case "OSG") and 2.) a config file that specifies the parameters of the job. It requires pyroot in order to open the input files and get the number of events in each. Thus, you need to have your environment set up with an appropriate version of root. This should be run from within the productions directory
cd productions source /apps/root/6.18.04/setroot_CUE.csh python3 ./setupProduction.py OSG productionSetups/run_Electroweak_Djangoh_ep-10x100nc-q2-10.txt
The setupProduction.py script will automatically clone the macros repository and checkout the correct branch based on the configuration file. It will then call the appropriate site-specific script for generating submission scripts for each job.
Submit all jobs
A master top-level script called submitJobs.sh will also be created which can be used to submit all of the jobs in one command. All submission scripts will be placed in a directory tree starting with submissionFiles. This allows you to use a common productions and macros directory for all jobs. In order to submit to the OSG you must be on scosg16 or scosg20.
ssh scosg16 setenv MYDIR /work/eic2/ECCE/PRODUCTION/2021.07.21.Electroweak_Djangoh_ep-10x100nc-q2-10 $MYDIR/productions/submissionFiles/*/*/*/osgJobs/submitJobs.sh
Check output
First, note that the standard locations for the production and submission scripts are in the /work/eic2/ECCE/PRODUCTION tree and the standard directory for the DSTs, evaluator files, and logs are in the /work/eic2/ECCE/MC tree. e.g.
/work/eic2/ECCE/PRODUCTION/2021.07.21.Electroweak_Djangoh_ep-10x100nc-q2-10 /work/eic2/ECCE/MC/prop.2/c131177/Electroweak/Djangoh/ep-10x100nc-q2-10
There are a few ways to check the progress of the campaign. The first is to just run condor_q on the osg submit node (e.g. scosg16.jlab.org):
ssh scosg16 condor_q
The second is to use the extras/plot_Njobs_vs_time.py script to generate a pair of plots showing the simultaneous running jobs vs. time and the total time taken to run a job.
cd $MYDIR/productions ./extras/plot_Njobs_vs_time.py root -l Njobs_vs_time.C
The third is to generate lists of jobs based on failure modes or success. This should be run at the end of the campaign to get total status. The script will create a directory named StatusReports where all of the files will be placed.
cd $MYDIR/productions ./extras/campaignStatus.py
Publishing results
The campaign output files are not placed in the xrootd server or the S3 server automatically. This allows you to confirm the campaign was successful and there were no major issues before posting the files. To publish the resulting evaluator files to BOTH the JLab xrootd and BNL S3 servers in the standard locations just run the campaignPublish.py script. It will read the campaign configuration from the submitParameters.dat file created when the setupProduction.py script was run. This file lives in the same directory as the submission scripts. Use this, the campaignPublish.py will form the correct directory names on the xrootd and S3 servers.
cd $MYDIR/productions ./extras/campaignPublish.py