Batch Processing
Information on batch processing.
Xrootd
Distributed file management tailored to ROOT files.
- Stanford page has documentation and code (snapshots are also in ROOT)
- CERN page
PanDA
Unified batch system layered on top of native (Condor, PBS)
Links
- Paper: 2008 Maeno
- Wiki: ATLAS wiki
- Presentations: 2006 Weanus, 2009 Potekhin
- Interactions between BNL PAS and Daya Bay groups
- Requesting OSG Certificate from DOE
- Initially request to be in Engagement VO
- Eventually we will get a Daya Bay specific VO
- Sponsor needs to be someone who can vouch for you and who is known to OSG
- Initially request to be in Engagement VO
Setting up Grid Certificate
Fill out this form
https://pki1.doegrids.org/ca/
(Affiliation: OSG, VO Name for OSG: BNL)
Wait for acceptance and follow instructions in resulting email message.
Here is a message from Jose Caballero (BNL) on VO:
Once you complete the process and create your userkey.pem and usercert.pem files, you can verify you have a valid certificate just by trying these commands on a machine where the GRID environment is setup $ grid-proxy-init $ grid-proxy-info If everything went fine, these two command should work. And actually, to work with PanDA right now, that is all you need. Now, in the future you want to join a VO, or more than one, no problem. You will use the same certificate you have just got from DOE to join all the VOs you need in the future. You are not forced to join the BNL one (which I think is being used for different purposes) and you don't need to request a new certificate to join other VOs. As soon as there is a VO for DayaBay you will be able to join it, with no additional steps. For the time being, I think your colleagues are joining VO Engage. But not sure about that. They can confirm. However, as I said, to work with PanDA you don't even need to join a VO. Not, at least, with the current setup we have.
Join the Daya Bay VO
After you have the certificate in your browser you can join our Virtual Organization (VO) by going here:
https://voms.lbl.gov:8443/voms/dayabay/
Import/Export/Backup the certificate
Following the instructions from the email message received above there may be two gotchyas:
- Make sure you import the CA chain from DOEGrids.
- Go to https://pki1.doegrids.org/ca/## Click "Retrieval" tab
- Click "Import CA Certificate Chain" on sidebar
- Select first choice "Import the CA certificate chain into your browser" and submit
- Under Firefox 3 when you "Export" the certificate, if you have an option to do "Backup" instead, use this. O.w. one may see errors from the openssl step.
You then need to create two key files like:
cd mkdir .globus openssl pkcs12 -in certificate-backup-file.p12 -clcerts -nokeys -out .globus/usercert.pem openssl pkcs12 -in certificate-backup-file.p12 -nocerts -out .globus/userkey.pem chmod 600 .globus/userkey.pem
Grid proxy
On RACF:
source /afs/cern.ch/project/gd/LCG-share/sl5/etc/profile.d/grid-env.sh grid-proxy-init
Submitting Jobs
See here for instructions:
http://www.opensciencegrid.org/panda
Here to download client:
http://www.opensciencegrid.org/panda_versions
Submitting jobs to PDSF
Cheng-Ju's questions and Jose's answers from email exchange of 8 Feb 2011
The website you listed below is useful to get an overview of Panda, but it's not exactly what I have in mind. What I have in mind is actually a practical user guide for panda@RACF. The page supported by Saclay is a good example of a practical guide. That page was written to help shifters run the production with Panda at Lyon. https://atlas-france.in2p3.fr/cgi-bin/twiki/bin/view/Atlas/PandaLyon
Let's say if I want to submit a job from BNL to PDSF using Panda:
- What BNL node do I need to login to?
- you don't need to log at BNL. Any machine with the OSG client is valid.
- What initialization files do I need to source?
- you download the panda client (following the link in the web page http://www.opensciencegrid.org/panda), and source setup.(c)sh and you are done. And you can skip the source if you add the directories included in the tarball in your $PYTHONPATH
- What are the voms-proxy-init options that I need to include?
- you don't need to. To submit jobs you only need a valid grid proxy.
- What is the URL for the Panda monitoring page?
- http://panda.cern.ch/server/pandamon/query?job=*&site=NERSC-PDSF&detail=no
you can see my testing jobs there
- http://panda.cern.ch/server/pandamon/query?job=*&site=NERSC-PDSF&detail=no
- How do I check the status of pilots on PDSF? Do I need to submit the pilots manually or auto-submitter is already in place?
- auto-submitter will be in place. I will turn it on when we start running jobs at NERSC.
To check the status of the pilots: http://panda.cern.ch/server/pandamon/query?tp=pilots&accepts=NERSC-PDSF
- auto-submitter will be in place. I will turn it on when we start running jobs at NERSC.
- Should we use command line sendJob.py to submit jobs? Are there any example scripts?
- David/Bret wrote a dedicated script which parses the input options and prepare the right list of options for nuwa.
Then it uses directly the python interface.
The script can be found here: http://www.usatlas.bnl.gov/~caballer/panda/transformations/daya/sub_panda.py
The transformation script used at RACF was http://www.usatlas.bnl.gov/~caballer/panda/transformations/daya/nuwa_wrap_tee3.sh
I forgot, the trf for NERSC, at least during the testing steps, has been http://www.usatlas.bnl.gov/~caballer/panda/transformations/daya/nuwa_wrap_tee3_nersc.sh
- David/Bret wrote a dedicated script which parses the input options and prepare the right list of options for nuwa.
These are some examples of the items that should be documented. After going over the documentation, the person should be able to march along without having to bug you every other minutes.
Initial Test
Example from Jose Caballero:
wget http://www.usatlas.bnl.gov/~caballer/panda/demo/sendjobs.tar tar xvf sendjobs.tar cd sendjobs source setup.sh ./sendJob.py --njobs 4 --computingSite TEST2 --transformation \ http://www.usatlas.bnl.gov/~caballer/panda/transformations/fake.py --prodSourceLabel user --jobParameters "a b c 1 2 3"
(These options works with ver. 081710)
See options: http://www.opensciencegrid.org/panda#options
The job can be monitored by PandaID etc here:
http://panda.cern.ch:25980/server/pandamon/query
or, by specifying your name (replace "YourFirstName" and "YourLastName" to yours)
http://panda.cern.ch:25980/server/pandamon/query?ui=user&name=YourFirstName%20YourLastName
Submitting pilots
Example from Jose Caballero:
wget http://www.usatlas.bnl.gov/~caballer/panda/demo/sendpilots.tar tar xvf sendpilots.tar cd sendpilots source setup.sh ./pilotScheduler.py --queue=TEST2 --pandasite=TEST2 --pilot=default --single
Note: in steady state, pilots will be submitted via cron job.
Kill PanDA job
Example from Jose: (NOTE: This method is not recommended for killing more than a few jobs)
Now, let's say you wan to kill job with ID 12345678 Use this script. Place it in the same directory where you have sendJobs.py ------------------------------------------------------- #!/usr/bin/env python import sys from userinterface.Client import killJobs jobid = sys.argv[1] killJobs(jobid) ------------------------------------------------------- And just type $ killscript.py 12345678
Condor & panda
Maxim on 'holding', 'pre-run' and the life-or-death struggle between condor and panda
in my experience "holding" means a transitory state such as when some data transfer of taking place, in our case, that's usually happening when staging out produced data. By now, as I see, that status cleared, as it usually does.
Pre-run means the job has been entered in the queue and is waiting for its matching pilot. I'll see what's happening if the match is still not happening -- previously, I've seen Condor problems that resulted in delays.