Batch Processing

From Daya Bay

Jump to: navigation, search

Information on batch processing.

Xrootd

Distributed file management tailored to ROOT files.

PanDA

Unified batch system layered on top of native (Condor, PBS)

Links

Setting up Grid Certificate

Fill out this form

https://pki1.doegrids.org/ca/

(Affiliation: OSG, VO Name for OSG: BNL)

Wait for acceptance and follow instructions in resulting email message.

Here is a message from Jose Caballero (BNL) on VO:

Once you complete the process and create your userkey.pem and usercert.pem files,
you can verify you have a valid certificate just by trying these commands on a machine where the GRID environment is setup
$ grid-proxy-init
$ grid-proxy-info
If everything went fine, these two command should work.
And actually, to work with PanDA right now, that is all you need.
Now, in the future you want to join a VO, or more than one, no problem. You will use the same certificate you have just got from DOE to join all the VOs you need in the future.
You are not forced to join the BNL one (which I think is being used for different purposes) and you don't need to request a new certificate to join other VOs.
As soon as there is a VO for DayaBay you will  be able to join it, with no additional steps.
For the time being, I think your colleagues are joining VO Engage. But not sure about that. They can confirm.
However, as I said, to work with PanDA you don't even need to join a VO. Not, at least, with the current setup we have.

Join the Daya Bay VO

After you have the certificate in your browser you can join our Virtual Organization (VO) by going here:

https://voms.lbl.gov:8443/voms/dayabay/

Import/Export/Backup the certificate

Following the instructions from the email message received above there may be two gotchyas:

  1. Make sure you import the CA chain from DOEGrids.
    1. Go to https://pki1.doegrids.org/ca/## Click "Retrieval" tab
    2. Click "Import CA Certificate Chain" on sidebar
    3. Select first choice "Import the CA certificate chain into your browser" and submit
  2. Under Firefox 3 when you "Export" the certificate, if you have an option to do "Backup" instead, use this. O.w. one may see errors from the openssl step.

You then need to create two key files like:

cd
mkdir .globus
openssl pkcs12 -in certificate-backup-file.p12 -clcerts -nokeys -out .globus/usercert.pem
openssl pkcs12 -in certificate-backup-file.p12 -nocerts -out .globus/userkey.pem
chmod 600 .globus/userkey.pem

Grid proxy

On RACF:

source /afs/cern.ch/project/gd/LCG-share/sl5/etc/profile.d/grid-env.sh
grid-proxy-init

Submitting Jobs

See here for instructions:

 http://www.opensciencegrid.org/panda

Here to download client:

 http://www.opensciencegrid.org/panda_versions

Submitting jobs to PDSF

Cheng-Ju's questions and Jose's answers from email exchange of 8 Feb 2011

The website you listed below is useful to get an overview of Panda, but it's not exactly what I have in mind. What I have in mind is actually a practical user guide for panda@RACF. The page supported by Saclay is a good example of a practical guide. That page was written to help shifters run the production with Panda at Lyon. https://atlas-france.in2p3.fr/cgi-bin/twiki/bin/view/Atlas/PandaLyon

Let's say if I want to submit a job from BNL to PDSF using Panda:

  1. What BNL node do I need to login to?
    • you don't need to log at BNL. Any machine with the OSG client is valid.
  2. What initialization files do I need to source?
    • you download the panda client (following the link in the web page http://www.opensciencegrid.org/panda), and source setup.(c)sh and you are done. And you can skip the source if you add the directories included in the tarball in your $PYTHONPATH
  3. What are the voms-proxy-init options that I need to include?
    • you don't need to. To submit jobs you only need a valid grid proxy.
  4. What is the URL for the Panda monitoring page?
  5. How do I check the status of pilots on PDSF? Do I need to submit the pilots manually or auto-submitter is already in place?
  6. Should we use command line sendJob.py to submit jobs? Are there any example scripts?


These are some examples of the items that should be documented. After going over the documentation, the person should be able to march along without having to bug you every other minutes.

Initial Test

Example from Jose Caballero:

wget http://www.usatlas.bnl.gov/~caballer/panda/demo/sendjobs.tar
tar xvf sendjobs.tar
cd sendjobs
source setup.sh
./sendJob.py --njobs 4 --computingSite TEST2 --transformation \
http://www.usatlas.bnl.gov/~caballer/panda/transformations/fake.py --prodSourceLabel user --jobParameters "a b c 1 2 3"

(These options works with ver. 081710)

See options: http://www.opensciencegrid.org/panda#options

The job can be monitored by PandaID etc here:

http://panda.cern.ch:25980/server/pandamon/query

or, by specifying your name (replace "YourFirstName" and "YourLastName" to yours)

http://panda.cern.ch:25980/server/pandamon/query?ui=user&name=YourFirstName%20YourLastName

Submitting pilots

Example from Jose Caballero:

wget http://www.usatlas.bnl.gov/~caballer/panda/demo/sendpilots.tar
tar xvf sendpilots.tar
cd sendpilots
source setup.sh
./pilotScheduler.py --queue=TEST2 --pandasite=TEST2 --pilot=default --single

Note: in steady state, pilots will be submitted via cron job.

Kill PanDA job

Example from Jose: (NOTE: This method is not recommended for killing more than a few jobs)

Now, let's say you wan to kill job with ID 12345678
Use this script. Place it in the same directory where you have sendJobs.py
-------------------------------------------------------
#!/usr/bin/env python 
import sys
from userinterface.Client import killJobs
jobid = sys.argv[1]
killJobs(jobid)
-------------------------------------------------------
And just type
$ killscript.py 12345678

Condor & panda

Maxim on 'holding', 'pre-run' and the life-or-death struggle between condor and panda

in my experience "holding" means a transitory state such as when some data transfer
of taking place, in our case, that's usually happening when staging out produced data.
By now, as I see, that status cleared, as it usually does.
Pre-run means the job has been entered in the queue and is waiting for its matching pilot.
I'll see what's happening if the match is still not happening -- previously, I've seen Condor 
problems that resulted in delays.
Personal tools