This wiki is now closed and kept for historical purposes. Please visit the new wiki at https://lbne.bnl.gov/wiki/

Computing

From DUSEL
Revision as of 09:07, 28 October 2014 by Bv (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Notice:

This topic is likely out of date. Instead visit

https://lbne.bnl.gov/wiki


This topic collects information about computing for LBNE WC project

BNL RACF

LBNE has some nodes installed in the BNL RHIC/ATLAS computing facility. Some are interactive (lbne0001 - lbne0003) and some are used for batch running.

Accounts

To get an account:

Note, when requesting an account you will need to choose a user name specific to the LBNE nodes and that it must be 8 characters or less.

After getting an initial account you will be instructed to upload an SSH public key which is to be used to get through the SSH gateway machines.

Logging In

To log in to interactive nodes you must first go through the "rssh" gateway machine before going on to one of our interactive nodes:

local-shell> ssh USER@rssh.rhic.bnl.gov
[USER@rssh02 ~]$ ssh lbneNNNN
USER@lbne0001:~> 

We current have 3 interactive nodes so NNNN is 0001 through 0003.

Initial account setup

Some things you may want to do after first log in

Add public SSH key

To avoid passwords you should add your public key (id_dsa.pub or id_rsa.pub) into

~/.ssh/authorized_keys

file. You may have to create this directory and file.

Set .forward

If you choose not to read email on the RACF then enter an address to which mail should be forwarded

echo me@bnl.gov > .forward

Deal with BNL's HTTP proxy

All HTTP outbound connections must go through a proxy for security reasons. Many programs that need to download via HTTP can be told to use the proxy by setting

bash> export http_proxy=http://192.168.1.130:3128/
tcsh> setenv http_proxy http://192.168.1.130:3128/

If FTP access is also needed then the proxy can be stated via:

bash> export ftp_proxy=http://192.168.1.130:3128/
tcsh> setenv ftp_proxy http://192.168.1.130:3128/

Note, just the variable name is different. The same HTTP URL is used.

For SVN you will need to run the command once to get any (even bogus) repository then edit the file ~/.subversion/servers which should have been generated for you:

bash> svn co http://example.com/
bash> cat<<EOF >> ~/.subversion/servers
cat <<EOF >> ~/.subversion/servers 
http-proxy-host = 192.168.1.130
http-proxy-port = 3128
EOF

Disk Storage

There are several types of storage on the cluster:

User Directory

This is your "home" directory and is visible throughout the cluster.

/lbne/u/USER

We have very limited space and it is shared by all collaborators. Do not leave large files in your home areas!

Local LBNE-specific storage

Every node owned by LBNE has several large local disks available for local storage:

/dataN

Where N is 0, 1, 2, 3,.... These disks are exported the the rest of the cluster via Xrootd.

Temporary batch home

When using the condor batch system your default current working directory is under /home. You should not count on any files written there to remain after the job is over.

AFS

AFS holds shared software installations. See below.

Software

Batch System

RACF uses Condor

Note: Condor can use your current environment if you use

getenv=True

in the JDF file. Otherwise, your job must take care to set up its own environment properly.

FNAL

Getting Started

Information on accessing Fermilab computing resources can be found here. LBNE-specific getting-started information is available on the Intensity Frontier Computing Infrastructure Wiki as well.

To access LBNE machines at Fermilab, you will need five components:

1) A Fermilab ID number

2) A Kerberos principal and password -- "principal" here means a username

3) A Services account -- for submitting Helpdesk requests via the web.

4) An FNALU account which gives you an AFS home area

5) Access to the LBNE machines.

You should get the first three of these when filling out the offsite user's forms or if present at Fermilab, when you get your Fermilab ID card. The above link should suffice for people resident at Fermilab The section of the form labeled "Select your new or current Fermilab Experiment, Division or Section affiliation" allows LBNE to be chosen.

Here is another page with much the same set of links for users who are not Fermilab employees:

Offsite User Getting Started Page

It no longer has a form to apply for the FNALU account. This form can be used:

Select the Request FNALU Account link on the page linked here.

but you need to get your Fermilab ID number and your Kerberos Principal first.

Fermilab uses Kerberos authentication. Some reading material on Kerberos and Fermilab's use of it can be found here.

Helpdesk Instructions

If you do not have a services account yet, you can apply for one at this link:

Service Account Application Form

Once you have your FNALU account, e-mail Tom Junk, trj@fnal.gov, and ask for access to the LBNE machines. The AFS home area the FNALU account gives you is needed for this step as it will be your home area on the LBNE machines.

More information can be found in this topic from the near detector wiki. (username "dusel", DocDB password).

Interactive Node at FNAL for LBNE Use

Here is a link to Fermilab's Intensity Frontier Computing Infrastructure Wiki

The recommended nodes for logging in and doing interactive work are lbnegpvm01.fnal.gov and lbnegpvm02.fnal.gov, although in order to get things set up and to submit jobs you will have to log in also to gpsn01.fnal.gov. The old nodenames gpcf026 and gpcf027 have been retired and these machines are now batch worker nodes and one is gpsn01.

Send an e-mail to trj@fnal.gov asking for your Kerberos principal to be added to the list of people who can log onto lbnegpvm01, and gpsn01. There is a large disk server, called BlueArc, which has data areas for Water Cherenkov work under /lbne/data/water/ and an applications area called /lbne/app/water/. These areas are also visible on flxi05.fnal.gov (one of the FNALU nodes), but it is recommended to use lbnegpvm01 and the batch systems for CPU and I/O intensive work. You are encouraged to create a directory for yourself in /lbne/app/users/ and put your code in there. I do not believe it is backed up however. All machines -- gpsn01, lbnegpvm01, the local worker nodes, and Fermigrid machines all have read/write access to the BlueArc disks.

Batch jobs may be run on resources which have priority allocation to lbne known as the local Condor pool, and users may also submit jobs to Fermigrid. The local lbne Condor pool has of order 70 slots in it. Let's put it to use!

BlueArc Tips

Our areas are /lbne/data/users/make_your_own_directory for data files, and /lbne/app/users/make_your_own_directory for code.

All data areas under /lbne/data/ are set up so that scripts and programs won't run on it -- it is mounted no-execute. You will have to run scripts from a directory elsewhere or get the message "Bad Interpreter (whatever shell you tried to use in your script) Permission denied.".

The applications area /lbne/app/users/ is mounted read-only by grid jobs -- don't try to write to it from a job.

AFS Tips

Your home area on FNALU and lbnegpvm01 is in Fermilab's AFS space -- user directories are in /afs/fnal.gov/files/home/room*. Some tips on using AFS are in this guide page. Most important is your token, which is needed in order to grant you full user access to your own files, and files you ought to have permission to read/modify. At Fermilab, AFS authentication is performed by Kerberos authentication. Your Kerberos ticket has a finite lifetime (often 1 day). You may find you are unable to modify files after one day of logging in because of an expired ticket/token. Refresh with kinit, or log out and back in again. Check your AFS token with the command "tokens". See the manual linked above for more information.

A useful AFS command is "fs". It is a suite of commands for doing things like checking your quota and default file protections (which are different from the normal Unix file protection mechanism and are not affected by chmod). Examples:

 fs quota

checks your quota, and

 fs la

checks the access control list. Type

 fs help

to get a list of AFS commands available via fs.


GPU-enabled servers

We have two nodes, hpcgpu1.fnal.gov and hpcgpu2.fnal.gov. I got my account from Don Holmgren, but please also contact Tom Junk if you'd like an account. Here's some information from Don about these machines:

 The two ndes share a common home area, and a common resource area (/usr/local).   Each of these
 systems has 4 Tesla GPUs.
 We don't have any documentation posted.  To use the GPUs, you will need to
 do
   export LD_LIBRARY_PATH=/usr/local/cuda/lib64
 We are running CUDA version 4.0.17.  The Cuda tools (e.g., the nvcc compiler)
 are in
   /usr/local/cuda/bin
 so you'll need to add that to your $PATH or add it to your makefiles.
 The Cuda SDK for 4.0.17 is installed at
   /usr/local/NVIDIA_GPU_Computing_SDK-4.0.17
 There are code samples there, and also some prebuilt binaries in
    /usr/local/NVIDIA_GPU_Computing_SDK-4.0.17/C/bin/linux/release
 I recommend making sure that the "deviceQuery" binary in that last 
 directory works, as this will verify that your environment is setup correctly.
 NVidia has lots of documentation at
    http://developer.nvidia.com/nvidia-gpu-computing-documentation
 If you are using OpenCL rather than Cuda, you will find it at
    /usr/local/OpenCL
 To check OpenCL functionality, try
   /usr/local/OpenCL/bin/linux/release/oclDeviceQuery
 Good luck!  If you run into any troubles or have questions, you can send them
 to Don Holmgren or to lqcd-admin@fnal.gov.

Local Condor Batch on lbnegpvm01.fnal.gov and gpsn01.fnal.gov

The batch system is up and running. Condor is available on gpsn01.fnal.gov and specific job submission and monitoring scripts are available on gpsn01.fnal.gov and lbnegpvm01. You can execute the job submission and monitoring commands from lbnegpvm01 and they will ssh in to gpsn01 to perform the actual commands. This has some interesting consequences: If your Kerberos ticket has expired, you may find you do not have privilege to submit or monitor your jobs.


Here are some instructions from Dennis Box who is getting the job submission going. These were instructions for me (trj) -- please change as appropriate. If you do not have write access to the BlueArc space /lbne/app/users/ and /lbne/data/water please put in a helpdesk ticket.

Log in, and

source /grid/fermiapp/common/tools/setup_condor.sh

The submission script is jobsub . It will be in your path after executing the above command.

Try a few 'hello world' type jobs running from /lbne/app/users/trj/ , to do this:

1) copy /lbne/app/users/dbox/test.sh to /lbne/app/users/trj

1.5) export GROUP=lbne (or setenv GROUP lbne for csh users)

2) jobsub /lbne/app/users/trj/test.sh (some_number) will send a sleep job to the pool for (some_number) seconds

jobsub -N (some_other_number) -submit_host gpsn01.fnal.gov /lbne/app/users/trj/test.sh (some_parameter)  

will send (some_other_number) of sleep jobs to the pool, each of which has an internal environment variable ${PROCESS} (from 0 to some_other_number-1) that you could use as an input to your script for setting seed values, etc. If this doesn't work, try

3) The output will go by default to the $CONDOR_TMP directory (/lbne/app/users/condor-tmp/trj). Within job scripts one can pipe stdout from commands to designated logfiles on the output disks -- I normally put my logfiles in the BlueArc data area close to where the simulation output is, to keep them for future reference. Nonetheless, sometimes this area is useful to look in if a job fails in some way which is not recorded in your own logfile. This directory can get littered up with .log and .err files and it is a good idea to clean it out yourself from time to time. An automated e-mail system will let you know if there are old files in there.

My other favorite option is -Q which suppresses the semi-infinite e-mails that come back.

I found that the environment variable HOME is not set by default on the gpwn* nodes, and that I had to define it to some directory (I made /lbne/app/users/trj/tmphome and set HOME to point to it) in order for the GEANT4 setup script env.csh not to crash.

Monitoring and Killing Jobs

I just use condor_q on the appropriate node. Our local batch is set up in two pieces -- one on gpcf026.fnal.gov, the other on gpsn01.fnal.gov. Issuing condor_q after sourcing /grid/fermiapp/lbne/condor_scripts/setup_lbne_condor.sh should get you a list of running jobs. Currently condor_rm has a bug in that it ssh's to a computer on which we have no accounts. Instead just use /opt/condor/bin/condor_rm job_id to kill jobs.

A nice GUI monitor is available on the Redmine wiki page

Monitoring Page

Look for the "gpsn01" links on that page. Two views are currently available -- a condor-style view which does not separate the usage by username, and a Minos-style view which breaks it down by user.

Submitting to FermiGrid from gpsn01.fnal.gov

Here are instructions from Dennis Box on registering your account and using the FermiGrid resources, submitting jobs from our new interactive nodes:

Job output goes to /lbne/app/users/condor-tmp/[username]/ unless you specify otherwise.

One-Time Setup Instructions

You will need to sign up for the LBNE Virtual Organization (VO) and then request a grid certificate. A new web page describing this procedure is

This page on the Intensity Frontier Redmine Wiki

It will guide you through the steps given below as well, with sample grid jobs.

Older docuemntation/hints. There is a new step of getting a certificate from the Ferilab VOMS service described in the above-linked redmine page, which must be done before the request_robot_cert described below.

1) log on to gpsn01.fnal.gov

2) run the script /grid/fermiapp/common/tools/request_robot_cert . It will respond with something like this (your name not mine):

This script will attempt to register robot cert
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=gpsn01.fnal.gov/CN=cron/CN=Dennis D. Box/CN=UID:dbox
with vomrs server for  vomrs/vo-fermilab
when this registration is approved, you will be able to
submit to the grid from host gpsn01.fnal.gov
possible user actions:
[p]roceed with registration for submission from gpsn01.fnal.gov
[c]hange host, I want to submit from a different machine
[q]uit, get me out of here!

 please enter [p/c/q]:


3) enter 'p' for proceed with the registration. Someone from the fermigrid department will approve your request manually most likely today, you *should* get email when this is done.

4) do a kcroninit on gpsn01 and follow the instructions. When this is done, if you do not yet have a directory called /scratch/[yourusername] make it, and in it make a directory called grid:

mkdir /scratch/[yourusername]
mkdir /scratch/[yourusername]/grid

add the following line to your crontab on gpsn01

 17 */2 * * * /usr/krb5/bin/kcron  /scratch/grid/kproxy lbne > /dev/null 2>&1

using the crontab command. It is easiest to set the environment variable EDITOR to your favorite editor, and type

crontab -e

and put this in with the editor that comes up.

After this, the script jobsub (not lbne_jobsub) with the -g option will send your jobs to the grid instead of the local batch system. Dennis suggests using -pOFF to work around a problem at this stage. I put commands like the following ones to submit 100 segments to Fermigrid

 export GROUP=lbne
 . /grid/fermiapp/common/tools/setup_condor.sh
 jobsub -pOFF -g -Q -N 100  /lbne/app/users/trj/mar23_2011/runjob_atmoscyl.sh 

Where /lbne/app/users/trj/mar23_2011/runjob_atmoscyl.sh is a job script to run 100 times. The jobs run under username lbneana, which is part of the same group as our lbne user accounts, but it is not your accout. So make sure your script is group runnable -- chmod g+x jobscript. Similarly, if you write output files directly to the /lbne/data/ areas, make sure you set the permissions on the output directory so that users in your group can write to it. I found I had to execute, for example,

chmod g+w /lbne/data/water/monoenergetic_15percent_hqe

in order to allow lbneana on fermigrid to write directly to this directory. Output files created by your job will also be owned by lbneana and will not be group writeable (or deletable) by default. Be sure to chmod g+w output files as well within your job script so you can delete them later interactively if you need to. If you forget to do this, no problem -- you can always submit a job to the grid to do it as user lbneana later.

Installed Software on the Fermilab Machines

WCSim, ROOT, and Andy's Reconstruction, and Morgan and Marc's Event Display

To work on WC Simulation and reconstruction, you need GEANT4 and ROOT software. These have been built with the native compiler on lbnegpvm01. The versions set up are ROOT 5.30/00 AND GEANT4.9.4P02, compiled with g++ V4.1.2. I have also compiled WCSIM with SVN version 845 (fetched October 17, 2011, to include the 200 kton 12" tube options). WCSim has been compiled with G4DEBUG set to 0. The compiled code, data files, and libraries are in /lbne/app/users/trj/wcsimsoft. To set up environment variables to point to this code, type

 source /lbne/app/users/trj/wcsimsoft/setup.sh

for sh-style shells (sh, bash), or

 source /lbne/app/users/trj/wcsimsoft/setup.csh

for csh-style shells (csh, tcsh). Note that the environment variable G4WORKDIR is set to point to /lbne/app/users/trj/wcsimsoft/WCSIM/trunk when you do this, and if you'd like to build your own WCSim.

After some tricks with makefiles and building a local copy of the glui library, the event3D display is now installed on lbnegpvm01, lbnegpvm02. Just source the appropriate setup script above, and run

 /lbne/app/users/trj/wcsimsoft/event3D/runme /lbne/data/water/sim200kton_20111017/singleparticle/rootfiles/muon_plus_004800MeV_200kton.0009.root

as an example to display some high-energy muons. The titlebar menu doesn't appear, so to quit out of the program, you'll have to control-c it in the window you started it.

GLoBES

GLoBES v3.0.11 and gsl (GNU Scientific library) v.1.9 are also installed as non-root installations in the /lbne/app area.

 source /lbne/app/users/trj/globes/globes_setup.sh

for sh-style scripts (sh, bash), or

 source /lbne/app/users/trj/globes/globes_setup.csh

for csh-style scripts (csh, tcsh). These scripts just set up the LD_LIBRARY_PATH and PATH variables to use the installed GLoBES libraries and executables. Look through the directories under /lbne/app/users/trj/globes to find useful examples of how to build GLoBES executables. In particular, /lbne/app/users/trj/globes/globes/globes-3.0.11/bin/globes-config (which should be in your path after calling one of the setup scripts above) contains compiler and linker options and is useful when creating makefiles. See the examples and the makefile (tested that they build) in /lbne/app/users/trj/globes/globes/globes-3.0.11/examples


UPS/UPD

Fermilab has a Unix Product Distribution system, called UPS/UPD, which has several versions of ROOT and GEANT4 available for use. Not all of these are compatible with lbnegpvm01's compiler, or with each other. The main ups areas are on AFS and are thus not visible to the batch workers. Batch workers have access to a subset of UPS/UPD software. There's also a "stealth" UPS maintained in the LBNE areas by the Liquid Argon team.

I found that I had to include this line in my batch job script on the local cluster in order to get access to the setup command described below to access products:

 source /afs/fnal.gov/ups/etc/setups.sh

and if you're using csh in your job script, use

 source /afs/fnal.gov/ups/etc/setups.csh

and on Fermigrid, it's

 source /grid/fermiapp/products/lbne/etc/setups.sh

and

 source /grid/fermiapp/products/lbne/etc/setups.csh

for sh and csh, respectively.

Fermilab's computing division distributes commonly used software packages via the UPS/UPD tools. For example, ROOT and GEANT4 are installed as packages in the UPS databases. ROOT runs out of the box in many versions. To list them, type

 ups list -a root

which returns a long list of possible root versions you can run. Type

 setup root v5_26_00 -q GCC_3_4_6

for example to get a recent version's symbols set up. The actual code resides on afs. There are several installations of GEANT4 as well

 ups list -a geant4

although evaluation is underway to figure out the best way to pick a GEANT4 version. So far, I (trj) have been packaging up all the root and GEANT4 libraries I need in a tarball I build on my desktop along with WCSim, but there are alternatives such as GARPI and the LAr tools.

Tips for handling FNAL Kerberos

It is easiest to copy Fermilab's krb5.conf file into /etc/krb5.conf -- make sure you make a backup copy of yours first. If you use Kerberos for other work, you may need your krb5.conf file and add Fermilab's servers. To do so,

Edit:

/etc/krb5.conf


Add:

[realms]
FNAL.GOV = {
        kdc = krb-fnal-1.fnal.gov
       kdc = krb-fnal-2.fnal.gov
       kdc = krb-fnal-3.fnal.gov
       kdc = krb-fnal-4.fnal.gov
       kdc = krb-fnal-5.fnal.gov
       admin_server = krb-fnal-admin.fnal.gov
}


System Level

Introduce the FNAL.GOV kerberos realm:

User level

~/.ssh/config

to hold:

host *.fnal.gov
  PubKeyAuthentication no
  GSSAPIAuthentication yes
  GSSAPIDelegateCredentials yes

Then get a Kerberos ticket

kinit -f <PRINCIPLE>@FNAL.GOV

Log in:

ssh <USER>@<HOST>.fnal.gov

Quick Summary -- BNL

  • To get the public key
  mkdir ~/.ssh  
  chmod 700 ~/.ssh
  ssh-keygen -q -f ~/.ssh/id_rsa -t rsa
  • To login into BNL machine
  ssh -Y -A -t <username>@rssh.rhic.bnl.gov ssh lbneNNNN
  • To access svn code
  emacs ~/.subversion/servers 
  Edit/Add following lines in [global]
  [global]
  http-proxy-host = 192.168.1.130
  http-proxy-port = 3128
  • To start and set environment
  mkdir ~/geant4                                                          
  setenv G4WORKDIR $HOME/geant4
  klog
  ls /afs/rhic.bnl.gov/lbne/software/trunk/                             # You should always pick the latest *.csh file
  source /afs/rhic.bnl.gov/lbne/software/trunk/setup-2010-06-10.csh     
  setenv PATH ${G4WORKDIR}/bin/${G4SYSTEM}:$PATH
  • Get the latest version of WCSim
  cd ~/geant4
  mkdir WCSim
  svn co http://svn.phy.duke.edu/repos/neutrino/dusel/WCSim/trunk WCSim
  cd WCSim
  make clean
  make rootcint
  make
  make shared
  # check if all the paths are set 
  echo $G4INSTALL
  echo $G4SYSTEM 
  echo $CLHEP_BASE_DIR 
  echo $ROOTSYS 
  echo $PATH 

You are ready to work with WCSim.

  • To run the condor jobs, please follow the steps:
  1) Provide the complete path of jobOptions.mac and tuningParameters.mac in WCSim.cc file and do make.
  2) If you novis.mac/vis.mac requires a input file, the full path to the file should be provided.
  3) For output file, if you like to be in /dataN/lbne/<username> directory, there is no need to provide the full path.
  4) After modifying, make sure that you are able to run the jobs locally.
  5) Now, make an example condor.jdf file with the following contents
       Universe        = vanilla
       Notification    = Complete
       Executable      = /direct/lbne+u/<username>/geant4/bin/Linux-g++/WCSim
       Requirements = (CPU_Experiment == "lbne")
       Rank = (CPU_Experiment == "lbne")*10
       Priority        = +20
       GetEnv          = True
       #Initialdir      = /dataN/lbne/<username>        # IMPORTANT: Uncomment these three lines if you want to store the data in /dataN/lbne directory
       #Should_Transfer_Files = YES
       #When_To_Transfer_Output = ON_EXIT
       Arguments       = "/direct/lbne+u/<username>/WCSim/<filename>"  # full path for the input file to WCSim i.e, novis.mac
       Input           = /dev/null 
       Output          = jobstatus.out
       Error           = jobstatus.err
       Log             = jobstatus.log
       Notify_user     = <email-id>  # if you want to be notified about the completion of the job
       +Experiment     = "lbne"
       +Job_Type       = "cas"
       Queue 1
  6) To submit the job:
  
       condor_submit condor.jdf
 
  7) To view status of the job:
 
       condor_q <username>


  • In case, you want to get data from a website outside to BNL
 setenv http_proxy http://192.168.1.130:3128/
 setenv ftp_proxy http://192.168.1.130:3128/