BNL RACF Cluster

From Daya Bay
Jump to navigation Jump to search

RACF is the BNL RHIC Atlas Computing Facility Cluster. RACF Announcements

Daya Bay has purchased nodes on RACF daya0001, daya0002, daya0003 referred to as dayaNNNN below.

News

Accounts

To get an account:

Note, when requesting an account you will need to choose a user name specific to the Daya Bay nodes and that it must be 8 characters or less.

After getting an initial account you will be instructed to upload an SSH public key which is to be used to get through the SSH gateway machines.

The default shell is tcsh. To change your shell, send a request to RT-RACF-UserAccounts@bnl.gov .

For other requests and problems consult this link.

Log in

 ssh USER@rssh.rhic.bnl.gov
 ssh dayaNNNN

where NNNN=0001, 0002 (0003 is batch only)

If you will need to use klog, then you need to input your kerboros password on the ssh gateway at least once,

ssh USER@rssh.rhic.bnl.gov
kinit
 - enter password as prompted -
ssh dayaNNNN

You (probably) only need to do this once.

After First Log In

After you first log in, perform these one time set ups:

Access NuWa

  • Add a line to your setup scripts so you can use NuWa
bash> echo 'source ~dayabay/nuwaenv.sh' >> ~/.bash_profile
tcsh> echo 'source ~dayabay/nuwaenv.csh' >> ~/.login

Join Mailing List

In order to be informed and/or ask questions about Daya Bay related things on RACF join this list:

https://lists.bnl.gov/mailman/listinfo/dayabay-racf

(you will also, automatically be placed on a list for RACF-wide information)

Monitoring

Ganglia monitoring of Daya Bay's disks and CPUs in RACF (requires passowdr)

NuWa Releases

Various releases and the trunk are centrally maintained. See below for status of the various releases.

To setup a base releases, nuwaenv and a line in your .bashrc or .cshrc file like:

source ~dayabay/nuwaenv.sh  # for bash
source ~dayabay/nuwaenv.csh # for tcsh

This defines a function/alias called "nuwaenv". To see what options it has do:

nuwaenv -h

To set up the default (not always latest) release:

nuwaenv

Each available release can be explicitly set up:

nuwaenv -r X.Y.Z

The latest available release is available via:

nuwaenv -r latest

The rarely updated "trunk" is setup via:

nuwaenv -r trunk

Optionally, specify "-g" or "-O" for debug/optimized versions:

nuwaenv -r latest [-g|-O]

Setup NuWa release by hand

In case nuwaenv is not configured for a release (or you want to use a release or a daily build that is not know to nuwaenv):

$ cd /afs/rhic.bnl.gov/dayabay/software/releases/NuWa-3.9.2/NuWa-3.9.2/
$ source setup.sh
$ cd dybgaudi/DybRelease/cmt/
$ source setup.sh

That's it.

NuWa Daily Build

Based on Simon Blyth's master/slave automatic dybinst'allation complex setup in the daya0001 machine, we are able to copy the latest successfully built into the afs dailies area at 3:15 a.m. everyday. We will keep 7-day copies, each one will be kept for a week. Due to the disk space limitation, we only keep the optimized versions for the daily. Thanks to the keytab afs token method suggested by John McCarthy, we can automatically renew the afs token at the beginning of the daily cron job. At the end of the cron job, it will release the afs token. So do not run the job, which needs the afs tokens, over night under dayabay account, the token will be released at about 3:30 a.m. by the daily cron job.

The dbg/opt slave copy pool are located at:

 daya0001:/data4/slave_install/nuwa_builds   

The daily copies are located at:

 /afs/rhic.bnl.gov/dayabay/software/dailies/<day>  [<day>: monday/tuesday/....]

You can check each day's revision number by looking at each day's subdirectory nuwa copy directory name.

To set up the daily built, if you are using nuwaenv, just type

 nuwaenv -r <day> [<day>: monday/tuesday/....]

Or you can go the dailies afs area, and source cmt script file inside of the <day> subdirectory. <day> needs to be all small-case.

If you are going to run some production or long term job, please use the snapshots in the

 /afs/rhic.bnl.gov/dayabay/software/dailies

directory.

Brett's scripts to find daily build logs

It drives me crazy trying to find the log for a failed bitten build on
RACF so I wrote a little script to help me.  Feel free to use it.

# opt log by SVN revision:
~bvdb/share/slave-log.sh r13205

# dbg log by SVN revision
~bvdb/share/slave-log.sh r13205 dbg

# log by build number
~bvdb/share/slave-log.sh 11123

-Brett.

Info on daily builds

The logging output for builds can be found at

 /data4/slave_install/dybinstall/logs/dybinst/BUILD_REVISION

where BUILD is the build number and REVISION is the revision that triggered the build. The usual log file from dybinst is in dybinst.log. There are also other log and output files from the stages of the build.

The builds are stored in

/data4/slave_install/nuwa_builds/NuWa-REVISION_slvmon_fails_1

for failed debug builds and

/data4/slave_install/nuwa_builds/NuWa-REVISION

for successful debug builds. Optimized builds are in analogous subdirectories under

/data4/slave_install/nuwa_builds/opt

and

/data4/slave_install/dybinstall/opt/logs/opt.dybinst

Non-NuWa software

On 8 Nov 2010, Brett wrote

 I've gotten fed up with the default emacs on RACF and have installed v23.
If you want to use it add /afs/rhic.bnl.gov/dayabay/software/trunk/fs/bin
To the head of your PATH.  I'll entertain ideas for other software to put there.

Disk

The various disk areas are described.

User home disk and quotas

  /dayabay/u/USER

There is a TOTAL of 75GB for ALL Daya Bay users that is backed up. This quota was increased from 25GB to 75GB on 4 Nov 2010. Individual user quotas were implemented at about the same time. The default quota for new users is 100MB. The quota can be increased to 5GB upon a valid request from the user.

To see your user quota (some additional RACF documentation):

  quota -s

Users should not store data in /dayabay/u/USER, instead users should make extensive use of the

  /home

area that is scratch space on each dayaNNNN machine that is used by condor, the batch processing system on RACF. It is not backed up.

Users should also make extensive use of the /dataN scratch disks and xrootd.

Daya Bay scratch disks

  /dataM

on each dayaNNNN, where M=0,...,5. Not backed up.

See also:

* Rsync for copying between nodes

MDC09b data (generated on RACF)

MDC09b data generated on RACF is at

daya0001:/data0/djaffed/MDC09b/output/

MDC09b muon sample

daya0002:/data1/data/sim/MDC09b/muon

Mini dry run data

MiniDryRun data is copied to

daya0002:/data1/data/exp/2009-1010.MiniDryRun

It is also available via Xrootd. See below.

Dry run data

Automated sync with rafiman and Xrootd

The rafiman utility is used to scp .data files from PDSF, convert to .rraw and load in to Xrootd. To do a sync, log in as dayabay@daya0001 and from the home directory do

nuwaenv -r trunk
nuwaenv -p rafiman
rafiman -u <PDSFUSER> sync <FIRSTRUN> <LASTRUN+1> 

You can see details of the sync in the rafiman.log file. The final locations of all synced files will be printed like:

root://daya0001//xrootd/rraw/2010/TestDAQ/NoTag/0831/daq.NoTag.0005032.ADCalib.SAB-AD1.SFO-1._0001.rraw

In general, the path after

root://daya0001//xrootd/rraw/

mirrors the PDSF location.

Some manual copying

Note: please use the rafiman approach. The .rraw files in Xrootd are smaller and can be accessed from all nodes in the RACF.

Select dry run data can be found under

daya0001:/data1/dry-run/2010/

Data from 0617 to 0626 are copied there. The subdirectories should match PDSF. Example of the copy:

bvdb@daya0001:NoTag> pwd
/data1/dry-run/2010/TestDAQ/NoTag
bvdb@daya0001:NoTag> scp -r bviren@pdsf.nersc.gov:/eliza7/dayabay/data/exp/dayabay/2010/TestDAQ/NoTag/0618 .

or

rsync -av pdsf.nersc.gov:/eliza7/dayabay/data/exp/dayabay/2010/TestDAQ/NoTag/0626 ./

If you do not find some data you want contact Brett or David.

Muon generator data

Pre-simulated underground muon profile and SAB muon profile is in

/afs/rhic.bnl.gov/dayabay/software/trunk/opt/NuWa-trunk/data

Scratch for AFS releases

The build/ and tarFiles subdirectories for externals for AFS-based releases, including trunk, are found under

 daya0001:/data4/dayabay/software/

with a parallel structure to that found under

 /afs/rhic.bnl.gov/dayabay/software/

Softlinks are made from AFS to the appropriate subdirectory in /data4

AFS

General AFS-related information

AFS provides disks that are globally visible and can serve as the central location for Daya Bay software. RACF management prefers to keep the AFS directories and files small. AFS has special rules for creation, deletion, etc. of files.

klog

Most users will NOT need to alter the content in the AFS of Daya Bay. If you do, then use klog to obtain a kerberos token:

 klog 

You will be prompted for your password. To view your tokens:

 tokens

To release your tokens:

 unlog 

Tokens remain valid for 1 week by default. You need a token to avoid the following:

$ svn update
svn: Can't open file '.svn/lock': Permission denied
klog as user dayabay
  • Secret knowledge from Jiajie to obtain tokens. As user dayabay:

/usr/kerberos/bin/kinit -k -t /dc/krb5.keytab_dayabay dayabay
/usr/bin/aklog 
  • These commands are in /dayabay/u/dayabay/supervisor/daily.sh

kinit

The klog procedure above gives you a token on the node (ie daya0002) that you are on. You can get a token valid on all dayaNNNN nodes by using kinit. (Note that this usage apparently contradicts the kinit documentation). Information from John McCarthy:

Before you log onto either system, log onto one of the SSH gateway systems and run
kinit
Enter your kerberos password and then log onto daya0001 and/or daya0002
and you should be set.  Even if you log onto the same gateway system in
another window and the log onto either of those systems you will still be set.

More information from John McCarthy:

kinit does not give you an AFS token.  If you were to log on to daya0001  and run
kinit
you would still not have an AFS token.  Also, running
kinit
on the gateway does not get you an AFS token.  It is the act of running kinit 
on the gateway and then ssh'ing onto daya0001 that gets you the AFS token 
because there is a PAM module that the farm has installed that gets this for
you.  It is the PAM module that is getting you the token.  I can understand the
confusion though.

Access Control List

AFS access is not controlled by unix permission bits, but by an Access Control List (ACL). Following example cribbed from Robert Hatcher MINOS-software email:

$ pwd
/afs/rhic.bnl.gov/dayabay/software
$ fs listacl
 Access list for . is
 Normal rights:
 dayabay rlidwk
 system:administrators rlidwka
 system:anyuser rl

NuWa releases on AFS

For installation of optimized code for NuWa-XXXX release, add

macro host-optdbg 'opt'

to

NuWa-XXXX/setup/default/cmt/requirements

before CMTCONFIG is defined. This will allow execution of setup.sh to properly load externals.

Release build checklist

  • First 4 steps below are done by /afs/rhic.bnl.gov/dayabay/software/releases/prepare_release.sh RELEASE
  1. Get tokens
  2. Set CMTCONFIG
  3. Set proxy
  4. Make and link scratch directories for build
  5. dybinst RELEASE all
  6. Make opt by default
    • cd RELEASE/setup/default/cmt/
    • echo "macro host-optdbg 'opt' " >> requirements
  7. dybinst RELEASE tests
  8. Edit /dayabay/u/dayabay/python/nuwaenv/configs/racf-dot-nuwaenv.cfg to include new release

NuWa/Daya Bay structure in AFS

  1. RACF NuWa-1.5.0 Status
  2. RACF NuWa-1.6.0 Status
  3. RACF NuWa-1.6.1 Status
  4. RACF NuWa-1.6.2 Status
  5. RACF NuWa-1.7.0 Status
  6. RACF NuWa-1.8.0 Status
  7. RACF NuWa-1.8.1 Status
  8. RACF NuWa-3.0.0 Status
  9. RACF NuWa-3.1.0 Status
  10. RACF NuWa-3.3.0 Status
  11. RACF NuWa-3.5.0 Status
  12. RACF NuWa-3.6.0 Status
  13. RACF NuWa-3.7.0 Status
  14. RACF NuWa-3.7.1 Status
  15. RACF NuWa-3.7.2 Status
  16. RACF NuWa-3.7.3 Status
  17. RACF NuWa-3.8.0 Status
  18. REMINDER: FOR EACH RELEASE, BE SURE TO UPDATE ~dayabay/.nuwaenv.cfg

To install one must be user dayabay and get an AFS ticket via:

klog -principal USERNAME

Old location of NuWa releases on AFS

Central location for NuWa trunk and releases. It is doled out in 15GB parcels. Main directory is

  /afs/rhic.bnl.gov/x8664_sl5/opt/dayabay
  • Presently trunk and 1.5.0 are installed at
  /afs/rhic.bnl.gov/x8664_sl5/opt/dayabay/NuWa-XXX
where XXX is trunk or 1.5.0 . Note that trunk is subject to frequent (daily) updates and rebuilds based on djaffe's whim. You don't need to use tokens to do the usual NuWa setup dance.

Tips

Due to various reasons, working on RACF can be more challenging than on your own workstation. This section collects some tips.

Copying data from PDSF

You can use scp to copy data in the usual way. SCP is rather slow. See the topic on using GlobusOnline/GlobusConnect for a much faster way.

Transparent Pass Through of SSH Gateways

The rssh.rhic.bnl.gov SSH gateways provide a very restrictive shell. After log in, there is not much one can do except ssh to an internal node. Many of the usual SSH tricks to transparently pass through the intermediate gateway will not work. This one will:

ssh -A -t USER@rssh.rhic.bnl.gov ssh daya0001

If you plan to make multiple connections you can reuse the first connection to the gateway via:

ssh -o ControlMaster=auto -A -t USER@rssh.rhic.bnl.gov ssh daya0001

Unfortunately, the same does not work for the leg from rssh to daya0001.

Edit Files With a local Emacs via TRAMP

The GNU emacs that comes on the RACF nodes is from the mid Pleistocene and it lacks enough editing modes that it is somewhat painful to use. (note newer emacs is now available).

Thankfully Emacs (and Aquamacs) has a nice remote editing mode called TRAMP but unfortunately it is foiled by the gateway. However, it can be used with the file copy gateways like so:

C-x C-f /ssh:USER@rftpexp.rhic.bnl.gov: <TAB> <TAB>

After the <TAB> TRAMP will log in and display the contents of your home directory. You can then find your file like usual. The only difference is a little more latency. If you need to look for something starting at "/" then just add "/" like:

C-x C-f /ssh:USER@rftpexp.rhic.bnl.gov:/ <TAB> <TAB>

Small notes of caution:

TRAMP connection method

It is possible to leave off the "ssh:" method name. TRAMP will then default to whatever it has been configured to use. In particular, if it uses "scp" method, it will make a new connection each time. Besides slowing down saves it may run afoul of any connection rate limiters.

Sticky TRAMP prefix

You may find that when you want to find a new file on your local file system after visiting a TRAMPped file that the TRAMP prefix is "sticky". Emacs will auto complete with <TAB> on local files but then when creating/visiting it will do so on the remove file system. To negate this misfeature you need to kill the prefix via the following:

C-x C-f C-a C-k ~/path/to/file/you/want

Accessing non-network disks

The rftpexp server used in the example above does not provide access to the /data disks. One trick to gain access to non-network disks is the following:

  • Configure so that you can SSH from daya0001 directly to your workstation in one logical hop (if your WS is behind a firewall see this page for how to handle it).
  • Set up a **remote** port forwarding **from** your workstation **to** daya0001
ssh -R2222:localhost:22 WS_USER@WORKSTATION
  • Use a Tramp URL like:
/ssh:RACF_USER@localhost#2222:/data0/

Note: if TRAMP fails to connect it may not display what the problem is. You can check that the tunnel is okay by directly connecting to it like:

ssh -p 2222 RACF_USER@localhost

Double Tunnel for Working from Home

The above won't work if you can not SSH directly from RACF to the computer at which you are working. This may happen, for example if you are working from home or otherwise behind some firewall. If there is an intermediate computer that you can SSH to from both RACF and "home" then you can make two tunnels join. To do this run:

racf> ssh -R2222:localhost:22 INTERMEDIATE_USER@INTERMEDIATE
home> ssh -L2222:localhost:2222 INTERMEDIATE_USER@INTERMEDIATE

Then use the same TRAMP URL as above.

Setting a bookmark for easy return

The tramp URL can be a pain to remember. To help with this, you can use Emacs bookmarks to remember you favorite files or directories. Simply open a file or directory once and then do "C-x r m" to bookmark it. Later you can jump to a bookmark with "C-x r b" or list all your bookmarks with "C-x r l". In the listing, click on the one you want to jump to it.

Mounting RACF Filesystem through SSH

Using the FUSE filesystem client sshfs you can mount any filesystem that you can see when logged into RACF onto your local workstation. Set up a "-R" tunnel like above then you can mount a directory like:

sshfs -p 2222 RACF_USER@localhost:/racf/path /some/local/path

Optimize for GIT via SSHFS

Only the working directory need be put on the RACF FS. Put .git on local FS to speed up its access. After mounting like above

cd /some/local/path/.../some/racf/path
git init mystuff 

or more probable:

git svn init http://dayabay.ihep.ac.cn/svn/dybsvn/people/ME ME

then move the resulting .git directory local and symlink it back

mv .git ~/.racf-people-me.git
ln -s ~/.racf-people-me.git .git

Condor

Some handy condor commands.

Submit job specified by file JDF

  condor_submit JDF

To see the ClassAD prepared for condor

  condor_submit -verbose JDF

Status of submitted jobs

  condor_q

Where are my jobs running?

  condor_q -run

Whose job is running on the dayaNNNN machines?

 condor_q -global -run | grep daya

Status of daya bay machines. Wildcards apparently not allowed for this command.

  condor_status daya0001 daya0002 daya0003


JDF (job description file) guidance

Example JDF from MDC09b and scripts and files used to create JDFs for MDC09b

Guidance from Chris Hollowell on the coupling of Requirements and Rank statements

General queue JDF (will only prefer to run on your hosts):

 Rank = (CPU_Experiment == "dayabay")*10

Non-general queue JDF (will force execution on your hosts):

 Rank = (State == "Unclaimed")
 Requirements = (CPU_Experiment == "dayabay")

Job submission guidance

For nuwa.py jobs that produce .root and .sem output files, condor returns those files to the directory from where condor_submit was executed. If you condor_submit from daya0001:/data2/UserName/, for example, then you have the full resources of the scratch disk available and you won't fill up the puny user disk.

Here is an example that will allow you to submit from the scratch disk. The relevant lines are

# disposition of output files
should_transfer_files    = YES
when_to_transfer_output  = ON_EXIT_OR_EVICT

xrootd

For general help with xrootd see the Xrootd topic.

xrootd on RACF

See also the general Xrootd topic. RACF specifics:

  • We access Xrootd through daya0001. Behind the scenes it may redirect requests to others.
  • We have two xrootd disk servers on daya0003 (130.199.205.53) and daya0002 (130.199.205.52).
  • To write to a file from nuwa.py, all intermediate directories must exist on all xrootd servers. You can assure this by adding "?mkpath=1" to the end of the url
  • Currently to check what data exists one must list a directory on all nodes. This can be accomplished with a script like:
#!/bin/bash
export LD_PRELOAD=$ROOTSYS/lib/libXrdPosixPreload.so
export XROOTD_VMP=daya0001:/xrootd/
for n in 1 2 3:
do
  ls -l root://daya000$n/$1
done
# Use the script like
xroot-ls /xrootd/path/to/my/directory

Removing xrootd files

  • Users can only delete their own files.
  • Brett reported that it is very slow to "rm" files with the POSIX preload library.
    • An alternative is to overwrite files with a zero-length file. Ofer has confirmed that these Xrootd 'knows' that the files are indeed zero length.
    • Files on the daya0002 server can be directly deleted, but it is not clear if Xrootd will 'know' about it.

Mini Dry Run Data in .rraw format

The Mini Dry Run data in .rraw format has been loaded into Xrootd. It is available via a URL like:

root://daya0002//xrootd/RawData/MiniDryRun/YYYY/MMDD/FILENAME.rraw

for example:

root://daya0002//xrootd/RawData/MiniDryRun/2009/1213/daq.NoTag.0000200.Physics.SAB-AD1.SFO-1._0001.rraw

The latest NuWa can use these files like:

nuwa.py [...] root://daya0002//xrootd/RawData/MiniDryRun/YYYY/MMDD/FILENAME.rraw

For reference, I loaded the files via:

cd /data1/data/exp/2009-1010.MiniDryRun/
for n in $(find . -name '*.rraw' -print); do 
  file=$(echo $n| sed -e 's#\./##')
  url=root://daya0001//xrootd/RawData/MiniDryRun/$file
  xrdcp $n $url
done

(Note:I (bv) wrongly used daya0002 so the files did not get spread out. Above is what I should have done.)

Some xrootd links

Tools

Google PerfTools

General usage instructions of PerfTools are described in Dealing_With_Memory_Leaks. Below are RACF-specifics.

  • Google Perftools is installed in AFS at:
 /afs/rhic.bnl.gov/rcassoft/x8664_sl5/google_perftools

You must have an active AFS token to access the shared libraries here.

  • The intermediate 'dot' program needed to generate final PS/PDF files is not available on RACF. You can have pprof explicitly make the intermediatemake a dot file like:
pprof [...usual options...] --dot > h.dot

Then copy it to another machine with dot and run, for example:

dot -Tps -o output.ps file.dot

Reporting issues

  • General link
  • To request installation of specific software, use RT-RACF-LinuxFarms@bnl.gov

Panda

Best to read this first

Requirement:

  • Obtain grid certificate. I tried to do it with Safari. It is probably best to do it with Firefox.

Renewing your grid certificate:

  • You will get an email notification to renew your certificate. Follow the instructions to do it within your Firefox browser.
  • Then follow the instructions to install your certificate.
  • You can copy the certificate some-name.p12 to another machine and then use the openssl commands in install it there.

Setup for each session

  1. setup (atlas) grid environment on racf: [1]
  2. obtain a grid proxy grid-proxy-init
  3. download client, see Batch_Processing#Submitting_Jobs (LATEST is the numeric date), setup:
wget http://www.opensciencegrid.org/pandaversions/sendjobs-LATEST.tar
tar -xvf sendjobs-LATEST.tar 
cd sendjobs-LATEST
source setup.sh 
 ./sendJob.py --njobs 1 --computingSite LBNE_DAYA_1  --transformation http://www.phy.bnl.gov/~djaffe/panda/transformations/nuwa_wrapper_example3.sh  --prodSourceLabel user --jobParameters "770077" --verbose 1

To see what you've done

  • Check status by entering "Panda Job ID", reported by sendJob.py, into "Quick search" in the Panda monitor
  • To see what is in xrootd, Instructions here, example below
LD_PRELOAD=$ROOTSYS/lib/libXrdPosixPreload.so XROOTD_VMP=daya0001:/xrootd/ ls -l root://daya0003//xrootd/panda_tests/outputs
LD_PRELOAD=$ROOTSYS/lib/libXrdPosixPreload.so XROOTD_VMP=daya0001:/xrootd/ ls -l root://daya0002//xrootd/panda_tests/outputs
  • Apparently, you must explicitly look in every node (dayaNNNN) that is part of dayabay's xrootd empire.