BNL RACF Cluster
RACF is the BNL RHIC Atlas Computing Facility Cluster. RACF Announcements
Daya Bay has purchased nodes on RACF daya0001, daya0002, daya0003 referred to as dayaNNNN below.
News
Accounts
To get an account:
Note, when requesting an account you will need to choose a user name specific to the Daya Bay nodes and that it must be 8 characters or less.
After getting an initial account you will be instructed to upload an SSH public key which is to be used to get through the SSH gateway machines.
The default shell is tcsh. To change your shell, send a request to RT-RACF-UserAccounts@bnl.gov .
For other requests and problems consult this link.
Log in
ssh USER@rssh.rhic.bnl.gov ssh dayaNNNN
where NNNN=0001, 0002 (0003 is batch only)
If you will need to use klog, then you need to input your kerboros password on the ssh gateway at least once,
ssh USER@rssh.rhic.bnl.gov kinit - enter password as prompted - ssh dayaNNNN
You (probably) only need to do this once.
After First Log In
After you first log in, perform these one time set ups:
Access NuWa
- Add a line to your setup scripts so you can use NuWa
bash> echo 'source ~dayabay/nuwaenv.sh' >> ~/.bash_profile tcsh> echo 'source ~dayabay/nuwaenv.csh' >> ~/.login
Join Mailing List
In order to be informed and/or ask questions about Daya Bay related things on RACF join this list:
https://lists.bnl.gov/mailman/listinfo/dayabay-racf
(you will also, automatically be placed on a list for RACF-wide information)
Monitoring
Ganglia monitoring of Daya Bay's disks and CPUs in RACF (requires passowdr)
NuWa Releases
Various releases and the trunk are centrally maintained. See below for status of the various releases.
To setup a base releases, nuwaenv and a line in your .bashrc
or .cshrc
file like:
source ~dayabay/nuwaenv.sh # for bash source ~dayabay/nuwaenv.csh # for tcsh
This defines a function/alias called "nuwaenv". To see what options it has do:
nuwaenv -h
To set up the default (not always latest) release:
nuwaenv
Each available release can be explicitly set up:
nuwaenv -r X.Y.Z
The latest available release is available via:
nuwaenv -r latest
The rarely updated "trunk" is setup via:
nuwaenv -r trunk
Optionally, specify "-g" or "-O" for debug/optimized versions:
nuwaenv -r latest [-g|-O]
Setup NuWa release by hand
In case nuwaenv is not configured for a release (or you want to use a release or a daily build that is not know to nuwaenv):
$ cd /afs/rhic.bnl.gov/dayabay/software/releases/NuWa-3.9.2/NuWa-3.9.2/ $ source setup.sh $ cd dybgaudi/DybRelease/cmt/ $ source setup.sh
That's it.
NuWa Daily Build
Based on Simon Blyth's master/slave automatic dybinst'allation complex setup in the daya0001 machine, we are able to copy the latest successfully built into the afs dailies area at 3:15 a.m. everyday. We will keep 7-day copies, each one will be kept for a week. Due to the disk space limitation, we only keep the optimized versions for the daily. Thanks to the keytab afs token method suggested by John McCarthy, we can automatically renew the afs token at the beginning of the daily cron job. At the end of the cron job, it will release the afs token. So do not run the job, which needs the afs tokens, over night under dayabay account, the token will be released at about 3:30 a.m. by the daily cron job.
The dbg/opt slave copy pool are located at:
daya0001:/data4/slave_install/nuwa_builds
The daily copies are located at:
/afs/rhic.bnl.gov/dayabay/software/dailies/<day> [<day>: monday/tuesday/....]
You can check each day's revision number by looking at each day's subdirectory nuwa copy directory name.
To set up the daily built, if you are using nuwaenv, just type
nuwaenv -r <day> [<day>: monday/tuesday/....]
Or you can go the dailies afs area, and source cmt script file inside of the <day> subdirectory. <day> needs to be all small-case.
If you are going to run some production or long term job, please use the snapshots in the
/afs/rhic.bnl.gov/dayabay/software/dailies
directory.
Brett's scripts to find daily build logs
It drives me crazy trying to find the log for a failed bitten build on RACF so I wrote a little script to help me. Feel free to use it. # opt log by SVN revision: ~bvdb/share/slave-log.sh r13205 # dbg log by SVN revision ~bvdb/share/slave-log.sh r13205 dbg # log by build number ~bvdb/share/slave-log.sh 11123 -Brett.
Info on daily builds
The logging output for builds can be found at
/data4/slave_install/dybinstall/logs/dybinst/BUILD_REVISION
where BUILD is the build number and REVISION is the revision that triggered the build. The usual log file from dybinst is in dybinst.log. There are also other log and output files from the stages of the build.
The builds are stored in
/data4/slave_install/nuwa_builds/NuWa-REVISION_slvmon_fails_1
for failed debug builds and
/data4/slave_install/nuwa_builds/NuWa-REVISION
for successful debug builds. Optimized builds are in analogous subdirectories under
/data4/slave_install/nuwa_builds/opt
and
/data4/slave_install/dybinstall/opt/logs/opt.dybinst
Non-NuWa software
On 8 Nov 2010, Brett wrote
I've gotten fed up with the default emacs on RACF and have installed v23. If you want to use it add /afs/rhic.bnl.gov/dayabay/software/trunk/fs/bin To the head of your PATH. I'll entertain ideas for other software to put there.
Disk
The various disk areas are described.
User home disk and quotas
/dayabay/u/USER
There is a TOTAL of 75GB for ALL Daya Bay users that is backed up. This quota was increased from 25GB to 75GB on 4 Nov 2010. Individual user quotas were implemented at about the same time. The default quota for new users is 100MB. The quota can be increased to 5GB upon a valid request from the user.
To see your user quota (some additional RACF documentation):
quota -s
Users should not store data in /dayabay/u/USER, instead users should make extensive use of the
/home
area that is scratch space on each dayaNNNN machine that is used by condor, the batch processing system on RACF. It is not backed up.
Users should also make extensive use of the /dataN scratch disks and xrootd.
Daya Bay scratch disks
/dataM
on each dayaNNNN, where M=0,...,5. Not backed up.
See also:
* Rsync for copying between nodes
MDC09b data (generated on RACF)
MDC09b data generated on RACF is at
daya0001:/data0/djaffed/MDC09b/output/
MDC09b muon sample
daya0002:/data1/data/sim/MDC09b/muon
Mini dry run data
MiniDryRun data is copied to
daya0002:/data1/data/exp/2009-1010.MiniDryRun
It is also available via Xrootd. See below.
Dry run data
Automated sync with rafiman and Xrootd
The rafiman utility is used to scp .data files from PDSF, convert to .rraw and load in to Xrootd. To do a sync, log in as dayabay@daya0001 and from the home directory do
nuwaenv -r trunk nuwaenv -p rafiman rafiman -u <PDSFUSER> sync <FIRSTRUN> <LASTRUN+1>
You can see details of the sync in the rafiman.log file. The final locations of all synced files will be printed like:
root://daya0001//xrootd/rraw/2010/TestDAQ/NoTag/0831/daq.NoTag.0005032.ADCalib.SAB-AD1.SFO-1._0001.rraw
In general, the path after
root://daya0001//xrootd/rraw/
mirrors the PDSF location.
Some manual copying
Note: please use the rafiman approach. The .rraw files in Xrootd are smaller and can be accessed from all nodes in the RACF.
Select dry run data can be found under
daya0001:/data1/dry-run/2010/
Data from 0617 to 0626 are copied there. The subdirectories should match PDSF. Example of the copy:
bvdb@daya0001:NoTag> pwd /data1/dry-run/2010/TestDAQ/NoTag bvdb@daya0001:NoTag> scp -r bviren@pdsf.nersc.gov:/eliza7/dayabay/data/exp/dayabay/2010/TestDAQ/NoTag/0618 .
or
rsync -av pdsf.nersc.gov:/eliza7/dayabay/data/exp/dayabay/2010/TestDAQ/NoTag/0626 ./
If you do not find some data you want contact Brett or David.
Muon generator data
Pre-simulated underground muon profile and SAB muon profile is in
/afs/rhic.bnl.gov/dayabay/software/trunk/opt/NuWa-trunk/data
Scratch for AFS releases
The build/ and tarFiles subdirectories for externals for AFS-based releases, including trunk, are found under
daya0001:/data4/dayabay/software/
with a parallel structure to that found under
/afs/rhic.bnl.gov/dayabay/software/
Softlinks are made from AFS to the appropriate subdirectory in /data4
AFS
AFS provides disks that are globally visible and can serve as the central location for Daya Bay software. RACF management prefers to keep the AFS directories and files small. AFS has special rules for creation, deletion, etc. of files.
klog
Most users will NOT need to alter the content in the AFS of Daya Bay. If you do, then use klog to obtain a kerberos token:
klog
You will be prompted for your password. To view your tokens:
tokens
To release your tokens:
unlog
Tokens remain valid for 1 week by default. You need a token to avoid the following:
$ svn update svn: Can't open file '.svn/lock': Permission denied
klog as user dayabay
- Secret knowledge from Jiajie to obtain tokens. As user dayabay:
/usr/kerberos/bin/kinit -k -t /dc/krb5.keytab_dayabay dayabay
/usr/bin/aklog
- These commands are in
/dayabay/u/dayabay/supervisor/daily.sh
kinit
The klog procedure above gives you a token on the node (ie daya0002) that you are on. You can get a token valid on all dayaNNNN nodes by using kinit. (Note that this usage apparently contradicts the kinit documentation). Information from John McCarthy:
Before you log onto either system, log onto one of the SSH gateway systems and run kinit Enter your kerberos password and then log onto daya0001 and/or daya0002 and you should be set. Even if you log onto the same gateway system in another window and the log onto either of those systems you will still be set.
More information from John McCarthy:
kinit does not give you an AFS token. If you were to log on to daya0001 and run kinit you would still not have an AFS token. Also, running kinit on the gateway does not get you an AFS token. It is the act of running kinit on the gateway and then ssh'ing onto daya0001 that gets you the AFS token because there is a PAM module that the farm has installed that gets this for you. It is the PAM module that is getting you the token. I can understand the confusion though.
Access Control List
AFS access is not controlled by unix permission bits, but by an Access Control List (ACL). Following example cribbed from Robert Hatcher MINOS-software email:
$ pwd /afs/rhic.bnl.gov/dayabay/software $ fs listacl Access list for . is Normal rights: dayabay rlidwk system:administrators rlidwka system:anyuser rl
NuWa releases on AFS
For installation of optimized code for NuWa-XXXX release, add
macro host-optdbg 'opt'
to
NuWa-XXXX/setup/default/cmt/requirements
before CMTCONFIG is defined. This will allow execution of setup.sh
to properly load externals.
Release build checklist
- First 4 steps below are done by /afs/rhic.bnl.gov/dayabay/software/releases/prepare_release.sh RELEASE
- Get tokens
- Set CMTCONFIG
- Set proxy
- Make and link scratch directories for build
- dybinst RELEASE all
- Make opt by default
- cd RELEASE/setup/default/cmt/
- echo "macro host-optdbg 'opt' " >> requirements
- dybinst RELEASE tests
- Edit /dayabay/u/dayabay/python/nuwaenv/configs/racf-dot-nuwaenv.cfg to include new release
NuWa/Daya Bay structure in AFS
- Use the Release build checklist to install a release
- Status of all releases on RACF
- RACF NuWa-1.5.0 Status
- RACF NuWa-1.6.0 Status
- RACF NuWa-1.6.1 Status
- RACF NuWa-1.6.2 Status
- RACF NuWa-1.7.0 Status
- RACF NuWa-1.8.0 Status
- RACF NuWa-1.8.1 Status
- RACF NuWa-3.0.0 Status
- RACF NuWa-3.1.0 Status
- RACF NuWa-3.3.0 Status
- RACF NuWa-3.5.0 Status
- RACF NuWa-3.6.0 Status
- RACF NuWa-3.7.0 Status
- RACF NuWa-3.7.1 Status
- RACF NuWa-3.7.2 Status
- RACF NuWa-3.7.3 Status
- RACF NuWa-3.8.0 Status
- REMINDER: FOR EACH RELEASE, BE SURE TO UPDATE ~dayabay/.nuwaenv.cfg
- Some statistics on Release_sizes
- Initial setup info
- NuWa Daily Build
- OptimizedNuWaTrunkStatus under
opt/
(11aug2011: rarely updated now that daily builds are available) - DebugNuWaTrunkStatus under
dbg/
(11aug2011: rarely updated now that daily builds are available)
To install one must be user dayabay and get an AFS ticket via:
klog -principal USERNAME
Old location of NuWa releases on AFS
Central location for NuWa trunk and releases. It is doled out in 15GB parcels. Main directory is
/afs/rhic.bnl.gov/x8664_sl5/opt/dayabay
- Presently trunk and 1.5.0 are installed at
/afs/rhic.bnl.gov/x8664_sl5/opt/dayabay/NuWa-XXX
- where XXX is trunk or 1.5.0 . Note that trunk is subject to frequent (daily) updates and rebuilds based on djaffe's whim. You don't need to use tokens to do the usual NuWa setup dance.
Tips
Due to various reasons, working on RACF can be more challenging than on your own workstation. This section collects some tips.
Copying data from PDSF
You can use scp to copy data in the usual way. SCP is rather slow. See the topic on using GlobusOnline/GlobusConnect for a much faster way.
Transparent Pass Through of SSH Gateways
The rssh.rhic.bnl.gov
SSH gateways provide a very restrictive shell. After log in, there is not much one can do except ssh to an internal node. Many of the usual SSH tricks to transparently pass through the intermediate gateway will not work. This one will:
ssh -A -t USER@rssh.rhic.bnl.gov ssh daya0001
If you plan to make multiple connections you can reuse the first connection to the gateway via:
ssh -o ControlMaster=auto -A -t USER@rssh.rhic.bnl.gov ssh daya0001
Unfortunately, the same does not work for the leg from rssh
to daya0001
.
Edit Files With a local Emacs via TRAMP
The GNU emacs that comes on the RACF nodes is from the mid Pleistocene and it lacks enough editing modes that it is somewhat painful to use. (note newer emacs is now available).
Thankfully Emacs (and Aquamacs) has a nice remote editing mode called TRAMP but unfortunately it is foiled by the gateway. However, it can be used with the file copy gateways like so:
C-x C-f /ssh:USER@rftpexp.rhic.bnl.gov: <TAB> <TAB>
After the <TAB>
TRAMP will log in and display the contents of your home directory. You can then find your file like usual. The only difference is a little more latency. If you need to look for something starting at "/" then just add "/" like:
C-x C-f /ssh:USER@rftpexp.rhic.bnl.gov:/ <TAB> <TAB>
Small notes of caution:
- TRAMP connection method
It is possible to leave off the "ssh:
" method name. TRAMP will then default to whatever it has been configured to use. In particular, if it uses "scp
" method, it will make a new connection each time. Besides slowing down saves it may run afoul of any connection rate limiters.
- Sticky TRAMP prefix
You may find that when you want to find a new file on your local file system after visiting a TRAMPped file that the TRAMP prefix is "sticky". Emacs will auto complete with <TAB>
on local files but then when creating/visiting it will do so on the remove file system. To negate this misfeature you need to kill the prefix via the following:
C-x C-f C-a C-k ~/path/to/file/you/want
Accessing non-network disks
The rftpexp server used in the example above does not provide access to the /data disks. One trick to gain access to non-network disks is the following:
- Configure so that you can SSH from daya0001 directly to your workstation in one logical hop (if your WS is behind a firewall see this page for how to handle it).
- Set up a **remote** port forwarding **from** your workstation **to** daya0001
ssh -R2222:localhost:22 WS_USER@WORKSTATION
- Use a Tramp URL like:
/ssh:RACF_USER@localhost#2222:/data0/
Note: if TRAMP fails to connect it may not display what the problem is. You can check that the tunnel is okay by directly connecting to it like:
ssh -p 2222 RACF_USER@localhost
Double Tunnel for Working from Home
The above won't work if you can not SSH directly from RACF to the computer at which you are working. This may happen, for example if you are working from home or otherwise behind some firewall. If there is an intermediate computer that you can SSH to from both RACF and "home" then you can make two tunnels join. To do this run:
racf> ssh -R2222:localhost:22 INTERMEDIATE_USER@INTERMEDIATE
home> ssh -L2222:localhost:2222 INTERMEDIATE_USER@INTERMEDIATE
Then use the same TRAMP URL as above.
Setting a bookmark for easy return
The tramp URL can be a pain to remember. To help with this, you can use Emacs bookmarks to remember you favorite files or directories. Simply open a file or directory once and then do "C-x r m" to bookmark it. Later you can jump to a bookmark with "C-x r b" or list all your bookmarks with "C-x r l". In the listing, click on the one you want to jump to it.
Mounting RACF Filesystem through SSH
Using the FUSE filesystem client sshfs you can mount any filesystem that you can see when logged into RACF onto your local workstation. Set up a "-R" tunnel like above then you can mount a directory like:
sshfs -p 2222 RACF_USER@localhost:/racf/path /some/local/path
Optimize for GIT via SSHFS
Only the working directory need be put on the RACF FS. Put .git on local FS to speed up its access. After mounting like above
cd /some/local/path/.../some/racf/path git init mystuff
or more probable:
git svn init http://dayabay.ihep.ac.cn/svn/dybsvn/people/ME ME
then move the resulting .git directory local and symlink it back
mv .git ~/.racf-people-me.git ln -s ~/.racf-people-me.git .git
Condor
- Effective condor documentation - quickstart-like.
- BNL condor documentation.
- Condor manuals
- Condor's file transfer mechanism
Some handy condor commands.
Submit job specified by file JDF
condor_submit JDF
To see the ClassAD prepared for condor
condor_submit -verbose JDF
Status of submitted jobs
condor_q
Where are my jobs running?
condor_q -run
Whose job is running on the dayaNNNN machines?
condor_q -global -run | grep daya
Status of daya bay machines. Wildcards apparently not allowed for this command.
condor_status daya0001 daya0002 daya0003
JDF (job description file) guidance
Example JDF from MDC09b and scripts and files used to create JDFs for MDC09b
Guidance from Chris Hollowell on the coupling of Requirements and Rank statements
General queue JDF (will only prefer to run on your hosts):
Rank = (CPU_Experiment == "dayabay")*10
Non-general queue JDF (will force execution on your hosts):
Rank = (State == "Unclaimed") Requirements = (CPU_Experiment == "dayabay")
Job submission guidance
For nuwa.py jobs that produce .root and .sem output files, condor returns those files to the directory from where condor_submit was executed. If you condor_submit from daya0001:/data2/UserName/, for example, then you have the full resources of the scratch disk available and you won't fill up the puny user disk.
Here is an example that will allow you to submit from the scratch disk. The relevant lines are
# disposition of output files should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT
xrootd
For general help with xrootd see the Xrootd topic.
xrootd on RACF
See also the general Xrootd topic. RACF specifics:
- We access Xrootd through daya0001. Behind the scenes it may redirect requests to others.
- We have two xrootd disk servers on daya0003 (130.199.205.53) and daya0002 (130.199.205.52).
- To write to a file from nuwa.py, all intermediate directories must exist on all xrootd servers. You can assure this by adding "?mkpath=1" to the end of the url
- Currently to check what data exists one must list a directory on all nodes. This can be accomplished with a script like:
#!/bin/bash export LD_PRELOAD=$ROOTSYS/lib/libXrdPosixPreload.so export XROOTD_VMP=daya0001:/xrootd/ for n in 1 2 3: do ls -l root://daya000$n/$1 done # Use the script like xroot-ls /xrootd/path/to/my/directory
Removing xrootd files
- Users can only delete their own files.
- Brett reported that it is very slow to "rm" files with the POSIX preload library.
- An alternative is to overwrite files with a zero-length file. Ofer has confirmed that these Xrootd 'knows' that the files are indeed zero length.
- Files on the daya0002 server can be directly deleted, but it is not clear if Xrootd will 'know' about it.
Mini Dry Run Data in .rraw format
The Mini Dry Run data in .rraw format has been loaded into Xrootd. It is available via a URL like:
root://daya0002//xrootd/RawData/MiniDryRun/YYYY/MMDD/FILENAME.rraw
for example:
root://daya0002//xrootd/RawData/MiniDryRun/2009/1213/daq.NoTag.0000200.Physics.SAB-AD1.SFO-1._0001.rraw
The latest NuWa can use these files like:
nuwa.py [...] root://daya0002//xrootd/RawData/MiniDryRun/YYYY/MMDD/FILENAME.rraw
For reference, I loaded the files via:
cd /data1/data/exp/2009-1010.MiniDryRun/ for n in $(find . -name '*.rraw' -print); do file=$(echo $n| sed -e 's#\./##') url=root://daya0001//xrootd/RawData/MiniDryRun/$file xrdcp $n $url done
(Note:I (bv) wrongly used daya0002 so the files did not get spread out. Above is what I should have done.)
Some xrootd links
- xrootd project homepage and wiki pages
- xrdcp documentation
- note that xrdcp --help claims that it is possible to copy a directory into xrootd, but it does not work. The xrdcp documentation has no such claim.
- Authenticating to rootd
- Using Xrootd from root
- root-talk
- David's double-open problem and solution in root-talk
- subscribe to xrootd mail list
Tools
Google PerfTools
General usage instructions of PerfTools are described in Dealing_With_Memory_Leaks. Below are RACF-specifics.
- Google Perftools is installed in AFS at:
/afs/rhic.bnl.gov/rcassoft/x8664_sl5/google_perftools
You must have an active AFS token to access the shared libraries here.
- The intermediate 'dot' program needed to generate final PS/PDF files is not available on RACF. You can have pprof explicitly make the intermediatemake a dot file like:
pprof [...usual options...] --dot > h.dot
Then copy it to another machine with dot and run, for example:
dot -Tps -o output.ps file.dot
Reporting issues
- General link
- To request installation of specific software, use RT-RACF-LinuxFarms@bnl.gov
Panda
Best to read this first
Requirement:
- Obtain grid certificate. I tried to do it with Safari. It is probably best to do it with Firefox.
Renewing your grid certificate:
- You will get an email notification to renew your certificate. Follow the instructions to do it within your Firefox browser.
- Then follow the instructions to install your certificate.
- You can copy the certificate some-name.p12 to another machine and then use the openssl commands in install it there.
Setup for each session
- setup (atlas) grid environment on racf: [1]
- obtain a grid proxy grid-proxy-init
- download client, see Batch_Processing#Submitting_Jobs (LATEST is the numeric date), setup:
wget http://www.opensciencegrid.org/pandaversions/sendjobs-LATEST.tar tar -xvf sendjobs-LATEST.tar cd sendjobs-LATEST source setup.sh ./sendJob.py --njobs 1 --computingSite LBNE_DAYA_1 --transformation http://www.phy.bnl.gov/~djaffe/panda/transformations/nuwa_wrapper_example3.sh --prodSourceLabel user --jobParameters "770077" --verbose 1
To see what you've done
- Check status by entering "Panda Job ID", reported by sendJob.py, into "Quick search" in the Panda monitor
- To see what is in xrootd, Instructions here, example below
LD_PRELOAD=$ROOTSYS/lib/libXrdPosixPreload.so XROOTD_VMP=daya0001:/xrootd/ ls -l root://daya0003//xrootd/panda_tests/outputs LD_PRELOAD=$ROOTSYS/lib/libXrdPosixPreload.so XROOTD_VMP=daya0001:/xrootd/ ls -l root://daya0002//xrootd/panda_tests/outputs
- Apparently, you must explicitly look in every node (dayaNNNN) that is part of dayabay's xrootd empire.