PDSF Cluster

From Daya Bay

Jump to: navigation, search

Contents

PDSF Cluster

Getting an account

See this private wiki topic.

PDSF usage, rules, and recommendations

PDSF is a shared cluster, meaning that Daya Bay shares the cluster with many other experiments and scientists. Please be courteous and aware of how your jobs impact others.

Production running on PDSF are done under 2 production accounts: dybspade and dybprod. These accounts have higher priority than normal user accounts. However, user jobs can still have a deleterious effect on on production jobs if they overload resources or cause administrators to suspend dayabay jobs during system recovery/stabilization.

  • Rules for Batch Jobs: You must obey these rules when running batch jobs on PDSF. Failure to do so may result in jobs being removed from the system to reverse problems.
    • Do not write to /home or ~/... areas from batch jobs
      • Personal quotas are very small. If you fill up your personal quota, this will impact the functionality of multiple volumes served by the same hardware.
      • Use /eliza7/dayabay/scratch/mynamehere areas instead
      • N.B. If you do not specify, error and log files will be written to the directory from which the job is submitted. This will often be your ~/... area. Use the -o and -e flags.
        • eg. EXAMPLE COMING SOON
    • Use the eliza16io resource flag if you are reading or writing from /eliza16/dayabay/... areas (eg. all data files)
      • When submitting a job which will open a file on /eliza16/dayabay/... for either read or write, add "-l eliza16io=1" to the qsub command to submit your batch jobs
        • eg. % qsub -l eliza16io=1 ~/hellow16.csh
    • Use the eliza7io resource flag if you are reading or writing from /eliza7/dayabay/... areas (eg. user areas)
      • When submitting a job which will open a file on /eliza7/dayabay/... for either read or write, add "-l eliza7io=1" to the qsub command to submit your batch jobs
        • eg. % qsub -l eliza7io=1 ~/hellow7.csh
    • Use the projectio resource flag if you are reading or writing from /project/dayabay/... areas.
      • When submitting a job which will open a file on /project/dayabay/... for either read or write, add "-l projectio=1" to the qsub command to submit your batch jobs
        • eg. % qsub -l projectio=1 ~/hellowp.csh


  • Best Practices for Batch Jobs:
    • Use ~/.sge_request
      • See "% man sge_request"
      • The parameters in ~/.sge_request file will be used by default when submitting SGE batch jobs unless overwritten by the qsub command line arguments. Using ~/.sge_request will ensure that you do not forget important arguments when submitting jobs.
      • ~/.sge_request EXAMPLE COMING SOON
    • Use $TMPDIR for any temporary scratch files.
      • This is temporary space on the hard drive of the batch node on which you are running. Files may be deleted *after* your job ends, but please clean up any scratch files at the end of your job.
      • N.B. Absolute disk I/O performance and latency will be better for these local disks. Consider using $TMPDIR for very disk-intensive tasks and then move files to /eliza or /project areas at the end of the job.
  • Rules for Interactive Running:
  • Best Practices for Interactive Running:

PDSF resources (disk, CPU, etc)

  • Special commands on PDSF
    •  % module load myquota; myquota -g dayabay
      • Use this command to see our group quotas on PDSF data vaults.
    •  % module load myquota; myquota -u tull
      • Use this command to see your user quotas on PDSF data vaults.
    •  % prjquota dayabay
      • Use this command to see our group quotas on GFS.
  • Status as of January 2012
    • HPSS = 350 TB offline (tape) storage
    • PDSF = ~150 core (fair share guaranteed)
      • PDSF uses a contribution & fair share business model. Daya Bay is guaranteed 150 cores (150 job slots), but can utilize as many of the 1200 cores as are available beyond that minimum.
    • Interactive nodes:
      • pdyb-04.nersc.gov = is a dedicated interactive node for dayabay. Please log in here for larger interactive jobs and to avoid contention with non-dayabay users.
    • /eliza7/dayabay = 6 TB
      • Old production area
      • => New user area + scratch
    • /eliza16/dayabay = 155 TB
      • New production area, Spade/Ingest, ODM, etc
      • DO NOT USE THIS AREA FOR USER DISKS
    • /project/projectdirs/dayabay/scratch
      • New user area. Please put user and group files that do not fit on /eliza7 here.
    • /common/dayabay = inode limited
      • Production Releases & Builds (eg. bitten slave)
    • /project/projectdirs/dayabay/www
      • Science Data Gateway = ODM, etc
    • Dedicated local disks = ~8 TB total
      • Currently unused => testing of xrootd & FUSE

PDSF environment

chos

PDSF allows you to change operating system on the command line. You need to use Scientific Linux for access to dayabay software. I think this is the default setting for dayabay users, but you can manually set it by running:

% chos sl53

You can also make it the default by adding a file '.chos' to your home directory, with the one line sl44:

% echo sl53 > ~/.chos

module

PDSF uses a package called module to load specific software packages. Common commands look like:

  • To list the modules you've loaded:
% module list
  • To list the modules available to be loaded:
% module avail
  • To load a specific version of SVN (excluding the /1.6.9 will load the default version of SVN). This will allow you to use all SVN commands, read SVN manpages, etc:
% module load subversion/1.6.9
  • To unload the previously loaded SVN module:
% module unload subversion/1.6.9
  • To show what the SVN module does to your environment:
% module show subversion/1.6.9

svn

We use 'svn' to access our software. On PDSF, you must load this module if you want to use svn:

% module load subversion/1.5.0

You must load this module each time you login. You can make it automatic by adding this line to your .cshrc or .bashrc file.

sge

The default memory limit for PDSF jobs is 1 GB. This is too low for many of our simulation jobs. You should use the qsub option '-l h_vmem=3G' to increase the default memory limit. You can make this automatic for all jobs by adding it to a file '.sge_request' in your home directory:

% echo '-l h_vmem=3G' > ~/.sge_request

flex

This is needed to build the OpenMotif library. The normal user should not need to worry about this as that library should already have been build as part of the NuWa installation. In the case where it is needed the following commands should be executed:

% module add flex
% export LIBRARY_PATH=$LD_LIBRARY_PATH

non ascii characters in log file

By default, PDSF sets LANG=en_US.UTF-8. Sometimes this can cause the output log file to produce non ascii characters. If you run into this problem, just set the environmental variable LANG=en_US

gdb

From Brett's email: After typing

stty sane

GDB behaves as expected.

Getting Started

  • initial clean up/set up:

read this: http://newweb.nersc.gov/users/computational-systems/pdsf/getting-started/shells-and-startup-files/#shells

  • To change your shell go to http://nim.nersc.gov and log in. In the upper right find the "Actions" menus and select "Change Shell"
  • To avoid inactive logouts you can add this line to your ~/.ssh/config file:
ServerAliveInterval 15
  • histsize– histsize is initially limited to 50 lines, but my usual command to change it did not work. Add this to .bashrc (does not work for me for .profile):

export HISTFILESIZE=1000000000

export HISTSIZE=1000000

NuWa Setup without nuwaenv

(To set up with nuwaenv, go to NuWa Setup) On PDSF, installations of the software can be found here:

% ls -lrt /common/dayabay/releases/NuWa

To set up your environment without nuwaenv, you must start in the the NuWa-trunk/ subdirectory of your chosen release. In what follows, <NuWaRelease> refers to the path to the chosen release on your site. Note that this specifies an absolute path. For example, on PDSF, the latest debugged release is here:

 /common/dayabay/releases/NuWa/trunk-dbg/NuWa-trunk/

To learn the <NuWaRelease> paths at various sites see the Local Setup Methods section.

To get the basic NuWa setup:

cd <NuWaRelease>
source setup.sh

This sets up CMT and a few other basic things but does not tell your shell about any projects or external packages. To incorporate the base release do (assuming you are currently in <NuWaRelease> -- this path is relative to your selected NuWa release):

cd dybgaudi/DybRelease/cmt
source setup.sh


Location of official data files

The location of the official production files on PDSF are stored in the data warehouse. Information on how to use the warehouse catalog is on the following wiki page https://wiki.bnl.gov/dayabay/index.php?title=Accessing_Data_in_a_Warehouse

Running and Monitoring Jobs

The queuing system on PDSF is Sun Grid Engine (SGE). Our software is compiled for 64bit processors. To submit a job to the 64bit queue, use the following line:

% qsub <your_script.sh>

Note: the maximum CPU allocation is 24 hrs so make sure your job can finish in less than a day.

You can check the status of your jobs with the following line:

% qstat -u $USER

You can get an overview of the queue activities (by group and users):

% sgeusers

After a job completes, you can check if there was some error or other problem:

% qacct -j <jobid>

To see all recent jobs:

% qacct -o $USER -j

Note: maxvmem shows the maximum memory used by your job; your job will be terminated if it exceeds the memory allocation specified in your .sge_request file.

For testing, I find it useful to open an interactive session on the queue:

% qsh 

Sometimes you will need to submit this command multiple times to get an open node. If you don't mind waiting, you can use:

% qsh -now n

To debug jobs that use >2GB of memory, you can use (note that default shell in the window that opens is the annoying csh):

% qsh -l h_vmem=4G


If your jobs are getting killed due to memory use, see the section 'sge' under 'PDSF Environment' above.

PDSF for Analysis Sprint 11a @ CalTech

  • Using NX is advised for any interactive X11 work. (see below)
  • Using qsub/qlogin is advised for any CPU/Memory-intensive work (see below)
  • For the most part, NX and qsub/qlogin are mutually-exclusive. Though you can get them to work together, you do not get the benefits of both. So choose your connection method based upon your work.
  • We have reserved 2 nodes for interactive use during the CalTech Analysis Sprint. Each of these nodes is a 12-core, 48 GB Westmere machines. This means that we have 24 qsh/qlogin slots. To use this nodes, use the qsh or qlogin command as follows:
    • Using qsh to pop up an Xterm-window on your $DISPLAY with a shell running on the 1st reserved node.
      • mynode% ssh -X pdsf.nersc.gov
      • pdsf% qsh -l dayabay=1
    • To use qlogin to login to the 1st reserved node in your current terminal window (no X).
      • mynode% ssh pdsf.nersc.gov
      • ssh-keygen -t dsa; cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
        • qlogin does not properly recognize Passwords. You must setup the ability to login with ssh-key instead. This should be done only once.
      • pdsf% qlogin -l dayabay=1
      • pdsf% setenv CHOS sl44; chos
        • The additional CHOS/chos commands are to properly setup Chos in the new qlogin shell (Otherwise, Chos does not get setup).

General info

PDSF webpage: http://www.nersc.gov/nusers/systems/PDSF/

CHOS: http://www.nersc.gov/nusers/systems/PDSF/chos/index.php

SGE: http://www.nersc.gov/nusers/systems/PDSF/software/SGE.php

NX: http://www.nomachine.com/

PDSF is "Parallel Distributed Systems Facility"

Mixed Monte Carlo Samples

The first set of Miao's mixed MC samples can be found in

/eliza7/dayabay/data/exp/dayabay/2010/generated/MixedEvent/0411

Reporting problems

You can report problems to PDSF using the NERSC Online Consulting webpage. Click on the link 'Ask NERSC Consultants' https://help.nersc.gov/cgi-bin/consult.cfg/php/enduser/home.php .

Personal tools