PDSF Cluster

From Daya Bay
Revision as of 07:32, 15 October 2014 by Djaffe (talk | contribs) (→‎trouble with module?)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

PDSF Cluster

Getting an account

See this private wiki topic.

PDSF usage, rules, and recommendations

PDSF is a shared cluster, meaning that Daya Bay shares the cluster with many other experiments and scientists. Please be courteous and aware of how your jobs impact others.

Production running on PDSF are done under 2 production accounts: dybspade and dybprod. These accounts have higher priority than normal user accounts. However, user jobs can still have a deleterious effect on on production jobs if they overload resources or cause administrators to suspend dayabay jobs during system recovery/stabilization.

  • Rules for Batch Jobs: You must obey these rules when running batch jobs on PDSF. Failure to do so may result in jobs being removed from the system to reverse problems.
    • Do not write to $HOME or ~/... areas from batch jobs
      • Personal quotas are very small. If you fill up your personal quota, this will impact the functionality of multiple volumes served by the same hardware.
      • Use /project/projectdirs/dayabay/scratch/mynamehere areas instead
      • N.B. If you do not specify, error and log files will be written to the directory from which the job is submitted. This will often be your ~/... area. Use the -o and -e flags.
    • Use the projectio resource flag if you are reading or writing from /project/dayabay/... areas (e.g. all raw and production data files)
      • When submitting a job which will open a file on /project/dayabay/... for either read or write, add "-l projectio=1" to the qsub command to submit your batch jobs
        • eg. % qsub -l projectio=1 ~/hellowp.csh
    • Use the eliza16io resource flag if you are reading or writing from /eliza16/dayabay/... areas.
      • When submitting a job which will open a file on /eliza16/dayabay/... for either read or write, add "-l eliza16io=1" to the qsub command to submit your batch jobs
        • eg. % qsub -l eliza16io=1 ~/hellow16.csh

  • Best Practices for Batch Jobs:
    • Use ~/.sge_request
      • See "% man sge_request"
      • The parameters in ~/.sge_request file will be used by default when submitting SGE batch jobs unless overwritten by the qsub command line arguments. Using ~/.sge_request will ensure that you do not forget important arguments when submitting jobs.
      • ~/.sge_request EXAMPLE COMING SOON
    • Use $TMPDIR for any temporary scratch files.
      • This is temporary space on the hard drive of the batch node on which you are running. Files may be deleted *after* your job ends, but please clean up any scratch files at the end of your job.
      • N.B. Absolute disk I/O performance and latency will be better for these local disks. Consider using $TMPDIR for very disk-intensive tasks and then move files to /project/projectdirs/dayabay/scratch/$USER areas at the end of the job.
  • Rules for Interactive Running:
    • Do not run job that takes longer than ~15min on interactive nodes (PDSF or pdyb-04)

PDSF resources (disk, CPU, etc)

  • Special commands on PDSF
    • % module load myquota; myquota -g dayabay
      • Use this command to see our group quotas on PDSF data vaults.
    • % module load myquota; myquota -u tull
      • Use this command to see your user quotas on PDSF data vaults.
    • % prjquota dayabay
      • Use this command to see our group quotas on GFS.
  • Status as of August 2014
    • HPSS = 1.5 PB offline (tape) storage
    • PDSF = ~370 core (fair share guaranteed)
      • PDSF uses a contribution & fair share business model. Daya Bay is guaranteed 370 cores (370 job slots), but can utilize as many of the 2200 cores as are available beyond that minimum.
    • Interactive nodes:
      • pdyb-04.nersc.gov = is a dedicated interactive node for dayabay. Please log in here for larger interactive jobs and to avoid contention with non-dayabay users.
    • /project/projectdirs/dayabay/scratch
      • New user area. This is the area to store your large output files.
    • /common/dayabay = inode limited
      • Production Releases & Builds (eg. bitten slave)
    • /project/projectdirs/dayabay/www
      • Science Data Gateway = ODM, etc
    • Dedicated local disks = ~8 TB total
      • Currently unused => testing of xrootd & FUSE

PDSF environment


PDSF allows you to change operating system on the command line. You need to use Scientific Linux for access to dayabay software. I think this is the default setting for dayabay users, but you can manually set it by running:

% chos sl53

You can also make it the default by adding a file '.chos' to your home directory, with the one line sl44:

% echo sl53 > ~/.chos


PDSF uses a package called module to load specific software packages. Common commands look like:

  • To list the modules you've loaded:
% module list
  • To list the modules available to be loaded:
% module avail
  • To load a specific version of SVN (excluding the /1.6.9 will load the default version of SVN). This will allow you to use all SVN commands, read SVN manpages, etc:
% module load subversion/1.6.9
  • To unload the previously loaded SVN module:
% module unload subversion/1.6.9
  • To show what the SVN module does to your environment:
% module show subversion/1.6.9

trouble with module?

Quick pro tip from Matt Kramer: If you've recently found yourself unable to use PDSF's "module" system from bash scripts submitted via qsub, change the first line of your script to the following:

#!/bin/bash -l

The "-l" tells bash to run as though it's a login shell, forcing it to source the files that enable the use of "module".

I heard that I wasn't the only one having this problem, so I'm sharing this solution, which was provided to me by Lisa Gerhardt of NERSC.


We use 'svn' to access our software. On PDSF, you must load this module if you want to use svn:

% module load subversion/1.5.0

You must load this module each time you login. You can make it automatic by adding this line to your .cshrc or .bashrc file.


The default memory limit for PDSF jobs is 1 GB. This is too low for many of our simulation jobs. You should use the qsub option '-l h_vmem=3G' to increase the default memory limit. You can make this automatic for all jobs by adding it to a file '.sge_request' in your home directory:

% echo '-l h_vmem=3G' > ~/.sge_request


This is needed to build the OpenMotif library. The normal user should not need to worry about this as that library should already have been build as part of the NuWa installation. In the case where it is needed the following commands should be executed:

% module add flex

non ascii characters in log file

By default, PDSF sets LANG=en_US.UTF-8. Sometimes this can cause the output log file to produce non ascii characters. If you run into this problem, just set the environmental variable LANG=en_US


From Brett's email: After typing

stty sane

GDB behaves as expected.

Getting Started

  • initial clean up/set up:

read this: http://newweb.nersc.gov/users/computational-systems/pdsf/getting-started/shells-and-startup-files/#shells

  • To change your shell go to http://nim.nersc.gov and log in. In the upper right find the "Actions" menus and select "Change Shell"
  • To avoid inactive logouts you can add this line to your ~/.ssh/config file:
ServerAliveInterval 15
  • histsize– histsize is initially limited to 50 lines, but my usual command to change it did not work. Add this to .bashrc (does not work for me for .profile):

export HISTFILESIZE=1000000000

export HISTSIZE=1000000

NuWa Setup without nuwaenv

The recommended method for setting up DayaBay software environment is to use nuwaenv (see NuWa Setup using nuwaenv). However, if you prefer to setup the environment manually, see below. On PDSF, installations of the software can be found here:

% ls -lrt /common/dayabay/releases/NuWa

To set up your environment without nuwaenv, you must start in the the NuWa-trunk/ subdirectory of your chosen release. In what follows, <NuWaRelease> refers to the path to the chosen release on your site. Note that this specifies an absolute path. For example, on PDSF, the latest debugged release is here:


To learn the <NuWaRelease> paths at various sites see the Local Setup Methods section.

To get the basic NuWa setup:

cd <NuWaRelease>
source setup.sh

This sets up CMT and a few other basic things but does not tell your shell about any projects or external packages. To incorporate the base release do (assuming you are currently in <NuWaRelease> -- this path is relative to your selected NuWa release):

cd dybgaudi/DybRelease/cmt
source setup.sh

Location of official data files

The location of the official production files on PDSF are stored in the data warehouse. Information on how to use the warehouse catalog is on the following wiki page https://wiki.bnl.gov/dayabay/index.php?title=Accessing_Data_in_a_Warehouse

Running and Monitoring Jobs

The queuing system on PDSF is Sun Grid Engine (SGE). Our software is compiled for 64bit processors. To submit a job to the 64bit queue, use the following line:

% qsub <your_script.sh>

Note: the maximum CPU allocation is 24 hrs so make sure your job can finish in less than a day.

You can check the status of your jobs with the following line:

% qstat -u $USER

You can get an overview of the queue activities (by group and users):

% sgeusers

After a job completes, you can check if there was some error or other problem:

% qacct -j <jobid>

To see all recent jobs:

% qacct -o $USER -j

Note: maxvmem shows the maximum memory used by your job; your job will be terminated if it exceeds the memory allocation specified in your .sge_request file.

For testing, I find it useful to open an interactive session on the queue:

% qsh 

Sometimes you will need to submit this command multiple times to get an open node. If you don't mind waiting, you can use:

% qsh -now n

To debug jobs that use >2GB of memory, you can use (note that default shell in the window that opens is the annoying csh):

% qsh -l h_vmem=4G

If your jobs are getting killed due to memory use, see the section 'sge' under 'PDSF Environment' above.

General info

PDSF webpage: http://www.nersc.gov/users/computational-systems/pdsf/

CHOS: http://www.nersc.gov/users/computational-systems/pdsf/software-and-tools/chos/

SGE: http://www.nersc.gov/users/computational-systems/pdsf/using-the-sge-batch-system/

NX: http://www.nomachine.com/

PDSF is "Parallel Distributed Systems Facility"

Reporting problems

You can report problems to PDSF using the NERSC Online Consulting webpage. Click on the link 'Ask NERSC Consultants' https://help.nersc.gov/cgi-bin/consult.cfg/php/enduser/home.php .