GlobusOnline and GlobusConnect
Copying files from PDSF to, say, RACF can take a long time when using the usual SCP mechanism. This problem is due to inherent latency issues with SSH and not the bandwidth of the connection. One solution is to use GridFTP. The problem with this is setting up a GridFTP server is more work than desirable when one end of the transfer is your personal RACF account or maybe your laptop.
Globus Online (GO) provides a web-based application for managing data transfers between "endpoints". One endpoint is an established GridFTP server. The other endpoint is either another GridFTP server or an instance of the Globus Connect (GC) application. GC can run on the top-3 platforms (Linux, Mac OS X and that other one). When you run GC it connects to GO, registers itself and waits to receive data file transmissions. You then give any endpoint credentials to GO and they are cached for a brief time. After set up you can select large amounts of data and initiate transfers.
And it's fast. How fast? Transfers of raw data files between PDSF and RACF proceed 6 times faster than NuWa can read the data in and about 5 times faster than normal SCP.
You will need your PDSF (or other endpoint with an established GridFTP server) login credentials ready and be logged in to the account that will serve as the second endpoint.
You need to do two things to get started:
Go to https://www.globusonline.org/SignUp to request a GO account. This is just for access the web application and is independent from any credentials for file access.
GlobusConnect (GC) provides an endpoint where no established GridFTP server may exist. This includes your personal RACF account, your laptop or even your home machine(s). Download it for your platform from https://www.globusonline.org/globus_connect/
After installing GC, sign in to GO to get a one-time setup key.
FIXME: add other GC setup help.
Start a GC instance:
cd /path/to/globusconnect-1.0/ ./globusconnect
A little monitoring GUI should pop up with a green "connected" and yellow "idle" status.
Initiate Transfers Via the Web App
Log in to GO and click on "File Transfer". Select your endpoint ("nersc#pdsf") from the drop down list on one side of the file transfer GUI. It will be cached for next visit. Enter your credentials (PDSF username/password).
On the other half of the transfer GUI select the endpoint name that you used when you set up your GC. It should show towards the top of the drop down list.
Initiate Transfers Via the Command Line
You can use the GSISSH command shipped with GC to log in to a special GO server and issue transfer, status and other commands. To set this up you will need to deal with certificates. If you do not have a grid certificate you can get one from DOE Grids. RACF online documentation also has some certificate instructions.
Initiate Transfers Via Python
There is a RESTful API in development and a Python interface to it that lets one control GO programatically. It is somewhat experimental at time of writing. An entry point is here.
Area of Development
In principle, GO/GC would allow your batch job to directly stage data from PDSF. However, to make this work one must be able to start an instance of GC and activate it from the batch job. There is a Python API that can in principle allow this but setting this up is still a work in progress (contact bv if you want to get involved). For now, you must manually pre-stage the files to some local file system.