Run7 nanoDST to RCF operations
The procedure for transferring nanoDSTs to RCF during the Run7 min bias reconstruction project

http://www.hep.vanderbilt.edu/~maguirc/Run7/nanoDSTTransferToRCF

Contact Person: Charles F. Maguire
Creation Date: May 19, 2007
Last update May 29, 2007: Revisions for actual operations

Normal operations for the transfer of nanoDST files to RCF from the ACCRE newstore areas

  1. The main cautoRun7 script arranges for previously obtained reconstruction output to be placed in a set of 14 subdirectories which are in a newstore directory, specifically /gfps3/RUN7PRDF/prod/run7/output/newstore directory. This move is done at the beginning of cautoRun7 starting to run, after checks are made that it is OK to submit a new set of reconstruction jobs. The script which does these moves is move.sh. After the moves are finished the cautoRun7 script determined a new set of jobs to be run, and submits those to PBS via the submit.pl script. After the submit.pl has completed all of its job submissions the last step of the cautoRun7 script is to initiate a fdtStartNanoServer.csh job on the vmps02 node.

  2. The fdtStartNanoServer.csh script running on the vmps02 node starts the fdtStartNanoServer.pl script.

  3. The fdtStartNanoServer.pl will start the FDT script fdtNanoServer.csh on the vmps02 node provided there is no nanoDST file transfer already in progress, and provided that there is disk space available on eon0 or eon1 at firebird. Otherwise the fdtStartNanoServer.pl script will exit without doing anything. A BACKUP SCRIPT NEEDS TO BE WRITTEN TO TAKE OVER WHEN THIS OCCURS! Both of these situations would be very abnormal. The cautoRun7 script itself should not initiate a new set of transfers unless the previous set was confirmed to be successfully at RCF. The buffer disks at firebird should not generally be more than 50% full

  4. The fdtNanoServer.csh script will invoke the fdtStartNanoClientEon0(1).csh script on firebird, and will also make a fdtInProgress file to block any new job launches by the cautoRun7 script. The fdtInProgress file will be removed by the fdtNanoServer.csh script after the FDT process has completed transferring all the files to firebird.

  5. The fdtStartNanoClient.csh script will wait 30 seconds, and then execute the fdtNanoClientEon0(1).csh script on firebird. The 30 seconds is the usual wait to make sure that the server process starts first.

  6. When the FDT is finished, the fdtNanoClientEon0(1).csh script will call the gridFTPNanoEon0(1).pl script to do the gridFTP transfers to RCF

  7. The gridFTPNanoEon0(1).pl script composes and executes the three grid transfer .csh scripts which run in parallel. Originally the grid transfer scripts had to be composed for each transfer since an older version of gridFTP required that each file be named separately. However, the newest version of gridFTP allows whole directories to be copied with one command. Nonetheless, I left the same three grid transfer .csh scripts to be composed anew each time. Each transfer script sends a file completion script as its last command. This completion (handshake) file contains the names and sizes of all the files which have been transferred. The handshake file is used at RCF for checking that the file transfers had no errors. A gridFtpInProgress file is sent to the ACCRE nanoDstTransfering area before the grid transfers begin. This file will block any starts of the cautoRun7 script for the production of new nanoDSTs.

  8. After starting the three grid transfer scripts, the gridFTPNanoEon0(1).pl script waits in 5 minute cycles until an eon0(1)UploadSuccess.txt file is found.

  9. The eon0(1)UploadSuccess.txt file is sent using a gridFTP transfer by the confirmAndEraseData59(58,63) script which runs every 30 minutes as a cron job in the maguire account on the rftpexp01 node. This script checks that the files arrived at RCF correctly. As you can see this is one of the current weak points of the project, that the output areas at RCF have to be adjusted manually instead of automatically.

  10. After the UploadSuccess.txt file appears the nanoDST files in the firebird newstore areas are removed, and a /home/phenix/nanoDsttTransfering/moveNanoToStore.csh script is executed on the vmps02 node.

  11. The moveNanoToStore.csh script will move the files from the newstore to the store area. After this move the gridFtpInProgress file is removed from the /home/phenix/nanoDstTransfering area which permits new job submissions by the cautoRun7 master script.

Abnormal conditions during the transfer of nanoDST files to RCF from the ACCRE newstore areas

  1. The gridFTP copy could hang and not all the files are transferred. In that case one has to do manual commands to recover the missing files. This would mean killing the hanging globus job, if there is one. Then one has to manually find out what are the missing files at RCF and arrange to have those copied to RCF by another gridFTP script. Once this is done, then the hopefully the automatic checking process would resume to verify that the files are all present.

  2. The transfer rate should proceed at a minimum of 15 MBytes/second over many hours. It is possible that the destination disk at RCF may become very overloaded by other users. In that case, one would have to find another disk destination area at RCF, and start the transfers from the beginning to that new area.

  3. I have written new gridFTP monitoring scripts which will send an e-mail if the average gridFTP transfer rate drops below 15 MBytes/second. These scripts were in use during the May 28-29 gridFTP transfer to data63, and no alarm message was sent during those 11 hours of transfer.