Run7 Reco CAUTO operation

Run7 CAUTO operation
Description of the main cron job for Run7 Min Bias PRDFs reconstruction

http://www.hep.vanderbilt.edu/~maguirc/Run7/cautoOperations.html

Contact Person: Charles F. Maguire
Creation Date: May 16, 2007
Last update May 29, 2007: Revisions after actual operations (database updating information is not yet accurate)

Introductory description [database updating description is obsolete]

The cautoRun7 script is the main control script to assemble jobs to be submitted to the PBS system for Run7 reconstruction on ACCRE. The script is based on the Run6 cauto script, but with changes needed because of the differences between Run6 pp and Run7 AuAu collisions. The Run7 jobs take three times as much CPU (12-14 hours) as did the Run6 jobs. Moreover, some of the compute nodes in ACCRE for 2007 are quad-CPUs. The quad-CPUs have been found to be only 75% efficient in the CPU time/Wall clock time ratio, compared to 92% for the dual CPU nodes. Hence, the actual run time on a quad CPU is more than 17 real hours.

The Run6 jobs, 100 at a time, were submitted twice per day, at 4:05 AM and 4:05 PM. This left plenty of time for one cycle to be completed before the next one was attempted. It also allowed the database to be updated on VUPAC starting at midnight and finishing before 3 AM, without interference to the PBS jobs.

For Run7, with the 17 hour minimum reconstruction cycle time, and longer database update times (~4.5 hours), it is not efficient to have a fixed cycle of job submission and database updating. Instead there will be an hourly attempt at job submission, and at data base updating. There will be provisions to insure that no new jobs are submitted before the old cycle is finished, and similarly to not try a database update while there are PBS jobs waiting in queue. The database update will be started, if needed, after two hours have elapsed since the final PBS job started running. The two hours insures that the PBS jobs will have completed their database accesses.

The database updating cron job also does not submit database update attempts between 6 PM and midnight. New database restore files arrive daily from RCF at 11 PM and are transferred to VUPAC at 11:35 PM. A database update attempt started after 6 PM could find its restore files overwritten after 11 PM. So the first attempt at a database update will be at midnight.

An important function of the cautoRun7 control script and its subscripts is to keep track of which runs have been processed, and which new runs need to be processed. This is done by the makefilelist.sh script described below.

Normal operational steps for the cautoRun7 production script

Check host
The script will run only on the vmps18 gateway node. If the script finds itself anywhere else, it will exit with a wrong node message. The script is triggered every half hour at 5 and 35 minutes after the hour by a cron job of the phnxreco account:

5,35 * * * * /gpfs3/RUN7PRDF/prod/run7/cautoRun7 >& /gpfs3/RUN7PRDF/prod/run7/cautoRun7.log

The script is presently located as a softlink /gpfs3/RUN7PRDF/prod/run7 subdirectory of the phnxreco account on ACCRE. The link is to the actual file which is in a CVS controlled area, /home/phnxreco/cvs/online/vanderbilt/run7AuAu. In fact all scripts which are executed either on the ACCRE or the VUPAC nodes are located in the home directory $CVSRUN7 area. The scripts on the firebird are also in CVS, but there is no CVS yet active on the firebird node.

The first thing that the cautoRun7 script does is to check for the presence of the file /gpfs3/RUN7PRDF/prod/run7/PRODUCTION_SUSPENDED. Such a file would be manually placed in this directory location. If such a file is found, then the cautoRun7 script will exit immediately. The presence of this script is the same as commenting out the command in the crontab file. This suspension file would be created manually if one wanted to stop the production for some reason.

The second thing that the cautoRun7 script does is to check for the presence of the file /gpfs3/RUN7PRDF/prod/run7/cautoRun7InProgress. This file is an example of an automatically generated InProgress signal file. Such signal files are crucially important and are generated and deleted automatically as required. In this case, the cautoRun7 script takes about 40 minutes to complete its work before new jobs start running. So we don't want another cautoRun7 script running 30 minutes after the first one starts.

The environment variable $PROD7 in the phnxreco account login points to the /gpfs3 subdirectory location. The $PROD7 area is the "top" directory for the Run7 production since large outputs are being generated and stored in some of its subdirectories.
Check for still active FDT or gridFTP transfers
The cautoRun7 script itself will trigger a copy of the previous cycle's output to the RCF area. This copy is first by an FDT transfer from ACCRE to firebird, and then a gridFTP transfer from firebird to RCF. A new production cycle should not take place until the previous production cycle has been completely copied. One can work around this manually, but one must be careful to see that no new files are being stored in the newstore area, while previous files in the newstore area are still present. A zero-length signal file, either fdtInProgress or gridFtpInProgress, will be present on the /home/phnxreco/nanoDstTransfering directory if there is an FDT or a gridFTP still in progress copying the previous production's output.
Check jobs still running in PBS
If there are any jobs still running in the phnxreco account from a previous job submission, the script will exit with a report of the number of jobs still running.

Note: this restricts the use of the phnxreco account in PBS to be only for Run7 min bias PRDFs reconstruction until further notice.

There was flaw in this check. Namely, if the status checking software in CVS is broken, which does happen a few times per month, then the status check would appear to be OK. The cautoRun7 was fixed to check the status checking software on ACCRE is functioning properly.
Check the database status to determine the current maximum run number, and which runs are calibrated
These checks involves three steps.
1. Running of /gpfs3/RUN7PRDF/prod/run7/test/checkcalib
  The checkcalib script executes a single line psql command on the Phenix table in the database to produce a calib.out file. The calib.out file is a list of run numbers, which as of May 26, 2007) starts at 228827 and ends at 23593. The calib.out file is used by the makefilelist.sh script described below.
2. Running of /gpfs3/RUN7PRDF/prod/run7/test/checkrun
  The checkrun script executes two command lines. The first command is a psql command to produce a run.out file. The second command is a grep command on the run.out file to produce a run.info file. The run.info file as a single run number (236001 as of May 26, 2007), and this information is used to set the maxrun variable in the makefilelist.sh script.
3. Checking OK status of the /home/phnxreco/prod/CONTROL file
  An OK line is checked to be the first line in the CONTROL file. That file was sent to the ACCRE phnxreco account as part of the database updating done on the VUPAC farm. If there is no such OK line, then the cautoRun7 script will exit with such a message.
Move old contents using the move.sh script
This script runs the PERL script move.pl which functions as follows:
1. The first step is to produce anew two files called ok.txt.move and fail.txt.move which are located in the /gpfs3/RUN7PRDF/prod/run7/list directory. These are obsolete files which were used during Run5 as part of a bbftp-based copy of the output files to RCF.
2. The second step is to construct an internal filelist of the full paths of all the log.txt files located in the subdirectories /gpfs3/RUN7PRDF/prod/run7/output/runs/batch*/run_* . PRDFs runs (run number and segment number) which have failed. These are noted in the fail.txt.move file.
3. Successful reconstruction output files are placed in the /gpfs3/RUN7PRDF/prod/run7/output/newstore areas according to the file type (CNT, DST, ...). These files will be transported to RCF. After the files are transported to RCF they are moved to the equivalent /gpfs3/RUN7PRDF/prod/run7/output/store areas.
Execute the cleanup script clean.sh script
This script runs the two scripts clean_batch.sh and clean_newstore.sh .
1. The clean_batch.sh script removes all the files in /gpfs3/RUN7PRDF/prod/run7/output/runs/batch* .
2. The clean_newstore.sh removes all the file in /gpfs3/RUN7PRDF/prod/run7/output/newstore/fail/*/* .
Execute the mklist.sh script to make a list of new input files.
The mklist.sh script has three command lines. The first is to execute the makefilelist.sh script which produces files called total_todo*.txt which are located in a list subdirectory. The second command is line count on these files. The third command is for the date. The outputs of these command are written to the mklist.log file. The principle script makefilelist.sh functions as follows:
1. A minimum run number is set to 228828, which was the first PRDF transferred to Vanderbilt for Run7. A maximum run number is set to be the value in the run.info file produced in step 3b described above.
2. The PRDF data file directory is presently set to /blue/phnxreco/RUN7PRDF/auauMinBias200GeV. A get_status script is run on the data directory /blue/phenix/RUN7PRDF/auauMinBias200GeV and the work directory /gpfs3/RUN7PRDF/prod areas. Three output files are produced: done.txt, ok.txt, and fail.txt. The done.txt file has the full path file names (including the /blue/phenix/..) which are in the "store" directory, namely those files which are already at RCF. The ok.txt file has the successfully reconstructed full path file names which are in the "newstore" directory awaiting transfer to RCF. The fail.txt file has the full path file names in the "newstore" fail subdirectory which means that these runs were not successfully reconstructed.
3. The file list from the data file directory (/blue/phenix/...) is updated to produce a filelist_all.txt file. This accounts for any newly transferred files.
4. The filelist_all.txt file is sorted to a unique names filelist.txt file, although this file seems to be identical to the original file.
5. The three status files ok.txt done.txt total_done are reduced to a unique names tmp.txt file which then replaces the previous total_done file.
6. A difference is made between the filelist.txt and the total_done.txt files to produce a diff.txt file.
7. The diff.txt file is reduced to a total_todo_all.txt file by checking the relevant run range. This run range is from the minimum run number to the maximum (and likely uncalibrated) run number.
8. A comparison is then made against the list of calibration runs. The current total_todo.txt file is erased and recreated with a zero length. The calib.out file is used to parse the total_todo_all.txt file with the result to be appended to the initially empty total_todo.txt file. This file then contains all the unprocessed input files which have valid calibrations.
9. There are then four file lists prepared:
  1) todo_all_runlist.txt starting from the full path file names total_todo_all.txt which produces a unique column of run numbers without the file segment numbers,
  2) todo_runlist.txt starting from calibrated file names list total_todo.txt producing again a unique column of run numbers,
  3) done_runlist.txt starting from the already done file names list total_done.txt to produce a unique column of numbers, and
  4) runlist.txt from the total file list of names filelist.txt to get the unique column of numbers list.
10. The total_todo.txt file is copied to the history subdirectory with a name plus date and time as todo.txt_YYYY-MM-DD_HH:MM:SS. This file then archives the information which files were still to be processed as of a given date.
  
  Summarizing, the PRDF data source directory serves as the input for all the run numbers, while the "done" and the "newstore" directories contain run numbers which have already been processed. The difference between those two sets of run numbers make of the list of runs still to be processed. Only those unprocessed runs which also are calibrated will be made available for the job submission scripts.
Execute the job submission script with a perl -w submit.pl >& submitPerl.log command
For Run7 there is no launch.csh script as there was in Run6. Instead the submit.pl script itself contains the qsub commands. The submit.pl script acts as follows:
1. A identification key is made according to a time command. This identification tag will be attached to the /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID subdirectory.
2. A count of the number of jobs to be run is obtained from the total_todo.txt file constructed previously. A maximum limit of 200 jobs is preset in the submit.pl script. The output of 200 jobs is about 760 GBytes, and that amount of data can be copied to RCF in the length of one, ~17 hour production cycle. If the transfer rate to RCF becomes too slow, then we will have to reduce to 150 or 100 jobs per cycle. Similarly, we would have to reduce to 150 or 100 jobs per cycle if we found that say in a 20 hour period only 150 or 100 jobs were completing.
3. An output file steerlist.txt is constructed from the input file total_todo.txt by eliminating from the beginning of the file a certain number of skipped files. This number of skipped files is presently hardcoded at 0. If there are less than 201 files in the total_todo.txt list, then all of these run numbers are written to the steerlist.txt files. If there are more than 200 files in the total_todo.txt list, then the last 100 only are written to the steerlist.txt file.
4. The /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID directory is created and the steerlist.txt file is copied to this directory.
5. A "for" loop is executed for the variable ifile going from 0 to the number jobs to be run. In this for loop a set of commands is constructed to make /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID/run_ifile directories which will be the working directories for the reconstruction jobs. Into each of these working directories softlinks will be placed which link to the various input files needed during the events reconstruction.
6. After the /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID/run_ifile areas are created and filled, the next step is to create the PBS production script for each of the 200 jobs. These PBS scripts are submitted by the submit.perl script in a multi-try fashion. Sometimes the PBS system is too busy and will reject a single qsub command. So the submit.perl script is set up to repeatedly try the qsub command, up to four more times, with a 20 second delay between each try. The information on the qsub result, whether it failed or succeeded, is written to the submitPerl.log.
Count how many jobs are in the PBS queue with the phnxreco account name. This count is written to the cautoRun7.log file.
Start the process for the transfer of the nanoDSTS
See the WWW site http://www.hep.vanderbilt.edu/~maguirc/Run7/nanoDSTTransferToRCF.html which details the steps by which the transfer of the nanoDSTs to RCF is accomplished. Also look at the WWW site http://www.hep.vanderbilt.edu/~maguirc/Run7/run7CronJobs.html which describes the various cron jobs used in the entire project.

Abnormal conditions during the job submission

As mentioned already, the qsub command could fail. If 5 successive tries fail, that the qsub is abandoned for that job and the qsub for the next job is attempted. Very likely, if one job fails with 5 qsub tries, there will be other jobs which fail too. The submitPerl.log would have to be examined to see which jobs failed, and one could attempt to resubmit these manually.
There was a change in the account name from phenix to phnxreco on May 25. This caused some initial confusion in the PBS accounting system, including one submission cycle where 65 jobs were killed immediately by one bad node. This is an example of the CPU-farm black hole effect. A single node may be defective and kill jobs prematurely. However, the job scheduler will not know the node is defective and will continue to submit successive jobs to this apparently idle but defective node. A very large set of jobs could be lost in this manner.
Individual jobs can fail to connect to the database server properly. Typically this means that the job will crash very quickly. However, one database failure connection to the Trigger information will allow the job to continue running, but producing many error messages. An automated script needs to be written which will kill such jobs, and resubmit this and any job which initially failed because of a database connection problem. Note of May 29: I believe this problem was because of bad coding in the phnxVandy_setup.csh file which always made a new softlink for the .odbc.ini file. I changed that to not use a softlink, and there have been no DB connection failures since Sunday May 29.
The logic of how the run numbers are chosen for a new set of runs is not fully understood by me. I tried to take out some CNT's from the store area to force these runs to be re-done. However, many other runs were then marked for re-doing also. So we have to understand better this logic.
Recent jobs (older PRDFs I think?) have been taking longer than 15 CPU hours instead of the expected 12-13 CPU hours. So I had to increase the wall clock time to 22 hours. I don't understand why this increase has occurred, perhaps more events in these files?
Sometimes jobs will crash because of a faulty node. This happened on May 28 with three jobs. If the crash occurs within a few hours of the initial submission, it is worth it to resubmit the jobs. On the other hand, if more than 6 hours have elapsed, then one should not resubmit the jobs. They will be picked up in the next cycle.

Run7 CAUTO operation Description of the main cron job for Run7 Min Bias PRDFs reconstruction

http://www.hep.vanderbilt.edu/~maguirc/Run7/cautoOperations.html

Introductory description [database updating description is obsolete]

Normal operational steps for the cautoRun7 production script

Abnormal conditions during the job submission

Run7 CAUTO operation
Description of the main cron job for Run7 Min Bias PRDFs reconstruction