Run7 CAUTO operation
Description of the main cron job for Run7 Min Bias PRDFs reconstruction
Contact Person:
Charles F. Maguire
Creation Date: May 16, 2007
Last update May 29, 2007: Revisions after actual operations (database updating
information is not yet accurate)
The cautoRun7 script is the main control script to assemble jobs to
be submitted to the PBS system for Run7 reconstruction on ACCRE. The
script is based on the Run6 cauto script, but with changes
needed because of the differences between Run6 pp and Run7 AuAu
collisions. The Run7 jobs take three times as much CPU (12-14 hours) as
did the Run6 jobs. Moreover, some of the compute nodes in ACCRE
for 2007 are quad-CPUs. The quad-CPUs have been found to be only
75% efficient in the CPU time/Wall clock time ratio, compared to
92% for the dual CPU nodes. Hence, the actual run time on a quad
CPU is more than 17 real hours.
The Run6 jobs, 100 at a time, were submitted twice per day, at
4:05 AM and 4:05 PM. This left plenty of time for one cycle to be completed
before the next one was attempted. It also allowed the database to be
updated on VUPAC starting at midnight and finishing before 3 AM, without interference
to the PBS jobs.
For Run7, with the 17 hour minimum reconstruction cycle time, and longer database
update times (~4.5 hours), it is not efficient to have a fixed cycle of
job submission and database updating. Instead there will be an hourly attempt
at job submission, and at data base updating. There will be provisions to insure
that no new jobs are submitted before the old cycle is finished, and similarly
to not try a database update while there are PBS jobs waiting in queue.
The database update will be started, if needed, after two hours have
elapsed since the final PBS job started running. The two hours insures
that the PBS jobs will have completed their database accesses.
The database updating cron job also does not submit database update
attempts between 6 PM and midnight. New database restore files arrive
daily from RCF at 11 PM and are transferred to VUPAC at 11:35 PM. A database
update attempt started after 6 PM could find its restore files overwritten
after 11 PM. So the first attempt at a database update will be at midnight.
An important function of the cautoRun7 control script and its subscripts
is to keep track of which runs have been processed, and which new
runs need to be processed. This is done by the makefilelist.sh script
described below.
- Check host
The script will run only on the vmps18 gateway node. If the
script finds itself anywhere else, it will exit with a wrong node message. The
script is triggered every half hour at 5 and 35 minutes after the hour by a
cron job of the phnxreco account:
5,35 * * * * /gpfs3/RUN7PRDF/prod/run7/cautoRun7 >&
/gpfs3/RUN7PRDF/prod/run7/cautoRun7.log
The script is presently located as a softlink
/gpfs3/RUN7PRDF/prod/run7 subdirectory of the phnxreco account on ACCRE.
The link is to the actual file which is in a CVS controlled area,
/home/phnxreco/cvs/online/vanderbilt/run7AuAu. In fact all scripts
which are executed either on the ACCRE or the VUPAC nodes are located in
the home directory $CVSRUN7 area. The scripts on the firebird are also in
CVS, but there is no CVS yet active on the firebird node.
The first thing that the cautoRun7 script does is to check for the presence of
the file /gpfs3/RUN7PRDF/prod/run7/PRODUCTION_SUSPENDED. Such a file
would be manually placed in this directory location. If such a file is
found, then the cautoRun7 script will exit immediately. The presence
of this script is the same as commenting out the command in the
crontab file. This suspension file would be created manually if one
wanted to stop the production for some reason.
The second thing that the cautoRun7 script does is to check for the presence of
the file /gpfs3/RUN7PRDF/prod/run7/cautoRun7InProgress. This file
is an example of an automatically generated InProgress signal file.
Such signal files are crucially important and are generated and deleted
automatically as required. In this case, the cautoRun7 script takes about
40 minutes to complete its work before new jobs start running. So we don't
want another cautoRun7 script running 30 minutes after the first one starts.
The environment variable $PROD7 in the phnxreco account login points to
the /gpfs3 subdirectory location. The $PROD7 area is the "top" directory for the Run7 production since
large outputs are being generated and stored in some of its subdirectories.
- Check for still active FDT or gridFTP transfers
The cautoRun7
script itself will trigger a copy of the previous cycle's output to the
RCF area. This copy is first by an FDT transfer from ACCRE to firebird,
and then a gridFTP transfer from firebird to RCF. A new production cycle
should not take place until the previous production cycle has been completely
copied. One can work around this manually, but one must be careful to see
that no new files are being stored in the newstore area, while
previous files in the newstore area are still present. A zero-length
signal file, either fdtInProgress or gridFtpInProgress, will be present
on the /home/phnxreco/nanoDstTransfering directory if there is an
FDT or a gridFTP still in progress copying the previous production's
output.
- Check jobs still running in PBS
If there are any jobs still
running in the phnxreco account from a previous job submission, the
script will exit with a report of the number of jobs still running.
Note: this
restricts the use of the phnxreco account in PBS to be only for
Run7 min bias PRDFs reconstruction until further notice.
There was flaw in this check. Namely, if the status checking software in CVS
is broken, which does happen a few times per month, then the status check would
appear to be OK. The cautoRun7 was fixed to check the status checking software
on ACCRE is functioning properly.
- Check the database status to determine the current maximum run number,
and which runs are calibrated
These checks involves three steps.
- Running of /gpfs3/RUN7PRDF/prod/run7/test/checkcalib
The checkcalib
script executes a single line psql command on the Phenix table in the database
to produce a calib.out file.
The calib.out file is a list of run numbers, which as of May 26, 2007) starts at
228827 and ends at 23593. The calib.out file is used by the
makefilelist.sh script described below.
- Running of /gpfs3/RUN7PRDF/prod/run7/test/checkrun
The checkrun script
executes two command lines. The first command is a psql command to produce a
run.out file. The second command is a grep command on the
run.out file to produce a run.info file. The run.info
file as a single run number (236001 as of May 26, 2007), and this information is used
to set the maxrun variable in the makefilelist.sh script.
- Checking OK status of the /home/phnxreco/prod/CONTROL file
An OK
line is checked to be the first line in the CONTROL file. That file
was sent to the ACCRE phnxreco account
as part of the database updating done on the VUPAC farm. If there is
no such OK line, then the cautoRun7 script will exit with such a message.
- Move old contents using the move.sh script
This script runs the
PERL script move.pl which functions as follows:
- The first step is to produce anew two
files called ok.txt.move and fail.txt.move which are
located in the /gpfs3/RUN7PRDF/prod/run7/list directory. These are obsolete
files which were used during Run5 as part of a bbftp-based copy of
the output files to RCF.
- The second step is to
construct an internal filelist of the full paths of all the log.txt files
located in the subdirectories /gpfs3/RUN7PRDF/prod/run7/output/runs/batch*/run_* .
PRDFs runs (run number and segment number) which have failed. These are noted in
the fail.txt.move file.
- Successful reconstruction output files are placed in the
/gpfs3/RUN7PRDF/prod/run7/output/newstore areas according to the file type
(CNT, DST, ...). These files will be transported to RCF. After the
files are transported to RCF they are moved to the equivalent
/gpfs3/RUN7PRDF/prod/run7/output/store areas.
- Execute the cleanup script clean.sh script
This script runs the
two scripts clean_batch.sh and clean_newstore.sh .
- The clean_batch.sh script removes all the files in
/gpfs3/RUN7PRDF/prod/run7/output/runs/batch* .
- The clean_newstore.sh
removes all the file in /gpfs3/RUN7PRDF/prod/run7/output/newstore/fail/*/* .
- Execute the mklist.sh script to make a list of new input files.
The mklist.sh script has three command lines. The first is to
execute the makefilelist.sh script which produces files called
total_todo*.txt which are located in a list subdirectory.
The second command is line count on these files. The third command is
for the date. The outputs of these command are written to the mklist.log
file. The principle script makefilelist.sh functions as follows:
- A minimum run number is set to 228828, which was the first PRDF
transferred to Vanderbilt for Run7. A maximum run number is set to be
the value in the run.info file produced in step 3b described above.
- The PRDF data file directory is presently set to
/blue/phnxreco/RUN7PRDF/auauMinBias200GeV.
A get_status script is run on the data directory /blue/phenix/RUN7PRDF/auauMinBias200GeV
and the work directory /gpfs3/RUN7PRDF/prod areas. Three output files are produced:
done.txt, ok.txt, and fail.txt.
The done.txt file has the full path file names (including the /blue/phenix/..) which are in the "store" directory,
namely those files which are already at RCF. The ok.txt file has the successfully
reconstructed full path file names which are in the "newstore" directory awaiting
transfer to RCF. The fail.txt file has the full path file names in the "newstore" fail
subdirectory which means that these runs were not successfully reconstructed.
- The file list from the data file directory (/blue/phenix/...)
is updated to produce a filelist_all.txt file. This accounts
for any newly transferred files.
- The filelist_all.txt file is sorted to a unique names
filelist.txt file, although this file seems to be identical
to the original file.
- The three status files ok.txt done.txt total_done are reduced to a
unique names tmp.txt file
which then replaces the previous total_done file.
- A difference is made between the filelist.txt and the
total_done.txt files to produce a diff.txt file.
- The diff.txt file is reduced to a total_todo_all.txt file
by checking the relevant run range. This run range is from the minimum
run number to the maximum (and likely uncalibrated) run number.
- A comparison is then made against the list of calibration runs.
The current total_todo.txt file is erased and recreated with
a zero length. The calib.out file is used to parse the
total_todo_all.txt file with the result to be appended to the initially
empty total_todo.txt file. This file then contains all the unprocessed
input files which have valid calibrations.
- There are then
four file lists prepared:
1) todo_all_runlist.txt starting from the full path file names total_todo_all.txt which
produces a unique column of run numbers without the file segment numbers,
2) todo_runlist.txt starting from calibrated file names list
total_todo.txt producing again a unique column of run numbers,
3) done_runlist.txt starting from the already done file names list
total_done.txt to produce a unique column of numbers, and
4) runlist.txt from the total file list of names filelist.txt to
get the unique column of numbers list.
- The total_todo.txt file is copied to the history subdirectory with a
name plus date and time as todo.txt_YYYY-MM-DD_HH:MM:SS. This file then
archives the information which files were still to be processed as of a given date.
Summarizing, the PRDF data source directory serves as the input for all the
run numbers, while the "done" and the "newstore" directories contain run
numbers which have already been processed. The difference between those two
sets of run numbers make of the list of runs still to be processed. Only
those unprocessed runs which also are calibrated will be made available
for the job submission scripts.
- Execute the job submission script with a perl -w submit.pl >& submitPerl.log command
For Run7 there is no launch.csh script as there was in Run6. Instead the
submit.pl script itself contains the qsub commands.
The submit.pl script acts as follows:
- A identification key is made according to a time command. This
identification tag will be attached to the
/gpfs3/RUN7PRDF/prod/run7/output/runs/batchID
subdirectory.
- A count of the number of jobs to be run is obtained from the
total_todo.txt file constructed previously. A maximum limit
of 200 jobs is preset in the submit.pl script. The output
of 200 jobs is about 760 GBytes, and that amount of data
can be copied to RCF in the length of one, ~17 hour production cycle.
If the transfer rate to RCF becomes too slow, then we will have to
reduce to 150 or 100 jobs per cycle. Similarly, we would have to
reduce to 150 or 100 jobs per cycle if we found that say in a
20 hour period only 150 or 100 jobs were completing.
- An output file steerlist.txt is constructed from the
input file total_todo.txt by eliminating from the beginning
of the file a certain number of skipped files. This number of skipped
files is presently hardcoded at 0. If there are less than 201 files
in the total_todo.txt list, then all of these run numbers are written
to the steerlist.txt files. If there are more than 200 files in
the total_todo.txt list, then the last 100 only are written
to the steerlist.txt file.
- The /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID directory is created
and the steerlist.txt file is copied to this directory.
- A "for" loop is executed for the variable ifile going from 0 to the
number jobs to be run. In this for loop a set of commands is constructed
to make /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID/run_ifile directories
which will be the working directories for the reconstruction jobs.
Into each of these working directories softlinks will be placed which
link to the various input files needed during the events reconstruction.
- After the /gpfs3/RUN7PRDF/prod/run7/output/runs/batchID/run_ifile
areas are created and filled, the next step is to create the PBS production
script for each of the 200 jobs. These PBS scripts are submitted
by the submit.perl script in a multi-try fashion. Sometimes the
PBS system is too busy and will reject a single qsub command. So the
submit.perl script is set up to repeatedly try the qsub command,
up to four more times, with a 20 second delay between each try.
The information on the qsub result, whether it failed or succeeded, is
written to the submitPerl.log.
- Count how many jobs are in the PBS queue with the phnxreco account name.
This count is written to the cautoRun7.log file.
- Start the process for the transfer of the nanoDSTS
See the WWW site
http://www.hep.vanderbilt.edu/~maguirc/Run7/nanoDSTTransferToRCF.html which
details the steps by which the transfer of the nanoDSTs to RCF is accomplished.
Also look at the WWW site
http://www.hep.vanderbilt.edu/~maguirc/Run7/run7CronJobs.html which
describes the various cron jobs used in the entire project.
- As mentioned already, the qsub command could fail. If 5 successive tries
fail, that the qsub is abandoned for that job and the qsub for the next
job is attempted. Very likely, if one job fails with 5 qsub tries, there will
be other jobs which fail too. The submitPerl.log would have to
be examined to see which jobs failed, and one could attempt to resubmit
these manually.
- There was a change in the account name from phenix to
phnxreco on May 25. This caused some initial confusion in
the PBS accounting system, including one submission cycle where
65 jobs were killed immediately by one bad node. This is an
example of the CPU-farm black hole effect. A single node
may be defective and kill jobs prematurely. However, the job
scheduler will not know the node is defective and will continue
to submit successive jobs to this apparently idle but defective node.
A very large set of jobs could be lost in this manner.
- Individual jobs can fail to connect to the database server
properly. Typically this means that the job will crash very
quickly. However, one database failure connection to the Trigger
information will allow the job to continue running, but producing
many error messages. An automated script needs to be written
which will kill such jobs, and resubmit this and any job which initially
failed because of a database connection problem. Note of May 29: I
believe this problem was because of bad coding in the phnxVandy_setup.csh
file which always made a new softlink for the .odbc.ini file. I changed
that to not use a softlink, and there have been no DB connection
failures since Sunday May 29.
- The logic of how the run numbers are chosen for a new set of
runs is not fully understood by me. I tried to take out some
CNT's from the store area to force these runs to be re-done.
However, many other runs were then marked for re-doing also.
So we have to understand better this logic.
- Recent jobs (older PRDFs I think?) have been taking longer than 15 CPU
hours instead of the expected 12-13 CPU hours. So I had to increase the wall
clock time to 22 hours. I don't understand why this increase has occurred,
perhaps more events in these files?
- Sometimes jobs will crash because of a faulty node. This happened
on May 28 with three jobs. If the crash occurs within a few hours of
the initial submission, it is worth it to resubmit the jobs. On the
other hand, if more than 6 hours have elapsed, then one should not
resubmit the jobs. They will be picked up in the next cycle.