Table of Contents
/tmp directory on remote workers must be open
for r/w. PROOF and ROOT writes there.
I have redirected all possible temporary files to PoD working directory,
but there are still some files, which ROOT/PROOF writes to the /tmp, it includes sockets files of proof/xrootd.
Since AFS doesn't support pipes you need to change the PoD server working directory in PoD user defaults configuration, so that a new directory will not reside on AFS anymore. Something like that should work:
[server] # # PoD working directory # work_dir=/tmp/manafov/
PoD setups workers on the remote nodes and it makes PROOF master to think
(only when PoD packet-forwarding connection is used),
that all of his workers are on the localhost. Actually PoD hides remote PROOF workers from the
PROOF server and acts as a "proxy" between them. And since default value for
PROOF_MaxSlavesPerNode is 2, therefore only 2 slaves get packages.
Since all slaves (for PROOF server) are on the localhost, the other Y-2 workers won't get packages.
See for more information PROOF Wiki:
PROOF_MaxSlavesPerNode Type: int Description: Parameter for the packetizers. Limit the number of slaves accessing data on any single node. Default Value: In TPacketizer the default value is 4. In TPacketizerAdaptive and TPacketizerProgressive it is 2.
From other source of information, it looks like the default number of workers reading remotely from one file node (worker machine) is not "2", but a number of CPU cores of the master node.
In order to resolve this issue, you need to change one variable of your PROOF session (50 is only an example):
proof->SetParameter( "PROOF_MaxSlavesPerNode", (Long_t)100 );
Hopefully in the future, this will be possible to do through XROOTD configuration file and PoD will manage it for you automatically.
If PoD doens't work for you out of the box at CERN on LSF, PoD jobs fail to start PoD workers, then most probably you are facing so called "gLite environment" issue. Check you the logs from PoD jobs and if you see something like this in you std_XXX.out of the job, then most probably xproofd will fail to start:
LD_LIBRARY_PATH=/tmp/PoD_cgmwU22625:/opt/d-cache/dcap/lib:/opt/d-cache/dcap/lib64:/opt/glite/lib:/opt/glite/lib64:/opt/globus/lib:\ /opt/lcg/lib:/opt/lcg/lib64:/usr/lib64:/afs/cern.ch/user/m/mbellomo/PoD/3.6/lib:/afs/cern.ch/sw/lcg/external/qt/4.4.2/x86_64-slc5-gcc43-opt/lib:\ /afs/cern.ch/sw/lcg/external/Boost/1.47.0_python2.6/x86_64-slc5-gcc43-opt//lib:/afs/cern.ch/sw/lcg/app/releases/ROOT/5.30.00/x86_64-slc5-gcc43-opt/root/lib:\ /afs/cern.ch/sw/lcg/contrib/gcc/4.3.5/x86_64-slc5-gcc34-opt/lib64:/afs/cern.ch/sw/lcg/contrib/mpfr/2.3.1/x86_64-slc5-gcc34-opt/lib:\ /afs/cern.ch/sw/lcg/contrib/gmp/4.2.2/x86_64-slc5-gcc34-opt/lib:/opt/classads/lib64/:/opt/c-ares/lib/
Note "glite" in the paths.
Just as a workaround, you can try to use Section 5.2, “User's environment on workers” in this file you need just two lines
#! /usr/bin/env bash export LD_LIBRARY_PATH=/afs/cern.ch/sw/lcg/external/qt/4.4.2/x86_64-slc5-gcc43-opt/lib:\ /afs/cern.ch/sw/lcg/external/Boost/1.44.0_python2.6/x86_64-slc5-gcc43-opt//lib:/afs/cern.ch/sw/lcg/app/releases/ROOT/5.30.00/x86_64-slc5-gcc43-opt/root/lib:\ /afs/cern.ch/sw/lcg/contrib/gcc/4.3.5/x86_64-slc5-gcc34-opt/lib64:/afs/cern.ch/sw/lcg/contrib/mpfr/2.3.1/x86_64-slc5-gcc34-opt/lib:\ /afs/cern.ch/sw/lcg/contrib/gmp/4.2.2/x86_64-slc5-gcc34-opt/lib:/afs/cern.ch/alice/library/afs_volumes/vol12/geant3/lib/tgt_linuxx8664gcc:\ /afs/cern.ch/alice/library/afs_volumes/vol12/AliRoot/lib/tgt_linuxx8664gcc
or any sutable LD_LIBRARY_PATH you like, just without gLite libs.
Another solution is to check what inserts glite environemnt to your LSF jobs and get rid of it.
If your home is on AFS, than you need to give permissions to Condor to access
some of your PoD directories. Namely, Condor needs to have full access to the following folders:
$HOME/.PoD/log. The last path is the default path for PoD logs.
If you changed this path in PoD user defaults settings and a new log directory is also on AFS, than you need to open it accordingly.
You can do that by issuing the following commands:
fs setacl -dir $HOME/.PoD/wrk -acl system:anyuser rlidwk fs setacl -dir $HOME/.PoD/log -acl system:anyuser rlidwk
One may want to compile CLASSADS with namespace support, because gLite UI contains CLASSADS which compiled without support of namespaces, though some of gLite API libraries (WMSUI for example) require classads with namespace support. This issue will prevent GAW to be build properly.
Download classads-0.9.9 from here.
tar -xzvf classads-0.9.9.tar.gz cd classads-0.9.9 ./configure --enable-namespace make make install
Be advised that in some Linux distributions there is the ClassAds package. For example Fedora 9:
If you have gLiteUI installed from relocatable tarball, then you may face this gLite bug by having the following (or similar ones) error messages while compiling GAW library:
grep: /opt/globus/lib/libglobus_ftp_control_gcc32dbg.la: No such file or directory /bin/sed: can't read /opt/globus/lib/libglobus_ftp_control_gcc32dbg.la: No such file or directory libtool: link: `/opt/globus/lib/libglobus_ftp_control_gcc32dbg.la' is not a valid libtool archive
One of the solutions would be to just copy Globus libs to /opt/:
cd $GLOBUS_LOCATION mkdir -p /opt/globus cp -rv lib /opt/globus/
If you have gLiteUI installed from relocatable tarball, then you may face this gLite bug. This issue will prevent GAW to be build properly.
One of the solutions would be to just get these headers from somewhere.
Since long time gLite UI_TAR installation (I suspect gLite UI as well) is missing "globus_config.h" (see CERN Savannah - gLite Bug #31180). gLite API is referencing to this file, but is not providing it. Users therefore should find it some where, to let GAW to use gLite API.