Setting up the Workflow
Last updated
Last updated
For preparing the workflow, we need to prepare
input-files
FolderThe ligand database which should be screened needs to be stored on the cluster file system to be available to VFVS. New or custom libraries can be prepared either by VFLP, or ready prepared libraries can be used such as the REAL Database of Enamine which we provide on the VirtualFlow homepage:
Here in this tutorial, we use the REAL library. For this purpose, we go to the homepage via the link above, and use the slider filters to select a smaller subset of library, consisting of approximately 50,000 compounds as in the image below:
After that, we download the wget file for the tranches (tranches.sh
) and the collection-length file (collections.txt
). The file tranches.sh
is a script which contains shell commands which will download the selected part of the ligand database. The collections length file contains the number of ligands of each of the ligand collections which belong to the selected tranches.
We replace the file tools/templates/todo.all
with the file collections.txt
. The todo.all
file is the central todo file of the workflow.
Then we go to the input-files/ligand-library
directory:
Next, we move the file tranches.sh
into this folder, and source it:
This will download all the ligand tranches/collections which were selected before in the tranche table.
The docking input files comprise the receptor structure and the docking program configuration file. All the filename path which are specified in the docking program configuration files need to be relative to the tools
folder.
The preparation of the docking program input files depends on the docking program and receptor to be used. This is external to VirtualFlow. For most of them a separate homepage and tutorials exist, such as for AutoDock Vina: http://vina.scripps.edu/
We now change the directory back to the input-files
folder:
For this tutorial, we will use the docking input files of Tutorial 1. They can be downloaded here. Using the wget
command we can get the file directly on the cluster:
And then, we can extract the files:
[For more information about the preparation of the input-files
folder in general, see the corresponding section in the documentation.]
tools
FolderAt first, we change the directory to the tools
folder:
tools/templates/all.ctrl
FileAs a next step, we will edit the file tools/templates/all.ctrl
. It needs to be adjusted according to the cluster/batch system which you are using. In particular, the following settings are cluster dependent:
batchsystem
: The resource manager which is used by your cluster.
partition
: The partition queue to be used for running the jobs of this workflow/tutorial.
timelimit
: Each partition/queue has normally a time-limit, therefore make sure you don't exceed it.
Everything else is set up in a way which should work on most clusters. The jobs are pre-configured such that each job uses one CPU-core on one node.
Very few clusters require that full nodes are used, and sometimes even a minimum number of the them. In this case, the following settings need to be adjusted as well:
steps_per_job
: This equals the number of nodes which should be used per job, and be at least as large as the minimum number of nodes per job since one job step is used for each node.
cpus_per_step
: In this case where entire nodes are used, this should be set to the number of CPUs per node to utilize them fully.
queues_per_step
: This should be set to the number of cpus_per_step
, since in this case we have one cpu per queue (i.e. the docking programs which are executed within the queues are set to use just one CPU core).
cpus_per_queue
: This is the number of CPU cores which are used per queue, and thus also per docking program instance. Normally, it is most efficient to set up the workflow such that each docking program uses one CPU core, and that there is one queue per core.
Regarding the parameters ligands_todo_per_queue
and ligands_per_refilling_step
, for this tutorial we set it to something smaller than the default value, which is more suitable for large-scale production runs. A value of 1000 might be favorable for both of them:
Regarding the logging of errors, it is favorable to enable full logging when trying to get a workflow working:
After everything seems to run smoothly, one can change the logging options and restart the workflow, because extensive uncompressed logging takes too much disk space for large-scale production runs.
Regarding the docking scenario options, the docking_scenario_inputfolders
parameter needs to be set according to the names of the docking scenario folders in the input-files
folder. The docking programs are specified with the docking_scenario_programs
option. For now, we use the following settings:
For submitting jobs into batch systems on clusters job files are needed. How they look like depends on the type of batch system. For most batch systems VirtualFlow has already pre-configured job files stored in the tools/templates/
folder:
SLURM: template1.slurm.sh
PBS: template1.pbs.sh
SGE: template1.sge.sh
LFS: template1.lfs.sh
Torque/MOAB: template1.torque.sh
For most clusters, the job files do not have to be edited manually. However, some clusters require specific/custom settings in the job files. If this is the case for your cluster, you need to edit the job templates and adjust them accordingly.
[For more information about the preparation of the tools
folder in general, see the corresponding section in the documentation.]
workflow
and output-files
FoldersAll other files are already set up, in particular the tools/templates/todo.all
, which contains the ligand collections which should be processed/screened, as well as the docking input files in the folder input-files
which also contains the input ligand database.
This means, we are ready to prepare the folders, which can be done with the command vf_prepare_folders.sh
. The command will delete old files from previous runs if present, and prepare the workflow folder which is used by VirtualFlow during the runtime to organize and log the workflow. To prepare the folders, simply run the command:
./vf_prepare_folders.sh
The command will ask you if you really want to reset/prepare the relevant folders, since this may delete files from previous runs.
[For more information about the preparation of the workflow
and output-files
folders in general, see the corresponding section in the documentation.]