Server Pipelines
Our lab has access to powerful computing resources and support through the Center for High-Throughput Computing (CHTC). Our core bioinformatics and image processing pipelines will be deployed through CHTC servers. All pipelines will be maintained on GitHub and associated with Docker environments to ensure reproducibility. Many of our pipelines will use Nextflow.
In general, pipelines will be run in three steps:
- Staging: input files will be transferred to the CHTC server from lab storage
- Pipeline: files will be processed using established pipelines
- Output: desired outputs will be transferred from the CHTC server to lab storage
A. Center for High-throughput Computing (CHTC)
Consult official CHTC and HTCondor documentation before getting started. Register for an account using this form.
The HTC System
-
Execute (Compute) nodes
The CHTC has an extensive set of execute nodes. To establish priority access for certain pipelines, our lab has secured a prioritized node that can be accessed on-demand using a designated flag.
- Typical nodes: 20 cores, 128 GB RAM
- High-memory nodes: e.g., 80 cores, 4 TB RAM
- Dedicated lab node: 40 cores (80 hyperthreading), 512 GB RAM, 3.8 TB HD
-
Submit nodes
Jobs on the CHTC are deployed from submit nodes. You can
ssh
into our assigned submit node (submit2) to run and monitor jobs using your UW net-id and password.ssh {net-id}@submit2.chtc.wisc.edu
Note: If you correctly updated your
~/.bash_profile
by following the macOS environment setup instructions, then you can use the simplesubmit
tossh
into the node. -
File system
Each net-id is associated with a
home
folder, where we manage job submission scripts. Our lab also has a sharedstaging
folder, for transfer of large files in and out of the CHTC system. The CHTC does not use a shared file system, but you can request the storage you need for any given job./ ├── home/{net-id}/ [quota: 20 GB, submit script dir] └── staging/groups/zamanian_group/ [quota: 1 TB | 100 files] └── input/ [input dir: unprocessed (raw) data] └── output/ [output dir: processed job outputs] └── WBP.tar.gz [permanent storage of WBP data]
Deploying Pipelines
-
Staging - transfer input data for processing (ResearchDrive -> CHTC)
In almost all cases, you will directly transfer your input data from ResearchDrive to the CHTC staging input folder. Most raw data on ResearchDrive is unarchived and uncompressed. However, our pipelines expect a single archived folder (.tar) as input and will deliver a single archived folder as output. Use the command below to transfer an unarchived folder on ResearchDrive to CHTC input and have it archived on arrival.
ResearchDrive -> CHTC transfer of unarchived raw data folder (archived on arrival)
# Log into transfer server and navigate to staging input dir ssh {net-id}@transfer.chtc.wisc.edu cd /staging/groups/zamanian_group/input/ # Example of transferring sequencing data smbclient -k //research.drive.wisc.edu/mzamanian/ -D "UWBC-Dropbox/Bioinformatics Resource Center" -Tc 201105_AHLVWJDSXY.tar "201105_AHLVWJDSXY" # Example of transferring ImageXpress data smbclient -k //research.drive.wisc.edu/mzamanian/ -D "ImageXpress/raw" -Tc 20201118-p01-MZ_172.tar "20201118-p01-MZ_172.tar"
ResearchDrive -> CHTC transfer of unarchived metadata folder (archived on arrival)
# Log into transfer server and navigate to staging metadata dir ssh {net-id}@transfer.chtc.wisc.edu cd /staging/groups/zamanian_group/metadata/ # Example of transferring ImageXpress metadata smbclient -k //research.drive.wisc.edu/mzamanian/ -D "ImageXpress/metadata" -Tc 20201118-p01-MZ_172.tar "20201118-p01-MZ_172"
Note: If you correctly updated your
~/.bash_profile
by following the macOS environment setup instructions, then you can use the simpletransfer
tossh
into the node.For ImageXpress data, an entire experiment may include >10 plates that will take hours to days to transfer. To facilitate batch transfers, we have included two scripts in the
input/
andmetadata/
directories of/staging/groups/zamanian_group/
calledtransfer_images.sh
andtransfer_metadata.sh
. These scripts reference the text file/staging/groups/zamanian_group/plates.txt
and will loop through the plate names, sequentially transferring them from ResearchDrive and archiving them upon arrival. Editplates.txt
in the terminal usingvi
ornano
, or edit the file locally and then upload it to/staging/groups/zamanian_group/input/
(be sure to include a single plate name on each line with a blank line at the end of the file.) Runsh transfer_metadata.sh
andsh transfer_images.sh
to initate the transfers.Both of these scripts will require a continuous ssh connection while they transfer. Transferring metadata will only take a few minutes, but transferring multiple plates of images will take several hours. The transfer process can be sent to the background (allowing the user to close the ssh connection) by using the
screen
tool. There are a number of helpful tutorials online, but a few sample commands are shown below.Background transfer using
screen
# Log into transfer server and navigate to staging input dir ssh {net-id}@transfer.chtc.wisc.edu cd /staging/groups/zamanian_group/input/ # Start a screen named 'transfer' screen -S transfer # Initiate the transfer sh transfer_images.sh # Detach from the screen by pressing Ctrl+a and then d # Reattach to the screen screen -r transfer # Close the screen exit
Rarely, you may have to transfer data from other sources to CHTC staging input. You can run simple transfer commands from your computer:
scp [dir] {net-id}@transfer.chtc.wisc.edu:/staging/groups/zamanian_group/input/
-
Pipeline - Submit and manage CHTC jobs
CHTC uses HTCondor for job scheduling. Submission files should follow lab conventions and be consistent with the CHTC documentation. An example submit script with annotations is shown below. This submit script (Core_RNAseq-nf.sub) loads a pre-defined Docker environment and runs a bash executable script (Core_RNAseq-nf.sh) with defined arguments on the execute node. Other options define log files, resource requirements , and transfer of files in/out of
home
. Avoid transferring large files in/out ofhome
! We transfer in our large data through/staging/groups/zamanian_group/input/
and we move job output files to/staging/groups/zamanian_group/output/
within the job executable script to avoid their transfer tohome
upon job completion. The only files that should be transferred back tohome
are small log files.Example CHTC job submission scripts (.sub / .sh)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
# Core_RNAseq-nf.sub # Input data in /staging/{net-id}/input/$(dir) # Run: condor_submit Core_RNAseq-nf.sub dir=191211_AHMMC5DMXX script=Core_RNAseq-nf.sh # request Zamanian Lab server Accounting_Group = PathobiologicalSciences_Zamanian # load docker image; request execute server with staging universe = docker docker_image = zamanianlab/chtc-rnaseq:v1 Requirements = (Target.HasCHTCStaging == true) # executable (/home/{net-id}/) and arguments executable = $(script) arguments = $(dir) # log, error, and output files log = $(dir)_$(Cluster)_$(Process).log error = $(dir)_$(Cluster)_$(Process).err output = $(dir)_$(Cluster)_$(Process).out # transfer files in-out of /home/{net-id}/ transfer_input_files = should_transfer_files = YES when_to_transfer_output = ON_EXIT # memory, disk and CPU requests request_cpus = 80 request_memory = 500GB request_disk = 1500GB # submit 1 job queue 1 ### END
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
#!/bin/bash # set home () and mk dirs export HOME=$PWD mkdir input work output # echo core, thread, and memory echo "CPU threads: $(grep -c processor /proc/cpuinfo)" grep 'cpu cores' /proc/cpuinfo | uniq echo $(free -g) # transfer input data from staging ($1 is ${dir} from args) cp -r /staging/groups/zamanian_group/input/$1.tar input cd input && tar -xvf $1.tar && rm $1.tar && mv */*/* $1 && cd .. # clone nextflow git repo git clone https://github.com/zamanianlab/Core_RNAseq-nf.git # run nextflow command export NXF_OPTS='-Xms1g -Xmx8g' nextflow run Core_RNAseq-nf/WB-pe.nf -w work -c Core_RNAseq-nf/chtc.config --dir $1\ --star --qc --release "WBPS15" --species "brugia_malayi" --prjn "PRJNA10729" --rlen "150" # rm files you don't want transferred back to /home/{net-id} rm -r work input # tar output folder and delete it cd output && tar -cvf $1.tar $1 && rm -r $1 && cd .. # remove staging output tar if there from previous run rm -f /staging/groups/zamanian_group/output/$1.tar # mv large output files to staging output folder; avoid their transfer back to /home/{net-id} mv output/$1.tar /staging/groups/zamanian_group/output/
Log into submit node to submit a job,
ssh {net-id}@submit2.chtc.wisc.edu condor_submit Core_RNAseq-nf.sub dir=191211_AHMMC5DMXX script=Core_RNAseq-nf.sh
Other useful commands for monitoring and managing jobs (Click to Expand)
# check on job status condor_q # remove a specific job condor_rm [job id] # remove all jobs for user condor_rm $USER # interative shell to running job on remote machine condor_ssh_to_job [job id] exit
-
Output - transfer output data (CHTC -> ResearchDrive)
To transfer your job output folder from the CHTC staging output directory to Research Drive:
CHTC -> ResearchDrive transfer (Click to Expand)
# log into CHTC staging server and navigate to output folder ssh {net-id}@transfer.chtc.wisc.edu cd /staging/groups/zamanian_group/output/ # connect to lab ResearchDrive smbclient -k //research.drive.wisc.edu/mzamanian # turn off prompting and turn on recursive smb: \> prompt smb: \> recurse # navigate to ResearchDrive dir for processed data (example) smb: \> cd /ImageXpress/proc/ # transfer output data folder (example) smb: \> mput 20201119-p01-MZ_200.tar
Output data can also be transferred to your computer directly from the CHTC (as shown in the command below), or from the mounted ResearchDrive if the data have already been moved to ResearchDrive.
scp -r {net-id}@transfer.chtc.wisc.edu:/staging/groups/zamanian_group/output/[dir] .
B. Docker
We will user Docker to establish consistent environments (containers) for our established pipelines. We will maintain Docker images on Docker Hub under the organization name 'zamanianlab'. These images can be directly loaded from Docker Hub in our CHTC submit scripts. The Dockerfiles used to create these images should be maintained in our GitHub Docker Repo. Install Docker Desktop for Mac and create a Dockerhub account to be associated with our organization Docker Hub (zamanianlab).
Building Docker Images
-
Create a lab Docker Hub repo (e.g., zamanianlab/chtc-rnaseq)
-
Create Dockerfile and auxillary (e.g., yaml) files in a folder with the repo name in the Docker GitHub repo.
The Dockerfile provides instructions to build a Docker image. In this case, we are starting with the official miniconda Docker image and then installing necessary conda packages into this image. You can search for existing Docker images on Docker Hub to build on, instead of starting from scratch.
Dockerfile (Click to Expand)
FROM continuumio/miniconda3 MAINTAINER mzamanian@wisc.edu # install (nf tracing) RUN apt-get update && apt-get install -y procps # install conda packages COPY conda_env.yml . RUN \ conda env update -n root -f conda_env.yml \ && conda clean -a
The following yml file lists
conda
packages to be installed. You can search for packages on Anaconda cloud.conda_env.yml (Click to Expand)
conda_env.yaml name: rnaseq-nf channels: - bioconda - conda-forge - defaults dependencies: - python=3.8.5 - nextflow=20.07.1 - bwa=0.7.17 - hisat2=2.2.1 - stringtie=2.1.2 - fastqc=0.11.9 - multiQC=1.9 - fastp=0.20.1 - bedtools=2.29.2 - bedops=2.4.39 - sambamba=0.7.0 - samtools=1.9 - picard=2.20.6 - bcftools=1.9 - snpeff=4.3.1t - mrbayes=3.2.7 - trimal=1.4.1 - mafft=7.471 - muscle=3.8.1551 - seqtk=1.3 - raxml=8.2.12 - htseq=0.12.4 - mirdeep2=2.0.1.2
-
Build Docker image
cd [/path/to/Dockerfile] docker build -t zamanianlab/chtc-rnaseq .
-
Test Docker image interactively
docker run -it --rm=TRUE zamanianlab/chtc-rnaseq /bin/bash ctrl+D to exit
-
Push Docker image to Docker Hub
docker push zamanianlab/chtc-rnaseq
Some useful Docker commands (Click to Expand)
# list docker images docker image ls (= docker images) # remove images docker rmi [image] ## remove all docker containers # run first because images are attached to containers docker rm -f $(docker ps -a -q) # remove every Docker image docker rmi -f $(docker images -q)
Testing Docker Pipelines
Before deploying a new pipeline on large datasets, test the pipeline using subsampled data. You can test locally with subsampled data, on the CHTC server with subsampled data, and finally, run the pipeline on the CHTC server with your full dataset. An example is provided below, using RNAseq data.
-
First, subsample your data into a more manageable size and store it in the staging
subsampled
folder. -
Run Docker container locally
docker run -it --rm=TRUE zamanianlab/chtc-rnaseq:v2 /bin/bash
-
Simulate the steps in your submit scripts
Running commands in local Docker container (Click to Expand)
# set home to working directory export HOME=$PWD # make input, work, and output directories for nextflow mkdir input work outputs # clone GitHub repo that contains pipeline in development git clone https://github.com/zamanianlab/Core_RNAseq-nf.git # transfer sub-sampled files from CHTC staging into your input folder scp -r {net-id}@transfer.chtc.wisc.edu:/staging/groups/zamanian_group/subsampled/191211_AHMMC5DMXX.tar input # run your pipeline commands # example of a nextflow command using chtc-local.config matched to your hardware specs nextflow run Core_RNAseq-nf/WB-pe.nf -w work -c Core_RNAseq-nf/chtc-local.config --dir "191211_AHMMC5DMXX" --release "WBPS14" --species "brugia_malayi" --prjn "PRJNA10729" --rlen "150"
-
Make changes to your GitHub pipeline,
push
those changes to GitHub,pull
those changes to your local container, and re-run the Nextflow command until the pipeline is behaving as expected.