Skip to content

Server Pipelines

Our lab has access to powerful computing resources and support through the Center for High-Throughput Computing (CHTC). We also have access to virtual servers through the UW Bioinformatics Research Center (BRC), attached to sequencing data generated by the UW Biotech Center. Our core bioinformatics and image processing pipelines will be deployed through CHTC servers. All pipelines will be maintained on GitHub and associated with Docker environments to ensure reproducibility. Many of our pipelines will use Nextflow.

In general, pipelines will be run in three steps:

  • Staging: input files will be transferred to the CHTC server from lab storage
  • Pipeline: files will be processed using established pipelines
  • Output: desired outputs will be transferred from the CHTC server to lab storage

A. Lab Storage

Large data sets (raw and processed) will be stored in locations with reliable backup.

  • BRC (UW Biotech Center)

    Sequencing data generated by the UW Biotechnology Center is delivered to their BRC servers. These data can be accessed in numerous ways.

  • UW ResearchDrive

    ResearchDrive provides 5 TB (expandable) of secure backed-up storage. Lab members will have access to the lab ResearchDrive using their UW net-id and password. Our sequencing data from the BRC is automatically moved into ResearchDrive. We will also use ResearchDrive to store imaging and all other large data types. General instructions on how to connect to and transfer data in and out of ResearchDrive are provided here. Data will be transferred from ResearchDrive to the CHTC for pipeline-based processing. Outputs will be transferred from the CHTC back to ResearchDrive for long-term storage. Subsets of outputs will be transferred into our lab Box account for post-processing, analysis and plotting. Curent ResearchDrive directory structure is shown below:

    /
    ├── ImageXpress/                          [IX storage]
    |   └── raw/                              [Raw exported data]
    |   └── metadata/                         [Experiment metadata]    
    |   └── proc/                             [CHTC-processed data]
    |  
    └── UWBC-Dropbox/                         [Auto-deposited data from the UWBC]
    |   └── Bioinformatics Resource Center/   [Sequencing Data]
    |   └── DNA Sequencing Sanger/            [Sanger Data]
    |
    └── Box/                                  [Box backup]
    |
    └── External/                             [External data]
    

    Instructions can be found for connecting to and transferring files in and out of ResearchDrive. Files can be transferred into the mounted ResearchDrive using a number of approaches, including simple drag-and-drop or command-line rsync or cp.

    Example on a Mac OS X system:

    Finder > Go > "Connect to server... >
    smb://research.drive.wisc.edu/mzamanian
    
    rsync -rltv ~/Desktop/Data/[dir] /Volumes/mzamanian/ImageXpress/raw/
    
  • SVM Research Data Storage

    SVM PIs will soon have access to secure and backed-up research data storage through the school, stocked initially with 335 TB of storage capacity. Once this storage solution is in place, we will migrate from UW ResearchDrive to the on-premise storage solution.

B. Center for High-throughput Computing (CHTC)

Consult official CHTC and HTCondor documentation before getting started. Register for an account using this form.

The HTC System

  1. Compute Nodes

    The CHTC has an extensive set of execute nodes that can be accessed for free. To establish priority access for certain pipelines, our lab has secured a prioritized node that can be accessed on-demand using a designated flag.

    • Typical nodes: 20 cores, 128 GB RAM
    • High-memory nodes: e.g., 80 cores, 4 TB RAM
    • Dedicated lab node: 40 cores (80 hyperthreading), 512 GB RAM, 3.8 TB HD
      [ Maximum request on lab node: CPU = 80; Memory = 500GB, Disk = 3500GB ]
  2. Submit nodes

    Jobs on the CHTC are deployed from submit nodes. You can ssh into your assigned submit node to run and monitor jobs using your UW net-id and password.

    ssh {net-id}@submit3.chtc.wisc.edu

  3. File system

    The CHTC does not use a shared file system, but you can request the storage you need for any given job. Each net-id will be associated with a home folder, where you will manage your job submission scripts. Each net-id will also be associated with a staging folder, for transfer of files in and out of the CHTC system.

    /
    ├── home/{net-id}/              [initial quota: 20 GB, submit script dir]
    └── staging/{net-id}/           [initial quota: 200 GB | 1000 files]
        └── input/                  [input dir: unprocessed (raw) data]
        └── output/                 [output dir: processed job outputs]
    

Deploying Pipelines

  1. Staging input data for processing

    Transfer a single folder containing your job input files to the staging input directory. You can run transfer commands from your computer or from the BRC server (sequencing data).

    scp [dir] {net-id}@transfer.chtc.wisc.edu:/staging/{net-id}/input/

    More typically, you will be transferring directly between ResearchDrive and CHTC. To transfer a directory from ResearchDrive to the CHTC staging input folder:

    ResearchDrive -> CHTC transfer (Click to Expand)
    # log into CHTC staging server and navigate to input folder
    ssh {net-id}@transfer.chtc.wisc.edu
    cd /staging/{net-id}/input/
    
    # connect to lab ResearchDrive
    smbclient -k //research.drive.wisc.edu/mzamanian
    
    # turn off prompting and turn on recursive
    smb: \> prompt
    smb: \> recurse
    
    # navigate to ResearchDrive dir with raw data (example)
    smb: \> cd /ImageXpress/raw/
    
    # transfer raw data folder (example)
    smb: \> mget 20200922-p01-NJW_114
    

  2. Creating job submit scripts

    CHTC uses HTCondor for job scheduling. Submission files (.sub) should follow lab conventions and be consistent with the CHTC documentation. An example submit script with annotations is shown below. This submit script (Core_RNAseq-nf.sub) loads a pre-defined Docker environment and runs a bash executable script (Core_RNAseq-nf.sh) with defined arguments (staged data location).

    Other options define standard log files, resource requirements (cpu, memory, and hard disk), and transfer of files in/out of home. Avoid transferring large files in/out of home! We transfer in our large data through /staging/{net-id}/input/ and we move job output files to /staging/{net-id}/output/ within the job executable script to avoid their transfer to home upon job completion. The only files that should be transferred back to home are small log files.

    Core_RNAseq-nf.sub (Click to Expand)
    # Core_RNAseq-nf.sub
    # Input data in /staging/{net-id}/input/$(dir)
    # Run: condor_submit Core_RNAseq-nf.sub dir=191211_AHMMC5DMXX netid=mzamanian script=Core_RNAseq-nf.sh
    
    # request Zamanian Lab server
    Accounting_Group = PathobiologicalSciences_Zamanian
    
    # load docker image; request execute server with large data staging
    universe = docker
    docker_image = zamanianlab/chtc-rnaseq:v1
    Requirements = (Target.HasCHTCStaging == true)
    
    # executable (/home/{net-id}/) and arguments
    executable = $(script)
    arguments = $(dir) $(netid)
    
    # log, error, and output files
    log = $(dir)_$(Cluster)_$(Process).log
    error = $(dir)_$(Cluster)_$(Process).err
    output = $(dir)_$(Cluster)_$(Process).out
    
    # transfer files in-out of /home/{net-id}/
    transfer_input_files =
    should_transfer_files = YES
    when_to_transfer_output = ON_EXIT
    
    # memory, disk and CPU requests
    request_cpus = 80
    request_memory = 500GB
    request_disk = 1500GB
    
    # submit 1 job
    queue 1
    ### END
    

    The submit script runs the annotated bash script below on the execute server. This pipeline creates input, work, and output dirs in the loaded Docker environment. It transfers the input data from staging into input, clones a GitHub repo (Nextflow pipeline), and runs a Nextflow command. Nextflow uses work for intermediary processing and spits out any files we have marked for retention into output, which gets transferred back to staging. input and work are deleted before job completion.

    Core_RNAseq-nf.sh (Click to Expand)
    #!/bin/bash
    
    # set home () and mk dirs
    export HOME=$PWD
    mkdir input work output
    
    # echo core, thread, and memory
    echo "CPU threads: $(grep -c processor /proc/cpuinfo)"
    grep 'cpu cores' /proc/cpuinfo | uniq
    echo $(free -g)
    
    # transfer input data from staging ($1 is ${dir} and $2 is ${netid} from args)
    cp -r /staging/$2/input/$1 input
    
    # clone nextflow git repo
    git clone https://github.com/zamanianlab/Core_RNAseq-nf.git
    
    # run nextflow
    export NXF_OPTS='-Xms1g -Xmx8g'
    nextflow run Core_RNAseq-nf/WB-pe.nf -w work -c Core_RNAseq-nf/chtc.config \
      --dir $1 --star --release "WBPS14" --species "brugia_malayi" --prjn "PRJNA10729" --rlen "150"
    
    # rm files you don't want transferred back to /home/{net-id}
    rm -r work
    rm -r input
    
    # remove staging output folder if there from previous run
    rm -r /staging/$2/output/$1
    
    # mv large output files to staging output folder; avoid their transfer back to /home/{net-id}
    mv output/$1/ /staging/$2/output/
    

  3. Submitting and managing jobs

    Submit job from submit node using condor_submit,

    condor_submit Core_RNAseq-nf.sub dir=191211_AHMMC5DMXX script=Core_RNAseq-nf.sh

    Other useful commands for monitoring and managing jobs (Click to Expand)
    # check on job status
      condor_q
    
    # remove a specific job
      condor_rm [job id]
    
    # remove all jobs for user
      condor_rm $USER
    
    # interative shell to running job on remote machine
      condor_ssh_to_job [job id]
      exit
    

  4. Transferring output data

    To transfer your job output folder from the CHTC staging output directory to your local computer.

    scp -r {net-id}@transfer.chtc.wisc.edu:/staging/{net-id}/output/[dir] .

    To transfer your job output directly from the CHTC staging output directory to ResearchDrive.

    CHTC -> ResearchDrive transfer (Click to Expand)
    # log into CHTC staging server and navigate to output folder
    ssh {net-id}@transfer.chtc.wisc.edu
    cd /staging/{net-id}/output/
    
    # connect to lab ResearchDrive
    smbclient -k //research.drive.wisc.edu/mzamanian
    
    # turn off prompting and turn on recursive
    smb: \> prompt
    smb: \> recurse
    
    # navigate to ResearchDrive dir for processed data (example)
    smb: \> cd /ImageXpress/proc/
    
    # transfer output data folder (example)
    smb: \> mput 20200922-p01-NJW_114
    

C. Docker

We will user Docker to establish consistent environments (containers) for our established pipelines. We will maintain Docker images on Docker Hub under the organization name 'zamanianlab'. These images can be directly loaded from Docker Hub in our CHTC submit scripts. The Dockerfiles used to create these images should be maintained in our GitHub Docker Repo. Install Docker Desktop for Mac and create a Dockerhub account to be associated with our organization Docker Hub (zamanianlab).

Building Docker Images

  1. Create a lab Docker Hub repo (e.g., zamanianlab/chtc-rnaseq)

  2. Create Dockerfile and auxillary (e.g., yaml) files in a folder with the repo name in the Docker GitHub repo.

    The Dockerfile provides instructions to build a Docker image. In this case, we are starting with the official miniconda Docker image and then installing necessary conda packages into this image. You can search for existing Docker images on Docker Hub to build on, instead of starting from scratch.

    Dockerfile (Click to Expand)
    FROM continuumio/miniconda3
    MAINTAINER mzamanian@wisc.edu
    
    # install (nf tracing)
    RUN apt-get update && apt-get install -y procps
    
    # install conda packages
    COPY conda_env.yml .
    RUN \
       conda env update -n root -f conda_env.yml \
    && conda clean -a
    

    yml file containing conda packages to be installed. You can search for packages on Anaconda cloud.

    conda_env.yml (Click to Expand)
    conda_env.yaml
      name: rnaseq-nf
    
      channels:
        - bioconda
        - conda-forge
        - defaults
    
      dependencies:
        - python=3.8.5
        - nextflow=20.07.1
        - bwa=0.7.17
        - hisat2=2.2.1
        - stringtie=2.1.2
        - fastqc=0.11.9
        - multiQC=1.9
        - fastp=0.20.1
        - bedtools=2.29.2
        - bedops=2.4.39
        - sambamba=0.7.0
        - samtools=1.9
        - picard=2.20.6
        - bcftools=1.9
        - snpeff=4.3.1t
        - mrbayes=3.2.7
        - trimal=1.4.1
        - mafft=7.471
        - muscle=3.8.1551
        - seqtk=1.3
        - raxml=8.2.12
        - htseq=0.12.4
        - mirdeep2=2.0.1.2
    

  3. Build Docker image

    cd [/path/to/Dockerfile]
    docker build -t zamanianlab/chtc-rnaseq .
    
  4. Test Docker image interactively

    docker run -it --rm=TRUE zamanianlab/chtc-rnaseq /bin/bash
    ctrl+D to exit
    
  5. Push Docker image to Docker Hub

    docker push zamanianlab/chtc-rnaseq
    

    Some useful Docker commands (Click to Expand)
    # list docker images
      docker image ls (= docker images)
    
    # remove images
      docker rmi [image]
    
    ## remove all docker containers
    # run first because images are attached to containers
      docker rm -f $(docker ps -a -q)
    # remove every Docker image
      docker rmi -f $(docker images -q)
    

Testing Docker Pipelines

Before deploying a new pipeline on large datasets, test the pipeline using subsampled data. You can test locally with subsampled data, on the CHTC server with subsampled data, and finally, run the pipeline on the CHTC server with your full dataset. An example is provided below, using RNAseq data.

  1. First, subsample your data:

    ...
    
  2. Run Docker container locally

    docker run -it --rm=TRUE zamanianlab/chtc-rnaseq /bin/bash
    
  3. Simulate the steps in your submit scripts

    Running commands in local Docker container (Click to Expand)
    # set home to working directory
    export HOME=$PWD
    
    # make input, work, and output directories for nextflow
    mkdir input work outputs
    
    # clone GitHub repo that contains pipeline in development
    git clone https://github.com/zamanianlab/Core_RNAseq-nf.git
    
    # transfer sub-sampled files from CHTC staging into your input folder
    scp -r mzamanian@transfer.chtc.wisc.edu:/staging/mzamanian/input/191211_AHMMC5DMXX/ input
    
    # run nextflow command using chtc-local.config matched to your hardware specs
    nextflow run Core_RNAseq-nf/WB-pe.nf -w work -c Core_RNAseq-nf/chtc-local.config --dir "191211_AHMMC5DMXX" --release "WBPS14" --species "brugia_malayi" --prjn "PRJNA10729" --rlen "150"
    

  4. Make changes to your GitHub pipeline, push those changes to GitHub, pull those changes to your local container, and re-run the Nextflow command until the pipeline is behaving as expected.