## Copyright Broad Institute, 2017 ## ## This WDL pipeline implements data pre-processing and initial variant calling (GVCF ## generation) according to the GATK Best Practices (June 2016) for germline SNP and ## Indel discovery in human whole-genome sequencing (WGS) data. ## ## Requirements/expectations : ## - Human whole-genome pair-end sequencing data in unmapped BAM (uBAM) format ## - One or more read groups, one per uBAM file, all belonging to a single sample (SM) ## - Input uBAM files must additionally comply with the following requirements: ## - - filenames all have the same suffix (we use ".unmapped.bam") ## - - files must pass validation by ValidateSamFile ## - - reads are provided in query-sorted order ## - - all reads must have an RG tag ## - GVCF output names must end in ".g.vcf.gz" ## - Reference genome must be Hg38 with ALT contigs ## ## Runtime parameters are optimized for Broad's Google Cloud Platform implementation. ## For program versions, see docker containers. ## ## LICENSING : ## This script is released under the WDL source code license (BSD-3) (see LICENSE in ## https://github.com/broadinstitute/wdl). Note however that the programs it calls may ## be subject to different licenses. Users are responsible for checking that they are ## authorized to run all programs before running this script. Please see the docker ## page at https://hub.docker.com/r/broadinstitute/genomes-in-the-cloud/ for detailed ## licensing information pertaining to the included programs. # WORKFLOW DEFINITION workflow PairedEndSingleSampleWorkflow { File ref_dict Int preemptible_tries # Create list of sequences for scatter-gather parallelization call CreateSequenceGroupingTSV { input: ref_dict = ref_dict, preemptible_tries = preemptible_tries } # We need disk to localize the sharded input and output due to the scatter for BQSR. # If we take the number we are scattering by and reduce by 3 we will have enough disk space # to account for the fact that the data is not split evenly. Int num_of_bqsr_scatters = length(CreateSequenceGroupingTSV.sequence_grouping) # Outputs that will be retained when execution is complete output { } } # TASK DEFINITIONS # Generate sets of intervals for scatter-gathering over chromosomes task CreateSequenceGroupingTSV { File ref_dict Int preemptible_tries # Use python to create the Sequencing Groupings used for BQSR and PrintReads Scatter. # It outputs to stdout where it is parsed into a wdl Array[Array[String]] # e.g. [["1"], ["2"], ["3", "4"], ["5"], ["6", "7", "8"]] command <<< python <>> runtime { preemptible: preemptible_tries docker: "python:2.7" memory: "2 GB" } output { Array[Array[String]] sequence_grouping = read_tsv("sequence_grouping.txt") Array[Array[String]] sequence_grouping_with_unmapped = read_tsv("sequence_grouping_with_unmapped.txt") } }