Here we run an analysis pipeline in R for the identification of circRNA. The pipeline attempts to recreate parts of the KNIFE pipeline described in the paper by Szabo et al. (2015). The code provided here uses the systemPipeR package with functions designed to run the pipleline on a local desktop. Consult the systemPipeR documentation to modify the code so it can be run on a cluster.

Prepartory steps

To install and resolve bioinformatics software dependecies, you can use the Bioconda package manager. First download Miniconda, then create an environment, activate it, add bioconda channels, and install the required software. For example, here we create a local environment named rnaseq and install bowtie2 and trim-galore.

conda create -n rnaseq python=2 anaconda
source activate rnaseq
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
conda install bowtie2
conda install trim-galore

1. Create project directories

First start a project and create subdirectories under the project’s working directory

folders <- c(
  "parameters", "src",
  "data/genome_data/hg19/fasta",
  "data/genome_data/hg19/bowtie2_index",
  "data/biosamples",
  "results/trimmed",
  "results/aligned/genome",
  "results/aligned/ribosomal",
  "results/aligned/junctions_reg",
  "results/aligned/junctions_scrambled",
  "/results/analysis/reports"
)

for (i in 1:length(folders))  {
  dir.create(paste(getwd(), folders[i], sep = "/"), recursive = TRUE)
}

2. Download genome data

To get started, we need to download the hg19_fastas.tar.gz archive that contains fasta files for the ribosome, the linear junctions, the scrambled junctions, and the reference genome. The compressed tar file is available at this link

3. Build index files

We first load the systemPipeR package and other required packages. We also create external files for the command-line software. These are the param files that specify the parameters and the target files that contain the relative path locations of any required data files.

#source("http://bioconductor.org/biocLite.R")
#biocLite("tgirke/systemPipeR", dependencies = TRUE)
library(systemPipeR)
library(dplyr)
library(pander)

The index files are large and might take a long time to download on a regular connection. It could be faster to just run the Bowtie index builder to create the index files. These two lines build the indices. The param and target files are available here.

args_bowtie2_index <- systemArgs(sysma = "./parameters/bowtie2_index.param",
                                 mytargets = "./parameters/hg19_fasta_filepaths.txt")
runCommandline(args = args_bowtie2_index)

4. Download sample data

We use the same HER2 Positive Breast Tumor dataset used in the KNIFE as sample data. The data is made up of paired-end reads from a total RNA rRNA-depleted RNA-seq library. More details about the experiment can be found here. To ensure reproducibility and data provenance, we download the SRA data files directly from the GEO database.

download.file("ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX374/SRX374866/SRR1027187/SRR1027187.sra", "./data/biosamples/SRR1027187.sra")

Next we convert the SRA file to fastq using fastq-dump. The options provided specify the following: split the paired reads into two files, dump the biological reads only, and compress the output using gzip. We also keep the defaults by not adding read ids and leaving the quality encoding as ASCII+33.

system("fastq-dump --split-files --skip-technical --gzip ./data/biosamples/SRR1027187.sra --outdir ./data/biosamples/")

5. Trim data

Reads in the raw data can contain low quality bases and partial adapter sequences that needs to be removed. Trimmed reads that are too short should also be filtered in order to obtain good alignment.

args_trim <- systemArgs(sysma = "./parameters/trim_galore.param",
                           mytargets = "./parameters/data_filepaths.txt")
trim_command <- paste(sysargs(args_trim)[1],
                      unlist(strsplit(sysargs(args_trim)[2], " "))[8],
                      sep = " ")
(trim_command)
[1] "trim_galore --length 50 --paired --output_dir /hd2/KNIFE_project_01/results/trimmed/  /hd2/KNIFE_project_01/data/biosamples/SRR1027187_1.fastq.gz  /hd2/KNIFE_project_01/data/biosamples/SRR1027187_2.fastq.gz"

Although we are treating each paired end read independently, we add the the –paired option argument in the trim_galore command to peform a validation test that ensures that both sequence pairs have a certain minimum length specified by the option --length. To run trim-galore we use the system function

system(trim_command)

6. Run data quality control checks

Lastly, we generate and plot FASTQ quality summary statistics. The fastqc manual provides a good explantion of the fastqc plots. First we define a helper function

runFastqPlots <- function(file_list, report_name) {
  fqlist <- bplapply(seq_along(file_list),
                     function(x)
                       seeFastq(
                         fastq = file_list[x],
                         batchsize = 100000,
                         klength = 8
                       ),
                     BPPARAM = MulticoreParam(workers = 4))
  png(
    sprintf(
      "./results/analysis/reports/fastq_report_%s.png",
      report_name
    ),
    height = 800,
    width = 200* length(fqlist),
  )
  seeFastqPlot(unlist(fqlist, recursive = FALSE))
  dev.off()
}

then we run it on the raw data

runFastqPlots(file_list = infile1(args_trim),
              report_name = "SRR1027187")

and on the trimmed data

data_trimmed_filepaths <- paste0(outfile1(args_trim),
                                 SampleName(args_trim),
                                 c("_val_1.fq.gz", "_val_2.fq.gz"))
names(data_trimmed_filepaths) <- paste0(SampleName(args_trim),"_trimmed")

runFastqPlots(file_list = data_trimmed_filepaths,
              report_name = "SRR1027187_trimmed")

Next we plot the reports side by side

fastqcplots fastqc_trimmed_plots

The plots for the trimmed data look much better. Although the mean quality distribution for the reads is bimodal, there are no low quality peaks.

Alignment Steps

Now we independently align each paired-end read dataset to four separate Bowtie2 indices: for the genome, ribosomal RNA, linear exon–exon junctions, and scrambled exon–exon junctions. First we load Bowtie2 parameter files and filepaths

args_align_genome <- systemArgs(sysma = "./parameters/bowtie2_align_genome.param",
                                mytargets = "./parameters/data_trimmed_filepaths.txt")
args_align_ribo <- systemArgs(sysma = "./parameters/bowtie2_align_ribosomal.param",
                              mytargets = "./parameters/data_trimmed_filepaths.txt")
args_align_juncts_reg <- systemArgs(sysma = "./parameters/bowtie2_align_junctions_reg.param",
                                    mytargets = "./parameters/data_trimmed_filepaths.txt")
args_align_juncts_scram <- systemArgs(sysma = "./parameters/bowtie2_align_junctions_scrambled.param",
                                      mytargets = "./parameters/data_trimmed_filepaths.txt")

then we run the Bowtie2 alignment commands through systemPipeR

# align to genome
genome_bam_files <- runCommandline(args = args_align_genome)
# align to ribosome
ribo_bam_files <- runCommandline(args = args_align_ribo)
# align to linear junctions
juncts_reg_bam_files <- runCommandline(args = args_align_juncts_reg)
# align to srambled junctions
juncts_scram_files <- runCommandline(args = args_align_juncts_scram)

We can open “submitargs01_log” for each alignment to display the alignment summary and the system command used. For the genome alignment, we can open the file inside R like this

readLines(file.path("./results/aligned/genome","submitargs01_log"))
 [1] "bowtie2 -p 8 --no-unal --score-min L,0,-0.24 --rdg 50,50 --rfg 50,50 -x /hd2/KNIFE_project_01/data/genome_data/hg19/bowtie2_index/hg19_genome -U /hd2/KNIFE_project_01/results/trimmed/SRR1027187_1_val_1.fq.gz  -S /hd2/KNIFE_project_01/results/aligned/genome/SRR1027187_1_val_1.sam"
 [2] "19308315 reads; of these:"
 [3] "  19308315 (100.00%) were unpaired; of these:"
 [4] "    6692805 (34.66%) aligned 0 times"
 [5] "    8628824 (44.69%) aligned exactly 1 time"
 [6] "    3986686 (20.65%) aligned >1 times"
 [7] "65.34% overall alignment rate"
 [8] "bowtie2 -p 8 --no-unal --score-min L,0,-0.24 --rdg 50,50 --rfg 50,50 -x /hd2/KNIFE_project_01/data/genome_data/hg19/bowtie2_index/hg19_genome -U /hd2/KNIFE_project_01/results/trimmed/SRR1027187_2_val_2.fq.gz  -S /hd2/KNIFE_project_01/results/aligned/genome/SRR1027187_2_val_2.sam"
 [9] "19308315 reads; of these:"
[10] "  19308315 (100.00%) were unpaired; of these:"
[11] "    6753734 (34.98%) aligned 0 times"
[12] "    8563704 (44.35%) aligned exactly 1 time"
[13] "    3990877 (20.67%) aligned >1 times"
[14] "65.02% overall alignment rate"                                                                                                                                                                                                                                                          

Preprocessing steps

The previous alignment step produced 8 BAM files. Two paired-end BAM files for each of the genome, ribosome, linear junctions, and scrambled junctions alignments. We are interested in junctional reads, those reads from the 8 files that align to a scrambled or linear junctions. Since the majority of the reads in the genome and ribosome BAM files are not junctional, we need to filter them out.

1. Get read IDs for genomic mate pairs

Since the next preprocessing steps are memory intensive, we can intially filter out reads in which both pair mates align to the genome or ribosome. We can also filter out unaligned reads and return lists of read IDs in which both pairs align to the indicies, or in which one pair mate aligns but the other doesn’t. To proceed we need to get the names of those reads that are found in both R1 and R2 genomic alignment files.

getReadIds <- function(bamfile) {
  scanBam(bamfile,
          param = ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE),
                               what = "qname"))[[1]][[1]]
  }

getCommonGenomicPairedReads <- function(genome_bamfiles, ribo_bamfiles) {

  genome_reads <- bplapply(genome_bamfiles,
                           getReadIds,
                           BPPARAM = MulticoreParam(workers = 2))
  ribosomal_reads <- bplapply(ribo_bamfiles,
                              getReadIds,
                              BPPARAM = MulticoreParam(workers = 2))

  genomic_reads_1 <- union(genome_reads[[1]], ribosomal_reads[[1]])
  genomic_reads_2 <- union(genome_reads[[2]], ribosomal_reads[[2]])
  common_reads <- intersect(genomic_reads_1, genomic_reads_2)
  r1_not_r2 <- setdiff(genomic_reads_1, genomic_reads_2)
  r2_not_r1 <- setdiff(genomic_reads_2, genomic_reads_1)

  return(
    list(
      common_reads = common_reads,
      r1_not_r2 = r1_not_r2,
      r1_not_r2 = r2_not_r1
    )
  )
}

Note that code above uses the parallel lapply function bplapply. If you system has less than 32G RAM, you can modify the code to run with one worker or use lapply instead. Here we retrieve the location of the aligned files and feed them into the function defined above.

genome_aligned <- outpaths(args_align_genome)
ribo_aligned <- outpaths(args_align_ribo)

reads_to_ignore <- getCommonGenomicPairedReads(genome_aligned, ribo_aligned)

2. Read alignment data and run initial filtering

Now we use these read IDs to create two filters. The first one filters out genomic reads in which both mates (R1 and R2) align. The second one is a junctional filter that keeps only junctional reads that overlap a particular junction by at least 8 nucleotides and that also satify the first filter.

genomic_reads_filter <-
  FilterRules(list(
    include = function(x) !(mcols(x)$qname %in% reads_to_ignore$common_reads)
  ))

MIDPOINT <- 150
JUNC_OVERLAP <- 8

junction_reads_filter <-
  FilterRules(list(
    include = function(x)
      (
        (MIDPOINT + JUNC_OVERLAP + 1 - width(x) <= start(x)) &
        (start(x) <= MIDPOINT - JUNC_OVERLAP + 1) &
        !(mcols(x)$qname %in% reads_to_ignore$common_reads)
      )
  ))

As part of the filtering step, we also update the alignment score for the junction reads such that the N-penalty is corrected.

countAmbigousBases <- function(align_obj) {
  sapply(gregexpr("N0|0N", mcols(align_obj)$MD),
         function(x)
           if (attributes(x)$match.length[1] == -1)
             return(0)
         else
           return(length(x))
  )
}

#MD and XN do agree
filterAlignments <-
  function(bamfile, param, filter, updateScore = TRUE) {
    x <- readGAlignments(file = bamfile, param = param)
    x <- subsetByFilter(x, filter)
    seqlevels(x) <- seqlevelsInUse(x)
    # correct for N-penalty in junction alignments and update alignment score
    if (updateScore) {
      mcols(x)$AS <- mcols(x)$AS + countAmbigousBases(x)
    }
    return(x)
  }

# get file paths
linear_junctions_aligned <- outpaths(args_align_juncts_reg)
scrambled_junctions_aligned <- outpaths(args_align_juncts_scram)

# run filters

param <-
  ScanBamParam(
    what = c("qname","mapq"), # get the read name and the mapping quality
    flag = scanBamFlag(isUnmappedQuery = FALSE),
    tag = c("AS", "XS", "XN", "MD") # get tags
  )

genomic_filtered <- bplapply(
  genome_aligned,
  filterAlignments,
  param = param,
  filter = genomic_reads_filter,
  updateScore = FALSE,
  BPPARAM = MulticoreParam(workers = 2)
)

linear_junctions_filtered <- bplapply(
  linear_junctions_aligned,
  filterAlignments,
  param = param,
  filter = junction_reads_filter,
  BPPARAM = MulticoreParam(workers = 2)
)

scrambled_junctions_filtered <- bplapply(
  scrambled_junctions_aligned,
  filterAlignments,
  param = param,
  filter = junction_reads_filter,
  BPPARAM = MulticoreParam(workers = 2)
)

Let’s take a look at first few rows of the GAlignment object for the linear junctions R1 aligned reads

linear_junctions_filtered[[1]][1:3,] %>% as.data.frame() %>% pander()
Table continues below
seqnames strand cigar qwidth start
chr10|ZMYND11:226068|ZMYND11:255828|reg|+ - 59M 59 106
chr10|ZMYND11:226068|ZMYND11:255828|reg|+ - 59M 59 109
chr10|ZMYND11:226068|ZMYND11:255828|reg|+ + 60M 60 122
end width njunc qname mapq AS XS XN MD
164 59 0 SRR1027187.35004121 39 0 -14 0 59
167 59 0 SRR1027187.35267243 24 -5 NA 0 56T2
181 60 0 SRR1027187.3091383 42 0 NA 0 60

We can also examine the frequency of the cigar strings

linear_junctions_filtered[[1]] %>%
  cigar() %>%
  table %>%
  sort(decreasing = TRUE) %>%
  head() 
.
   60M    59M    58M    57M    56M    55M
397762 138477  45942  13278   3761   3041 

3. Convert GAlignment objects into dataframes

toDataFrame <- function(x) {
  x <- as.data.frame(x)
  x$seqnames <- as.character(x$seqnames)
  x$strand <- as.character(x$strand)
  x <- subset(x, select = -c(cigar, qwidth, njunc, XS, XN, MD))
  return(x)
}

appendJunctionTags <- function(x) {
  df <- data.frame(do.call(rbind,
                           lapply(as.character(seqnames(x)),
                                  function(x) unlist(strsplit(x, "[:]|[|]")))),
                   stringsAsFactors = FALSE)
  colnames(df) <- c("g_chrom", "g_gene1", "g_pos1", "g_gene2",
                    "g_pos2", "junc_type", "junc_strand")
  df$g_pos1 <- as.integer(df$g_pos1)
  df$g_pos2 <- as.integer(df$g_pos2)
  mcols(x) <- cbind(mcols(x), df)
  return(toDataFrame(x))
}

genomic_filtered_dt <- bplapply(
  genomic_filtered,
  toDataFrame,
  BPPARAM = MulticoreParam(workers = 2)
)

linear_junctions_filtered_dt <- bplapply(
  linear_junctions_filtered,
  appendJunctionTags,
  BPPARAM = MulticoreParam(workers = 2)
)

scrambled_junctions_filtered_dt <- bplapply(
  scrambled_junctions_filtered,
  appendJunctionTags,
  BPPARAM = MulticoreParam(workers = 2)
)

Let’s take a look at the dataframe for the linear junctions R1 aligned reads

panderOptions("table.style", "simple")
panderOptions("table.split.table", 140)
linear_junctions_filtered_dt[[1]][1:3,] %>% pander()
Table continues below
seqnames strand start end width qname mapq AS g_chrom g_gene1 g_pos1
chr10|ZMYND11:226068|ZMYND11:255828|reg|+ - 106 164 59 SRR1027187.35004121 39 0 chr10 ZMYND11 226068
chr10|ZMYND11:226068|ZMYND11:255828|reg|+ - 109 167 59 SRR1027187.35267243 24 -5 chr10 ZMYND11 226068
chr10|ZMYND11:226068|ZMYND11:255828|reg|+ + 122 181 60 SRR1027187.3091383 42 0 chr10 ZMYND11 226068
g_gene2 g_pos2 junc_type junc_strand
ZMYND11 255828 reg +
ZMYND11 255828 reg +
ZMYND11 255828 reg +

4. Filter R1 junctional reads and pair them with R2 mates

A read (R1) is a linear junction read if it aligns to a linear junction and does not align to the genome or ribosome. In the code these reads are stored in linear_1. A read R1 is a scrambled junction read if it aligns to a scrambled junction and does not align to the genome or ribosome or to the linear junction. Scrambled junction reads are stored in scrambled_1. The read mate R2 can be a genomic, linear junction, or a scrambled junction read as determined by whehter the read name can be found in the R2 aligned reads. An R1 read can have a single or multiple R2 mates. Some have none at all. These are the unmapped reads.

createJunctionsReadsDF <- function(genomic,
                                   linear,
                                   scrambled,
                                   r1_not_r2,
                                   suffixes = c(".r1", ".r2g", ".r2lin", ".r2scram")) {

  linear_1 <- linear[[1]][!(linear[[1]]$qname %in% r1_not_r2), ]
  scrambled_1 <- scrambled[[1]][!(scrambled[[1]]$qname %in% c(r1_not_r2, linear_1$qname)), ]

  genomic_mate <- genomic[[2]][(genomic[[2]]$qname %in% c(linear_1$qname,scrambled_1$qname)), ]
  linear_mate <- linear[[2]][(linear[[2]]$qname %in% c(linear_1$qname,scrambled_1$qname)), ]
  scrambled_mate <- scrambled[[2]][(scrambled[[2]]$qname %in% c(linear_1$qname, scrambled_1$qname)), ]

  linear_genomic <- left_join(linear_1,
                              genomic_mate,
                              by = "qname",
                              suffix = suffixes[c(1, 2)])

  linear_linear <- left_join(linear_1,
                             linear_mate ,
                             by = "qname",
                             suffix = suffixes[c(1, 3)])

  linear_scrambled <- left_join(linear_1,
                                scrambled_mate,
                                by = "qname",
                                suffix = suffixes[c(1, 4)])

  scrambled_genomic <- left_join(scrambled_1,
                                 genomic_mate,
                                 by = "qname",
                                 suffix = suffixes[c(1, 2)])

  scrambled_linear <- left_join(scrambled_1,
                                linear_mate ,
                                by = "qname",
                                suffix = suffixes[c(1, 3)])

  scrambled_scrambled <- left_join(scrambled_1,
                                   scrambled_mate,
                                   by = "qname",
                                   suffix = suffixes[c(1, 4)])

  linear_df <- cbind(linear_linear[, 1:15],
                     linear_genomic[, 16:22],
                     linear_linear[, 16:29],
                     linear_scrambled[, 16:29])

  scrambled_df <- cbind(scrambled_linear[, 1:15],
                        scrambled_genomic[, 16:22],
                        scrambled_linear[, 16:29],
                        scrambled_scrambled[, 16:29])

  return(list(linear_df = linear_df, scrambled_df = scrambled_df))
}

R1R2MatesJunctionReads <- createJunctionsReadsDF(genomic_filtered_dt,
                                                 linear_junctions_filtered_dt,
                                                 scrambled_junctions_filtered_dt,
                                                 reads_to_ignore$r1_not_r2)

5. Classify Mate Reads

Here we classify mate reads based on alignment score. We also create a new variable ascore that averages the alignment scores of the two matched reads.

classsifyMateReads <- function(df, suffixes = c(".r1", ".r2g", ".r2lin", ".r2scram")) {

    scores_df <- df[, paste0("AS", suffixes)]
    map_categ <- apply(scores_df[, 2:4] , 1, function(x) sum(!is.na(x)))
    max_score <- apply(scores_df[, 2:4] , 1, which.max)

    mate_categ <- sapply(max_score, function(x)
      if (is.null(attributes(x)))
        {return(0)}
      else
        {return(as.integer(x))}
      )

    matescore <- scores_df[cbind(1:nrow(scores_df), (mate_categ + 1))]
    ascore <- 0.5 * (scores_df$AS.r1 + matescore)
    df <- cbind(df,
                mate_categ = mate_categ,
                ascore = ascore,
                map_categ = map_categ)
    return(df)
  }

linear_df <- R1R2MatesJunctionReads$linear_df %>% classsifyMateReads()
scrambled_df <- R1R2MatesJunctionReads$scrambled_df %>% classsifyMateReads()

cat( " number of linear junction reads",
     NROW(linear_df), "\n",
     "number of scrambled junction reads",
     NROW(scrambled_df), "\n",
  "total number of junctional reads:",
    NROW(linear_df) + NROW(scrambled_df), "\n",
    "number of junctions with aligned reads:",
    length(unique(linear_df$seqnames.r1)) +
      length(unique(scrambled_df$seqnames.r1)))
 number of linear junction reads 600658
 number of scrambled junction reads 1599
 total number of junctional reads: 602257
 number of junctions with aligned reads: 108949

Let’s take a look at the dataframe for the linear junctions R1 aligned reads

panderOptions("table.style", "simple")
panderOptions("table.split.table", 140)
scrambled_df[1:3,]  %>% pander()
Table continues below
seqnames.r1 strand.r1 start.r1 end.r1 width.r1 qname mapq.r1 AS.r1 g_chrom.r1
chr10|RBM17:6139151|FAM208B:5762506|rev|+ - 143 202 60 SRR1027187.5794479 0 -5 chr10
chr10|ATP5C1:7844817|ATP5C1:7839009|rev|+ - 126 185 60 SRR1027187.4048280 42 -1 chr10
chr10|ATP5C1:7844817|ATP5C1:7839009|rev|+ - 130 185 56 SRR1027187.29265526 42 0 chr10
Table continues below
g_gene1.r1 g_pos1.r1 g_gene2.r1 g_pos2.r1 junc_type.r1 junc_strand.r1 seqnames.r2g strand.r2g start.r2g end.r2g
RBM17 6139151 FAM208B 5762506 rev + chr10 + 5754806 5754865
ATP5C1 7844817 ATP5C1 7839009 rev + NA NA NA NA
ATP5C1 7844817 ATP5C1 7839009 rev + NA NA NA NA
Table continues below
width.r2g mapq.r2g AS.r2g seqnames.r2lin strand.r2lin start.r2lin end.r2lin width.r2lin
60 42 0 NA NA NA NA NA
NA NA NA chr10|ATP5C1:7844388|ATP5C1:7844720|reg|+ + 101 160 60
NA NA NA NA NA NA NA NA
Table continues below
mapq.r2lin AS.r2lin g_chrom.r2lin g_gene1.r2lin g_pos1.r2lin g_gene2.r2lin g_pos2.r2lin junc_type.r2lin junc_strand.r2lin
NA NA NA NA NA NA NA NA NA
42 -1 chr10 ATP5C1 7844388 ATP5C1 7844720 reg +
NA NA NA NA NA NA NA NA NA
Table continues below
seqnames.r2scram strand.r2scram start.r2scram end.r2scram width.r2scram mapq.r2scram AS.r2scram
NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
chr10|ATP5C1:7844817|ATP5C1:7839009|rev|+ + 130 185 56 42 0
Table continues below
g_chrom.r2scram g_gene1.r2scram g_pos1.r2scram g_gene2.r2scram g_pos2.r2scram junc_type.r2scram junc_strand.r2scram
NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
chr10 ATP5C1 7844817 ATP5C1 7839009 rev +
mate_categ ascore map_categ
1 -2.5 1
2 -1 1
3 0 1

Let’s also output the frequency of R1 reads that have 0, 1, 2, or 3 potential paired-end mate reads (R2)

linear_junction_mapcateg <- linear_df$map_categ %>%  table  %>% as.data.frame()
rownames(linear_junction_mapcateg) <- c("Unmapped","OneMateMatch", "TwoMateMatchs", "ThreeMateMatchs")
scrambled_junction_mapcateg <- scrambled_df$map_categ %>%  table %>%  as.data.frame()
rownames(scrambled_junction_mapcateg) <- c("Unmapped","OneMateMatch", "TwoMateMatchs", "ThreeMateMatchs")
linear_junction_mapcateg 
                .   Freq
Unmapped        0  45424
OneMateMatch    1 541523
TwoMateMatchs   2  13490
ThreeMateMatchs 3    221
scrambled_junction_mapcateg 
                . Freq
Unmapped        0  217
OneMateMatch    1 1346
TwoMateMatchs   2   35
ThreeMateMatchs 3    1

6. Classify Junction Reads

We split the reads into reads that have a mapped mate R2 and those that do not. We intially assign all reads as uncategorized (i.e. read_categ=0).

linear_reads_split <- split(linear_df, linear_df$map_categ == 0)
scrambled_reads_split <- split(scrambled_df, scrambled_df$map_categ == 0)
unmapped_reads <- rbind(linear_reads_split[[2]], scrambled_reads_split[[2]])[,1:15]

linear_reads_mapped <- cbind(linear_reads_split[[1]] , read_categ = 0)
scrambled_reads_mapped <- cbind(scrambled_reads_split[[1]] , read_categ = 0)

Let’s take a look at the unmapped reads

panderOptions("table.style", "simple")
panderOptions("table.split.table", 140)
unmapped_reads[1:3,] %>% pander()
Table continues below
  seqnames.r1 strand.r1 start.r1 end.r1 width.r1 qname mapq.r1 AS.r1
1 chr10|ZMYND11:226068|ZMYND11:255828|reg|+ - 106 164 59 SRR1027187.35004121 39 0
7 chr10|GTPBP4:1046808|GTPBP4:1046887|reg|+ - 119 171 53 SRR1027187.12128237 23 -5
19 chr10|GTPBP4:1056460|GTPBP4:1058404|reg|+ - 104 160 57 SRR1027187.25775441 37 0
  g_chrom.r1 g_gene1.r1 g_pos1.r1 g_gene2.r1 g_pos2.r1 junc_type.r1 junc_strand.r1
1 chr10 ZMYND11 226068 ZMYND11 255828 reg +
7 chr10 GTPBP4 1046808 GTPBP4 1046887 reg +
19 chr10 GTPBP4 1056460 GTPBP4 1058404 reg +

Now we proceed to classify scrambled junctional reads as circular or decoy and linear junctional reads as linear or anomalous.

classifyDecoyOrAnomalous <- function(x) {

  cond1 <- x$strand.r1 != x[, c("strand.r2g", "strand.r2lin", "strand.r2scram")][cbind(1:nrow(x), x$mate_categ)]
  cond2 <- x$g_chrom.r1 == x[, c("seqnames.r2g", "g_chrom.r2lin", "g_chrom.r2scram")][cbind(1:nrow(x), x$mate_categ)]
  x$read_categ[!(cond1 & cond2)] <- 1 #(decoy)

  return(x)
}

classifyCircularPairedMates <- function(x) {

  cond1 <- x$mate_categ == 3
  cond2 <- x$seqnames.r1 == x$seqnames.r2scram
  cond3 <- !(x$read_categ == 1)
  x$read_categ[(cond1 & cond2 & cond3)] <- 2 #(circ)

  return(x)
}

classifyCircularLinearMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$g_pos1.r1, x$g_pos2.r1)
  r1_end <- pmax(x$g_pos1.r1, x$g_pos2.r1)
  r2_start <- pmin(x$g_pos1.r2lin, x$g_pos2.r2lin)  - MIDPOINT +  x$start.r2lin
  r2_end <- pmin(x$g_pos1.r2lin, x$g_pos2.r2lin)  - MIDPOINT +  x$end.r2lin

  cond1 <- r2_start >=  r1_start - BUFFER
  cond2 <- r2_start <=  r1_end + BUFFER
  cond3 <- r2_end >= r1_start - BUFFER
  cond4 <- r2_end <= r1_end + BUFFER
  cond5 <- x$mate_categ == 2
  cond6 <- !(x$read_categ %in% c(1,2))

  x$read_categ[(cond1 & cond2 & cond3 & cond4 & cond5 & cond6 )] <- 3 #(circ_lin)
  return(x)
}

classifyCircularGenomicMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$start.r1, x$g_pos2.r1)
  r1_end <- pmax(x$g_pos1.r1, x$g_pos2.r1)
  r2_start <- x$start.r2g
  r2_end <- x$end.r2g

  cond1 <- r2_start >=  r1_start - BUFFER
  cond2 <- r2_start <=  r1_end + BUFFER
  cond3 <- r2_end >= r1_start - BUFFER
  cond4 <- r2_end <= r1_end + BUFFER
  cond5 <- x$mate_categ == 1
  cond6 <- !(x$read_categ %in% c(1,2,3))

  x$read_categ[(cond1 & cond2 & cond3 & cond4 & cond5 & cond6 )] <- 4 #(circ_gen)
  return(x)
}

classifyLinearPairedMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$g_pos1.r1, x$g_pos2.r1) - MIDPOINT + x$start.r1
  r2_start <- pmin(x$g_pos1.r2lin, x$g_pos2.r2lin)  - MIDPOINT +  x$start.r2lin

  cond1_p <- r2_start >= r1_start - BUFFER
  cond1_m <- r1_start >= r2_start - BUFFER
  cond1 <- ifelse(x$strand.r1 == "+", cond1_p, cond1_m)
  cond2 <- x$mate_categ == 2
  cond3 <- !(x$read_categ == 1)
  x$read_categ[(cond1 & cond2 & cond3)] <- 2 #(circ)
  return(x)

}

classifyLinearGenomicdMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$g_pos1.r1, x$g_pos2.r1) - MIDPOINT + x$start.r1
  r2_start <- x$start.r2g

  cond1_p <- r2_start >= r1_start - BUFFER
  cond1_m <- r1_start >= r2_start - BUFFER
  cond1 <- ifelse(x$strand.r1 == "+", cond1_p, cond1_m)
  cond2 <- x$mate_categ == 1
  cond3 <- !(x$read_categ %in% c(1,2))
  x$read_categ[(cond1 & cond2 & cond3)] <- 3
  return(x)
}

classifyLinearIgnore <- function(x) {
  cond1 <- x$mate_categ == 3
  cond2 <- !(x$read_categ %in% c(1))
  x$read_categ[(cond1 & cond2 )] <- 4
  return(x)
}

linear_reads_mapped <-
  linear_reads_mapped %>%
  classifyDecoyOrAnomalous() %>%
  classifyLinearPairedMates() %>%
  classifyLinearGenomicdMates() %>%
  classifyLinearIgnore

scrambled_reads_mapped <-
  scrambled_reads_mapped %>%
  classifyDecoyOrAnomalous() %>%
  classifyCircularPairedMates() %>%
  classifyCircularLinearMates() %>%
  classifyCircularGenomicMates

The frequency of read categories are as follows:

linear_junction_reads <- linear_reads_mapped$read_categ %>%  table  %>% as.data.frame()
rownames(linear_junction_reads) <- c("Uncategorized","Anomaly", "LinearLinear", "linearGenome" ,"Ignore")
scrambled_junction_reads <- scrambled_reads_mapped$read_categ %>%  table %>%  as.data.frame()
rownames(scrambled_junction_reads) <- c("Uncategorized","Decoy", "CircularCircular", "CircularLinear" ,"CircularGenome")
linear_junction_reads 
              .   Freq
Uncategorized 0   2561
Anomaly       1  11913
LinearLinear  2 196786
linearGenome  3 343574
Ignore        4    400
scrambled_junction_reads
                 . Freq
Uncategorized    0   59
Decoy            1   58
CircularCircular 2   52
CircularLinear   3  356
CircularGenome   4  857

The total running time for the 6 preprocessing steps

proc.time() - ptm
   user  system elapsed
104.764   2.000 684.007 
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] pander_0.6.0               dplyr_0.5.0
 [3] systemPipeR_1.9.2          ShortRead_1.32.0
 [5] GenomicAlignments_1.10.0   SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             BiocParallel_1.8.1
 [9] Rsamtools_1.26.1           Biostrings_2.42.1
[11] XVector_0.14.0             GenomicRanges_1.26.1
[13] GenomeInfoDb_1.10.2        IRanges_2.8.1
[15] S4Vectors_0.12.1           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8            locfit_1.5-9.1         lattice_0.20-34
 [4] GO.db_3.4.0            assertthat_0.1         rprojroot_1.1
 [7] digest_0.6.10          R6_2.2.0               plyr_1.8.4
[10] BatchJobs_1.6          backports_1.0.4        RSQLite_1.1-1
[13] evaluate_0.10          ggplot2_2.2.1          zlibbioc_1.20.0
[16] GenomicFeatures_1.26.2 lazyeval_0.2.0         annotate_1.52.1
[19] Matrix_1.2-7.1         checkmate_1.8.2        rmarkdown_1.3
[22] GOstats_2.40.0         splines_3.3.2          stringr_1.1.0
[25] pheatmap_1.0.8         RCurl_1.95-4.8         biomaRt_2.30.0
[28] munsell_0.4.3          sendmailR_1.2-1        rtracklayer_1.34.1
[31] base64enc_0.1-3        BBmisc_1.10            htmltools_0.3.5
[34] fail_1.3               tibble_1.2             edgeR_3.16.5
[37] XML_3.98-1.5           AnnotationForge_1.16.0 bitops_1.0-6
[40] grid_3.3.2             RBGL_1.50.0            xtable_1.8-2
[43] GSEABase_1.36.0        gtable_0.2.0           DBI_0.5-1
[46] magrittr_1.5           scales_0.4.1           graph_1.52.0
[49] stringi_1.1.2          hwriter_1.3.2          genefilter_1.56.0
[52] limma_3.30.7           latticeExtra_0.6-28    brew_1.0-6
[55] rjson_0.2.15           RColorBrewer_1.1-2     tools_3.3.2
[58] Category_2.40.0        survival_2.40-1        yaml_2.1.14
[61] AnnotationDbi_1.36.0   colorspace_1.3-2       memoise_1.0.0
[64] knitr_1.15.1          

References

Szabo, Linda, Robert Morey, Nathan J Palpant, Peter L Wang, Nastaran Afari, Chuan Jiang, Mana M Parast, Charles E Murry, Louise C Laurent, and Julia Salzman. 2015. “Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development.” Genome Biology 16. Genome Biology: 126. doi:10.1186/s13059-015-0690-5.