Here we run an analysis pipeline in R for the identification of circRNA. The pipeline attempts to recreate parts of the KNIFE pipeline described in the paper by Szabo et al. (2015). The code provided here uses the systemPipeR package with functions designed to run the pipleline on a local desktop. Consult the systemPipeR documentation to modify the code so it can be run on a cluster.

Prepartory steps

To install and resolve bioinformatics software dependecies, you can use the Bioconda package manager. First download Miniconda, then create an environment, activate it, add bioconda channels, and install the required software. For example, here we create a local environment named rnaseq and install bowtie2 and trim-galore.

conda create -n rnaseq python=2 anaconda
source activate rnaseq
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
conda install bowtie2
conda install trim-galore

1. Create project directories

First start a project and create subdirectories under the project’s working directory

folders <- c(
  "parameters", "src",
  "data/genome_data/hg19/fasta",
  "data/genome_data/hg19/bowtie2_index",
  "data/biosamples",
  "results/trimmed",
  "results/aligned/genome",
  "results/aligned/ribosomal",
  "results/aligned/junctions_reg",
  "results/aligned/junctions_scrambled",
  "/results/analysis/reports"
)

for (i in 1:length(folders))  {
  dir.create(paste(getwd(), folders[i], sep = "/"), recursive = TRUE)
}

2. Download genome data

To get started, we need to download the hg19_fastas.tar.gz archive that contains fasta files for the ribosome, the linear junctions, the scrambled junctions, and the reference genome. The compressed tar file is available at this link

3. Build index files

We first load the systemPipeR package and other required packages. We also create external files for the command-line software. These are the param files that specify the parameters and the target files that contain the relative path locations of any required data files.

#source("http://bioconductor.org/biocLite.R")
#biocLite("tgirke/systemPipeR", dependencies = TRUE)
library(systemPipeR)
library(dplyr)
library(pander)

The index files are large and might take a long time to download on a regular connection. It could be faster to just run the Bowtie index builder to create the index files. These two lines build the indices. The param and target files are available here.

args_bowtie2_index <- systemArgs(sysma = "./parameters/bowtie2_index.param",
                                 mytargets = "./parameters/hg19_fasta_filepaths.txt")
runCommandline(args = args_bowtie2_index)

4. Download sample data

We use the same HER2 Positive Breast Tumor dataset used in the KNIFE as sample data. The data is made up of paired-end reads from a total RNA rRNA-depleted RNA-seq library. More details about the experiment can be found here. To ensure reproducibility and data provenance, we download the SRA data files directly from the GEO database.

download.file("ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX374/SRX374866/SRR1027187/SRR1027187.sra", "./data/biosamples/SRR1027187.sra")

Next we convert the SRA file to fastq using fastq-dump. The options provided specify the following: split the paired reads into two files, dump the biological reads only, and compress the output using gzip. We also keep the defaults by not adding read ids and leaving the quality encoding as ASCII+33.

system("fastq-dump --split-files --skip-technical --gzip ./data/biosamples/SRR1027187.sra --outdir ./data/biosamples/")

5. Trim data

Reads in the raw data can contain low quality bases and partial adapter sequences that needs to be removed. Trimmed reads that are too short should also be filtered in order to obtain good alignment.

args_trim <- systemArgs(sysma = "./parameters/trim_galore.param",
                           mytargets = "./parameters/data_filepaths.txt")
trim_command <- paste(sysargs(args_trim)[1],
                      unlist(strsplit(sysargs(args_trim)[2], " "))[8],
                      sep = " ")
(trim_command)

[1] "trim_galore --length 50 --paired --output_dir /hd2/KNIFE_project_01/results/trimmed/  /hd2/KNIFE_project_01/data/biosamples/SRR1027187_1.fastq.gz  /hd2/KNIFE_project_01/data/biosamples/SRR1027187_2.fastq.gz"

Although we are treating each paired end read independently, we add the the –paired option argument in the trim_galore command to peform a validation test that ensures that both sequence pairs have a certain minimum length specified by the option --length. To run trim-galore we use the system function

system(trim_command)

6. Run data quality control checks

Lastly, we generate and plot FASTQ quality summary statistics. The fastqc manual provides a good explantion of the fastqc plots. First we define a helper function

runFastqPlots <- function(file_list, report_name) {
  fqlist <- bplapply(seq_along(file_list),
                     function(x)
                       seeFastq(
                         fastq = file_list[x],
                         batchsize = 100000,
                         klength = 8
                       ),
                     BPPARAM = MulticoreParam(workers = 4))
  png(
    sprintf(
      "./results/analysis/reports/fastq_report_%s.png",
      report_name
    ),
    height = 800,
    width = 200* length(fqlist),
  )
  seeFastqPlot(unlist(fqlist, recursive = FALSE))
  dev.off()
}

then we run it on the raw data

runFastqPlots(file_list = infile1(args_trim),
              report_name = "SRR1027187")

and on the trimmed data

data_trimmed_filepaths <- paste0(outfile1(args_trim),
                                 SampleName(args_trim),
                                 c("_val_1.fq.gz", "_val_2.fq.gz"))
names(data_trimmed_filepaths) <- paste0(SampleName(args_trim),"_trimmed")

runFastqPlots(file_list = data_trimmed_filepaths,
              report_name = "SRR1027187_trimmed")

Next we plot the reports side by side

fastqcplots fastqc_trimmed_plots

The plots for the trimmed data look much better. Although the mean quality distribution for the reads is bimodal, there are no low quality peaks.

Alignment Steps

Now we independently align each paired-end read dataset to four separate Bowtie2 indices: for the genome, ribosomal RNA, linear exon–exon junctions, and scrambled exon–exon junctions. First we load Bowtie2 parameter files and filepaths

args_align_genome <- systemArgs(sysma = "./parameters/bowtie2_align_genome.param",
                                mytargets = "./parameters/data_trimmed_filepaths.txt")
args_align_ribo <- systemArgs(sysma = "./parameters/bowtie2_align_ribosomal.param",
                              mytargets = "./parameters/data_trimmed_filepaths.txt")
args_align_juncts_reg <- systemArgs(sysma = "./parameters/bowtie2_align_junctions_reg.param",
                                    mytargets = "./parameters/data_trimmed_filepaths.txt")
args_align_juncts_scram <- systemArgs(sysma = "./parameters/bowtie2_align_junctions_scrambled.param",
                                      mytargets = "./parameters/data_trimmed_filepaths.txt")

then we run the Bowtie2 alignment commands through systemPipeR

# align to genome
genome_bam_files <- runCommandline(args = args_align_genome)
# align to ribosome
ribo_bam_files <- runCommandline(args = args_align_ribo)
# align to linear junctions
juncts_reg_bam_files <- runCommandline(args = args_align_juncts_reg)
# align to srambled junctions
juncts_scram_files <- runCommandline(args = args_align_juncts_scram)

We can open “submitargs01_log” for each alignment to display the alignment summary and the system command used. For the genome alignment, we can open the file inside R like this

readLines(file.path("./results/aligned/genome","submitargs01_log"))

 [1] "bowtie2 -p 8 --no-unal --score-min L,0,-0.24 --rdg 50,50 --rfg 50,50 -x /hd2/KNIFE_project_01/data/genome_data/hg19/bowtie2_index/hg19_genome -U /hd2/KNIFE_project_01/results/trimmed/SRR1027187_1_val_1.fq.gz  -S /hd2/KNIFE_project_01/results/aligned/genome/SRR1027187_1_val_1.sam"
 [2] "19308315 reads; of these:"
 [3] "  19308315 (100.00%) were unpaired; of these:"
 [4] "    6692805 (34.66%) aligned 0 times"
 [5] "    8628824 (44.69%) aligned exactly 1 time"
 [6] "    3986686 (20.65%) aligned >1 times"
 [7] "65.34% overall alignment rate"
 [8] "bowtie2 -p 8 --no-unal --score-min L,0,-0.24 --rdg 50,50 --rfg 50,50 -x /hd2/KNIFE_project_01/data/genome_data/hg19/bowtie2_index/hg19_genome -U /hd2/KNIFE_project_01/results/trimmed/SRR1027187_2_val_2.fq.gz  -S /hd2/KNIFE_project_01/results/aligned/genome/SRR1027187_2_val_2.sam"
 [9] "19308315 reads; of these:"
[10] "  19308315 (100.00%) were unpaired; of these:"
[11] "    6753734 (34.98%) aligned 0 times"
[12] "    8563704 (44.35%) aligned exactly 1 time"
[13] "    3990877 (20.67%) aligned >1 times"
[14] "65.02% overall alignment rate"

Preprocessing steps

The previous alignment step produced 8 BAM files. Two paired-end BAM files for each of the genome, ribosome, linear junctions, and scrambled junctions alignments. We are interested in junctional reads, those reads from the 8 files that align to a scrambled or linear junctions. Since the majority of the reads in the genome and ribosome BAM files are not junctional, we need to filter them out.

1. Get read IDs for genomic mate pairs

Since the next preprocessing steps are memory intensive, we can intially filter out reads in which both pair mates align to the genome or ribosome. We can also filter out unaligned reads and return lists of read IDs in which both pairs align to the indicies, or in which one pair mate aligns but the other doesn’t. To proceed we need to get the names of those reads that are found in both R1 and R2 genomic alignment files.

getReadIds <- function(bamfile) {
  scanBam(bamfile,
          param = ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE),
                               what = "qname"))[[1]][[1]]
  }

getCommonGenomicPairedReads <- function(genome_bamfiles, ribo_bamfiles) {

  genome_reads <- bplapply(genome_bamfiles,
                           getReadIds,
                           BPPARAM = MulticoreParam(workers = 2))
  ribosomal_reads <- bplapply(ribo_bamfiles,
                              getReadIds,
                              BPPARAM = MulticoreParam(workers = 2))

  genomic_reads_1 <- union(genome_reads[[1]], ribosomal_reads[[1]])
  genomic_reads_2 <- union(genome_reads[[2]], ribosomal_reads[[2]])
  common_reads <- intersect(genomic_reads_1, genomic_reads_2)
  r1_not_r2 <- setdiff(genomic_reads_1, genomic_reads_2)
  r2_not_r1 <- setdiff(genomic_reads_2, genomic_reads_1)

  return(
    list(
      common_reads = common_reads,
      r1_not_r2 = r1_not_r2,
      r1_not_r2 = r2_not_r1
    )
  )
}

Note that code above uses the parallel lapply function bplapply. If you system has less than 32G RAM, you can modify the code to run with one worker or use lapply instead. Here we retrieve the location of the aligned files and feed them into the function defined above.

genome_aligned <- outpaths(args_align_genome)
ribo_aligned <- outpaths(args_align_ribo)

reads_to_ignore <- getCommonGenomicPairedReads(genome_aligned, ribo_aligned)

2. Read alignment data and run initial filtering

Now we use these read IDs to create two filters. The first one filters out genomic reads in which both mates (R1 and R2) align. The second one is a junctional filter that keeps only junctional reads that overlap a particular junction by at least 8 nucleotides and that also satify the first filter.

genomic_reads_filter <-
  FilterRules(list(
    include = function(x) !(mcols(x)$qname %in% reads_to_ignore$common_reads)
  ))

MIDPOINT <- 150
JUNC_OVERLAP <- 8

junction_reads_filter <-
  FilterRules(list(
    include = function(x)
      (
        (MIDPOINT + JUNC_OVERLAP + 1 - width(x) <= start(x)) &
        (start(x) <= MIDPOINT - JUNC_OVERLAP + 1) &
        !(mcols(x)$qname %in% reads_to_ignore$common_reads)
      )
  ))

As part of the filtering step, we also update the alignment score for the junction reads such that the N-penalty is corrected.

countAmbigousBases <- function(align_obj) {
  sapply(gregexpr("N0|0N", mcols(align_obj)$MD),
         function(x)
           if (attributes(x)$match.length[1] == -1)
             return(0)
         else
           return(length(x))
  )
}

#MD and XN do agree
filterAlignments <-
  function(bamfile, param, filter, updateScore = TRUE) {
    x <- readGAlignments(file = bamfile, param = param)
    x <- subsetByFilter(x, filter)
    seqlevels(x) <- seqlevelsInUse(x)
    # correct for N-penalty in junction alignments and update alignment score
    if (updateScore) {
      mcols(x)$AS <- mcols(x)$AS + countAmbigousBases(x)
    }
    return(x)
  }

# get file paths
linear_junctions_aligned <- outpaths(args_align_juncts_reg)
scrambled_junctions_aligned <- outpaths(args_align_juncts_scram)

# run filters

param <-
  ScanBamParam(
    what = c("qname","mapq"), # get the read name and the mapping quality
    flag = scanBamFlag(isUnmappedQuery = FALSE),
    tag = c("AS", "XS", "XN", "MD") # get tags
  )

genomic_filtered <- bplapply(
  genome_aligned,
  filterAlignments,
  param = param,
  filter = genomic_reads_filter,
  updateScore = FALSE,
  BPPARAM = MulticoreParam(workers = 2)
)

linear_junctions_filtered <- bplapply(
  linear_junctions_aligned,
  filterAlignments,
  param = param,
  filter = junction_reads_filter,
  BPPARAM = MulticoreParam(workers = 2)
)

scrambled_junctions_filtered <- bplapply(
  scrambled_junctions_aligned,
  filterAlignments,
  param = param,
  filter = junction_reads_filter,
  BPPARAM = MulticoreParam(workers = 2)
)

Let’s take a look at first few rows of the GAlignment object for the linear junctions R1 aligned reads

linear_junctions_filtered[[1]][1:3,] %>% as.data.frame() %>% pander()

Table continues below
seqnames	strand	cigar	qwidth	start
chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	-	59M	59	106
chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	-	59M	59	109
chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	+	60M	60	122

end	width	qname	mapq	AS	XS	MD
164	59	SRR1027187.35004121	39	0	-14	59
167	59	SRR1027187.35267243	24	-5	NA	56T2
181	60	SRR1027187.3091383	42	0	NA	60

We can also examine the frequency of the cigar strings

linear_junctions_filtered[[1]] %>%
  cigar() %>%
  table %>%
  sort(decreasing = TRUE) %>%
  head()

.
   60M    59M    58M    57M    56M    55M
397762 138477  45942  13278   3761   3041

3. Convert GAlignment objects into dataframes

toDataFrame <- function(x) {
  x <- as.data.frame(x)
  x$seqnames <- as.character(x$seqnames)
  x$strand <- as.character(x$strand)
  x <- subset(x, select = -c(cigar, qwidth, njunc, XS, XN, MD))
  return(x)
}

appendJunctionTags <- function(x) {
  df <- data.frame(do.call(rbind,
                           lapply(as.character(seqnames(x)),
                                  function(x) unlist(strsplit(x, "[:]|[|]")))),
                   stringsAsFactors = FALSE)
  colnames(df) <- c("g_chrom", "g_gene1", "g_pos1", "g_gene2",
                    "g_pos2", "junc_type", "junc_strand")
  df$g_pos1 <- as.integer(df$g_pos1)
  df$g_pos2 <- as.integer(df$g_pos2)
  mcols(x) <- cbind(mcols(x), df)
  return(toDataFrame(x))
}

genomic_filtered_dt <- bplapply(
  genomic_filtered,
  toDataFrame,
  BPPARAM = MulticoreParam(workers = 2)
)

linear_junctions_filtered_dt <- bplapply(
  linear_junctions_filtered,
  appendJunctionTags,
  BPPARAM = MulticoreParam(workers = 2)
)

scrambled_junctions_filtered_dt <- bplapply(
  scrambled_junctions_filtered,
  appendJunctionTags,
  BPPARAM = MulticoreParam(workers = 2)
)

Let’s take a look at the dataframe for the linear junctions R1 aligned reads

panderOptions("table.style", "simple")
panderOptions("table.split.table", 140)
linear_junctions_filtered_dt[[1]][1:3,] %>% pander()

Table continues below
seqnames	strand	start	end	width	qname	mapq	AS	g_chrom	g_gene1	g_pos1
chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	-	106	164	59	SRR1027187.35004121	39	0	chr10	ZMYND11	226068
chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	-	109	167	59	SRR1027187.35267243	24	-5	chr10	ZMYND11	226068
chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	+	122	181	60	SRR1027187.3091383	42	0	chr10	ZMYND11	226068

g_gene2	g_pos2	junc_type	junc_strand
ZMYND11	255828	reg	+
ZMYND11	255828	reg	+
ZMYND11	255828	reg	+

4. Filter R1 junctional reads and pair them with R2 mates

A read (R1) is a linear junction read if it aligns to a linear junction and does not align to the genome or ribosome. In the code these reads are stored in linear_1. A read R1 is a scrambled junction read if it aligns to a scrambled junction and does not align to the genome or ribosome or to the linear junction. Scrambled junction reads are stored in scrambled_1. The read mate R2 can be a genomic, linear junction, or a scrambled junction read as determined by whehter the read name can be found in the R2 aligned reads. An R1 read can have a single or multiple R2 mates. Some have none at all. These are the unmapped reads.

createJunctionsReadsDF <- function(genomic,
                                   linear,
                                   scrambled,
                                   r1_not_r2,
                                   suffixes = c(".r1", ".r2g", ".r2lin", ".r2scram")) {

  linear_1 <- linear[[1]][!(linear[[1]]$qname %in% r1_not_r2), ]
  scrambled_1 <- scrambled[[1]][!(scrambled[[1]]$qname %in% c(r1_not_r2, linear_1$qname)), ]

  genomic_mate <- genomic[[2]][(genomic[[2]]$qname %in% c(linear_1$qname,scrambled_1$qname)), ]
  linear_mate <- linear[[2]][(linear[[2]]$qname %in% c(linear_1$qname,scrambled_1$qname)), ]
  scrambled_mate <- scrambled[[2]][(scrambled[[2]]$qname %in% c(linear_1$qname, scrambled_1$qname)), ]

  linear_genomic <- left_join(linear_1,
                              genomic_mate,
                              by = "qname",
                              suffix = suffixes[c(1, 2)])

  linear_linear <- left_join(linear_1,
                             linear_mate ,
                             by = "qname",
                             suffix = suffixes[c(1, 3)])

  linear_scrambled <- left_join(linear_1,
                                scrambled_mate,
                                by = "qname",
                                suffix = suffixes[c(1, 4)])

  scrambled_genomic <- left_join(scrambled_1,
                                 genomic_mate,
                                 by = "qname",
                                 suffix = suffixes[c(1, 2)])

  scrambled_linear <- left_join(scrambled_1,
                                linear_mate ,
                                by = "qname",
                                suffix = suffixes[c(1, 3)])

  scrambled_scrambled <- left_join(scrambled_1,
                                   scrambled_mate,
                                   by = "qname",
                                   suffix = suffixes[c(1, 4)])

  linear_df <- cbind(linear_linear[, 1:15],
                     linear_genomic[, 16:22],
                     linear_linear[, 16:29],
                     linear_scrambled[, 16:29])

  scrambled_df <- cbind(scrambled_linear[, 1:15],
                        scrambled_genomic[, 16:22],
                        scrambled_linear[, 16:29],
                        scrambled_scrambled[, 16:29])

  return(list(linear_df = linear_df, scrambled_df = scrambled_df))
}

R1R2MatesJunctionReads <- createJunctionsReadsDF(genomic_filtered_dt,
                                                 linear_junctions_filtered_dt,
                                                 scrambled_junctions_filtered_dt,
                                                 reads_to_ignore$r1_not_r2)

5. Classify Mate Reads

Here we classify mate reads based on alignment score. We also create a new variable ascore that averages the alignment scores of the two matched reads.

classsifyMateReads <- function(df, suffixes = c(".r1", ".r2g", ".r2lin", ".r2scram")) {

    scores_df <- df[, paste0("AS", suffixes)]
    map_categ <- apply(scores_df[, 2:4] , 1, function(x) sum(!is.na(x)))
    max_score <- apply(scores_df[, 2:4] , 1, which.max)

    mate_categ <- sapply(max_score, function(x)
      if (is.null(attributes(x)))
        {return(0)}
      else
        {return(as.integer(x))}
      )

    matescore <- scores_df[cbind(1:nrow(scores_df), (mate_categ + 1))]
    ascore <- 0.5 * (scores_df$AS.r1 + matescore)
    df <- cbind(df,
                mate_categ = mate_categ,
                ascore = ascore,
                map_categ = map_categ)
    return(df)
  }

linear_df <- R1R2MatesJunctionReads$linear_df %>% classsifyMateReads()
scrambled_df <- R1R2MatesJunctionReads$scrambled_df %>% classsifyMateReads()

cat( " number of linear junction reads",
     NROW(linear_df), "\n",
     "number of scrambled junction reads",
     NROW(scrambled_df), "\n",
  "total number of junctional reads:",
    NROW(linear_df) + NROW(scrambled_df), "\n",
    "number of junctions with aligned reads:",
    length(unique(linear_df$seqnames.r1)) +
      length(unique(scrambled_df$seqnames.r1)))

 number of linear junction reads 600658
 number of scrambled junction reads 1599
 total number of junctional reads: 602257
 number of junctions with aligned reads: 108949

Let’s take a look at the dataframe for the linear junctions R1 aligned reads

panderOptions("table.style", "simple")
panderOptions("table.split.table", 140)
scrambled_df[1:3,]  %>% pander()

Table continues below
seqnames.r1	strand.r1	start.r1	end.r1	width.r1	qname	mapq.r1	AS.r1	g_chrom.r1
chr10\|RBM17:6139151\|FAM208B:5762506\|rev\|+	-	143	202	60	SRR1027187.5794479	0	-5	chr10
chr10\|ATP5C1:7844817\|ATP5C1:7839009\|rev\|+	-	126	185	60	SRR1027187.4048280	42	-1	chr10
chr10\|ATP5C1:7844817\|ATP5C1:7839009\|rev\|+	-	130	185	56	SRR1027187.29265526	42	0	chr10

Table continues below
g_gene1.r1	g_pos1.r1	g_gene2.r1	g_pos2.r1	junc_type.r1	junc_strand.r1	seqnames.r2g	strand.r2g	start.r2g	end.r2g
RBM17	6139151	FAM208B	5762506	rev	+	chr10	+	5754806	5754865
ATP5C1	7844817	ATP5C1	7839009	rev	+	NA	NA	NA	NA
ATP5C1	7844817	ATP5C1	7839009	rev	+	NA	NA	NA	NA

Table continues below
width.r2g	mapq.r2g	AS.r2g	seqnames.r2lin	strand.r2lin	start.r2lin	end.r2lin	width.r2lin
60	42	0	NA	NA	NA	NA	NA
NA	NA	NA	chr10\|ATP5C1:7844388\|ATP5C1:7844720\|reg\|+	+	101	160	60
NA	NA	NA	NA	NA	NA	NA	NA

Table continues below
mapq.r2lin	AS.r2lin	g_chrom.r2lin	g_gene1.r2lin	g_pos1.r2lin	g_gene2.r2lin	g_pos2.r2lin	junc_type.r2lin	junc_strand.r2lin
NA	NA	NA	NA	NA	NA	NA	NA	NA
42	-1	chr10	ATP5C1	7844388	ATP5C1	7844720	reg	+
NA	NA	NA	NA	NA	NA	NA	NA	NA

Table continues below
seqnames.r2scram	strand.r2scram	start.r2scram	end.r2scram	width.r2scram	mapq.r2scram	AS.r2scram
NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA
chr10\|ATP5C1:7844817\|ATP5C1:7839009\|rev\|+	+	130	185	56	42	0

Table continues below
g_chrom.r2scram	g_gene1.r2scram	g_pos1.r2scram	g_gene2.r2scram	g_pos2.r2scram	junc_type.r2scram	junc_strand.r2scram
NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA
chr10	ATP5C1	7844817	ATP5C1	7839009	rev	+

mate_categ	ascore	map_categ
1	-2.5	1
2	-1	1
3	0	1

Let’s also output the frequency of R1 reads that have 0, 1, 2, or 3 potential paired-end mate reads (R2)

linear_junction_mapcateg <- linear_df$map_categ %>%  table  %>% as.data.frame()
rownames(linear_junction_mapcateg) <- c("Unmapped","OneMateMatch", "TwoMateMatchs", "ThreeMateMatchs")
scrambled_junction_mapcateg <- scrambled_df$map_categ %>%  table %>%  as.data.frame()
rownames(scrambled_junction_mapcateg) <- c("Unmapped","OneMateMatch", "TwoMateMatchs", "ThreeMateMatchs")
linear_junction_mapcateg

                .   Freq
Unmapped        0  45424
OneMateMatch    1 541523
TwoMateMatchs   2  13490
ThreeMateMatchs 3    221

scrambled_junction_mapcateg

                . Freq
Unmapped        0  217
OneMateMatch    1 1346
TwoMateMatchs   2   35
ThreeMateMatchs 3    1

6. Classify Junction Reads

We split the reads into reads that have a mapped mate R2 and those that do not. We intially assign all reads as uncategorized (i.e. read_categ=0).

linear_reads_split <- split(linear_df, linear_df$map_categ == 0)
scrambled_reads_split <- split(scrambled_df, scrambled_df$map_categ == 0)
unmapped_reads <- rbind(linear_reads_split[[2]], scrambled_reads_split[[2]])[,1:15]

linear_reads_mapped <- cbind(linear_reads_split[[1]] , read_categ = 0)
scrambled_reads_mapped <- cbind(scrambled_reads_split[[1]] , read_categ = 0)

Let’s take a look at the unmapped reads

panderOptions("table.style", "simple")
panderOptions("table.split.table", 140)
unmapped_reads[1:3,] %>% pander()

Table continues below
	seqnames.r1	strand.r1	start.r1	end.r1	width.r1	qname	mapq.r1	AS.r1
1	chr10\|ZMYND11:226068\|ZMYND11:255828\|reg\|+	-	106	164	59	SRR1027187.35004121	39	0
7	chr10\|GTPBP4:1046808\|GTPBP4:1046887\|reg\|+	-	119	171	53	SRR1027187.12128237	23	-5
19	chr10\|GTPBP4:1056460\|GTPBP4:1058404\|reg\|+	-	104	160	57	SRR1027187.25775441	37	0

	g_chrom.r1	g_gene1.r1	g_pos1.r1	g_gene2.r1	g_pos2.r1	junc_type.r1	junc_strand.r1
1	chr10	ZMYND11	226068	ZMYND11	255828	reg	+
7	chr10	GTPBP4	1046808	GTPBP4	1046887	reg	+
19	chr10	GTPBP4	1056460	GTPBP4	1058404	reg	+

Now we proceed to classify scrambled junctional reads as circular or decoy and linear junctional reads as linear or anomalous.

classifyDecoyOrAnomalous <- function(x) {

  cond1 <- x$strand.r1 != x[, c("strand.r2g", "strand.r2lin", "strand.r2scram")][cbind(1:nrow(x), x$mate_categ)]
  cond2 <- x$g_chrom.r1 == x[, c("seqnames.r2g", "g_chrom.r2lin", "g_chrom.r2scram")][cbind(1:nrow(x), x$mate_categ)]
  x$read_categ[!(cond1 & cond2)] <- 1 #(decoy)

  return(x)
}

classifyCircularPairedMates <- function(x) {

  cond1 <- x$mate_categ == 3
  cond2 <- x$seqnames.r1 == x$seqnames.r2scram
  cond3 <- !(x$read_categ == 1)
  x$read_categ[(cond1 & cond2 & cond3)] <- 2 #(circ)

  return(x)
}

classifyCircularLinearMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$g_pos1.r1, x$g_pos2.r1)
  r1_end <- pmax(x$g_pos1.r1, x$g_pos2.r1)
  r2_start <- pmin(x$g_pos1.r2lin, x$g_pos2.r2lin)  - MIDPOINT +  x$start.r2lin
  r2_end <- pmin(x$g_pos1.r2lin, x$g_pos2.r2lin)  - MIDPOINT +  x$end.r2lin

  cond1 <- r2_start >=  r1_start - BUFFER
  cond2 <- r2_start <=  r1_end + BUFFER
  cond3 <- r2_end >= r1_start - BUFFER
  cond4 <- r2_end <= r1_end + BUFFER
  cond5 <- x$mate_categ == 2
  cond6 <- !(x$read_categ %in% c(1,2))

  x$read_categ[(cond1 & cond2 & cond3 & cond4 & cond5 & cond6 )] <- 3 #(circ_lin)
  return(x)
}

classifyCircularGenomicMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$start.r1, x$g_pos2.r1)
  r1_end <- pmax(x$g_pos1.r1, x$g_pos2.r1)
  r2_start <- x$start.r2g
  r2_end <- x$end.r2g

  cond1 <- r2_start >=  r1_start - BUFFER
  cond2 <- r2_start <=  r1_end + BUFFER
  cond3 <- r2_end >= r1_start - BUFFER
  cond4 <- r2_end <= r1_end + BUFFER
  cond5 <- x$mate_categ == 1
  cond6 <- !(x$read_categ %in% c(1,2,3))

  x$read_categ[(cond1 & cond2 & cond3 & cond4 & cond5 & cond6 )] <- 4 #(circ_gen)
  return(x)
}

classifyLinearPairedMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$g_pos1.r1, x$g_pos2.r1) - MIDPOINT + x$start.r1
  r2_start <- pmin(x$g_pos1.r2lin, x$g_pos2.r2lin)  - MIDPOINT +  x$start.r2lin

  cond1_p <- r2_start >= r1_start - BUFFER
  cond1_m <- r1_start >= r2_start - BUFFER
  cond1 <- ifelse(x$strand.r1 == "+", cond1_p, cond1_m)
  cond2 <- x$mate_categ == 2
  cond3 <- !(x$read_categ == 1)
  x$read_categ[(cond1 & cond2 & cond3)] <- 2 #(circ)
  return(x)

}

classifyLinearGenomicdMates <- function(x, BUFFER = 15, MIDPOINT = 150) {

  r1_start <- pmin(x$g_pos1.r1, x$g_pos2.r1) - MIDPOINT + x$start.r1
  r2_start <- x$start.r2g

  cond1_p <- r2_start >= r1_start - BUFFER
  cond1_m <- r1_start >= r2_start - BUFFER
  cond1 <- ifelse(x$strand.r1 == "+", cond1_p, cond1_m)
  cond2 <- x$mate_categ == 1
  cond3 <- !(x$read_categ %in% c(1,2))
  x$read_categ[(cond1 & cond2 & cond3)] <- 3
  return(x)
}

classifyLinearIgnore <- function(x) {
  cond1 <- x$mate_categ == 3
  cond2 <- !(x$read_categ %in% c(1))
  x$read_categ[(cond1 & cond2 )] <- 4
  return(x)
}

linear_reads_mapped <-
  linear_reads_mapped %>%
  classifyDecoyOrAnomalous() %>%
  classifyLinearPairedMates() %>%
  classifyLinearGenomicdMates() %>%
  classifyLinearIgnore

scrambled_reads_mapped <-
  scrambled_reads_mapped %>%
  classifyDecoyOrAnomalous() %>%
  classifyCircularPairedMates() %>%
  classifyCircularLinearMates() %>%
  classifyCircularGenomicMates

The frequency of read categories are as follows:

linear_junction_reads <- linear_reads_mapped$read_categ %>%  table  %>% as.data.frame()
rownames(linear_junction_reads) <- c("Uncategorized","Anomaly", "LinearLinear", "linearGenome" ,"Ignore")
scrambled_junction_reads <- scrambled_reads_mapped$read_categ %>%  table %>%  as.data.frame()
rownames(scrambled_junction_reads) <- c("Uncategorized","Decoy", "CircularCircular", "CircularLinear" ,"CircularGenome")
linear_junction_reads

              .   Freq
Uncategorized 0   2561
Anomaly       1  11913
LinearLinear  2 196786
linearGenome  3 343574
Ignore        4    400

scrambled_junction_reads

                 . Freq
Uncategorized    0   59
Decoy            1   58
CircularCircular 2   52
CircularLinear   3  356
CircularGenome   4  857

The total running time for the 6 preprocessing steps

proc.time() - ptm

   user  system elapsed
104.764   2.000 684.007

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] pander_0.6.0               dplyr_0.5.0
 [3] systemPipeR_1.9.2          ShortRead_1.32.0
 [5] GenomicAlignments_1.10.0   SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             BiocParallel_1.8.1
 [9] Rsamtools_1.26.1           Biostrings_2.42.1
[11] XVector_0.14.0             GenomicRanges_1.26.1
[13] GenomeInfoDb_1.10.2        IRanges_2.8.1
[15] S4Vectors_0.12.1           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8            locfit_1.5-9.1         lattice_0.20-34
 [4] GO.db_3.4.0            assertthat_0.1         rprojroot_1.1
 [7] digest_0.6.10          R6_2.2.0               plyr_1.8.4
[10] BatchJobs_1.6          backports_1.0.4        RSQLite_1.1-1
[13] evaluate_0.10          ggplot2_2.2.1          zlibbioc_1.20.0
[16] GenomicFeatures_1.26.2 lazyeval_0.2.0         annotate_1.52.1
[19] Matrix_1.2-7.1         checkmate_1.8.2        rmarkdown_1.3
[22] GOstats_2.40.0         splines_3.3.2          stringr_1.1.0
[25] pheatmap_1.0.8         RCurl_1.95-4.8         biomaRt_2.30.0
[28] munsell_0.4.3          sendmailR_1.2-1        rtracklayer_1.34.1
[31] base64enc_0.1-3        BBmisc_1.10            htmltools_0.3.5
[34] fail_1.3               tibble_1.2             edgeR_3.16.5
[37] XML_3.98-1.5           AnnotationForge_1.16.0 bitops_1.0-6
[40] grid_3.3.2             RBGL_1.50.0            xtable_1.8-2
[43] GSEABase_1.36.0        gtable_0.2.0           DBI_0.5-1
[46] magrittr_1.5           scales_0.4.1           graph_1.52.0
[49] stringi_1.1.2          hwriter_1.3.2          genefilter_1.56.0
[52] limma_3.30.7           latticeExtra_0.6-28    brew_1.0-6
[55] rjson_0.2.15           RColorBrewer_1.1-2     tools_3.3.2
[58] Category_2.40.0        survival_2.40-1        yaml_2.1.14
[61] AnnotationDbi_1.36.0   colorspace_1.3-2       memoise_1.0.0
[64] knitr_1.15.1

References

Szabo, Linda, Robert Morey, Nathan J Palpant, Peter L Wang, Nastaran Afari, Chuan Jiang, Mana M Parast, Charles E Murry, Louise C Laurent, and Julia Salzman. 2015. “Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development.” Genome Biology 16. Genome Biology: 126. doi:10.1186/s13059-015-0690-5.

Identification and quantification of circular RNA from RNA-Seq data

Rick Farouni

January 4, 2017