rSeq: RNA-Seq Analyzer

README for rSeq: RNA-Seq Analyzer

Hui Jiang

Table of Contents

Introduction

Usage

Download and install

Gene and isoform expression estimation

License

Method and References

Acknowledgements

Introduction

rSeq is a set of tools for RNA-Seq data analysis. It consists of programs that deal with many aspects of RNA-Seq data analysis, such as reference sequence generation, sequence mapping, gene and isoform expressions (RPKMs) computation, etc. There are also many other features that will be gradually added into rSeq in the future.

Usage

Download and install

Simply download the zipped file from the rSeq website and unzip the file to somewhere on the hard drive. If you downloaded the source code, simply type "make" to compile it. You will need a recent C++ compiler which supports C++11 standard.

Gene and isoform expression estimation

The procedure for gene and isoform expression estimation consists of three steps: transcript sequence preparation, read mapping and expression estimation.

Note: since rSeq 0.1.0 a new way for generating transcripts and estimating gene and isoform expression values is used. The old way (here are the old user manual and old FAQ) is still supported but not recommended.

[1] Prepare transcript sequences.

Transcript sequences should be stored in a file in the FASTA format. Name different transcripts (isoforms) of a gene as gene_name$$isoform_id such as WASH5P$$NR_024540 and here is an example. Note: gene_name will be the index used by rSeq to output gene expression. So if you want something other than gene names, such as Entrez gene ID, you should use them in the transcript sequence file.

If you can prepare transcripts according to the above format then you can jump to step 2 directly. rSeq also provides two methods for transcript sequence generation. The first method is easier but it only works with the UCSC refFlat annotation. The second method is slightly more complicated but it is recommended, and it works with gene annotation in the GTF format (e.g., the ENCODE annotation), UCSC refFlat format and UCSC KnownGene format (BED format). The second method is also required to generated alignment results in BED files (described in more details in a later section).

Method 1) Transcript sequences for many species can be downloaded from the UCSC genome browser. For example, weekly updated human RefSeq mRNA sequences from GenBank can be download at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/refMrna.fa.gz. Then you need to convert the transcript sequence names downloaded from the UCSC genome browser in the format needed by rSeq by running rSeq with the following command:

rseq annotate_transcripts refMrna.fa kgXref.txt

Where "refMrna.fa" is the unzipped file containing transcript sequence downloaded from the UCSC genome browser. "kgXref.txt" is the unziped file containing the gene name/transcript id corresponding table also download from the UCSC genome browser. For human genome, it is at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/kgXref.txt.gz. When finished, you will get a "refMrna.fa.new.fa" file, which contains all the annotated transcript sequences in FASTA format.

Method 2) Download gene annotation file in UCSC refFlat format, UCSC known Gene format (BED format) or the GTF format (e.g., the ENCODE annotation). As an example, the refFlat format annotation file for hg19 can be downloaded at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz.Also download the genome sequences. For example, the genome sequences of all the chromosomes for hg19 can be downloaded at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz. You can then generate transcript sequences in the format needed by rSeq by running rSeq with the following command (here I used the refFlat format as an example, but it should also work for annotation files in GTF format or BED format).

rseq generate_transcripts refFlat.txt chrfilelist.txt

where "refFlat.txt" is the annotation file in UCSC refFlat format and "chrfilelist.txt" is the list of filenames of DNA sequences of all the chromosomes. And here is an example. When finished, you will get a "refFlat.txt.fa" file, which contains all the transcript sequences in FASTA format.

Note: starting from rSeq 0.2.1, generate_transcripts also accepts the whole genome in a single fasta file, i.e.

rseq generate_transcripts refFlat.txt genome.fasta

From now on I will demonstrate using rSeq with transcript sequences generated with method 1. If you are using method 2, simply replace all the "refMrna.fa.new.fa" with "refFlat.txt.fa" in the command lines.

[2] Map the reads to the transcript sequences.

The rSeq package includes SeqMap as the default aligner, which usually works well with ~10M reads for desktop PCs with 2GB memory. For larger data sets, you can either use a computer with larger memory or split the read file into several files, map each file respectively and then concatenate the final mapping results as one single file.

If you can handle it technically, it is recommended that you use other aligners which output in either the SAM format or the Eland-multi format. We have tested rSeq with SeqMap, Eland, BWA, Bowtie and Bowtie2. Aligners other than SeqMap need to be installed by yourself. Simply go to the aligner's website and follow its instructions to download and install it. Each aligner has its own options and features. Here I only summarize the most basic usage of a few aligners. For better control of the alignment procedure or advanced features, please follow the user manual of the particular aligner that you are using.

For SeqMap (provided togetherthe rSeq package), use the following command,

seqmap 2 reads.input refMrna.fa.new.fa reads.mapped.output /eland:3

Where "reads.input" is the file containing the sequencing reads (in plain read sequences, FASTA or FASTQ format, see SeqMap's user manual for more details). refMrna.fa.new.fa" is the annotated transcript sequence file. "reads.mapped.output" is the output file. For paired-end reads, map the two files for each end separately. Please refer to SeqMap's user manual for more options.

For BWA, first use the following command to index the transcript sequences:

bwa index refMrna.fa.new.fa

Then use the following commands to align single-end reads:

bwa aln refMrna.fa.new.fa reads.fastq > reads.bwa.sai
bwa samse -n 100 -f reads.mapped.sam refMrna.fa.new.fa reads.bwa.sai reads.fastq

Where "reads.fastq" is the read file in the FASTQ format. BWA (and Bowtie and Bowtie2) also accept FASTQ or other input formats. See their user manuals for the usages. "reads.mapped.sam" is the output file in SAM format. For paired-end reads, either map files for each end separately, or use the following commands:

bwa aln refMrna.fa.new.fa reads_1.fastq > reads_1.bwa.sai
bwa aln refMrna.fa.new.fa reads_2.fastq > reads_2.bwa.sai
bwa sampe -a 10000 -n 100 -f reads.mapped.sam refMrna.fa.new.fa reads_1.bwa.sai reads_2.bwa.sai reads_1.fastq reads_2.fastq

Please refer to BWA's user manual for more options.

For Bowtie, first use the following command to index the transcript sequences:

bowtie-build refMrna.fa.new.fa refMrna.fa.new.fa

Then use the following command to align single-end reads:

bowtie -k 100 -S refMrna.fa.new.fa reads.fastq reads.mapped.sam

For paired-end reads, either map files for each end separately, or use the following command:

bowtie -k 100 -S refMrna.fa.new.fa -1 reads_1.fastq -2 reads_2.fastq reads.mapped.sam

Please refer to Bowtie's user manual for more options.

For Bowtie2, first use the following command to index the transcript sequences:

bowtie2-build refMrna.fa.new.fa refMrna.fa.new.fa

Then use the following command to align single-end reads:

bowtie2 -x refMrna.fa.new.fa -U reads.fastq -S reads.mapped.sam

For paired-end reads, either map files for each end separately, or use the following command:

bowtie2 -x refMrna.fa.new.fa -1 reads_1.fastq -2 reads_2.fastq -S reads.mapped.sam

Please refer to Bowtie2's user manual for more options.

For Eland (provided by Illumina), run with option "--multi", i.e.,

eland reads.txt refMrna.fa.new.fa.squashed/ reads.mapped.output --multi

Where "reads.txt" is the file of the sequence reads in plain read sequences. refMrna.fa.new.fa.squahshed/" is the directory containing the squashed sequences generated with the "squashGenome" tool in the Eland package. For paired-end reads, either map files for each end separately.

[3] Compute gene and isoform expressions.

Use the following command:

rseq expression_analysis refMrna.fa.new.fa mapped.reads.output1 [mapped.reads.output2]

where "mapped.reads.output1" (and also mapped.reads.output2 for paired-end reads when each ends are mapped individually) are the output files from the aligners, in either the Eland-multi format or the SAM format. When finished, you will get an output file: "mapped.reads.output1.exp.xls" (together with several other files for diagnostic purpose which you should simply ignore). It contains the estimated gene and isoform expression values, indexed by gene names and isoform IDs. It is a tab-delimited text file with one gene per line and can be opened by any text editor or Excel. For example, the following line means that gene SDF4 has an expression estimate 103.293, in the unit of RPKM. The gene has two isoforms NM_016547 and NM_016176 with expression estimates 0 and 103.293 respectively.

SDF4 103.293 2 NM_016547,NM_016176 0,103.293

Here is an example of the output file.

[4] Generate BED alignment file.

rSeq can generate alignment file in UCSC BED format, which can be visualized in UCSC Genome Browser as a custom track or in CisGenome Browser. Use the following command:

rseq expression_analysis -a refFlat.txt refFlat.txt.fa mapped.reads.output1 [mapped.reads.output2]

where "refFlat.txt" is the gene annotation file in UCSC refFlat format and "refFlat.txt.fa" is the transcript sequence file generated using with the second method, i.e., "rseq generate_transcripts refFlat.txt chrfilelist.txt". When finished, you will get a "mapped.reads.output1.reads.bed" file, which contains all the uniquely aligned reads in UCSC BED Format. Here and Here are two schreenshots of visualizing aligned paired-end RNA-Seq reads in CisGenome Browser.

License

Anyone can use the source codes, documents or the excutable file of rSeq free of charge for non-commercial use. For commercial use, please contact the author.

Method and References

For isoform-specific gene expression estimation, rSeq uses the method described in

Jiang, H., Wong, W.H. (2009) Statistical Inferences for Isoform Expression in RNA-Seq, Bioinformatics, 25(8), 1026–1032. [online]

For paired-end RNA-Seq analysis, rSeq uses the method described in

Salzman, J., Jiang, H., Wong, W. H. (2011) Statistical Modeling of RNA-Seq Data, Statistical Science, 26 (1): 62-83. [online][arxiv]

Acknowledgements

rSeq was developed and tested with the help of the members and several collaborators of the Wong lab.