rSeq: RNA-Seq Analyzer

README for rSeq: RNA-Seq Analyzer (for rSeq version 0.0.7, not recommended)

Hui Jiang

Table of Contents

Overview

Usage

Download and install

Gene expression estimation

Isoform-specific gene expression estimation

Paired-end RNA-Seq

License

Method

Acknowledgements

Overview

rSeq is a set of tools for RNA-Seq data analysis. It consists of programs that deal with many aspects of RNA-Seq data analysis, such as reference sequence generation, sequence mapping, gene and isoform expressions (RPKMs) computation, etc. There are also many other features that will be gradually added into rSeq in the future.

Usage

Download and install

Simply download the zipped file from the rSeq website and unzip the file to somewhere on the hard drive. If you are downloading the source code files, simply type "make" to compile it.

Gene expression estimation

The procedure of gene expression estimation consists of three steps: transcript sequence preparation, read mapping and gene expression estimation.

[1] Prepare transcript sequences.

Transcript sequences should be in a file of FASTA format. Here is an example:

>WASH5P
CCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTGCTC
AGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACA
CAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACA
GAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGA
AGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAG
GTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGT
CTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGT
CCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATG
GAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGG
TCCTTTCCCAGAGATGCCCTTGCGCCTCATGACCAGCTTGTTGAAGAGAT

If a gene has several transcripts (isoforms), name the transcripts as gene_id$$transcript_id, such as

>WASH5P$$NR_024540
CCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTGCTC
AGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACA
CAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACA
GAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGA
AGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAG
GTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGT
CTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGT
CCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATG
GAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGG
TCCTTTCCCAGAGATGCCCTTGCGCCTCATGACCAGCTTGTTGAAGAGAT

Note: gene_id will be the index used by rSeq to output gene expression. So if you want something other than gene names, such as Entrez gene ID, you should use them in the transcript sequence file.

For your convenience, rSeq provides an option "gen_trans" to generate transcript sequences, e.g.,

rseq gen_trans refFlat.txt chrfilelist.txt

Where "refFlat.txt" is the annotation file in UCSC refFlat format. As an example, the refFlat format annotation file for hg19 can be downloaded at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz

Notice that other gene annotations files downloaded from UCSC, such as knowGene.txt and ensGene.txt etc, are not in exactly the same format as the refFlat.txt, therefore manually converting to the refFlat format is needed before using them with rSeq.

"chrfilelist.txt" is the list of filenames of DNA sequences of all the chromosomes. For example:

/data/chr1.fa
/data/chr2.fa
/data/chr3.fa

As an example, the genome sequences of all the chromosomes for hg19 can be downloaded at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

When finish, you will get a "refFlat.txt.fa" file, which contains all the transcript sequences in FASTA format.

[2] Map the reads to the transcript sequences.

To map the reads, you can use either SeqMap (already provided in the rSeq package) or Eland (provided by Illumina with the sequencer). For SeqMap, use the following command,

seqmap 2 reads.txt refFlat.txt.fa output.txt /eland:3

For Eland, run with option "--multi", i.e.,

eland reads.txt refFlat.txt.fa.squashed/ output.txt --multi

Where "reads.txt" is the file of the sequence reads (in plain read sequences or FASTA format, see SeqMap or Eland's user manual for more details), "refFlat.txt.fa.squahshed/" is the directory containing the squashed sequences generated with the "squashGenome" tool in the ELAND package, "output.txt" is the output filename.

[3] Compute gene expressions. Use the following command:

rseq comp_exp -r rl refFlat.txt.fa output.txt

Where "rl" is the length of the reads (default is 25), "refFlat.txt.fa" is the transcript sequence file generated in step 1, and "output.txt" is the mapping output file generated in step 2.

When finished, you will get an output file: "output.txt.2.gene.exp". It contains the computed gene expressions, indexed by gene names. It is a tab-delimited text files with each gene a line. The fields for each gene are:

name: name or refseqID of the gene
ucount: count of unique reads
count: count of all reads, including unique reads and reweighted multi-reads
ratio: ratio of unique reads
total: total number of mapped reads
len: effective length of the gene (gene length - read length)
exp: expression value
lower: lower bound of the 95% confidence interval of gene expression (estimated by normal approximation of Binomial RVs)
upper: upper bound of the 95% confidence interval of gene expression
reduced_len: 0 by default, reserved for internal usage

Here is an example of the output file:

Isoform-specific gene expression estimation

For isoform-specific gene expression estimation, the three steps are similar as for gene expression estimation, with the following variations.

1) Prepare transcript sequences.

Run rSeq with option "gen_exons_junctions -v 2" to generate a new reference sequence file "refFlat.txt.exons_junctions.fa" and another file "refFlat.txt.subexons.txt", e.g.,

rseq gen_exons_junctions -v 2 -r rl refFlat.txt chrfilelist.txt

where rl is the read length, default is 25.

2) Map the reads.

Run Eland or SeqMap to remap the reads to file "refFlat.txt.exons_junctions.fa", e.g.,

seqmap 2 reads.txt refFlat.txt.exons_junctions.fa output.txt /eland:3

eland reads.txt refFlat.txt.exons_junctions.fa.squashed/ output.txt --multi

3) Isoform-specific gene expression estimation.

Run rSeq with option "comp_exp -v 2" with file "refFlat.txt.subexons.txt", e.g.,

rseq comp_exp -v 2 -r rl refFlat.txt.subexons.txt output.txt

Where "rl" is the length of the reads (default is 25), "refFlat.txt.subexons.txt" is the file generated in step 1, and "output.txt" is the mapping output file generated in step 2.

When finished, you will get an output file: "output.txt.2.b.exp". It contains the computed gene and isoform expressions, indexed by gene and isoform IDs. It is a tab-delimited text files with each gene a line. For example, the following line means that gene SDF4 has expression value 103.293, it has two isoforms NM_016547 and NM_016176 with expression values 0 and 103.293 respectively.

SDF4 103.293 2 NM_016547,NM_016176 0,103.293

Here is an example of the output file:

Paired-end RNA-Seq

For isoform-specific gene expression estimation based on paired-end RNA-Seq reads, the first two steps are the same as for single-end RNA-Seq reads, except that you will need to map the two files for both ends of the reads separately in step 2. For the third step, run rSeq with option "paired_end -v 2" with file "refFlat.txt.subexons.txt", e.g.,

rseq paired_end -v 2 -r rl refFlat.txt.subexons.txt output1.txt output2.txt

Where "rl" is the length of the reads (default is 25), "refFlat.txt.subexons.txt" is the file generated in step 1, and "output1.txt" and "output2.txt" are the two mapping output files for the two ends of reads generated in step 2.

When finished, you will get an output file: "output1.txt.exp", in the same format as the one that you can get with single-end RNA-Seq reads, as described above.

License

Anyone can use the source codes, documents or the excutable file of rSeq free of charge for non-commercial use. For commercial use, please contact the author.

Method

For gene expression estimation, rSeq uses the RPKM method described in

"Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq", Ali Mortazavi, et al., Nature Methods - 5, 621 - 628 (2008).

For isoform-specific gene expression estimation, rSeq uses the method described in

"Statistical Inferences for isoform expression in RNA-Seq", Hui Jiang and Wing Hung Wong, Bioinformatics, 25(8):1026-1032, (2009).

For paired-end RNA-Seq analysis, rSeq uses the method described in

"Statistical Modeling of RNA-Seq Data", Julia Salzman, Hui Jiang and Wing Hung Wong, Technical Report in Biostatistics, Stanford University (2010).

Acknowledgements

rSeq was developed and tested with the help of the members and several collaborators of the Wong lab.