rSeq: RNA-Seq Analyzer

FAQ for rSeq: RNA-Seq Analyzer (for rSeq version 0.0.7, not recommended)

You've developed a Baysian approach to estimate confidence intervals for isoform expression estimates in the paper "Statistical Inferences for Isoform Expression in RNA-Seq". Is that algorithm implemented in rSeq?

Yes or No. That work was written in MATLAB therefore is not part of rSeq yet. I am planning to rewrite them in C++ and then put into rSeq. For the MATLAB code that does Bayesian inference, please follow the instructions here.

When I run gen_trans it loads and reads all the chromosome files but it doesn't write anything in the refFlat.txt.fa file. What's wrong?

You need to run the command in the directory of the chromosome sequences, or you will have to put the full path of the chromosome files in chrlist.txt.

How much memory does rSeq need?

Usually 2GB memory should be sufficient. For more than 10M reads, SeqMap may need more than 2GB memory, which usually requires a 64-bit system. To reduce memory usage, you can split the read file into two or more parts and map each part separately and then concatenate the mapping results.

How can I splite the read file and combine the results?

Use linux command "split" to split the reads, "cat" to concatenate the results.

What seqmap does with duplicate reads? does it count them in the output but just leave them out of the alignment?

SeqMap finds all the duplicate reads, maps one of them, and replicates the result for the rest of them. So don't worry, all of them are there in the output file.

Are the mapping results in Seqmap output in strict order? I'm planning to split my reads data into several pieces, map them with Seqmap and combine the results as input to rSeq. I'm wondering whether this partition will break the structure in Seqmap output file.

Yes, they are in strict order. You can go ahead splitting the data files.

Can rSeq compute expressions on SeqMap data run with a 3 bp mismatch allowance?

Yes, simply run rSeq with option "-n 3".

I’m trying to use gen_trans and rseq will only recognize one file, then everything else in my file list is invalid. I checked a handful of the files and they are all in the correct fasta format.

You may have created or edited the chrfilelist.txt file in Windows and then used it in Linux. Windows and Linux have different ways of encoding text files. If that's reason, you can either create and edit the file in Linux directly (e.g., using vi) or transform them into Linux format using command unix2dos.

I’m mapping some reads to a hg18 fasta file and I’m getting this message as seqmap runs: "......Bad charactor R found when processing transcript chr16_random. Skipped......." What does this mean? Is it skipping the entire chromosome? Or just that bp?

Don't worry. Only that base is skipped.

I got the following error in running SeqMap:
bad format in line: @SRR002051.1 :8:1:325:773 length=33 AAAGAACATTAAAGCTATATTATAAGCAAAGAT NM
internal error: read file failed

It is because you have space in your read tag "@SRR002051.1 :8:1:325:773 length=33". Changing it to ""@SRR002051.1" should solve the problem. You should use /eland:3 in the mapping.

SeqMap told me that there is not enough memory but actually I have.

Sometimes SeqMap's memory detection function has some problem. If you are sure that is the case you can turn it off by using option "/available_memory:8000" to specify say 8G available memory as in the example.

When I run "rseq gen_trans" I got the error message " is not a vaid fasta file." and only one file in my file list is processed.

Did you edit the file "chrfilelist.txt" in Windows and then use it in Linux? Windows and Linux have different ways of encoding text files. If that's reason, you can either create and edit the file in Linux directly (e.g., using vi) or transform them into Linux format using command unix2dos.

What happes if a gene has multiple isoforms when I run "rseq comp_exp"?

When multiple isoforms are annotated with the same gene, rseq will take the longest of them to compute the gene length.

Where can I download the UCSC refFlat files, say, for Arabidopsis Thalina, in order to generate the transcript sequences?

It seems that UCSC does not provide support for Arabidopsis so we will have to prepare the file by ourselves (based on the NCBI release). Fortunately, you can download the package prepared by Hongkai Ji at Johns Hopkins at http://www.biostat.jhsph.edu/~hji/cisgenome/index_files/download.htm, choose the Genome Database for TAIR8 and unzip it. You can find the genome sequences and the refFlat.txt file (in subdirectory "annotation").

How does rSeq the 95% confidence intervals for gene expression estimates?

The 95% confidence intervals are computed by using normal approximation of binomial random variables, i.e., we assume the number of reads mapped to a gene is a binomial random variable with parameters (p, N), where p is the gene expression, and N is the total number of mapped reads.

What makes rSeq any different than ERANGE?

rSeq is a set of tools for RNA-Seq analysis, including quality control, sequence alignment, gene expression calculation, isoform expression calculation and so on. For gene expression calculation, rSeq uses the Mortazavi's approach to estimate gene level RPKM and to split multi-reads. However, rSeq maps the reads to annotated transcripts rather than genome+splices as does in ERANGE, which can help reduce multi-reads, reduce running time, fully exploit splice reads and etc. For isoform expression, rSeq uses the approach described in "Statistical Inferences for isoform expression in RNA-Seq", Bioinformatics, 25(8):1026-1032, (2009).

Can rSeq output refseq names along with gene names?

One gene can have several transcripts each with a different RefSeq ID. If you feed rSeq with an annotation file with Entrez gene IDs instead gene names, rSeq will output in Entrez gene IDs.

Can you tell me how you calculated the transcript length for each gene? For example, for this gene: Mrc2 (NM_008626). My calculation is: 5801 but in Rseq output, the length is 5765. It seems your length is always a bit shorter (sometimes a lot) than mine. Did you exclude some regions from calculation?

rSeq uses effective gene length, which is the actual gene length minus the read length, which is the number of possible different reads that
you can get from that gene.

Is it important to specify the read length "-r" option, if I do not specify this, will the program automatically detect the length of the reads?

If you don't specify it, it will take the default value 25. In the isoform expression case, it is important to specify the read length.

What command will tell me the RSAT version I’m using?

Sorry, no such command yet. To make sure you are using the latest version, you can download it again from the website.

Does rSeq do de novo isoform discovery?

No, it does not. For de novo splicing event discovery, you can try SpliceMap.

Can I use other mapping software such as MAQ or Bowtie with rSeq?

No, right now only SeqMap and Eland are supported. The mapping usually take only a few minutes.

Does rSeq accept GFF file as gene annotations?

No, it does not.

Does rSeq output per exon RPKMs?

No, it does not.

Does rSeq do differential expression?

No, it does not.

Even though I have an annotation annotation.txt that looks valid, I don't get any subexons:
"loading refFlat file: annotation.txt
0 duplicate transcripts.
checking data.
total 0 genes, 0 transcripts, 0 genes have unique transcript, max num transcript a gene is 0.
0 genes have too small overlap.
generating results.
generating subexons.
checking subexons.
generating subexon vectors.
checking subexon vectors.
writing subexon file: f.txt.subexons.txt Loading fasta file E.fa.
1 sequences read, total length is 2974."

Make sure there is no white space, such as spaces, and except tabs in your annotation file.