FAQ for SeqMap: A Tool For Mapping Millions Of Short Sequences To The Genome
How to quick start?
1) Download the binary executable file according your platform, or download the source package and use a standard C++ compiler to generate a executable file on your machine. For example, using g++:
g++ -O3 -m64 -o seqmap match.cpp.
2) Prepare input sequences and reference genome in two FASTA format files, respectively. For example, file input.fa:
>1
AATATGAAATCGGGCATTCGTAAGA
>2
AGAAAATCGGACCACAAGAATTGGC
3) Invoke the program by typing the following command:
./seqmap 2 input.fa genome.fa output.txt
For Eland output format: (the default output format is /eland:2)
./seqmap 2 probes.fa trans.fa output.txt /eland (or /eland:n, where n=1,2 or
3)
Output format for all matches (use option /output_all_matches)
TranscriptId TranscriptCoord TranscriptSeq ProbeId ProbeSeq MismatchNumber [Strand]
Mostly used options:
use "./seqmap 3 probes.fa trans.fa output.txt" to search for 3bp mismatches.
use option "/cut:1,25" to take the first 25mer of the probes for the
mapping (default is to map the full-length probes).
use option "/allow_insdel:1" to allow 1bp insertion/deletion in the
mapping (default is disallowed).
use option "/forward_strand" to map to forward strand only (default
is to map to both strands).
I have a distinct FASTA file for each chromosome, what shall I do?
Simply concatenate the files to get a single FASTA file. For example, on Unix/Linux machines, use:
cat *.fa > genome.fa
What is the license for SeqMap?
There is no license require for non-commercial use of SeqMap, which means anyone can use the source codes, documents or the excutable file of SeqMap free of charge. For commercial use, please contact the author.
What platform does SeqMap run on?
Being written in ANSI C++ source code, SeqMap can be compiled and run on any platform.
What are the dependencies?
It doesn't have any dependencies so that you can just download it and run it immediately.
What is it written in?
It is written in ANSI C++.
How to get the source code?
The source code is freely downloadable from the website.
What input file formats does it support?
It supports FASTA file formats for input. Also, input files with each raw DNA sequence per line is also accepted.
What output file formats it generates?
It generates several output formats, which can be set by command line parameters. By default, it outputs in the Eland format.
Does SeqMap throw away non-unique mapped targets like Eland?
No, it doesn't. In output_all_matches mode, all targets will be output. In /output_statistics or /eland:3 mode, you can set parameter /output_top_matches:N to keep the top N targets. In /eland:1 or /eland:2 modes, only unique targets will be output.
How many mismatch/in/del does SeqMap allow?
There is a limit set by SeqMap to be maximum 5 mismatches (ins&del included) and maximum 3 ins&dels. But it is mostly set empirically because usually the memory usage, running time and number of targets will blow up with parameters beyond that.
What is the memory requirement?
The memory usage varies a lot with the running parameters. Basically, a PC with 2GB memory can do 3bp-mismatch matching for about 8M reads at a time. We are mostly running SeqMap on 64-bit machines with 8~16GB memory, where memory requirement usually is not a big problem.
When I was runing SeqMap under Linux, there is a error message returned:
./seqmap: error while loading shared libraries:
libstdc++.so.6: cannot open shared object file: No such file or directory
./seqmap: /lib/ssa/libstdc++.so.6: version `GLIBCXX_3.4' not found (required
by ./seqmap)
It seems that you are using an older version of GLIBC than I am using. But if you download the source codes and compile it on your machine, that should work.
If the reference sequence or the probe FASTA file has ambiguous DNA
code such as 'N', 'H', '.', etc, how does seqmap deal with it?
By default, SeqMap will try to match probes with 'N', although that particular
position will always be counted as a mismatch. Use option /skip_N to ignore
all probes has letters other than 'A', 'C', 'G' or 'T'. If the reference sequence
has other letters, that position will be skipped. Just try a simple example
and you will know how it works.
What algorithm does SeqMap use?
SeqMap is a fast sequence mapping software. Unlike BLAT, SeqMap indexes the
short sequences rather than the genome. Given the numbers of maximum allowed
mutations, insertions and deletions, SeqMap splits the short sequences into
several parts. By keeping some parts rather than all of them to be fixed, the
non-candidates can be eliminated in the very first step. All the candidates
that are left will then be collected and a local alignment algorithm will be
running on them to finally determine the matched targets. Similar algorithm
has been used several times in some papers and softwares (e.g., ELAND by Illumina/Solexa).
However, SeqMap can also do insertion/deletion detection and is very fast. Look
at the SeqMap paper for more details:
Hui
Jiang and Wing Hung Wong (2008)
SeqMap : mapping massive amount of oligonucleotides to the genome.
Bioinformatics. doi: 10.1093/bioinformatics/btn429
Can SeqMap do a 35-nt search?
Yes, it maps probes of any length by default.
Although Solexa usually generates 35-nt tags, the error rate goes up quickly after the 25th nt, so doing the mapping using the first 25-nt or 30-nt is reasonable and turns out to be quite successful.
How to output in genome coordinates order?
If you want to output in genome coordinates order, you can use option /output_all_matches.
Can SeqMap run in parallel?
Yes, the users can simply split the reads into several files or split the genome into several files (e.g., one FASTA file for each chromosome), and then invoke several SeqMap processes on a multi-core CPU or a multi-node cluster to do the mapping in parallel. Depending on the user¡¯s purpose for the mapping, some simple scripts or programs can be used to combine the mapping results at the end. We provide a sample C++ code for such purpose here, which works for the /output_statistics output format. For the default output format, just put all the output files together and no more action is needed.
Are 3 or more mismatches needed for mapping Solexa reads?
Yes, for reads that are longer than 30nt, the 3bp mismatch mapping gives mostly true signal rather than noise. We map 11M RNA-Seq reads to mouse chr19 using 2bp and 3bp mismatch mapping, respectively. The mapping results are shown in the following table, where we can see that 3bp mapping gives 18.5% more uniquely mapped reads and 42% of them fall into transcribed regions annotated by RefSeq genes, which occupies only 2~3% of the genome.
mapping |
uniquely mapped reads |
reads mapped to transcribed regions |
2bp mismatch |
308,095 |
195,986 |
3bp mismatch |
365,172 |
220,050 |