README for SeqMap: Short Sequence Mapping Tool version 1.0.8

Hui Jiang

--------------------------------------------------------------------------------
Table of Contents
--------------------------------------------------------------------------------
1. Overview
2. Usage
3. Input Format
4. Output Format
5. Program Parameters
6. SeqMap Algorithms
7. Acknowledgements

--------------------------------------------------------------------------------
1. Overview
--------------------------------------------------------------------------------

SeqMap is a tool for mapping large amount of oligonucleotide to the genome. It
is designed for finding all the places in a genome where an oligonucleotide
could potentially come from. The oligonucleotides can be either generated by a
high throughput sequencing machine, or extracted from the probes on a high
density microarray. With carefully designed index-filtering algorithm and
delicate implementation, SeqMap can efficiently map as many as dozens of
millions of short sequences to a genome of several billions of nucleotides.
While doing the mapping, several mutations and insertions/deletions of the
nucleotide bases in the sequences can be tolerated and furthermore detected. 
Various input and output formats are supported, as well as many command line 
options for tuning almost every steps in the mapping process. A typical mapping 
can be done in a few hours on an ordinary PC. Parallel using of SeqMap on a 
cluster is also very straightforward.

Note: in this document, words "short sequences", "reads", "tags", "probes",
"oligonucleotides" all refer to the sequences that are to be mapped. Words
"transcripts", "genomes sequences", all refer to the sequences that are mapping 
to.

--------------------------------------------------------------------------------
2. Usage
--------------------------------------------------------------------------------

The use of SeqMap is very easy. First, download the program at its website. 
If for some reason the user wants to compile the source codes and generate 
the excutable file himself/herself, he/she could download the source package, 
then use the following command to compile it:

g++ -O3 -m64 -o seqmap match.cpp

Where a g++ compiler and a 64-bit OS are assumed. For 32-bit version compilation,
just remove the -m64 option, and include the -m32 option if needed. Most other 
compilers have similar parameters.

Then, use the following command to do the mapping:

./seqmap <num_mismatch> <probe_file> <trans_file> <output_file> [options]

Where parameters enveloped by "<" and ">" are required, and parameters enveloped
by "[" and "]" are optional. <num_mismatch> is the maximum number of 
mismatched basepires that will be tolerated and furthermore detected during the 
mapping, it can also include insertions and deletions if the user needs to. 
<probe_file> is the input file which contains the sequences to be mapped. 
<trans_file> is the input file which contains the sequences to be mapped to. 
<output_file> is the output file. [options] are several optional command line 
parameters.

For the details of input/output file formats and optional parameters, please refer 
to Section 3, 4 and 5, respectively.

--------------------------------------------------------------------------------
3. Input Format
--------------------------------------------------------------------------------

Currrently SeqMap supports two input formats for input probe file: FASTA format
and raw DNA sequence format with one sequence per line. The reference genome file
has to be in FASTA format. The FASTA format has two parts for each sequence: a tag 
line and a sequence line. Sequences can take mutiple lines.

Here is an example of a FASTA file:

>1
AATATGAAATCGGGCATTCGTAAGA
>2
AGAAAATCGGACCACAAGAATTGGC
>3
AAGCCGGTTAAAAAATAAACTAAGT
>4
GCGGATGTTCCTATACACCGAGTCG
>5
ACGAAGATTGGTGAAGAGAAAACTA
>6
CTAGAGGCGTTAACCGACATTGTTA
>7
GTGTTTTCGCGCCCTGTTCTGCCAG
>8
GTCGGTGTCGCGGGCCTTGCGCGGA
>9
AACTCTAATTACAAATTATACTTTA
>10
ACCGAGACTCGTAGAATATCATTTT

For most users, reference genomes in the FASTA format can be downloaded at
http://hgdownload.cse.ucsc.edu/downloads.html

A detailed description of the FASTA format can be found at 
http://en.wikipedia.org/wiki/Fasta_format

--------------------------------------------------------------------------------
4. Output Format
--------------------------------------------------------------------------------

There are several types of output formats that SeqMap can generate. In the default
mode, it outputs in Eland format. It's also the format when SeqMap is running with 
option "/eland", or "/eland:n", where n can be 1, 2 or 3. Default is 2.

Another output format is with option /output_all_matches. In this output format, 
SeqMap outputs all the mapped targets in genome coordinate order, i.e., while it scans
the genomes, it outputs the matched sequences whenever it finds one. An example
of the output file in this format is given below:

trans_id        trans_coord     target_seq      probe_id        probe_seq       num_mismatch
1       313902  AACTCCGGGAGGGCCGCTTTGTATG       509644  AACTCCGGGAGTGCCGCTTTGTAGG       2
1       423680  TTTCACAATCAATGGATCAGGCCGC       129326  TTTCACAATCATTGGATCAGGCCAC       2
1       537816  CTTGAATTCAGTAAATAGTTTAACG       330515  CTTGAATTTAGTAAATAGTTTACCG       2
2       297292  CGTCAAATTTCGTCCTTTTCGCTGT       636826  CGTCAATTTTCGTCCTTTTCGGTGT       2
2       326279  CGTAGGACCATTCAGGCCGTTAAGC       986424  CGTAGGAGCATTCAGGCCGTTATGC       2
2       870729  GTTAACCTGTGGTAAGTAACGTAGT       433048  GTTAACCTGGGGTAAGTAACGTATT       2
3       204747  TAGCTCATTAACAGGGGATCTTAGG       917614  TAGCTCATTAATAGCGGATCTTAGG       2
3       601827  GTCGTTTTATTCCGCCTGGAGAGGT       321632  GTCGTCTGATTCCGCCTGGAGAGGT       2
3       674797  TCGCACTTGGGGCTAAATGGGCATC       336321  TCGCACTTCGGGCTAAATGGGAATC       2
3       927627  CAGCCAAAGATACGCAGCTCAGTCT       619563  GAGGCAAAGATACGCAGCTCAGTCT       2
4       305440  GACGGAAATCCATATAAGGTAGGGA       80583   GACGGAAATCGAGATAAGGTAGGGA       2

There are six columns in the output file. Their meaning are:

field           meaning
trans_id        ID of the transcript of the mapped target
trans_coord     coordinate of the mapped target in the transcript
target_seq      mapped sequence of the mapped target in the transcript
probe_id        ID of the mapped probe
probe_seq       sequence of the mapped probe
num_mismatch    total number of mutations (including insertions and deletions also if permitted) occurred in the mapping

If SeqMap runs with option "/output_statistics", it will output in another format, 
in which the mapped targets are sorted in the original order of the probes in 
the input file. Here is an example of an output file in such format:

probe_id        #mismatch=0     #mismatch=1     #mismatch=2     trans_id        coord   #mismatch       trans_id        coord   #mismatch       trans_id coord   #mismatch
15385   0       1       0       7       147096  1
48341   0       0       1       5       364275  2
80583   0       0       1       4       305440  2
129326  0       0       1       1       423680  2
151804  0       0       1       6       177193  2
193752  0       0       1       8       289880  2
218856  0       0       1       7       516924  2

As we can see, there are several columns in the output file. The number of the 
columns actually changes with some parameters. The meaning of these columns are:

field           meaning
probe_id        ID of the mapped probe
#mismatch=n     number of mapped targets with n mutations (including insertions and deletions also if permitted)
trans_id        ID of the transcript of the mapped target
coord           coordinate of the mapped target in the transcript
#mismatch       total number of the mutations (including insertions and deletions also if permitted) in the mapping

--------------------------------------------------------------------------------
5. Program Parameters
--------------------------------------------------------------------------------

There are many parameters/options can be given when running SeqMap. Their
meanings are listed below: (parameters/options begin with a * are for advanced
users only) 

parameters/options                              meaning
<number of mismatches>                          maximum number of mutations (including insertions and deletions also if permitted) in the mapping
<probe file name>                               name of the input file which contains the sequences to be mapped
<transcript FASTA file name>                    name of the input file which contains the sequences to be mapped to
<output file name>                              name of the output file
[/eland[:style]]                                output in Eland format, style can be 1, 2 or 3, default is 2
[/output_top_matches:num_top_matches]           output the top "num_top_matches" targets
[/forward_strand]                               search forward strand of the genome sequences only, default is to search both strands
[/allow_insdel:num_insdel]                      enable insertion and deletion in the mapping. The maximum number of insertions/deletions allowed are "num_insdel", default is disabled
[/cut:start,end]                                take the [start,end] portion (both included) of the probes for the mapping
[/match_shorter_probes]                         match probes that are shorter than probe_len
[/skip_N]                                       skip probes that have N or .
[/no_repeats]                                   do not search repeat regions (letters in small capital) in the genome
[/silent]                                       running in silent mode, without outputting too much debug information, default is verbose mode
[/available_memory:memory_size(in MB)]          disable memory detection by provide memory information manually
[/zero_indexed]                                 output coordinates in 0-indexed manner
[/output_all_matches]                           output all matches
[/exact_mismatch]                               output targets with exact number of mismatches only
*[/output_alignment]                            output the alignment between the pair of sequences in /output_all_matches mode, default is disabled
*[/output_statistics]                           output in probe order                   
*[/do_not_output_probe_without_match]           do not output probe if it has no target
*[/do_not_duplicate_probes]                     do not duplicate probes for reverse strand search
*[/use_hash]                                    use hash method instead of index-filtering method, if possible
*[/no_hash_1bp_mismatch]                        do not use hash method if num_mismatch is not zero
*[/limit_num_part:max_num_part]                 limit the maximum number of parts that a probe can be splitted into, may reduce memory usage while increase running time
*[/interpolation_search]                        use interpolation search in the algorithm instead of binary search
*[/shift_mask]                                  may reduce memory usage while increase running time
*[/no_store_key]                                may reduce memory usage while increase running time
*[/no_fast_index]                               may reduce memory usage while increase running time
*[fast_index_fraction:fraction]                 
*[/no_filter_results]
*[/filter_selected_probes:selected_probes_file] filter the probes in file "selected_probes_file"
*[/output_filtered_probes]                      output the filtered probes
*[/filter_low_quality_probes]                   filter low quality probes

--------------------------------------------------------------------------------
6. SeqMap Algorithms
--------------------------------------------------------------------------------

SeqMap indexes the short sequences rather than the genome sequences. Given the 
numbers of maximum allowed mutations, insertions and deletions, SeqMap splits 
the short sequences into several parts. By keeping some parts rather than all  
parts to be fixed, the non-candidates can be eliminated in the very first step. 
All the potential candidates will then be collected and a local alignment 
algorithm will be running on them to finally determine the matched targets.

Similar algorithm has been used several times in some papers and softwares.
However, to the author's best knowledge, SeqMap is the first to use this
algorithm for insertion/deletion detection.

Look at the SeqMap paper for more details:
Hui Jiang and Wing Hung Wong (2008)
SeqMap : mapping massive amount of oligonucleotides to the genome.
Bioinformatics. doi: 10.1093/bioinformatics/btn429
http://bioinformatics.oxfordjournals.org/cgi/reprint/btn429v1

--------------------------------------------------------------------------------
8. Acknowledgements
--------------------------------------------------------------------------------

SeqMap was developed and tested with the help of the members and several
collaborators of Wing Wong lab.