README for SeqMap: Short Sequence Mapping Tool version 1.0.8 Hui Jiang -------------------------------------------------------------------------------- Table of Contents -------------------------------------------------------------------------------- 1. Overview 2. Usage 3. Input Format 4. Output Format 5. Program Parameters 6. SeqMap Algorithms 7. Acknowledgements -------------------------------------------------------------------------------- 1. Overview -------------------------------------------------------------------------------- SeqMap is a tool for mapping large amount of oligonucleotide to the genome. It is designed for finding all the places in a genome where an oligonucleotide could potentially come from. The oligonucleotides can be either generated by a high throughput sequencing machine, or extracted from the probes on a high density microarray. With carefully designed index-filtering algorithm and delicate implementation, SeqMap can efficiently map as many as dozens of millions of short sequences to a genome of several billions of nucleotides. While doing the mapping, several mutations and insertions/deletions of the nucleotide bases in the sequences can be tolerated and furthermore detected. Various input and output formats are supported, as well as many command line options for tuning almost every steps in the mapping process. A typical mapping can be done in a few hours on an ordinary PC. Parallel using of SeqMap on a cluster is also very straightforward. Note: in this document, words "short sequences", "reads", "tags", "probes", "oligonucleotides" all refer to the sequences that are to be mapped. Words "transcripts", "genomes sequences", all refer to the sequences that are mapping to. -------------------------------------------------------------------------------- 2. Usage -------------------------------------------------------------------------------- The use of SeqMap is very easy. First, download the program at its website. If for some reason the user wants to compile the source codes and generate the excutable file himself/herself, he/she could download the source package, then use the following command to compile it: g++ -O3 -m64 -o seqmap match.cpp Where a g++ compiler and a 64-bit OS are assumed. For 32-bit version compilation, just remove the -m64 option, and include the -m32 option if needed. Most other compilers have similar parameters. Then, use the following command to do the mapping: ./seqmap [options] Where parameters enveloped by "<" and ">" are required, and parameters enveloped by "[" and "]" are optional. is the maximum number of mismatched basepires that will be tolerated and furthermore detected during the mapping, it can also include insertions and deletions if the user needs to. is the input file which contains the sequences to be mapped. is the input file which contains the sequences to be mapped to. is the output file. [options] are several optional command line parameters. For the details of input/output file formats and optional parameters, please refer to Section 3, 4 and 5, respectively. -------------------------------------------------------------------------------- 3. Input Format -------------------------------------------------------------------------------- Currrently SeqMap supports two input formats for input probe file: FASTA format and raw DNA sequence format with one sequence per line. The reference genome file has to be in FASTA format. The FASTA format has two parts for each sequence: a tag line and a sequence line. Sequences can take mutiple lines. Here is an example of a FASTA file: >1 AATATGAAATCGGGCATTCGTAAGA >2 AGAAAATCGGACCACAAGAATTGGC >3 AAGCCGGTTAAAAAATAAACTAAGT >4 GCGGATGTTCCTATACACCGAGTCG >5 ACGAAGATTGGTGAAGAGAAAACTA >6 CTAGAGGCGTTAACCGACATTGTTA >7 GTGTTTTCGCGCCCTGTTCTGCCAG >8 GTCGGTGTCGCGGGCCTTGCGCGGA >9 AACTCTAATTACAAATTATACTTTA >10 ACCGAGACTCGTAGAATATCATTTT For most users, reference genomes in the FASTA format can be downloaded at http://hgdownload.cse.ucsc.edu/downloads.html A detailed description of the FASTA format can be found at http://en.wikipedia.org/wiki/Fasta_format -------------------------------------------------------------------------------- 4. Output Format -------------------------------------------------------------------------------- There are several types of output formats that SeqMap can generate. In the default mode, it outputs in Eland format. It's also the format when SeqMap is running with option "/eland", or "/eland:n", where n can be 1, 2 or 3. Default is 2. Another output format is with option /output_all_matches. In this output format, SeqMap outputs all the mapped targets in genome coordinate order, i.e., while it scans the genomes, it outputs the matched sequences whenever it finds one. An example of the output file in this format is given below: trans_id trans_coord target_seq probe_id probe_seq num_mismatch 1 313902 AACTCCGGGAGGGCCGCTTTGTATG 509644 AACTCCGGGAGTGCCGCTTTGTAGG 2 1 423680 TTTCACAATCAATGGATCAGGCCGC 129326 TTTCACAATCATTGGATCAGGCCAC 2 1 537816 CTTGAATTCAGTAAATAGTTTAACG 330515 CTTGAATTTAGTAAATAGTTTACCG 2 2 297292 CGTCAAATTTCGTCCTTTTCGCTGT 636826 CGTCAATTTTCGTCCTTTTCGGTGT 2 2 326279 CGTAGGACCATTCAGGCCGTTAAGC 986424 CGTAGGAGCATTCAGGCCGTTATGC 2 2 870729 GTTAACCTGTGGTAAGTAACGTAGT 433048 GTTAACCTGGGGTAAGTAACGTATT 2 3 204747 TAGCTCATTAACAGGGGATCTTAGG 917614 TAGCTCATTAATAGCGGATCTTAGG 2 3 601827 GTCGTTTTATTCCGCCTGGAGAGGT 321632 GTCGTCTGATTCCGCCTGGAGAGGT 2 3 674797 TCGCACTTGGGGCTAAATGGGCATC 336321 TCGCACTTCGGGCTAAATGGGAATC 2 3 927627 CAGCCAAAGATACGCAGCTCAGTCT 619563 GAGGCAAAGATACGCAGCTCAGTCT 2 4 305440 GACGGAAATCCATATAAGGTAGGGA 80583 GACGGAAATCGAGATAAGGTAGGGA 2 There are six columns in the output file. Their meaning are: field meaning trans_id ID of the transcript of the mapped target trans_coord coordinate of the mapped target in the transcript target_seq mapped sequence of the mapped target in the transcript probe_id ID of the mapped probe probe_seq sequence of the mapped probe num_mismatch total number of mutations (including insertions and deletions also if permitted) occurred in the mapping If SeqMap runs with option "/output_statistics", it will output in another format, in which the mapped targets are sorted in the original order of the probes in the input file. Here is an example of an output file in such format: probe_id #mismatch=0 #mismatch=1 #mismatch=2 trans_id coord #mismatch trans_id coord #mismatch trans_id coord #mismatch 15385 0 1 0 7 147096 1 48341 0 0 1 5 364275 2 80583 0 0 1 4 305440 2 129326 0 0 1 1 423680 2 151804 0 0 1 6 177193 2 193752 0 0 1 8 289880 2 218856 0 0 1 7 516924 2 As we can see, there are several columns in the output file. The number of the columns actually changes with some parameters. The meaning of these columns are: field meaning probe_id ID of the mapped probe #mismatch=n number of mapped targets with n mutations (including insertions and deletions also if permitted) trans_id ID of the transcript of the mapped target coord coordinate of the mapped target in the transcript #mismatch total number of the mutations (including insertions and deletions also if permitted) in the mapping -------------------------------------------------------------------------------- 5. Program Parameters -------------------------------------------------------------------------------- There are many parameters/options can be given when running SeqMap. Their meanings are listed below: (parameters/options begin with a * are for advanced users only) parameters/options meaning maximum number of mutations (including insertions and deletions also if permitted) in the mapping name of the input file which contains the sequences to be mapped name of the input file which contains the sequences to be mapped to name of the output file [/eland[:style]] output in Eland format, style can be 1, 2 or 3, default is 2 [/output_top_matches:num_top_matches] output the top "num_top_matches" targets [/forward_strand] search forward strand of the genome sequences only, default is to search both strands [/allow_insdel:num_insdel] enable insertion and deletion in the mapping. The maximum number of insertions/deletions allowed are "num_insdel", default is disabled [/cut:start,end] take the [start,end] portion (both included) of the probes for the mapping [/match_shorter_probes] match probes that are shorter than probe_len [/skip_N] skip probes that have N or . [/no_repeats] do not search repeat regions (letters in small capital) in the genome [/silent] running in silent mode, without outputting too much debug information, default is verbose mode [/available_memory:memory_size(in MB)] disable memory detection by provide memory information manually [/zero_indexed] output coordinates in 0-indexed manner [/output_all_matches] output all matches [/exact_mismatch] output targets with exact number of mismatches only *[/output_alignment] output the alignment between the pair of sequences in /output_all_matches mode, default is disabled *[/output_statistics] output in probe order *[/do_not_output_probe_without_match] do not output probe if it has no target *[/do_not_duplicate_probes] do not duplicate probes for reverse strand search *[/use_hash] use hash method instead of index-filtering method, if possible *[/no_hash_1bp_mismatch] do not use hash method if num_mismatch is not zero *[/limit_num_part:max_num_part] limit the maximum number of parts that a probe can be splitted into, may reduce memory usage while increase running time *[/interpolation_search] use interpolation search in the algorithm instead of binary search *[/shift_mask] may reduce memory usage while increase running time *[/no_store_key] may reduce memory usage while increase running time *[/no_fast_index] may reduce memory usage while increase running time *[fast_index_fraction:fraction] *[/no_filter_results] *[/filter_selected_probes:selected_probes_file] filter the probes in file "selected_probes_file" *[/output_filtered_probes] output the filtered probes *[/filter_low_quality_probes] filter low quality probes -------------------------------------------------------------------------------- 6. SeqMap Algorithms -------------------------------------------------------------------------------- SeqMap indexes the short sequences rather than the genome sequences. Given the numbers of maximum allowed mutations, insertions and deletions, SeqMap splits the short sequences into several parts. By keeping some parts rather than all parts to be fixed, the non-candidates can be eliminated in the very first step. All the potential candidates will then be collected and a local alignment algorithm will be running on them to finally determine the matched targets. Similar algorithm has been used several times in some papers and softwares. However, to the author's best knowledge, SeqMap is the first to use this algorithm for insertion/deletion detection. Look at the SeqMap paper for more details: Hui Jiang and Wing Hung Wong (2008) SeqMap : mapping massive amount of oligonucleotides to the genome. Bioinformatics. doi: 10.1093/bioinformatics/btn429 http://bioinformatics.oxfordjournals.org/cgi/reprint/btn429v1 -------------------------------------------------------------------------------- 8. Acknowledgements -------------------------------------------------------------------------------- SeqMap was developed and tested with the help of the members and several collaborators of Wing Wong lab.