Introduction
What is IndexToolkit ?
Welcome to IndexToolkit web site.
The IndexToolkit project is an open-source project that
offers an accelerating strategy to preprocess plain-text-format sequence databases in high-speed-request approaches.
For FASTA-format protein sequence databases, IndexToolkit provides a indexing framework for integration of existing and new indexing methods.
IndexToolkit reference:
Dequan Li, Wen Gao, Charles X. Ling, Xiaobiao Wang, Ruixiang Sun and Simin He.
IndexToolkit: an open source toolbox to index protein databases for high-throughput proteomics. Bioinformatics. 2006, Volume 22, Number 20 Pp. 2572-2573.
For new users, we have presented some information in a Q&A (Questions and Answers) format, and hope it will be useful. We hope this Q&A will be a guide only and recommend that users obtain detailed information to determine applicability to their specific requirements in the proteomics research.
Before designing and coding the IndexToolkit, We set ourselves these goals:
- Simplicity of use
- High speed and memory optimization
- Powerful set of features
- Ease of extension and addition of new implementations
The project includes
user-friendly tools and Application Programming Interface (API) to
improve your systems' quality and speed. The project can be extened for many purposes,
such as database searching or modification discovering in proteomics.
The IndexToolkit project can be easily and quickly applied as a way to integrating index files with higher searching speed in your application such as database-searching program, sequence database analysis toolbox, and so on. The IndexToolkit project is written in C++ using the Standard Template Library (STL ).
The IndexToolkit project is open source under the GNU GPL license.
To install and run the IndexToolkit
, simply download the source and add class files to your project . See detailed.
The IndexToolkit project is developed by Bioinformatics Group, Joint Research & Development Laboratory (JDL), Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS).
Please send comments and bug reports to Dequan Li ( dqli@jdl.ac.cn ).

- ( 2006-09-15 ) Fixed one bug: if the last line of the FASTA database is blank, the work_bench does not exit. Thanks bug reporter Xin Wang.
- ( 2006-05-27 ) Q&A (Questions and Answers) web page was completed .
- ( 2006-05-25 ) A project for creating IndexToolkit
Dynamic Link Library (DLL) was added. It is more helpful to integrate IndexToolkit with developers' applications.
- ( 2006-05-15 ) IndexToolkit version 1.1 was realeased.
comparing with
1.0, ver 1.1 has an additional supporting of memory-mapped file .
Highlighted new features of IndexToolkit
- Advanced FASTA processing and Indexed data storage
- IndexToolkit is designed for high-throughput software developers who require high speed FASTA data processing and storage functionality within their applications. The API contains methods for storing, querying, retrieving indexed protein/peptide/ion sequence and mass.
- FASTA processing and indexing made easy
- a parser for parsing FASTA format and retrieving specific protein information
- various indexing methods to optimize access to frequently used peptide/ion sequence data
- special API support for index data management and developing.
- some user-friendly tools and continuous developing and upgrading.
- Powered by open source
- To achieve maximal flexibility, we select open source. The detailed index file formats are described here.
- Easy to integrate
- easily add IndexToolkit into developers' project
- a sequence-retrieve developing framework, which is easily extensible for high-speed-request proteomic applications
- provide powerful C++ class library which is very helpful to
import, index, and retrieve databases.
- Apply to Search Engines
- have applied to pFind 2.0 project.
- plan to apply to open source X!Tandem.
Why do we use IndexToolkit ?
Firstly, With IndexToolkit, you can improve your systems' quality and speed. As is well known to all,
speed is a crucial measure of usability for high-throughput proteomic applications.
Currently, the downloaded protein sequence databases are common in FASTA format. The FASTA format is a plain text format and very inefficient to perform search
on. For example, to search a specific protein entry in the database, program must get every char or read every line one by one until finding it. Moreover, for each observed MS/MS spectrum searching against a protein sequence database, database searching retrieves every protein entry, predicts all possible peptides obtained from a theoretical digest of this protein entry, selects all possible peptides that match the spectrum precursor mass within a given range as candidates, and compares those candidates against spectrum to compute a score . In the other words, for a specified MS/MS spectrum, in order to find total candidate peptides sequences for scoring analysis, a program has to fly all protein sequences one by one in database from stem to stern, thus digests every protein sequence to peptide sequences and calculate s mass value of those peptides. Clearly, the searching process is typically complicated and time-consuming for enormous number of MS/MS data, because many times were wasted for the same computation - such as digestion and calculation of peptides' mass per match performed.
Clearly, with well-managed index methods, the retrieval process can be improved to a higher speed.
Secondly, besides SEQUEST and Mascot,
researchers keep on developing the novel search algorithms for protein identification, but lacking index support hampers new algorithms spread and adaptation to real applications. It needs provide indexing tools for those researchers. IndexTollkit is just a tool for that.
Thirdly, IndexToolkit is open source project. Developers can download freely and extend it in their bioinformatics solutions with no limit. Now, to develop high-throughput proteomic software such as database searching, researchers have an open source way
to carry out a high-level developing without considering speed in bottom level to access protein sequence database.
Finally, for a higher speed, the core module of IndexToolkit package was all coded in C++ using Standard Template Library (STL), which simplifies porting to multiple operating systems including Windows, Linux, and Solaris. However, we also developed some example tools such as MassCat and DBIndex_WinDemo to illuminate how to integrate DBIndixToolkit with your GUI applications.
Codes of IndexToolkit Project were built and tested using Visual Studio .NET 2003.
Installation
Note: The source codes compiled cleanly under Visual Studio .NET 2003. The provided solution is a Visual Studio .NET 2003 solution.
The following document describes a number of issues, which are important to use the IndexToolkit under Windows platform.
Installation
------------
Download the IndexToolkit and utilities source files from the download page. IndexToolkit is stored in a zip-format compressed archive. WinZip or
Pkzip is the most common tool for uncompressing these archives.
You can find pkzip for windows at http://www.pkware.com
To extract the files, create a new directory to house all of
IndexToolkit. Now, create
this directory and extract the archive into it. Open the Visual Studio .NET 2003 solution file (*.sln) , and compile it.
Compilers
---------
We compiled IndexToolkit for Windows platform ( Win2000/XP/2003 ) using Microsoft Visual Studio .NET 2003.
Setting up an environment for IndexToolkit
-----------------------------------
First, to make IndexToolkit running successfully, you must edit the setup.ini file with a text editor or Setting.exe in the package. The setup.ini is a text-format file to specify the path of the FASTA databases, the cleavage rule of the enzymes, and some common setting of the index file.
Now, setup.ini has three sections : 'Database', 'Enzyme', and 'Filter'. The 'Database' section gives the name and path of every database. The index files' names are often based on the database's name. The 'Enzyme' section gives the name, cleavage rules of every enzyme. The name of an enzyme is one part of its peptide index files' name. the 'Filter' section gives some filter setting to determine how many peptides are to be stored into the index files ( such as Max. missed cleavage sites, The minimal / maximal length of peptides to be indexed ).
After setting, save the setup.ini file, and run IndexToolkit.
ACKNOWLEDGEMENTS
This work was supported by the National Key Basic Research & Development Program (973) of China under Grant No. 2002CB713807 and
China National Key Technologies R&D Programme Grant 2004BA711A21.