Sequence Database Setup: Contaminants
Overview
If you search a single organism database, its usually a good idea to include sequences for common
contaminants, such as keratins, BSA, and trypsin.
Two groups make their collections available for download. The
Max Planck Institute of Biochemistry,
Martinsried, maintains a file of some 262 proteins selected from IPI. The Global Proteome Machine Organization
common Repository of Adventitious Proteins
contains some 111 proteins selected from Swiss-Prot. (Numbers as of November 2009).
In Mascot 2.3, you simply select the contaminants database in the search form, along with the target
database. For Mascot 2.2
and earlier, you need to append the contaminant sequences to the end of the target database fasta file.
This can be complicated by the requirement to have a uniform syntax for all the title lines. One database may
have Swiss-Prot style accessions and the other NCBI-style accessions. If so, you either have to find a
parse rule that works with both or modify the title lines of one database using a script or text editor.
If both target and contaminants databases have accessions drawn from the same pool, remember to watch for
duplicates. It may be safer to leave the CON_ prefix in place for the MPI collection, or add a prefix for the
GPM collection.
Download
http://www.biochem.mpg.de/en/rd/maxquant/Downloads/fasta/contaminants.zip
for contaminants from MPI
ftp://ftp.thegpm.org/fasta/cRAP/crap.fasta
for cRAP from GPM
Taxonomy
Taxonomy is not appropriate. You want to include all contaminants in every search.
Parse Rules
Typical Fasta title line from the MPI collection:
>IPI:CON_Trypsin|SWISS-PROT:P00761|TRYP_PIG
Trypsin - Sus scrofa (Pig).
Accession from Fasta title: ">IPI:CON_\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"
Typical Fasta title line from the GPM collection:
>sp|ALBU_BOVIN|
Accession from Fasta title: ">sp|\([^|]*\)"
Description from Fasta title (same): ">sp|\([^|]*\)"
Configuration
The MPI collection was downloaded to
C:\inetpub\mascot\sequence\contaminants\current,
decompressed using gzip,
and renamed to contaminants_20090624.fasta.
The GPM collection was downloaded to
C:\inetpub\mascot\sequence\cRAP\current,
and renamed to cRAP_20090731.fasta.
Always test a new definition before applying the changes to mascot.dat.
|