README

Mattbench consists of benchmark sets at two homology levels, roughly comparable to the SABMark "Superfamily" and "Twilight Zone" sets. The superfamily set consists of 225 distinct multiple alignments, each containing between 3 and 15 protein domains. All domains within each multiple alignment are at most 50% identical to one another, and all are chosen from the same superfamily from the Touring Protein Space with Matt hierarchy. There is also a set of decoys for each multiple alignment, which are domains that are similar in sequence but not structure to the domains in the multiple alignment. Specifically, the decoys are chosen such that their alignment with any of the domains in the superfamily set is worse than the Matt superfamily threshold, but so they have a blastp sequence alignment e-value to one of the superfamily domains that is better than at least one of the pairwise e-values within the superfamily set.


The twilight zone set consists of 34 distinct multiple alignments, constructed similiarly to the superfamily level except maximum sequence identity is 20% and the domains within each alignment are chosen from the same fold, rather than superfamily.


This tarball contains two directories, 'superfamily' and 'twilight'. Each contains numbered directories, each corresponding to a Matt superfamily or fold, respectively. Each of those benchmark instances then contains three directories:


decoys/

- contains between 3 and 15 decoy domains for this set


domains/

- contains between 3 and 15 positive (aligning) domains for this set


matt_alignment/

- contains four files. If this benchmark number is 'n':


n.fasta

- FASTA format showing the Matt alignment represented as sequence

n.pdb

- pdb file showing the structural alignment

n.spt

- jmol script that will produce a nice cartoon view based on this pdb file

n.txt

- the raw Matt output for this alignment