Computational Cell Biology, Teasdale Group IMB
SVMtm Predictor Description
About the prediction method
SVMtm is a support vector machine based transmembrane (TM) helices predictor.
First, protein transmembrane profiles (propensities) are generated by support
vector machines. The training of support vector machine is realized by using
SVM_light (Joachims, 1999). Different coding schemes have been used including
21-UNIT and hydropathy scales (Yuan et al., 2004). The hydropathy scales are
normalized as mean zero and standard deviation 1, as shown in Table 1. Second,
the algorithm of MaxSubSeq (Fariselli et al, 2003) is adopted to find TM
segments from TM profiles. The length of a transmembrane segment is limited the
range from 15 to 35 amino acids when the algorithm maximizes the sequence
Performance of SVMtm predictor
The method has been examined according to different coding schemes and only the
best performing coding scheme is implemented as a web predictor.
Based on a non-redundant dataset of 148 well-annotated membrane proteins (Moller
et al., 2000), seven-fold cross-validation is performed. We define two
accuracies based on segments (Qsp and Qse) and two
accuracies based on proteins (Q0 and Q1).
Specificity Qsp = TP / P and sensitivity Qse = TP / T
where TP, P and T are the number of correctly predicted, predicted and observed
transmembrane segments, respectively. A correctly predicted segment is defined
as one that has at least 9 residues overlapping the observed segment.
Q0 = T0 / TA and Q1 = ( T0 + T1 ) / TA
where T0 is the number of correctly predicted proteins. If a protein has
correctly predicted the number of transmembrane segments and all the segments,
it is a correctly predicted protein. T1 is the number of proteins, which have
only one transmembrane segment wrongly predicted, including those proteins
over-predicted or under-predicted one segment, or correctly predicted the
number of segments but miss-predicted one segment. TA is the number of total
proteins. Results are given in Table 2.
Table 2. Prediction of transmembrane segments based on different sequence
||Prediction accuracy (%)
To balance the discrimination between membrane proteins and soluble proteins, we
select the maximum TM score from each protein. If a threshold is set as 10,
98.8% of soluble proteins have maximum scores lower than it, while 98.6% of TM
proteins have maximum scores higher than it. Therefore, This threshold is used
to filter the final results. For a predicted TM protein, if the maximum TM
score is less than 10, it is re-assigned as a soluble protein.
There is a low efficiency for this method to differentiate N-terminal TM
segments and signal peptides. Some other methods are needed to verify the
N-terminal predicted TM segments.
Illustration of prediction results
21-UNIT coding scheme is selected with all the four accuracies the best to set
up a predictor SVMtm. For a predicted protein, TM profiles and TM segments are
shown in a plot (JPG format). The scores are also given following each
predicted TM segments for the user to easily evaluate the strength of the TM
signal. For example, in the figure above, the preprotein translocase secY
subunit B (SWISS-PROT ID: CYOE_ECOLI) is predicted to have 8 TM segments.
Transmembrane profile generated by support vector machines is represented by
dashed red line. Solid green line represents the transmembrane segments with
value 1 for transmembrane residues and 0 for others. SVMtm predictor gives the
initial and terminal positions of TM segments as well as the TM scores.
Larger scores mean strong transmembrane signals. One segment has a score of 1.9,
while other 7 segments have scores all larger than 12. This segment is actually
a false positive transmembrane segment. Its low score suggests it needs further
examination by other methods.
Boyd, D., Schierle, C. and Beckwith, J. (1998) How many membrane proteins are
there? Protein Sci., 7, 201-205.
Eisenberg, D., Schwarz, E., Komaromy, M. and Wall, R. (1984) Analysis of
membrane and surface protein sequences with the hydrophobic moment plot. J.
Mol. Biol., 179, 125-142.
Fariselli, P., Finelli, M., Marchignoli, D., Martelli, P.L., Rossi, I. And
Casadio, R. (2003). MaxSubSeq: an algorithm for segment-length optimization.
The case study of the transmembrane spanning segments. Bioinformatics,
Joachims, T. (1999) Making large-Scale SVM Learning Practical. In: Sch_lkopf B,
Burges C, Smola A, editor. Advances in Kernel Methods-Support Vector Learning
MIT Press, p 41-54.
Kyte. J. and Doolittle, R.F. (1982) A simple method for displaying the
hydropathic character of a protein. J. Mol. Biol., 157,
Moller, S., Kriventseva, E.V. and Apweiler, R. (2000) A collection of well
characterized integral membrane proteins. Bioinformatics, 16,
Yuan, Z., Mattick, J.S. and Teasdale, R.D. (2004). SVMtm: Support Vector
Machines to Predict Transmembrane Segments. J Comput Chem. 2004, 25(5), 632-636.
Last updated: 09-May-2006