UQLogo
Computational Cell Biology, Teasdale Group IMB
Home SVMTM Golgi ASAP Publications LOCATE Teasdale Group IMB UQ

SVMtm Predictor Description

About the prediction method

SVMtm is a support vector machine based transmembrane (TM) helices predictor. First, protein transmembrane profiles (propensities) are generated by support vector machines. The training of support vector machine is realized by using SVM_light (Joachims, 1999). Different coding schemes have been used including 21-UNIT and hydropathy scales (Yuan et al., 2004). The hydropathy scales are normalized as mean zero and standard deviation 1, as shown in Table 1. Second, the algorithm of MaxSubSeq (Fariselli et al, 2003) is adopted to find TM segments from TM profiles. The length of a transmembrane segment is limited the range from 15 to 35 amino acids when the algorithm maximizes the sequence global scores.

Table1.jpg

Performance of SVMtm predictor

The method has been examined according to different coding schemes and only the best performing coding scheme is implemented as a web predictor.

Based on a non-redundant dataset of 148 well-annotated membrane proteins (Moller et al., 2000), seven-fold cross-validation is performed. We define two accuracies based on segments (Qsp and Qse) and two accuracies based on proteins (Q0 and Q1).
Specificity Qsp = TP / P and sensitivity Qse = TP / T
where TP, P and T are the number of correctly predicted, predicted and observed transmembrane segments, respectively. A correctly predicted segment is defined as one that has at least 9 residues overlapping the observed segment.
Q0 = T0 / TA and Q1 = ( T0 + T1 ) / TA
where T0 is the number of correctly predicted proteins. If a protein has correctly predicted the number of transmembrane segments and all the segments, it is a correctly predicted protein. T1 is the number of proteins, which have only one transmembrane segment wrongly predicted, including those proteins over-predicted or under-predicted one segment, or correctly predicted the number of segments but miss-predicted one segment. TA is the number of total proteins. Results are given in Table 2.


Table 2. Prediction of transmembrane segments based on different sequence coding schemes

Coding Scheme Prediction accuracy (%)
Qsp Qse Q0 Q1
21-UNIT 92.0 93.4 63.5 86.5
JTT 91.6 93.0 61.5 83.1
KD 91.0 92.9 60.1 86.5
EB 90.1 92.7 56.1 83.8

To balance the discrimination between membrane proteins and soluble proteins, we select the maximum TM score from each protein. If a threshold is set as 10, 98.8% of soluble proteins have maximum scores lower than it, while 98.6% of TM proteins have maximum scores higher than it. Therefore, This threshold is used to filter the final results. For a predicted TM protein, if the maximum TM score is less than 10, it is re-assigned as a soluble protein.

There is a low efficiency for this method to differentiate N-terminal TM segments and signal peptides. Some other methods are needed to verify the N-terminal predicted TM segments.


Illustration of prediction results

cyoe.jpg

21-UNIT coding scheme is selected with all the four accuracies the best to set up a predictor SVMtm. For a predicted protein, TM profiles and TM segments are shown in a plot (JPG format). The scores are also given following each predicted TM segments for the user to easily evaluate the strength of the TM signal. For example, in the figure above, the preprotein translocase secY subunit B (SWISS-PROT ID: CYOE_ECOLI) is predicted to have 8 TM segments. Transmembrane profile generated by support vector machines is represented by dashed red line. Solid green line represents the transmembrane segments with value 1 for transmembrane residues and 0 for others. SVMtm predictor gives the initial and terminal positions of TM segments as well as the TM scores.

Start End Score
13 28 13.91
38 55 15.54
80 105 24.4
108 126 18.7
160 174 1.9
209 224 14.91
229 247 14.14
266 281 12.77

Larger scores mean strong transmembrane signals. One segment has a score of 1.9, while other 7 segments have scores all larger than 12. This segment is actually a false positive transmembrane segment. Its low score suggests it needs further examination by other methods.

References

  • Boyd, D., Schierle, C. and Beckwith, J. (1998) How many membrane proteins are there? Protein Sci., 7, 201-205.
  • Eisenberg, D., Schwarz, E., Komaromy, M. and Wall, R. (1984) Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol., 179, 125-142.
  • Fariselli, P., Finelli, M., Marchignoli, D., Martelli, P.L., Rossi, I. And Casadio, R. (2003). MaxSubSeq: an algorithm for segment-length optimization. The case study of the transmembrane spanning segments. Bioinformatics, 19, 500-505.
  • Joachims, T. (1999) Making large-Scale SVM Learning Practical. In: Sch_lkopf B, Burges C, Smola A, editor. Advances in Kernel Methods-Support Vector Learning MIT Press, p 41-54.
  • Kyte. J. and Doolittle, R.F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157, 105-132.
  • Moller, S., Kriventseva, E.V. and Apweiler, R. (2000) A collection of well characterized integral membrane proteins. Bioinformatics, 16, 1159-1160.
  • Yuan, Z., Mattick, J.S. and Teasdale, R.D. (2004). SVMtm: Support Vector Machines to Predict Transmembrane Segments. J Comput Chem. 2004, 25(5), 632-636.

Last updated: 09-May-2006