Datasets of N-terminal transmembrane helix protein sequences used in the study were generated from two sources: the 188 well-annotated membrance proteins (Moller et al. 2000) and Swiss-prot 40.0 (Boeckmann et al. 2003).


Redundancy reduction has been carried out on all protein datasets obtained from these two sources by calculating the pairwise identity of protein segments using ClustalW (Thompson et al., 1994), followed by the determination of the largest representative dataset using the algorithm developed by Hobohm et al. (1992).  All protein segments within a dataset have pairwise identity less than 25%.


Extracted from the 188 well-annotated membrane proteins are M27.seg comprised of 27 eukaryotic protein sequences, and M70.seg of 70 Gram-negative bacterial sequences. anchor247.seg of 247 signal anchor type II membrane proteins is derived from Swiss-prot 40.0.


com272.seg of 272 eukaryotic protein sequences is generated by merging M27.seg and Anchor247.seg with redundancy reduction, while com89.seg of 89 Gram-negative bacterial sequences is generated from Swiss-prot 40.0 and M70.seg.


gram-sig232.seg of 232 Gram-negative bacterial signal peptides, gram-non186.seg of 186 non-secretary soluble bacterial proteins, and eu-sig943.seg of 943 eukaryotic signal peptides, eu-non820.seg of 820 non-secretary soluble eukaryotic proteins are derived from SignalP server (Nielsen et al. 1999).


Gram-negative bacterial protein N-terminal segment datasets

Eukaryotic protein N-terminal segment datasets