Motivation: Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only < 1% of more than 70 million proteins in UniProt have experimental GO annotations, implying the strong necessity of automatic function prediction (AFP) of proteins, where AFP is a multi-label classification problem due to the nature of one protein with multiple GO terms. Most of these proteins have not so much information rather than sequences only, indicating the importance of sequence-based AFP (SFAP: sequences are the only input). Furthermore homology-based SFAP tools are competitive in AFP competitions, while they do not work well for so-called difficult proteins, which have only low similarities to other proteins with annotations already. Thus the important and challenging problem now is to develop a method for SFAP, particularly for so-called difficult proteins.
Methods: The key of the method is to extract diverse and useful information/evidence from sequence inputs and integrate them into a predictor in an efficient and also effective manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, in the framework of learning to rank (LTR), a rather new paradigm of machine learning powerful for multi-label classification.
Results: The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed the significant performance advantage over state-of-the-art competing methods in numerous aspects.

Input fasta

Or upload a fasta file

Clear input Show an example Example File

Process id