TY - JOUR T1 - EnzML: multi-label prediction of enzyme classes using InterPro signatures. JF - BMC Bioinformatics Y1 - 2012 A1 - De Ferrari, Luna A1 - Stuart Aitken A1 - van Hemert, Jano A1 - Goryanin, Igor AB - BACKGROUND: Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function. RESULTS: We present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters. CONCLUSIONS: InterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN). VL - 13 ER - TY - CONF T1 - A model of social collaboration in Molecular Biology knowledge bases T2 - Proceedings of the 6th Conference of the European Social Simulation Association (ESSA'09) Y1 - 2009 A1 - De Ferrari, Luna A1 - Stuart Aitken A1 - van Hemert, Jano A1 - Goryanin, Igor AB - Manual annotation of biological data cannot keep up with data production. Open annotation models using wikis have been proposed to address this problem. In this empirical study we analyse 36 years of knowledge collection by 738 authors in two Molecular Biology wikis (EcoliWiki and WikiPathways) and two knowledge bases (OMIM and Reactome). We first investigate authorship metrics (authors per entry and edits per author) which are power-law distributed in Wikipedia and we find they are heavy-tailed in these four systems too. We also find surprising similarities between the open (editing open to everyone) and the closed systems (expert curators only). Secondly, to discriminate between driving forces in the measured distributions, we simulate the curation process and find that knowledge overlap among authors can drive the number of authors per entry, while the time the users spend on the knowledge base can drive the number of contributions per author. JF - Proceedings of the 6th Conference of the European Social Simulation Association (ESSA'09) PB - European Social Simulation Association ER -