Primary objective: to perform data mining on a real-world data set from a biology lab in the School of Biological Sciences with the aim to extract patterns that lead to hypotheses about mode of action of compounds and function of genes.
Functional genomic screens, especially in the budding yeast, generate huge amounts of data. The Tyers Lab (http://tyerslab.bio.ed.ac.uk/) lab generates chemical-genetic data to test the effect of small molecules on growth of different yeast deletion mutants. This data set combined with data published by other labs is a large enough data set for advanced data mining.
Possibilities for this data analysis range from different clustering algorithms to pattern matching and association analysis. It is also possible to include structural similarity calculation between compounds. This should lead to a definition of a set of chemical-genetic signatures that are associated with specific effects on eukaryotic cells (like novel detoxification pathways). Also, the biologists in the lab are looking for new hypotheses about the mode of action for compounds and about the function of yeast genes that are as yet uncharacterized (up to 1000 of the 6000 yeast genes are still uncharacterized).
This project expects from you:
- To identify the data mining procedure and algorithms suitable to extract patterns from these data.
- To identify solutions to handle the large amount of data (distributed computing paradigms such as MapReduce)
- To develop the data mining workflow using existing or new implementations
- To deliver this workflow in a way that they can interact and use it
as tool after the project.