Principal goal: to apply machines learning to identify small molecues that are likely candidates to have relevant bioactivity for follow-up wet-lab experiments.
Nowadays, biologists screen hundreds of small molecules routinely to test them for bioactivity. The screens performed return many hit compounds that could be interesting for follow-up experiments. Quite often there are too many hits to be followed up and many of the molecules are ubiqutious binders.
One way to deal with this is to try to classify the hit molecules into groups by learning their molecule structure and properties to bin them. To do that, a dataset with molecules and experimental results is submitted to get the compounds classifed based on features that correspond to bioactivity in the screen and their structure.
Right now the current tools are only capable of handling binned results. The researchers have to specify which ones are
positive/negative hits or belong to specific groups A, B, C, etc. Your job would be to deliver a quantitative refinement of the classification. For example, you could instead of bin the results assign a real valued number that represents the interestingness or alternatively rank the hit compounds.
This project will allow you to apply machine learning algorithms to real data and to work with chemo-informatics libraries.
This project has the potential to be published as an application note in the journal Bioinformatics when finshed.