We are building a data-intensive machine as a research platform to explore data-intensive computational strategies. We are interested in computations over large bodies of data, where the data-handling is a dominant issue. Computational challenges with these properties are getting ever more prevalent as the cost of digital sensors and computational/societal data sources become ever cheaper, ever more powerful and more ubiquitous. The use of algorithms over such data are of growing importance in medicine, planning, engineering, policy and science.
It is not possible to resource ever computation that interests people, as the cost of accessing data is high (in time, joules or £). Therefore, there is interest in effective strategies for getting the best possible answer within some cost bound.
The project would explore strategies that take account of data locality and which make the best use of each data transfer. For example, each computing node samples data stored locally and when a disk access has been made, all of the records that can contribute to the answer are used.
More precisely: we seek to the best answer to f(proj, D) where f is some function we can potentially understand, such as mean, max, min, k-largest, etc., and for which we imagine a well provided library built by collaborations of Data-Intensive Distributed Computing (DIDC) engineers and Data Analysts. D is a (large) collection of data already distributed over nodes and storage units by an algorithm we may know, and may influence. We may use samples from astronomy, biology or seismology. proj, is a user defined function, such that proj(d_i) generates the value of interest to be analysed by f. So we want the best estimate we can afford of f(proj(d_i)) for all d_i in D. Note that we cannot have prior knowledge of proj, it can be any function that interests the person analysing the data. We cannot predict its cost, but we can learn it as the algorithm runs, or over successive runs. That learning is probably beyond the scope of the student project.
A naive approach would to keep choosing d_i randomly from D, looking it up, and using that value, iterating until time runs out (strictly, until the aggregation part of the algorithm has time to run). Such an approach would make pathologically poor use of t, as it would access whole blocks or whole files to use one object, and those files would not be co-located with the process evaluating f, so there would also be a lot of communication delays.
So the approach the student would take, would be to invent a new function f_p that does part of the work, and organise a framework for running f_p as follows:
Clone f_p across n (a control or tunable parameter) nodes of the computational framework as f_{p,k} workers. If we run for time t, then nt is a reasonable indicator of costs. Each f_{p,k} then randomly accesses local data units and uses all of the d_i in each data unit u_j read. The f_{p,k} either stream intermediary results or send final results to an aggregator, ag, when t has elapsed. The results sent are a tuple or some such.
Where u_n is the number of data storage units processed, u_{max} is the number data storage units available locally, \hat{f_k} is the local estimate of f(proj(D)), \hat{\epsilon_k} is the local estimate of the error, count_k is the number of d_i that were sampled.
The aggregator would produce its best estimate of \hat{f{proj(D)}}, its best estimate of the error \hat{\epsilon} and an estimate of coverage p, the percent of D sampled.
These would then be fed to a user or consumer algorithm.
Students would decide on a data-distribution description schema, e.g. that says how D is distributed over nodes, how it is divided into storage units on each node, and maybe how it was randomised on arrival. They can make this harder, by moving from simple statistical functions towards progressively more challenging data-mining algorithms. They can show how their strategy progressively improves over the naive method. They can show their indicators of result quality are indicative of the goodness of the result by comparing with an exhaustive computation.
A more advanced, streaming framework might be tried by an ambitious student.
We now have F(f, t_0, e, r, D), that when called generates a similar distributed algorithm to the above, where t_0 is time to run, f is the function to evaluate, e is an estimate of f(D), r is an estimate of the current error, coverage, etc.
After the algorithm has distributed its elements, and returned a result to show the user, or to satisfy a consumer, it repeats, with another time step t_1, to obtain a better answer, taking care not to re-use d_i used in step 1 and so on. t_i might be supplied as a Fibonacci series, and the process run continuously until complete coverage or consumer/user says, "Enough".
This may work as a student project (or two) and the second framework is easily expressed in DISPEL, given a supply of incremental algorithms of the first sort. Though now local housekeeping, to remember which data units had been sampled, would need to persist over the iterations, even though each node might not be sampled again, and be cleaned up at the end.
Attachment | Size |
---|---|
MScAffordableMalcolmAtkinson20110119.pdf | 98.98 KB |