MIST is a collection of algorithms for mining the most interesting patterns from a dataset, that is, only those patterns that are most relevant to a data analyst doing exploratory data analysis.
We provide algorithms for mining three different types of pattern: itemsets (sets of data attributes), sequences (subsequences of data attributes), API Calls (subsequences of API calls).
MIST is a collection of tools designed to mine the most interesting patterns from a given dataset, specifically those patterns that are useful for a data analyst performing exploratory data analysis. Unlike frequent pattern mining algorithms, which return huge numbers of highly redundant patterns, our algorithms are designed from the ground up to mine only those patterns that are the most interesting, greatly reducing redundancy.
In order to achieve this, we define a probablistic model of transactions and apply a statistical inference algorihm to efficiently infer the interesting patterns directly from the database. For more technical details, please see the accompanying papers to the algorithms below.
Interesting Itemset Miner mines sets of attributes from data.
This is an implementation of the algorithm from our PKDD paper.
Interesting Sequence Miner mines sequences of attributes from data.
This is an implementation of the algorithm from our KDD paper.
Probabilistic API Miner mines sequences of API calls from data.
This is an implementation of the algorithm from our FSE paper.
The datasets used in our papers are available in the datasets/
subdirectory in the source code for each algorithm (see above).
Jaroslav Fowkes is a Postdoc at the University of Edinburgh and member of the machine learning group. His research focuses on developing novel statistical methods for exploratory data analysis as well as natural language processing techniques for the analysis of program source code text.
Charles Sutton is a lecturer (= US Assistant Professor) at the University of Edinburgh and member of the machine learning group. His research aims at new statistical methods for interactive machine learning as well as to handle data about the operation and performance of large-scale computer systems.