

 
Data remains as raw text until it is mined and the information contained within it is harnessed. Mining data to make sense out of it has applications in varied fields of industry and academia. In this article, we explore the best open source tools that can aid us in data mining.
Data mining, also known as knowledge discovery from databases, is a process of mining and analysing enormous amounts of data and extracting information from it. Data mining can quickly answer business questions that would have otherwise consumed a lot of time. Some of its applications include market segmentation – like identifying characteristics of a customer buying a certain product from a certain brand, fraud detection – identifying transaction patterns that could probably result in an online fraud, and market based and trend analysis – what products or services are always purchased together, etc. This article focuses on the various open source options available and their significance in different contexts.
A brief look at mining tasks
For those who are new to data mining, let’s take a brief look at some of the common mining tasks.
Pre-processing: This involves all the preliminary tasks that can help in getting started with any of the actual mining tasks. Pre-processing could be removing anomalies and noise from the data that’s about to be mined, filling in missing values, normalising the data or compressing data using techniques like generalisation and aggregation.
Clustering: This is partitioning a huge set of data into related sub-classes.
Classification: This is tagging or classifying data items into different user-defined categories.
Outlier analysis helps in identifying those data elements which are deviant or distant from the rest of the elements in a dataset. This can help in anomaly detection.
Associative analysis helps in bringing out hidden relationships among data items in a large data set. This can help in predicting the occurrence of a particular item in a transaction or an event whenever some other item is present. You can think of this as a conditional probability.
Regression is used to predict values of a dependent variable by constructing a model or a mathematical function out of independent variables.
Summarisation helps in coming up with a compact description for the whole data set.
Data mining is a combination of various techniques like pattern recognition, statistics, machine learning, etc. While there is a good amount of intersection between machine learning and data mining, as both go hand in hand and machine learning algorithms are used for mining data, we will restrict ourselves in this article to only those tools specialised for data mining.
