Machine learning is a branch of artificial intelligence that concerns the construction and study of systems made to automatically screen (often big amounts of) data ("big data") in order to discover regularities in regard to which other regularities in so far unscreened data can be predicted. An often quoted example is a machine learning system trained on email messages to learn to distinguish between spam and nonspam.
Having a clear target in respect to which a machine (actually an algorithm) can be trained  such as the distinction between spam and not spam  is an instance of supervised machine learning. A first step in this kind of machine learning often is to find out which parts of the data will gain most information in respect to the target classification. The so called information gain methodology uses the concept of Shannonentropy.
Consider a data set of the following form and the task to derive from it (to "learn") what kind of attributes of the surveyed persons will best "predict" whether someone will invest into buying (here: adopting) a photo voltaic installation for his home.
Name 
Account balance 
Age 
Employed 
Adopt 
Alfred 
450000 
45 
Yes  Yes 
Bob 
35000 
34 
No 
No 
Chris 
115000  37 
Yes 
No 
Dora 
29000  25 
No 
No 
Edgar 
72000  51 
Yes 
Yes 
Each row indicates an instance of the data set, each column a feature and each cell an attribute. Rows represent so called feature vectors. The last column contains the so called target feature, that is, the feature for which the causes are sought and are assumed to be derivable from the other features. In the example case, this is the question which of the three features (account balance, age or employment status) provides most predictive information about whether someone will adopt photo voltaic or not. One way to approach an answer to this question is to use a technique called information gain through entropy.
Entropy is a measure of disorder, suggested by Claude E. Shannon (1948), and can be used to classify the segmentation as provided by the features in the data set.
For example, the feature account balance allows segmenting the instances (i.e. the surveyed persons) into such with more than Euro 100.000 balance (Alfred and Chris) and less than Euro 100.000 (Bob, Dora and Edgar), whereas the feature age allows segmenting the instances into such older than 40 (Alfred and Edgar) and younger than 40 (Bob, Chris and Dora). The disorder (or entropy) of a segmentation indicates how mixed or impure a feature is with respect to the target attribute. It varies between 0 and 1, with 0 indicating absolute order (or purity) and 1 indicating disorder (or impurity).
Entropy (E) is defined as:
\[E = p_1 log_2(p_1)  p_2 log_2(p_2) ... \]with \(p_i\) indicating the probability (or proportion) of property \(i\) within the set. In respect to account balance for instance, two properties could be "less than" or "more than Euro 100.000" on the account.
The following code calculates the entropy (according to the above formula) of the mixed set [1 ,1 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ] (or alternatively for any other given set), showing that an impure feature set of 10 attributes with two of them not corresponding to the target value would have a relatively high entropy of about 0.72 of 1, or in other words a rather low information gain in respect to the target.
The following code snippet calculates the successive entropies of 100 randomly generated binary sets of length 10000 with probabilities for the occurrence of ones in them ranging from 0 to 100. The plot beneath illustrates that pure sets  in this case consisting of just zeros or just ones  have an entropy of zero while maximally mixed sets  50% zeros and 50% ones  have an entropy of 1.
In the above example, account balance generates an impure segmentation (with \(E = \frac{3}{5}* log_2 \frac{3}{5} + \frac{2}{5}* log_2 \frac{2}{5} = (0.6*0.74 + 0.4*1.32)= (0.44 0.53)=0.97 \))  there is only one adopter classified in the segment "account balance > 100.000". But age generates a pure segmentation (\(entropy = 0\)), indicating that all adopters are older than 40.
Based on entropy, information gain measures how much an (added) feature decreases the entropy over the whole segmentation.
If the original set (or the one on which a new segmentation with a new attribute is performed) is called the parent set and the new sets are called children sets, the information gain (IG) with these children can be defined as:
\[IG(parent,children)=E(parent)[\pi(c_1)*E(c_1)+\pi(c_2)*E(c_2)+...]\]with \(\pi(c_i)\) indicating the proportion of instances belonging to a child.
Once the information gain of each feature is known, features can be sorted in respect to the information gain they provide. By splitting each feature at an appropriate value a decision tree can be constructed and used to determine whether certain combinations of ranges of feature values (attributes) would indicate the target correctly  in this case the question of photo voltaic adoption or notadoption.
Three features at the instances and two classes in the target variable
X 
Y 
Z 
C 

1 
1 
1 
I 

1 
1 
0 
I 

0 
0  1 
II 

1 
0  0 
II 
\(X=1\) yields two time I, but also one time II: \(E_{child_1}= (1/3)log_2(1/3)(2/3)log_2(2/3) = 0.5284 + 0,39=0.9184\)
\(X=0\) yields one time II: \(E_{child_2}= 0\)
\[E_{parent}=1\] \[Gain=1\frac{3}{4}*0.9184\frac{1}{4}*0=0.3112\]\(y=1\) yields two time I: \(E_{child_1}= 0\)
\(Y=0\) yields two times II: \(E_{child_2}= 0\)
\[E_{parent}=1\] \[Gain=1\frac{1}{2}*0\frac{1}{2}*0=1\]\(Z=1\) yields on time I and one time II: \(E_{child_1}= 1\)
\(X=0\) yields on time I and one time II: \(E_{child_2}= 1\)
\[E_{parent}=1\] \[Gain=1\frac{1}{2}*1\frac{1}{2}*1=0\]So obviously, splitting on Y yields the highest information gain, and splitting on Z the worst, which is easily deduced from the above table.
There are several good machine learning software collections available on the web and many websites and software companies offer prefabricated tools. However, since machine learning systems are complex systems in a strict sense of the word  with global effects emerging (often counterintuitively) from local interaction of components  using them as "black boxes", that is without understanding their internal mechanisms, involves the risk of getting suboptimal or even false results. One way to mitigate this problem is to use open source library packages like Weka for Java from the New Zealand University of Waikato or the scikit.learn toolkit for Python, which offer the chance to check internal operations.
The following information gain example is calculated with scikit.learn from a data set generated in the course of a survey of an Austrian bottomup initiative for participation in a communal photovoltaic plant. The survey focused on the seven features environmental awareness, belief in technical progress, importance of home comforts, modernity of lifestyle, importance of social inclusion, importance of independence from mainstream energy providers, importance of financial aspects.The last ten instances of the dataset (in terms of the numbers in the leftmost column) are shown below:
The feature values (attributes) vary between 1 ("not relevant") and 5 ("very relevant"). The last column shows the target values, with 1 indicating "adoption" and 0 "nonadoption".
As can be seen from the list below, the two last features importance of independence from mainstream energy providers and importance of financial aspects, although not highly informative, provide most predictive information in respect to the question whether someone will participate in the communal photovoltaic plant or not. These two features will be used in the following to illustrate other particular techniques of data mining. The indicator Best Split suggests values with the help of which the feature can be divided in order to distinguish the fraction of nonadopters (\(<\) Best Split) from the one of adopters (\(\geq\) Best Split). Sorting all features in respect to information gain (with respect to adoption in this case), generates a decision tree.
The leaves of a decision tree allow to estimate probabilities for class memberships, that is, for the probability of an instance belonging to the class of adopters or to the class of nonadopters. For example, if a leaf contains \(n\) positive instances (adopters) and \(m\) negative instances (nonadopters), the probability of any new instance being positive can be estimated as \(\frac{n}{n+m}\). this is called a frequencybased estimate of class membership probability.
The data set used in the above example is unusual small for machine learning. It consists of just 236 instances, so no wonder that the information gain of features is rather small. Using small data sets bears the risk that some leaves of the decision tree contain only few instances. In this case the frequencybased estimate can be massively twisted. A leaf with two positive instances and no negative one would have the same estimate as a leaf with 20 positive instances and no negative one.
In order to smooth this distortion, one can use the so called Laplace correction which moderates the influence of leaves with only few instances. Its formula for binary class probability estimation is:
\[p(c)=\frac{n+1}{n+m+2}\]