Machine learning is a branch of artificial intelligence that concerns the construction and study of systems made to automatically screen (often big amounts of) data ("big data") in order to discover regularities in regard to which other regularities in so far unscreened data can be predicted. An often quoted example is a machine learning system trained on email messages to learn to distinguish between spam and nonspam.
Having a clear target in respect to which a machine (actually an algorithm) can be trained  such as the distinction between spam and not spam  indicates and instance of supervised machine learning. A first step in this kind of machine learning often is to find out which parts of data will gain most information in respect to the target classification. The so called information gain methodology uses the concept of Shannonentropy.
Consider a data set of the following form and the task to learn from it what kind of attributes of the persons registered will predict best whether someone will invest into buying (= adopt) a photo voltaic installation for her home.
Name 
Account balance 
Age 
Employed 
Adopt 
Alfred 
450000 
45 
Yes  Yes 
Bob 
35000 
34 
No 
No 
Chris 
115000  37 
Yes 
No 
Dora 
29000  25 
No 
No 
Edgar 
72000  51 
Yes 
Yes 
Each row indicates an instance of the data set, each column a feature and each cell an attribute. Rows represent so called feature vector. The last column contains the so called target attributes, that is, the attributes for which the causes are sought and are suspected to be somewhere hidden in the other attributes. In this case, this is the question, which of the three other features (account balance, age or employed) provides most predictive information about whether someone will adopt photo voltaic or not. One possibility to approach a solution for this task is to use a technique called information gain through entropy.
Entropy is a measure of disorder, suggested by Claude E. Shannon (1948), and can be used to classify a segmentation which can be generated with one of the attributes in the data set.
For example, the feature account balance would allow segmenting the instances (i.e. the registered persons) into such with more than 100000 balance (Alfred and Chris) and less than 100000 (Bob, Dora and Edgar), whereas the attribute age would allow segmenting the instances (i.e. the registered persons) into such older than 40 (Alfred and Edgar) and younger than 40 (Bob, Chris and Dora). The disorder (or entropy) of a segmentation indicates how mixed or impure a segment is with respect to the target attribute. It varies between 0 and 1, with 0 indicating absolute order (or purity) and 1 indicating disorder (or impurity).
Entropy (E) is defined as:
\[E = p_1 log_2(p_1)  p_2 log_2(p_2) ... \]with \(p_i\) indicating the probability (the relative percentage) of property \(i\) within the set.
The following code calculates the entropy (according to the above formula) of the mixed set [1 ,1 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ] (or alternatively for any other given set), showing that an impure feature set of 10 instances with two of them not corresponding to the target value would have a relatively high entropy of about 0.72 of 1, or in other words a rather low information gain in respect to the target.
The following code snippet calculates the successive entropies of 100 randomly generated binary sets of length 10000 with probabilities for the occurrence of ones in them ranging from 0 to 100. The plot beneath illustrates that pure sets  in this case consisting of just zeros or just ones  have an entropy of zero while maximally mixed sets  50% zeros and 50% ones  have an entropy of 1.
In the above example, account balance generates an impure segmentation (with \(E = \frac{3}{5}* log_2 \frac{3}{5} + \frac{2}{5}* log_2 \frac{2}{5} = (0.6*0.74 + 0.4*1.32)= (0.44 0.53)=0.97 \))  there is only one adopter classified in the segment "account balance > 100000". But age generates a pure segmentation (\(entropy = 0\))  all adopters are older than 40.
Based on entropy, information gain measures how much an (added) feature improves or decreases entropy over the whole segmentation.
If the original set (or the one on which a new segmentation with a new attribute is performed) is called the parent set and the new set is called the child set, then information gain (IG) can be defined as:
\[IG(parent,children)=E(parent)[p(c_1)*E(c_1)+p(c_2)*E(c_2)+...]\]with \(p(c_i)\) indicating the proportion of instances belonging to this child.
Once the information gain of each feature is known, features can be sorted in respect to the information gain they provide in respect to the given target. By splitting each feature at an appropriate value a decision tree can be constructed and used to determine whether certain combinations of ranges of feature values would indicate the target correctly  in this case the question of adoption or notadoption.
Three features at the instances and two classes in the target variable
X 
Y 
Z 
C 

1 
1 
1 
I 

1 
1 
0 
I 

0 
0  1 
II 

1 
0  0 
II 
\(X=1\) yields two time I, but also one time II: \(E_{child_1}= (1/3)log_2(1/3)(2/3)log_2(2/3) = 0.5284 + 0,39=0.9184\)
\(X=0\) yields one time II: \(E_{child_2}= 0\)
\[E_{parent}=1\] \[Gain=1\frac{3}{4}*0.9184\frac{1}{4}*0=0.3112\]\(y=1\) yields two time I: \(E_{child_1}= 0\)
\(Y=0\) yields two times II: \(E_{child_2}= 0\)
\[E_{parent}=1\] \[Gain=1\frac{1}{2}*0\frac{1}{2}*0=1\]\(Z=1\) yields on time I and one time II: \(E_{child_1}= 1\)
\(X=0\) yields on time I and one time II: \(E_{child_2}= 1\)
\[E_{parent}=1\] \[Gain=1\frac{1}{2}*1\frac{1}{2}*1=0\]So obviously splitting on Y yields the highest information gain, and splitting on Z the worst, which is easily deduced from the above table.
There are several good machine learning software collections available on the web and many websites and software companies offer prefabricated tools. However, since machine learning systems are complex systems in a strict sense of the term  with relevant global effects emerging (often counterintuitively) from local interaction of components  using them as "black boxes", that is without understanding their internal mechanisms, involves the risk of getting suboptimal or even false results. One way to mitigate this problem is to use open source library packages like Weka for Java from the New Zealand University of Waikato or the scikit.learn toolkit for Python, which offer the chance to look at their internal operations.
The following information gain example is calculated with scikit.learn from a data set generated in the course of a survey of an Austrian bottomup initiative for participation in a communal photovoltaic plant. The survey focused on the seven features environmental awareness, belief in technical progress, importance of home comforts, modernity of lifestyle, importance of social inclusion, importance of independence from mainstream energy providers, importance of financial aspects.The last ten instances of the dataset look like follows:
The feature values vary between 1 (lowest) and 5 (highest). The last column shows the target values, with 1 for adopt and 0 for notadopt.
As can be seen from the list below, the two last features importance of independence from mainstream energy providers and importance of financial aspects, although not highly informative, provide most predictive information in respect to the question whether someone will participate in the communal photovoltaic plant or not. These two features will be used in the following to explain some particular techniques of data mining. The indicator Best Split suggests values at which the particular feature could be divided in order to differentiate between the fractions of nonadopters (\(<\) Best Split) and adopters (\(\geq\) Best Split). Sorting all features in respect to the fractions of the target values (i.e. adoption in this case), generates a decision tree.
The leaves of a decision tree allow to estimate probabilities for class memberships, that is, for the probability of someone belonging to the class of adopters or to the class of nonadopters. For example, if a leaf contains \(n\) positive instances (adopter) and \(m\) negative instances (nonadopters), the probability of any new instance being positive can be estimated as \(\frac{n}{n+m}\). this is called a frequencybased estimate of class membership probability.
The data set used in the above example is unusual small for machine learning. It consists of just 236 instances, so no wounder that the information gain of features is rather small. Using small data sets bears the risk that some leaves of the decision tree contain only few instances. In this case the frequencybased estimate can be massively twisted. A leaf with two positive instances and no negative one would have the same estimate as a leaf with 20 positive instances and no negative one.
In order to smooth this distortion, one can use the so called Laplace correction which moderates the influence of leaves with only few instances. Its formula for binary class probability estimation is:
\[p(c)=\frac{n+1}{n+m+2}\]