Machine learning

Machine learning is a branch of artificial intelligence that concerns the construction and study of systems made to automatically screen (often big amounts of) data ("big data") in order to discover regularities in regard to which other regularities in so far un-screened data can be predicted. An often quoted example is a machine learning system trained on email messages to learn to distinguish between spam and non-spam.

Having a clear target in respect to which a machine (actually an algorithm) can be trained - such as the distinction between spam and not spam - is an instance of supervised machine learning. A first step in this kind of machine learning often is to find out which parts of the data will gain most information in respect to the target classification. The so called information gain methodology uses the concept of Shannon-entropy.

Entropy, information gain and decision trees

Consider a data set of the following form and the task to derive from it (to "learn") what kind of attributes of the surveyed persons will best "predict" whether someone will invest into buying (here: adopting) a photo voltaic installation for his home.

Account balance
Yes Yes
115000 37
29000 25
72000 51

Each row indicates an instance of the data set, each column a feature and each cell an attribute. Rows represent so called feature vectors. The last column contains the so called target feature, that is, the feature for which the causes are sought and are assumed to be derivable from the other features. In the example case, this is the question which of the three features (account balance, age or employment status) provides most predictive information about whether someone will adopt photo voltaic or not. One way to approach an answer to this question is to use a technique called information gain through entropy.

Entropy is a measure of disorder, suggested by Claude E. Shannon (1948), and can be used to classify the segmentation as provided by the features in the data set.

For example, the feature account balance allows segmenting the instances (i.e. the surveyed persons) into such with more than Euro 100.000 balance (Alfred and Chris) and less than Euro 100.000 (Bob, Dora and Edgar), whereas the feature age allows segmenting the instances into such older than 40 (Alfred and Edgar) and younger than 40 (Bob, Chris and Dora). The disorder (or entropy) of a segmentation indicates how mixed or impure a feature is with respect to the target attribute. It varies between 0 and 1, with 0 indicating absolute order (or purity) and 1 indicating disorder (or impurity).

Entropy (E) is defined as:

\[E = -p_1 log_2(p_1) - p_2 log_2(p_2)- ... \]

with \(p_i\) indicating the probability (or proportion) of property \(i\) within the set. In respect to account balance for instance, two properties could be "less than" or "more than Euro 100.000" on the account.

Calculating entropy with Python:

The following code calculates the entropy (according to the above formula) of the mixed set [1 ,1 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ] (or alternatively for any other given set), showing that an impure feature set of 10 attributes with two of them not corresponding to the target value would have a relatively high entropy of about 0.72 of 1, or in other words a rather low information gain in respect to the target.

Entropy 1

The following code snippet calculates the successive entropies of 100 randomly generated binary sets of length 10000 with probabilities for the occurrence of ones in them ranging from 0 to 100. The plot beneath illustrates that pure sets - in this case consisting of just zeros or just ones - have an entropy of zero while maximally mixed sets - 50% zeros and 50% ones - have an entropy of 1.

Entropy 2

Pureness and information gain

In the above example, account balance generates an impure segmentation (with \(E = -\frac{3}{5}* log_2 \frac{3}{5} + \frac{2}{5}* log_2 \frac{2}{5} = -(0.6*-0.74 + 0.4*-1.32)= -(-0.44 -0.53)=0.97  \)) -- there is only one adopter classified in the segment "account balance > 100.000". But age generates a pure segmentation (\(entropy = 0\)), indicating that all adopters are older than 40.

Based on entropy, information gain measures how much an (added) feature decreases the entropy over the whole segmentation.

If the original set (or the one on which a new segmentation with a new attribute is performed) is called the parent set and the new sets are called children sets, the information gain (IG) with these children can be defined as:


with \(\pi(c_i)\) indicating the proportion of instances belonging to a child.

Once the information gain of each feature is known, features can be sorted in respect to the information gain they provide. By splitting each feature at an appropriate value a decision tree can be constructed and used to determine whether certain combinations of ranges of feature values (attributes) would indicate the target correctly - in this case the question of photo voltaic adoption or not-adoption.

Decision tree

A simple example

Three features at the instances and two classes in the target variable




0 1

0 0


1st case: split on attribute X

\(X=1\) yields two time I, but also one time II: \(E_{child_1}= -(1/3)log_2(1/3)-(2/3)log_2(2/3) = 0.5284 + 0,39=0.9184\)

\(X=0\) yields one time II: \(E_{child_2}= 0\)

\[E_{parent}=1\] \[Gain=1-\frac{3}{4}*0.9184-\frac{1}{4}*0=0.3112\]

2nd case: split on attribute Y

\(y=1\) yields two time I: \(E_{child_1}= 0\)

\(Y=0\) yields two times II: \(E_{child_2}= 0\)

\[E_{parent}=1\] \[Gain=1-\frac{1}{2}*0-\frac{1}{2}*0=1\]

3rd case: split on attribute Z

\(Z=1\) yields on time I and one time II: \(E_{child_1}= 1\)

\(X=0\) yields on time I and one time II: \(E_{child_2}= 1\)

\[E_{parent}=1\] \[Gain=1-\frac{1}{2}*1-\frac{1}{2}*1=0\]

So obviously, splitting on Y yields the highest information gain, and splitting on Z the worst, which is easily deduced from the above table.

Tools and an example

There are several good machine learning software collections available on the web and many websites and software companies offer pre-fabricated tools. However, since machine learning systems are complex systems in a strict sense of the word - with global effects emerging (often counter-intuitively) from local interaction of components - using them as "black boxes", that is without understanding their internal mechanisms, involves the risk of getting sub-optimal or even false results. One way to mitigate this problem is to use open source library packages like Weka for Java from the New Zealand University of Waikato or the scikit.learn toolkit for Python, which offer the chance to check internal operations.

The following information gain example is calculated with scikit.learn from a data set generated in the course of a survey of an Austrian bottom-up initiative for participation in a communal photo-voltaic plant. The survey focused on the seven features environmental awareness, belief in technical progress, importance of home comforts, modernity of lifestyle, importance of social inclusion, importance of independence from mainstream energy providers, importance of financial aspects.The last ten instances of the dataset (in terms of the numbers in the left-most column) are shown below:


The feature values (attributes) vary between 1 ("not relevant") and 5 ("very relevant"). The last column shows the target values, with 1 indicating "adoption" and 0 "non-adoption".

As can be seen from the list below, the two last features importance of independence from mainstream energy providers and importance of financial aspects, although not highly informative, provide most predictive information in respect to the question whether someone will participate in the communal photo-voltaic plant or not. These two features will be used in the following to illustrate other particular techniques of data mining. The indicator Best Split suggests values with the help of which the feature can be divided in order to distinguish the fraction of non-adopters (\(<\) Best Split) from the one of adopters (\(\geq\) Best Split). Sorting all features in respect to information gain (with respect to adoption in this case), generates a decision tree.


Laplace correction

The leaves of a decision tree allow to estimate probabilities for class memberships, that is, for the probability of an instance belonging to the class of adopters or to the class of non-adopters. For example, if a leaf contains \(n\) positive instances (adopters) and \(m\) negative instances (non-adopters), the probability of any new instance being positive can be estimated as \(\frac{n}{n+m}\). this is called a frequency-based estimate of class membership probability.

The data set used in the above example is unusual small for machine learning. It consists of just 236 instances, so no wonder that the information gain of features is rather small. Using small data sets bears the risk that some leaves of the decision tree contain only few instances. In this case the frequency-based estimate can be massively twisted. A leaf with two positive instances and no negative one would have the same estimate as a leaf with 20 positive instances and no negative one.

In order to smooth this distortion, one can use the so called Laplace correction which moderates the influence of leaves with only few instances. Its formula for binary class probability estimation is: