Machine learning

Machine learning is a branch of artificial intelligence that concerns the construction and study of systems made to automatically screen (often big amounts of) data ("big data") in order to discover regularities in regard to which other regularities in so far un-screened data can be predicted. An often quoted example is a machine learning system trained on email messages to learn to distinguish between spam and non-spam.

Having a clear target in respect to which a machine (actually an algorithm) can be trained - such as the distinction between spam and not spam - indicates and instance of supervised machine learning. A first step in this kind of machine learning often is to find out which parts of data will gain most information in respect to the target classification. The so called information gain methodology uses the concept of Shannon-entropy.

Entropy, information gain and decision trees

Consider a data set of the following form and the task to learn from it what kind of attributes of the persons registered will predict best whether someone will invest into buying (= adopt) a photo voltaic installation for her home.

Account balance
Yes Yes
115000 37
29000 25
72000 51

Each row indicates an instance of the data set, each column a feature and each cell an attribute. Rows represent so called feature vector. The last column contains the so called target attributes, that is, the attributes for which the causes are sought and are suspected to be somewhere hidden in the other attributes. In this case, this is the question, which of the three other features (account balance, age or employed) provides most predictive information about whether someone will adopt photo voltaic or not. One possibility to approach a solution for this task is to use a technique called information gain through entropy.

Entropy is a measure of disorder, suggested by Claude E. Shannon (1948), and can be used to classify a segmentation which can be generated with one of the attributes in the data set.

For example, the feature account balance would allow segmenting the instances (i.e. the registered persons) into such with more than 100000 balance (Alfred and Chris) and less than 100000 (Bob, Dora and Edgar), whereas the attribute age would allow segmenting the instances (i.e. the registered persons) into such older than 40 (Alfred and Edgar) and younger than 40 (Bob, Chris and Dora). The disorder (or entropy) of a segmentation indicates how mixed or impure a segment is with respect to the target attribute. It varies between 0 and 1, with 0 indicating absolute order (or purity) and 1 indicating disorder (or impurity).

Entropy (E) is defined as:

\[E = -p_1 log_2(p_1) - p_2 log_2(p_2)- ... \]

with \(p_i\) indicating the probability (the relative percentage) of property \(i\) within the set.

Calculating entropy with Python:

The following code calculates the entropy (according to the above formula) of the mixed set [1 ,1 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ] (or alternatively for any other given set), showing that an impure feature set of 10 instances with two of them not corresponding to the target value would have a relatively high entropy of about 0.72 of 1, or in other words a rather low information gain in respect to the target.

Entropy 1

The following code snippet calculates the successive entropies of 100 randomly generated binary sets of length 10000 with probabilities for the occurrence of ones in them ranging from 0 to 100. The plot beneath illustrates that pure sets - in this case consisting of just zeros or just ones - have an entropy of zero while maximally mixed sets - 50% zeros and 50% ones - have an entropy of 1.

Entropy 2

Pureness and information gain

In the above example, account balance generates an impure segmentation (with \(E = -\frac{3}{5}* log_2 \frac{3}{5} + \frac{2}{5}* log_2 \frac{2}{5} = -(0.6*-0.74 + 0.4*-1.32)= -(-0.44 -0.53)=0.97  \)) -- there is only one adopter classified in the segment "account balance > 100000". But age generates a pure segmentation (\(entropy = 0\)) -- all adopters are older than 40.

Based on entropy, information gain measures how much an (added) feature improves or decreases entropy over the whole segmentation.

If the original set (or the one on which a new segmentation with a new attribute is performed) is called the parent set and the new set is called the child set, then information gain (IG) can be defined as:


with \(p(c_i)\) indicating the proportion of instances belonging to this child.

Once the information gain of each feature is known, features can be sorted in respect to the information gain they provide in respect to the given target. By splitting each feature at an appropriate value a decision tree can be constructed and used to determine whether certain combinations of ranges of feature values would indicate the target correctly - in this case the question of adoption or not-adoption.

Decision tree

A simple example

Three features at the instances and two classes in the target variable




0 1

0 0


1st case: split on attribute X

\(X=1\) yields two time I, but also one time II: \(E_{child_1}= -(1/3)log_2(1/3)-(2/3)log_2(2/3) = 0.5284 + 0,39=0.9184\)

\(X=0\) yields one time II: \(E_{child_2}= 0\)

\[E_{parent}=1\] \[Gain=1-\frac{3}{4}*0.9184-\frac{1}{4}*0=0.3112\]

2nd case: split on attribute Y

\(y=1\) yields two time I: \(E_{child_1}= 0\)

\(Y=0\) yields two times II: \(E_{child_2}= 0\)

\[E_{parent}=1\] \[Gain=1-\frac{1}{2}*0-\frac{1}{2}*0=1\]

3rd case: split on attribute Z

\(Z=1\) yields on time I and one time II: \(E_{child_1}= 1\)

\(X=0\) yields on time I and one time II: \(E_{child_2}= 1\)

\[E_{parent}=1\] \[Gain=1-\frac{1}{2}*1-\frac{1}{2}*1=0\]

So obviously splitting on Y yields the highest information gain, and splitting on Z the worst, which is easily deduced from the above table.

Tools and an example

There are several good machine learning software collections available on the web and many websites and software companies offer pre-fabricated tools. However, since machine learning systems are complex systems in a strict sense of the term - with relevant global effects emerging (often counter-intuitively) from local interaction of components - using them as "black boxes", that is without understanding their internal mechanisms, involves the risk of getting sub-optimal or even false results. One way to mitigate this problem is to use open source library packages like Weka for Java from the New Zealand University of Waikato or the scikit.learn toolkit for Python, which offer the chance to look at their internal operations.

The following information gain example is calculated with scikit.learn from a data set generated in the course of a survey of an Austrian bottom-up initiative for participation in a communal photo-voltaic plant. The survey focused on the seven features environmental awareness, belief in technical progress, importance of home comforts, modernity of lifestyle, importance of social inclusion, importance of independence from mainstream energy providers, importance of financial aspects.The last ten instances of the dataset look like follows:


The feature values vary between 1 (lowest) and 5 (highest). The last column shows the target values, with 1 for adopt and 0 for not-adopt.

As can be seen from the list below, the two last features importance of independence from mainstream energy providers and importance of financial aspects, although not highly informative, provide most predictive information in respect to the question whether someone will participate in the communal photo-voltaic plant or not. These two features will be used in the following to explain some particular techniques of data mining. The indicator Best Split suggests values at which the particular feature could be divided in order to differentiate between the fractions of non-adopters (\(<\) Best Split) and adopters (\(\geq\) Best Split). Sorting all features in respect to the fractions of the target values (i.e. adoption in this case), generates a decision tree.


Laplace correction

The leaves of a decision tree allow to estimate probabilities for class memberships, that is, for the probability of someone belonging to the class of adopters or to the class of non-adopters. For example, if a leaf contains \(n\) positive instances (adopter) and \(m\) negative instances (non-adopters), the probability of any new instance being positive can be estimated as \(\frac{n}{n+m}\). this is called a frequency-based estimate of class membership probability.

The data set used in the above example is unusual small for machine learning. It consists of just 236 instances, so no wounder that the information gain of features is rather small. Using small data sets bears the risk that some leaves of the decision tree contain only few instances. In this case the frequency-based estimate can be massively twisted. A leaf with two positive instances and no negative one would have the same estimate as a leaf with 20 positive instances and no negative one.

In order to smooth this distortion, one can use the so called Laplace correction which moderates the influence of leaves with only few instances. Its formula for binary class probability estimation is: