Non-linear data discrimination

Different to linear data discrimination, non-linear discrimination considers more complex possibilities to distinguish classes in data than just straight lines, respectively linear hyper planes in higher dimensions.

Naive Bayes

An interesting classifier in this respect is Naive Bayes. It builds on the famous rule of the English statistician Thomas Bayes (1701-1761) and is often used in email-spam-filters to classify mail into spam and non-spam.

In order to understand the working principle of this classifier, lets consider the probability of someone from a rural village investing in a photo voltaic installation given some kind of evidence \(E\) - for example high environmental awareness of this person. This could be indicated as \(p(investment \mid evidence)\), reading "the probability of \(investment\) given that \(evidence\)" .

In general, if there are two events, \(A\) and \(B\), with the probabilities \(p(A)\) and \(p(B)\), the so called joint probability of \(p(AB)\), that is, the probability that both events occur, can be calculated as \(p(A)*p(B \mid A)\) or \(p(B)*p(A \mid B)\).

This implies that \(p(A)*p(B \mid A) = p(B)*p(A \mid B)\) and so, if both sides are divided by \(p(A)\), one gets \(p(B \mid A) = \frac{p(A \mid B)* p(B)}{p(A)}\), the famous Bayes' Rule, named after Thomas Bayes.

Lets assume that \(B\) is some kind of Hypothesis \(H\), for example the hypothesis that someone will invest in a PV-installation, and \(E\) is an evidence, for example the high environmental awareness of this person. Renaming Bayes' rule gives \(p(H \mid E) = \frac{p(E \mid H)* p(H)}{p(E)}\). The advantage of this transformation is that all three terms in it might be more easily assessed than the probability of someone investing in PV given high environmental awareness. The probability of high environmental awareness given that someone owns a PV-installation, as well as the probabilities of PV-installations and of high environmental awareness in general, that is, its occurrence in the overall population might be empirically detectable.

Applying Bayes' rule to data

Now lets call the event that a target variable will take on a particular value \(C = c\), for example that a person adopts a PV-installation (or that an incoming email is spam). Rewriting Bayes' rule gives \(p(C = c \mid E) = \frac{p(E \mid C = c)* p(C = c)}{p(E)}\). In the data set, the evidence \(E\) would be the feature vector of the person (that is, the vector containing all know characteristics (i.e. attributes) of this person, see here). \(p(C = c \mid E)\) is called the posterior probability. \(p(C = c)\), the so called prior probability, can be taken as the "base rate" of \(c\), that is, the prevalence of \(c\) in the whole population. The term \(p(E \mid C = c )\), which is the so called likelihood of seeing the evidence \(E\) when \(C=c\), can be computed from the data as the percentage of examples of class \(c\) that have feature vector \(E\). Finally \(p(E)\) is the likelihood of \(E\) in general, calculated as the percentage occurrence of \(E\) among all examples.

One problem with this calculation however is the fact that the feature vectors in the data can be very specific. Usually it gets more specific, the larger the vector is. Large feature vectors hence might not allow to estimate their probability of occurrence with any confidence, since hardly any vectors will be exactly the same. To overcome this, it is often assumed that the features are conditionally independent given that \(C=c\), meaning that a probability \(p(e_i)\) does not say anything about the probability \(p(e_j)\) and therefore (if we write \(c\) instead of \(C=c\) for the sake of simplicity) Bayes' rule can be regarded as \(p(c\mid E)=\frac{p(e_1\mid c)*p(e_2\mid c)*...*p(e_n\mid c)*p(c)}{p(E)}\). Each of the \(p(e_i\mid c)\) terms can be computed directly from the data. Instead of looking for the match of entire feature vectors, it can be derived from the proportion of the time an individual feature \(e\) is found in the fraction of \(C=c\), that is for example in the fraction of PV-adopters.

An advantage of Naive Bayes is that it is an "incremental learner", meaning that it updates its "knowledge" in real time with every single new instance that is added to the data set. It does not have to be started anew. Each new adopter or non-adopter, or each new mail or spam-mail adds more information to the system. The plot below shows a classification of the data used here, done with the Naive Bayes-algorithm as provided by scikit.learn (GaussianNB).


Nearest Neighbor

Another way to generate a non-linear discrimination function through the space of data instances is to build on the principle of similarity of data points, or more precise, the similarity of feature vectors. The reasoning behind this is simply that data instances inhabiting the same region in data space might have more in common than data instances from different regions. Since feature vectors have the mathematical form of coordinates, albeit maybe in a higher dimensional space, one common way to compute similarity is Euclidian distance.

According to the Pythagorean theorem, in two dimensions, the Euclidian distance of two points \(A\) and \(B\) with vectors \((x_A, y_A)\) and \((x_B, y_B)\) can be calculated with the formula \(\sqrt{(x_B - x_ A)^2 + (y_B - y_A)^2}\). In higher dimensions calculation proceeds analogously. The distance of data points hence can be compared and used to differentiate data points in respect to their distance from each other.

So in the case of predictive modeling, that is, in the case of an attempt to predict the behavior of a so far unknown data instance from known data (will it adopt PV or will it not adopt), one would look at (already assessed) close-by data points, so called neighbors, and orientate prediction on the behavior of these nearest neighbors. If someone's feature vector is close (by means of Euclidian distance for instance) to three other people who adopted PV, chances are high that this someone will also adopt.

One question concerning this method is, how many neighbors shall be considered. Since the number of neighbors is often given as \(k\) the methodology is also called \(k\)-Nearest Neighbors or shorthand \(k\)-NN.

Another problem concerns the question what happens if one considers, say 4 neighbors and two are adopters and two others are non-adopters. In order to break such ties \(k\) is often taken to be an odd number and the prediction is then orientated on the majority rule. Another often used possibility however is to weigh neighbors in respect to their distance. Intuitively, one would agree that the closer ones of the \(k\) neighbors should have more predictive power than the ones further away. The following plot shows the same data as above with a discriminant generated by a nearest neighbor algorithm, left with \(k=3\), middle with \(k=5\) and right with \(k=7\). As you can see, both, the left and the right result, classify one point incorrectly. Only the middle calculation with \(k=5\) gets all points right.