Text mining

Text mining - sometimes aka text analytics - refers to the (often automated) process of deriving structured information and meaningful numeric indices from (mostly) unstructured textual information (often from the internet) in order to access and analyze this information with statistical and machine learning methods. Text mining usually involves several steps of natural language processing, including tokenizing, stop words exclusion, stemming, parsing, categorization, text clustering, word frequency analysis, part-of-speech identification, and many more.

The following example of a text mining survey in a scientific journal gives a brief overview on some of the possibilities associated with text mining.

Mining an ecological open access journal

In a first data retrieval step, the complete collection of papers from an open access journal for ecological conservation was loaded from the internet. The gathered text corpus comprises 475 pdf-formated scientific papers - the first one from Oct 2002 and the last one from July 2014, with a data volume of 345 MB. A Python-script, essentially drawing on the Natural Language Toolkit (the NLTK 3.0 modul), was used to transform these papers into a txt-file of about 22 MB. Extracting stop words, that is, short words like articles or propositions which contain little to no relevant information, and some special words that, like the journal's name, repeat in each paper, and converting these texts into NLTK's specific text format, generated a text corpus containing 1.798.948 words.

In a first analytical step a frequency distribution of the 30 most frequent words in the journal was generated.

Frequency distribution

Frequency distributions are not confined to one-word frequencies. The following plot shows a distribution for most frequent two-word combinations in the text corpus.

Frequency distribution - two words

The analytical tools of the NLTK allow for a wide variety of interesting text surveys. An example provides the NLTK-function concordance() which allows to investigate the context a word appears in. The function generates the following output - in this case for the word 'species'. (Have in mind that stop words were removed in this example).


The results of this survey can be used to generate frequency distributions of the contexts a certain word appears in, in this case again the word 'species', considering only the words before and after the concordanced word.

Frequency distribution in context

Using the Python-modul networkx allows to depict context interdependences as networks.

Frequency distribution as network

Iterating through contexts and finding contexts of context words, allows to generate rich networks of word-context usages. Using graph-visualization tools like Gephi networks can be displayed and analyzed in various ways.

Frequency distribution as network