Welcome to the seventh blog in a series on machine learning. Once again, this material is a supplement to the introductory course in machine learning on Udacity. We will be reviewing the basics of natural language processing as well as categorizing the RokkinCat blog posts with clustering.

Natural language processing (NLP) is a method to translate between computer and human languages. It is a method of getting a computer to understandably read a line of text without the computer being fed some sort of clue or calculation. ~ Technopedia

Simply put, natural language processing (NLP) uses text or speech as your feature (input), and yields some label (output) for classifying data. In fact, there is a good chance that you have experienced natural language processing at some point today. Siri and Google are both prime examples that illustrate the power of natural language processing. They both take your inquiry (whether it be from typing or speech) and execute some action.

Here we will be exploring the different ways of classifying data as well as operations that you can do to improve your results.


There are a variety of operations that you can perform on your dataset to improve classification. Some of these include ngrams, normalization, stemming, and tokenization.

While you can attempt to recreate some of these operations (I am guilty of this), sometimes it is best to leave the heavy lifting to the professionals. For this reason, I will be using the Natural Language Toolkit. I recommend you try it out as well, it is a pretty neat tool with tons of different strategies for analyzing text.


Tokenizing is one of the most critical operations in NLP and should be one of the first filters that you apply to your data. Tokenizing is the process of converting strings of text into data structures (such as a list of strings) so that the data is easier to manipulate.

Jason went to the Chinese festival

We can tokenize the above sentence by simply using string.split or you can use some of the more complicated functions provided by NLTK.

sentence = "Jason went to the Chinese festival"
s = sentence.split()
>> ['Jason', 'went', 'to', 'the', 'Chinese', 'festival']


If you are looking to put additional stress on the pairing of words, then the ngram is an important strategy to consider. With this strategy, you can analyze the pairing of words to understand the context of a sentence better. Consider the following statements.

Josh wants to get Chinese

Alex wants to get Chinese food

Jason went to the Chinese festival

As a human, we can understand the context of these sentences, however it is more difficult for a computer. The ngram strategy breaks these sentences up into many subcomponents. If we were to transform these sentences into a bigram, we would get:

[["Josh", "wants"], ["wants", "to"], ["to", "get"], ["get", "Chinese"]]

[["Alex", "wants"], ["wants", "to"], ["to", "get"], ["get", "Chinese"], ["Chinese", "food"]]

[["Jason", "went"], ["went", "to"], ["to", "the"], ["the", "Chinese"], ["Chinese", "festival"]]

Notice how each word is paired with the words around it? These pairings will ultimately help a computer understand the context. Notice how the first two sentences contain ["get", "Chinese"]? We can also see that the second sentence contains ["Chinese", "food"]. By using the transitive property, we can assume that the context of the first sentence is relating to the word “food” as well.


So what happens when we have two sentences that mean the same thing, but are represented with different grammatical structures?

I like to code

I like coding

Even though code and coding are equivalent in context, the differing grammatical conjugations will confuse the computer. The computer will think that code and coding are two completely different words – and therefore have two different contexts.

Stemming is an operation that processes a word so that the morphological affixes (prefixes, suffixes, etc.) do not influence the meaning of a sentence. Using NLTK, we can properly stem the above sentences.

from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
one = "I like coding"
two = "I like to code"

res1 = ""
res2 = ""

for w in one.split():
    res1 += st.stem(w) + " "

for w in two.split():
    res2 += st.stem(w) + " "

print res1
print res2
>> i lik cod
>> i lik to cod

With this example, we transformed like into lik and code into cod. You can also find an issue in the above result. While, I personally like both “code” and “cod”, I am sure this does not apply to everyone 😉. It also misrepresents the context of the sentence. This means that we will need to either find a different stemmer or accept this error. Accepting the error is resonable in cases where there is a limited domain where you do not expect to find conflicting words.

NLTK contains many different types of stemmers, and also gives you the option to specify your own prefixes/suffixes to remove. This can be especially useful when you need to have additional customizations in your domain.

Bag of Words

Moving onward! Now that we have covered some of the basic text manipulation strategies, we can start classifying some data.

The bag of words technique focuses on evaluating the frequency of words to understand the context. A simple way to illustrate this is to do some examples (who would have thought? 😱 )

Josh wants to get Chinese

Alex wants to get Chinese food

Jason went to the Chinese festival

In short, all we do it count the number of times each word appears!

Josh: 1
wants: 2
to: 3
get: 2
Chinese: 3
Alex: 1
food: 1
Jason: 1
went: 1
the: 1
festival: 1

Furthermore, since we are unable to make calculations on the above words, it is common to treat each unique word as an index to an array. This makes our sentences represented as:

dictionary = ["Josh", "wants", "to", "get", "Chinese", "Alex", "food", "Jason", "went", "the", "festival"]

0: 1
1: 2
2: 3
3: 2
4: 3
5: 1
6: 1
7: 1
8: 1
9: 1
10: 1

Sklearn provides some tools to perform bag of words classifications. They also provide more complex methods of analyzing text. For example, one of these will stress the usage of less common words more than the common words. This is known as TFIDF.

Using this knowledge, as well as the knowledge learned from the clustering post, we will be able to categorize the RokkinCat blog posts.

Blog Clustering

Even though our RokkinCat blog is pretty tidy and already categorized appropriately, why not recategorize them with machine learning!

First, I gathered all of the blog posts that we have written over the last few years. Since there were only 26 of them, I manually copy-pasted them to local files. I also named them appropriately so that I could easily determine which category it belonged to:

e: elixir
ml: machine learning
m: miscellaneous
h: hack n tell

After that, I determined how many clusters we should use. I decided to use 5 clusters: 2 for miscellaneous, and 1 each for the others.

Next, I used Sklearn’s CountVectorizer to classify the blog posts. Unfortunately when I did this I received terrible results. It was classifying all of the test data posts in the same cluster! 🤔 I assumed the issue is that I did not filter out the common words such as “the”, “a”, etc. Therefore, I utilized Sklearn’s TfidfVectorizer to classify our blog post data.

Thankfully, this worked out as planned. In the case that the TfidfVectorizer returned undesirable results, I would have resorted to using the NLTK library to perform some of the operations mentioned in the previous sections. This would have aided in filtering out the noise in the test data.

So without further ado, let us look at the code.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import glob, os


# Getting a list of files to train on
fps = glob.glob('*.txt')

# Create the vectorizer
# make sure to specify the input is a list of files.
cv = TfidfVectorizer(input='filename')

# Determine our x and y values for clustering
x = np.array(cv.transform(fps).toarray())
y = np.arange(0, len(cv.vocabulary_), 1.0)

cluster_clf = KMeans(n_clusters=NUM_OF_CENTROIDS)
cluster_clf.fit_transform(x, y)

# Remember to save 10-20% of your data for testing!
directory = 'cluster_testdata'
test_data = os.listdir(directory)
test_files = []

for post in test_data:
    test_files.append(directory + '/' + post)

# Once we have our list of filepaths, let's predict!
test_x = np.array(cv.transform(test_files).toarray())
prediction = cluster_clf.predict(test_x)

print test_files
print prediction

For how complicated this problem is, we were able to get by with 20 lines of code! This could even be further condensed if we choose to sacrifice readability.

When I chose the test data, I was sure to choose 1 file from each category.

As we can see, the clustering did a fairly good job at categorizing this data. We can see that it correctly classified the machine learning & elixir posts.

['cluster_testdata/e1.txt', 'cluster_testdata/h3.txt', 'cluster_testdata/m3.txt', 'cluster_testdata/ml3.txt']

[2 3 3 0]

h3.txt was the 4th hack n tell post, while m3.txt was the welcome post for Alex. One interesting theory for these posts appearing in the same cluster is that “Alex Solo” was mentioned in both of the blog posts. One method of possibly correcting this issue is to remove names from the data.


Sklearn and NLTK both have some extremely useful functions for processing text. In most cases, you can get away with only using Sklearn’s functions; however NLTK is great for fine-tuning data.

In addition, there are many useful operations for fine-tuning data. Some of these strategies include: tokenizing, stemming, ngrams, and bag of words. Coupling these strategies together allows the computer to understand the context of the data and can yield better results for your test data.