Statistical Classification

AIMA Book chapters recommended: 18.3 (Learning Decision Trees), 18.4 (Evaluating and Choosing The Best Hypothesis)

Note

To use the classification module of SimpleAI you need to have Numpy installed.

To train and use the statistical classification algorithms in this library you will need to write code that specifies your problem. Essentially, this boils down to:

  • Define some attributes to be used as the input feature vector by the algorithms.
  • Define a target attribute to be predicted by the algorithms.
  • Give a dataset to be used as a training set by the algorithms.

This chapter explains how to use the statistical classification facilities of simpleai with an example that can be found in simpleai/samples/machine_learning/language_classification.py.

Defining your dataset

A dataset can be any iterable python object, and the objects being iterated (the observations itself) can be any Python object you want.

For instance, suppose we have a dataset of sentences in some language and the classification task is to identify the language of each sentence. Then, it’s perfectly valid for our dataset to be a list of (target_language, sentence_text) objects.

The example of language classification followed here is fully coded in simpleai/samples/machine_learning/language_classification.py.

In the code, each observation is represented as a named tuple instead of just a tuple for clarity:

class Sentence(object):
    def __init__(self, language, text):
        self.language = language
        self.text = text

Since another goal of this example is to learn a decision tree from a large dataset without putting it in memory all at once, the dataset is read using a custom iterator that only stores a single observation at a time:

class OnlineCorpusReader(object):
    def __init__(self):
        self.input_files = [("english", "text.en"),
                            ("spanish", "text.es")]

    def __iter__(self):
        for language, filename in self.input_files:
            for text in open(filename):
                yield Sentence(language, text.lower())

With that, we can create a dataset of sentences from a file that has a sentence of each line such as the europarl corpus.

Defining your attributes

In order to do an automatic classification you’ll have to define what are the attributes (the features) that are going to be used in the learning phase.

The way attributes are represented is slightly different than usual in this library. Normally, a classification problem uses a vector of attributes as input, in which each value in the vector is a value of some attribute. So if the vector has size N, you have N attributes.

To do the same thing this library you have to provide N functions, such that each function takes an observation and returns an attribute value. So each function is applied to the observation and the resulting N values are the classical feature vector.

Back to the language classification example, let’s assume that our attributes/features are the frequency counts of each letter in the sentence. Then, we can define the attributes like this:

class LetterCount(Attribute):
    def __init__(self, letter):
        self.letter = letter
        self.name = "Counts for letter {!r}".format(letter)

    def __call__(self, sentence):
        return sentence.text.count(self.letter)

# ...
# somewhere else:
for letter in "abcdefghijklmnopqrstuvwxyz":
    attribute = LetterCount(letter)
# ...

Here the attributes inherit from the Attribute class, which is recommended, but it’s not stricly necessary. The only requirement that an attribute has to meet is to be a callable object (a function, a method, a class that defines __call__, etc.).

So, a bare minumum valid attribute that counts the letter "a" in a observation could have been like this:

def attribute_count_a(observation):
    return observation.text.count("a")

And that would have been all that is needed.

If you are wondering “Why, oh, why you did it this way???!!!” it’s because not all datasets exist as a feature vector: there could be text, there could be images, there could be graphs, etc… so using attribute functions is a way of explicitly (and neatly too) declaring all preprocessing done to the data without altering the original data in any way (ie, read-only).

Defining your problem

The ClassificationProblem is where the attributes previously defined live. In your problem it also has to be defined the target attribute. The target is the attribute that classifier has to guess, ie, it’s a method that given an example from the dataset returns the correct classification for it.

Back to the language classification example, the problem definition would be:

For example:

class LanguageClassificationProblem(ClassificationProblem):
    def __init__(self):
        super(LanguageClassificationProblem, self).__init__()
        for letter in "abcdefghijklmnopqrstuvwxyz":
            attribute = LetterCount(letter)
            self.attributes.append(attribute)

    def target(self, sentence):
        return sentence.language

Here we define an instance of the LetterCount attribute for each letter in the english alphabet.

Once this defined, it must be stored in the attributes list of your ClassificationProblem. In this example, the target just returns the language of a Sentence.

Using a classifier

Once all is defined, you can train one of the implemented classifiers like Naive Bayes or a Decision Tree.

input_files = [("english", "europarl-v7.es-en.en"),
               ("spanish", "europarl-v7.es-en.es")]

dataset = OnlineCorpusReader()
problem = LanguageClassificationProblem()
classifier = NaiveBayes(dataset, problem)

test = Sentence(None, "is this an english sentence?")
print classifier.classify(test)
test = Sentence(None, "es ésta una oración en español?")
print classifier.classify(test)

Classifier API

class simpleai.machine_learning.models.Classifier(dataset, problem)[source]

Base of all classifiers. This specifies the classifier API.

Each classifier holds at least a dataset and a ClassificationProblem.

attributes

The attributes of the problem. A list of callable objects.

classify(example)[source]

Returns the classification for example.

distance(a, b)[source]

Custom distance between a and b.

learn()[source]

Does the training. Returns nothing.

classmethod load(filepath)[source]

Loads a pickled version of the classifier saved in filepath

save(filepath)[source]

Pickles the tree and saves it into filepath

target

The problem’s target. A callable that takes an observation and returns the correct classification for it.

Avaliable classifiers

Classifiers implemented:
class simpleai.machine_learning.classifiers.DecisionTreeLearner(dataset, problem)[source]

This implementation features an algorithm that strictly follows the pseudocode given in AIMA.

It’s obviously ineficient in too many ways (perhaps incomplete too), but it’s intended to be used pedagogically.

See the other implementations in this same file for some discusión and issues solved.

This algorithm is equivalent to ID3.

classify(example)[source]

Returns the classification for example.

importance(attribute, examples)[source]

AIMA implies that importance should be information gain. Since AIMA only defines it for binary features this implementation was based on the wikipedia article: http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

learn(examples, attributes, parent_examples)[source]

A decision tree learner that strictly follows the pseudocode given in AIMA. In 3rd edition, see Figure 18.5, page 702.

class simpleai.machine_learning.classifiers.DecisionTreeLearner_Queued(dataset, problem)[source]
This implementations has a few improvements over the one based on the book:
-It uses a queue instead of recursion, so the python stack limit is
never reached.
-In case an attribute has a value not seen in training the intermediate
nodes can give a “best so far” classification.
-Abusive re-iteration of the train examples is avoided by calculating

at the same time all information gains of a single node split.

This algorithm is equivalent to ID3.

classify(example)[source]

Returns the classification for example.

learn()[source]

Does the training. Returns nothing.

class simpleai.machine_learning.classifiers.DecisionTreeLearner_LargeData(dataset, problem, minsample=1)[source]

This implementations is specifically designed to handle large dataset that don’t fit into memory and has more improvements over the queued one:

-Data is processed one-at-a-time, so the training data doesn’t need to
fit in memory.
-The amount of times the train data is read is aproximately log(N) full

iterations (from first to last) for a dataset with N examples. This is because the gain from all splits from all leaf nodes are estimated simultaneuosly, so every time the train data is read completely a full new level of the tree (ie, nodes with equal depth, leaves) is expanded simultaneously.

This algorithm is equivalent to ID3.

Is very important to note that in order to have a small memory footprint the minsample argument has to be set to a reasonable size, otherwhise there will be one tree leaf for every example in the training set and this totally defeats the pourpose of having a large data version of the algorithm.

learn()[source]

Does the training. Returns nothing.

class simpleai.machine_learning.classifiers.NaiveBayes(dataset, problem)[source]

Implements a classifier that uses the Bayes’ theorem.

classify(example)[source]

Returns the classification for example.

learn()[source]

Does the training. Returns nothing.

class simpleai.machine_learning.classifiers.KNearestNeighbors(dataset, problem, k=1)[source]

Classifies objects based on closest training example. Uses the k-nearest examples from the training and gets the most common classification among these.

To use this classifier the problem must define a distance method to messure the distance between two examples.

classify(example)[source]

Returns the classification for example.

learn()[source]

Does the training. Returns nothing.

save(filepath)[source]

Saves the classifier to filepath. Because this classifier needs to save the dataset, it must be something that can be pickled and not something like an iterator.