Any-gram Kernel Implementation

Rasoul Kaljahi 8bf08ddfa5 Update 'README.md' %!s(int64=7) %!d(string=hai) anos
LICENSE fec325b457 Update 'LICENSE' %!s(int64=7) %!d(string=hai) anos
README.md 8bf08ddfa5 Update 'README.md' %!s(int64=7) %!d(string=hai) anos
agk.c c78b9dc440 headers %!s(int64=7) %!d(string=hai) anos
anygramkernel.py c78b9dc440 headers %!s(int64=7) %!d(string=hai) anos
anygramwrapper.py c78b9dc440 headers %!s(int64=7) %!d(string=hai) anos
glove.we 8526464aa3 testing %!s(int64=7) %!d(string=hai) anos
we.py c78b9dc440 headers %!s(int64=7) %!d(string=hai) anos

README.md

Any-gram Kernel Implementation

This package is the implementation of any-gram kernels described in [1]. In short, if you have a data set containing text instances (typically sentences) each with a label (continues numbers or categorical values) and want to train a model, you can use the any-gram kernels, which provide an effective way to use all possible n-grams in the text without having to 1) manually extract them and 2) be limited to specific orders of n-gram which also require to be determined experimentally.

One feature of the any-gram kernels, which is implemented in this package, is that they can use word embeddings instead of string tokens to compute the kernel value. This help account for word similarities and go beyond mere string match when computing the kernel value.

Note that the any-gram kernels can be combined with other kernels which are already precomputed. This provides a means to combine n-grams with hand-crafted features. A wrapper is provided in the package (agkw.py) which can be used for this purpose.

There are three methods to compute any-gram kernels:

  1. string match (SM): uses exact string match to compute the similarity between instances.
  2. word embedding similarity threshold (WEST): uses a threshold value for the similarity of two word embeddings, above which those words are considered a match and below which different.
  3. word embedding similarity score (WESS): uses the similarity score of two word embeddings, instead of binary match/different, to compute the kernel value.

The method can be specified when creating an any-gram kernel object (see example below). According to the experiments in the paper, WESS is the method to be used when using the word embeddings, as it performs better than WEST.

Content

The project contains the following files:

  • agk.c: the actual kernel function(s) written in C
  • agk.py: Python class to prepare and compute the kernel by calling the kernel function in agk.c using ctypes
  • agkw.py: a wrapper for preparing data, computing kernel using agk.py and training and evaluating models using scikit-learn.
  • we.py: Python class for loading and managing pre-trained word embeddings
  • glove.we: a subset of GloVe pre-trained word embeddings for the toy data used in the example

Prerequisites

The any-gram kernel code requires C compiler and Python 2.7 and has only been tested on Linux. Basically, it can be used by any kernel-based learning algorithm implementation which can call Python code. However, a wrapper is provided (agkw.py) to accelarte using the any-gram kernels, which uses Support Vector Machines implemented in scikit-learn package. Therefore, to use the wrapper, scikit-learn should be installed.

Installation

To compile and build the C code (agk.c), run the following command in Linux:

gcc -shared -Wl,-soname,agk -o agk.so -fPIC agk.c

Usage Example

The following example shows how to train a model using SVM in sikit-learn and any-gram kernels with WESS method. It uses some toy example data to show how the data should be formated to be input to the algorithm.

The pre-trained word embeddings are loaded from a file which is assumed to be in text format, where each word is represented in a line which contains the word and the vector all separated by space/tab. See tips and tricks below if you need to load word embeddings in a different format.

The input data needs to be provided (to the wrapper) through two python lists, one containing a list of text instances and one containing the labels matching the text instances.

The test() method of the wrapper class performs a prediction on the test sample and returns the results. The results also include the accuracy of the prediction, which assumes that the task in hand is a classification task, which is a naive assumption. You can either modify the code to implement the appropriate metric, ignore the evaluation and compute the metric based on predictions returned, or use the predict() method instead of test() and then evaluate the returned preditions.

You can ignore the word embeddings and only use string match method by replacing the wess with sm when creating the AnyGram object. Interestingly, you will get a 100% accuracy for the toy data.

import anygramkernel as agk
import anygramwrapper as agkw

# toy data

tr = [("This is a toy example data with label 0 .", 0),
      ("This is a toy example data with label 1 .", 1),
      ("This is a toy example data with label 1 .", 1),
      ("This is a toy example data with label 0 .", 0),
      ("This is a toy example data with label 1 .", 1),
      ("This is a toy example data with label 1 .", 1),
      ("This is a toy example data with label 0 .", 0),
      ("This is a toy example data with label 0 .", 0),
      ("This is a toy example data with label 0 .", 0)]

te = [("This is a toy example data with label 1 .", 1),
      ("This is a toy example data with label 0 .", 0),
      ("This is a toy example data with label 0 .", 0),
      ("This is a toy example data with label 1 .", 1),
      ("This is a toy example data with label 1 .", 1)]



# creating an any-gram kernel object (AnyGram)
k = agk.AnyGram(pMethod = "wess", pMaxTxtLen = 80)

# wrapping the any-gram object with a wrapper object (AnyGramWrapper)
w = agkw.AnyGramWrapper(k)

# loading word embeddings
w.loadEmbeddings("glove.we", True)

# loading data
w.loadTrainSet([i[0] for i in tr], [i[1] for i in tr])
w.loadTestSet([i[0] for i in te], [i[1] for i in te])

# precomputing kernels (can be igonred and left for training function)
w.precomputeTrainKernel()
 
# training the model using precomputed kernels
w.train(pflgUsePrecompKernel = True)

# save (pickle) the model (if needed)
w.saveModel("model")

# prediction and evaluation on test set
preds, acc = w.test([i[0] for i in te])

Tips and Tricks

Precomputed Kernels

Most of the time, there are a set of hyperparameters involved in training a model, which need to be tuned to get an optimum set of values for them. The tuning involves training several models each with a subset of hyperparameter values, which in turn involve computing the kernel. When these hyperparameters are indepndent of the kernel function, these computations are reduntant. Therefore, the kernel can be computed once and used in all tuning trainings. The wrapper class in agkw.py has a method which can serve this perpose.

Loading Pre-trained Word Embeddings

The word embeddings are loaded by the wrapper using the WordEmbedding class in we.py. However, the wrapper and the AnyGram class only use the method which loads from a general format, while other methods are also implemented in WordEmbedding to load from other formats (e.g. word2vec or binary GloVe format). You can edit the methods involved to utilize these additional methods which suit your word embedding file format.

Using Auxiliary Input

In addition to the textual input, any-gram kernels can use auxiliary input to supply additional information when precomputed. The auxiliary input should be in one-per-token basis. For example, to use POS tags as auxiliary input, they should first be converted to one-hot vectors and then every instance should be constructed as a 2D list where the first dimension matches the tokens in the sentence and the second dimension the one-hot vector lists. You can use the loadTrainAux() and loadTestAux() of the wrapper class to load auxiliary data.

Combining Kernels

Precomputed any-gram kernels can be combined with any other pre-computed kernel, provided that the other kernel (which can also be an any-gram kernel) is supplied in a matrix which has the same shape as the any-gram kernel. The combination operations used are addition, multiplication, *arithmatic mean" and "geometric mean". An example use case is using hand-crafted features besides any-grams where the any-gram kernels are combined with other kernels (e.g. RBF) computed over those features. You can use the combinePrecomputedTrainKernel() and combinePrecomputedTestKernel() of the wrapper class to load auxiliary data.

Contact

You can contact Rasoul Kaljahi (rasoul.kaljahi@adaptcentre.ie) for any further questions regarding the use of the package and the any-gram kernels themselves.


References

[1] Rasoul Kaljahi, Jennifer Foster. 2017. Any-gram Kernels for Sentence Classification: A Sentiment Analysis Case Study. arXiv:1712.07004.