Getting Started with Neural Networks and Word2Vec using Keras

Shan-Hung Wu & DataLab
Fall 2016
In [1]:
from IPython.display import Image
from IPython.display import display

# inline plotting instead of popping out
%matplotlib inline

# load utility classes/functions that has been taught in previous labs
# e.g., plot_decision_regions()
import os, sys
module_path = os.path.abspath(os.path.join('.'))
sys.path.append(module_path)
from lib import *
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brandonwu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In this lab, we will show how to train a neural network (NN) for text classification using the Keras library. We then train another neural network, called the word2vec, that embeds words into a dense vector space where semantically similar words are mapped to nearby points. The weights learned from the latter word2vec network can be used to generate the input of the former network (for text classification) to save the memory consumption.

Keras: High-Level NN Library

Keras is a simple to use, high-level neural-network library written in Python and running on top of either the TensorFlow or Theano, two well-known low-level neural-network libraries that offers the necessary computing primitives (including GPU parallelism). The high-level, modular API offered by Keras is compatible with the Scikit-learn and allows fast prototyping.

Installation and Configuration

You can install Keras using:

> pip install --upgrade keras

In the rest of this and following labs, we will use theano backend. Go to the configuration file:

~/.keras/keras.json

If it isn't there, you can create it. The default configuration file looks like this:

{
    "image_dim_ordering": "tf",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"
}

Simply change the field "backend" to "theano".

Note that Theano supports up to Python 3.4 currently. So, you may have to switch to a Python 3.4 kernel in Jupyter notebook. If there is no Python 3.4 kernel, you can create one in the command line using Anaconda:

> conda create --name [env-name] python=3.4

then restart the Jupyter notebook server. In case of missing packages (e.g., matlibplot or nltk), activate the Python 3.4 environment in command line:

Windows:

> activate [env-name]

Linux or Mac OS X:

> source activate [env-name]

then install the missing packages, e.g.,

> conda install nltk

To run Theano on CPU properly you should have a c++ compiler. On Mac OS X this means that you should have installed the XCode and run it at least once. On Windows you need Visual Studio. For more instructions on Windows, pleae refer to this blog post.

Sequential Model

The Sequential model is a linear stack of layers commonly used to implement a feedforward neural network. A typical NN for binary classification looks like this:

Each circle denotes a non-linear function called the unit or neuron. The units are organized in multiple layers, where the layer closest to the output (colored in red) is called the output layer and the rests are called the hidden layers. The units are interconnected by weights (solid lines). We denote $W_{i,j}^{(k)}$ as the weight from the $i$-th unit at the $(k-1)$-th layer to the $j$-th unit at the $k$-th layer. We use the word "feedforward" because information flows from input to output

In binary classification where $y\in \{0,1\}$, we can think of this NN as a complex nonlinear function that takes a data point $\boldsymbol{x}$ as the input and outputs

$$a^{(L)}=\rho$$

from the output unit, assuming

$$\Pr(y=1\,|\,\boldsymbol{x})\sim\mathrm{Bernoulli}(\rho).$$

With $\rho$, we know the entire distribution of $\Pr(y\,|\,\boldsymbol{x})$ that we can use either to define a cost function (for training the weights $W_{i,j}^{(k)}$'s) or to make prediction:

$$\hat{y}=\arg\max_{y}\mathrm{P}(y\,|\,\boldsymbol{x};\rho)=\arg\max_{y}\rho^{y}(1-\rho)^{(1-y)}.$$

We can regard each unit as a simple nonlinear function. The design of units in modern NNs are guided by math and engineering principles as discussed in the lecture. Typically, we use the Rectified Linear Unit (ReLU) as a hidden unit:

and the Sigmoid unit as the output unit:

We can see that they differ only in the activation fnction. Note that with the Sigmoid output unit, we have

$$\hat{y}=\arg\max_{y}\rho^{y}(1-\rho)^{(1-y)}=\mathrm{sign}(z^{(L)})$$

Looks familiar? This NN is basically a logistic regressor that takes $\boldsymbol{a}^{(L-1)}$ as the input.

Layers and Activation Functions

The two main components to build neural networks architecture in Keras is Layer and Activation. There are many useful layers (like embedding, dense, convolutional, recurrent, etc) and activation functions (like sigmoid, relu, tanh, softmax, etc) in Keras. You can find what you want in the Keras Document.

For now, let's create a binary classifier with 1 hidden layer and typical neurons:

In [2]:
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential() 
model.add(Dense(input_dim=512, output_dim=32))
model.add(Activation('relu'))
model.add(Dense(output_dim=1))
model.add(Activation('sigmoid'))
Using Theano backend.

You can also create a Sequential model by passing a list of layer instances to the constructor:

In [3]:
model = Sequential([
    Dense(input_dim=512, output_dim=32),
    Activation('relu'),
    Dense(output_dim=1),
    Activation('sigmoid'),
])

Cost Function

Most NNs are trained using the maximum likelihood (ML) principle with the goal to maximize the probability $\log\mathrm{P}(\mathbb{X}\,|\,\Theta)$, where $\Theta=\{\boldsymbol{W}^{(1)}, \cdots, \boldsymbol{W}^{(L)}\}$:

$$\arg\max_{\Theta}\log\mathrm{P}(\mathbb{X}\,|\,\Theta)=\arg\min_{\Theta}-\log\mathrm{P}(\mathbb{X}\,|\,\Theta).$$

When examples are i.i.d., this can be simplified to

$$\arg\min_{\Theta}\sum_{i=1}^{N}{-\log\mathrm{P}(\boldsymbol{y}^{(i)}\,|\,\boldsymbol{x}^{(i)},\Theta)}.$$

When $\Pr(y=1\,|\,\boldsymbol{x})\sim\mathrm{Bernoulli}(\rho)$ in binary classification, we have

$$-\log\mathrm{P}(y^{(i)}\,|\,\boldsymbol{x}^{(i)};\Theta)=-\log[(a^{(L)})^{y^{(i)}}(1-a^{(L)})^{1-y^{(i)}}]=-\log y^{(i)}a^{(L)}-\log(1-y^{(i)})(1-a^{(L)}).$$

The problem is to maximize $a^{(L)}=\rho$ when we see a positive example; and to minimize $a^{(L)}=\rho$ when we see a negative one. As discussed in the lecture, this problem can also be regarded as minimizing the cross entropy from the predicted distribution $\mathrm{P}(\mathrm{y}\,|\,\boldsymbol{\mathrm{x}};\Theta)$ to the data distribution $\mathrm{Empirical}(\mathrm{y}\,|\,\boldsymbol{\mathrm{x}};\mathbb{X})$.

Once defining our objective, we can solve it using an optimization algorithm. Most NNs employ the SGD algorithm as it offers many practical advantages as discussed in the lecture.

Back to our Keras example. After defining the networks architecture, we need to configure how to train the network, which is done via the compile method:

In [4]:
# for a binary classification problem

from keras.optimizers import SGD

# instantiate an optimizer with specific parameters
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9)
model.compile(optimizer=sgd,
              loss='binary_crossentropy',
              metrics=['accuracy'])

There are lots of optimizers, loss functions, and metrics which have already been defined in Keras. For optimizer, you can either instantiate it before passing it to model.compile(), as in the above example, or you can call it by its name as below. In the latter case, the default parameters for the optimizer will be used.

In [5]:
# or use default parameters
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Let wrap it up by defining a model generating function:

In [3]:
def gen_nn(input_dim=512, width=32, depth=2):
    model = Sequential()
    model.add(Dense(input_dim=input_dim, output_dim=width))
    model.add(Activation('relu'))
    for k in range(2, depth):
        model.add(Dense(output_dim=width))
        model.add(Activation('relu'))
    model.add(Dense(output_dim=1))
    model.add(Activation('sigmoid'))
    model.compile(optimizer='sgd',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

Text Classification

Having a compiled model, we can train it by calling the fit() method and passing the training examples. Let's apply our model to classifying the text in the 20 News Groups dataset.

The 20 News Groups Dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts (data points) on 20 topics (labels). We can load the dataset using sklearn. For simplicity, we consider a binary classification task that distinguishes the "science" label from the "entertainment" label only.

In [4]:
from sklearn.datasets import fetch_20newsgroups

# we take 2 news groups:
# 1) rec : recreational activities (entertainment)
# 2) sci : science
categories = [
    'rec.autos',
    'rec.motorcycles',
    'rec.sport.baseball',
    'rec.sport.hockey',
    'sci.crypt',
    'sci.electronics',
    'sci.med',
    'sci.space',
]

# remove the following context, we only need the content
remove = ('headers', 'footers', 'quotes')

# after fetching the data
# content will store in newsgroups.data
# label will store in newsgroups.target
newsgroups = fetch_20newsgroups(subset='all', categories=categories,
                                     shuffle=True, random_state=0,
                                     remove=remove)

We get about 8000 data points:

In [5]:
print('#Data size: %s' % np.shape(newsgroups.data))
print('Labels: %s' % np.unique(newsgroups.target))
#Data size: 7931
Labels: [0 1 2 3 4 5 6 7]

After loading the dataset, we need to do some text preprocessing and split it into training and testing set. We can apply what we have learned in the last competition to the text preprocessing (see lib.py for more details):

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# for simplicity, we only take 512 features
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features=512, sublinear_tf=True, 
                             max_df=0.5, preprocessor=preprocessor, tokenizer=tokenizer_stem_nostop)

X = vectorizer.fit_transform(newsgroups.data).toarray()
y = (newsgroups.target > 3).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

# the dictionary map from word to feature index
dictionary = vectorizer.vocabulary_

# construct inverse_dictionary for later use
inverse_dictionary = {v: k for k, v in dictionary.items()}

Now, we can feed the data into our compiled NN model:

In [6]:
import time

batch_size = 32

model = gen_nn(input_dim=X_train.shape[1])
%time his = model.fit(X_train, y_train, nb_epoch=400, \
                      batch_size=batch_size, \
                      validation_split=0.2, \
                      shuffle=True, verbose=0)
# evaluate trained model
score = model.evaluate(X_test, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
CPU times: user 47.3 s, sys: 906 ms, total: 48.2 s
Wall time: 29.5 s

Test loss: 0.387
Test accuracy: 0.825

The accuracy is about 82%. We can plot the learning curve to see the convergence:

In [7]:
train_loss = his.history['loss']
val_loss = his.history['val_loss']

# visualize training history
plt.plot(range(1, len(train_loss)+1), train_loss, color='blue', label='Train loss')
plt.plot(range(1, len(val_loss)+1), val_loss, color='red', label='Val loss')
plt.legend(loc="upper right")
plt.xlabel('#Epoch')
plt.ylabel('Loss')
plt.savefig('./output/fig-nn-val.png', dpi=300)
plt.show()

From the above figure, we can see that the model begins to overfit the data after epoch 150. When training an NN, it is common to early stop the SGD algorithm once the overfitting is detected. We can reduce the epoch size to 150 and retrain it:

In [8]:
model = gen_nn(input_dim=X_train.shape[1])
%time his = model.fit(X_train, y_train, \
                      nb_epoch=150, \
                      batch_size=batch_size, \
                      validation_split=0.2, \
                      shuffle=True, verbose=0)
# evaluate trained model
score = model.evaluate(X_test, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
CPU times: user 19.4 s, sys: 393 ms, total: 19.8 s
Wall time: 11.5 s

Test loss: 0.357
Test accuracy: 0.835

Now we have slightly better accuracy on test set.

Callbacks

In the above, we don't actually "early stop" SGD. We just train an NN twice. Its is generally hard to know how many epochs we need in the first training, and starting over again in the second training is time consuming. Keras provides a hook for callback functions that you can implement to get a view on the internal states and statistics of the model during training and to perform some monitoring tasks. For early stopping, there is an EarlyStopping class that helps you stop the current training process:

In [11]:
from keras.callbacks import EarlyStopping

model = gen_nn(input_dim=X_train.shape[1])

# monitor the validation loss
# min_delta : minimum change in the monitored quantity to qualify as an improvement
# patience : number of epochs with no improvement after which training will be stopped
early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=20, verbose=0)

%time his = model.fit(X_train, y_train, \
                      nb_epoch=1000, \
                      batch_size=batch_size, \
                      validation_split=0.2, \
                      shuffle=True, verbose=0, \
                      callbacks=[early_stop])

# evaluate trained model
score = model.evaluate(X_test, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])

train_loss = his.history['loss']
val_loss = his.history['val_loss']

# visualize training history
plt.plot(range(1, len(train_loss)+1), train_loss, color='blue', label='Train loss')
plt.plot(range(1, len(val_loss)+1), val_loss, color='red', label='Val loss')
plt.xlim(0, len(train_loss))
plt.legend(loc="upper right")
plt.xlabel('#Epoch')
plt.ylabel('Loss')
plt.savefig('./output/fig-nn-val-earlystop.png', dpi=300)
plt.show()
CPU times: user 19.3 s, sys: 1.55 s, total: 20.8 s
Wall time: 20.6 s

Test loss: 0.360
Test accuracy: 0.832

As we can see, it takes less time to get the desired performance.

Speeding Up Training with GPU (Optional)

Theano supports GPU (only for Nvidia) to speed up the training. You can follow the steps below to enable the GPU in Theano:

  1. Install Nvidia driver;
  2. Install CUDA 8.0;
  3. Download and install cuDNN;
  4. Save the following to ~/.theanorc:

     [global]
     floatX=float32
     device=gpu0 (if you want to switch back to CPU, cancel this line)
    
     [cuda]
     root=/usr/local/cuda-8.0
    
     [nvcc]
     fastmath=True
    
     [blas]
     ldflags = -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lm
     

It requires some extra steps to enable GPU on Windows, see this blog post. If you encounter any problem, Google it! In Jupyter notebook, you have to restart the kernel once you modify the ~/.theanorc configuration file.

No Speed Up?

If you rerun the above code blocks with GPU enabled, you may find that the training time does not improve much. This is mainly because that we use a very small batch size (32). In this case, you will not take advantage of GPU parallelism since the bottleneck will be the data transfer between CPU and GPU---at each SGD iteration, the CPU needs to send a batch of data to GPU to compute the updated weights.

Adjusting Hyperparameters

We can reduce the overhead of data transfer by setting a larger batch size. Let's run the code below twice, using CPU in the first time and GPU in the second, and compare the training time:

In [18]:
from itertools import product
import time
import pickle
import os.path

input_dim = [512, 2048]
batch_size = [32, X_train.shape[0]]
layer = [2, 4]
width = [32, 512]

train_time = {}
index = [tuple("{0:04b}".format(i)) for i in [0, 1, 2, 4, 8, 5]]
for (i, b, d, w) in index:
    vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features=input_dim[int(i)], sublinear_tf=True, 
                             max_df=0.5, preprocessor=preprocessor, tokenizer=tokenizer_stem_nostop)
    
    X = vectorizer.fit_transform(newsgroups.data).toarray()
    y = (newsgroups.target > 3).astype(int)
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=0)
    
    model = gen_nn(input_dim=X_train.shape[1], width=width[int(w)], depth=layer[int(d)])
        
    start = time.time()
    his = model.fit(X_train, y_train,
                    nb_epoch=1000,
                    batch_size=batch_size[int(b)],
                    validation_split=0.2,
                    shuffle=True, verbose=0,)
    train_time[(i, b, d, w)] = time.time() - start
    
# save train_time using pickle
# rerun this block using different configuration (CPU/GPU)
# load the saved train_time and compare results
file_name = 'cpu_train_time.pickle'
if not os.path.isfile(file_name):
    with open(file_name, 'wb') as handle:
        pickle.dump(train_time, handle)
else:
    with open(file_name, 'rb') as handle:
        cpu_train_time = pickle.load(handle)
    
    gpu_train_time = train_time

    # compare CPU & GPU time
    cpu_time = [cpu_train_time[(i, b, d, w)] for (i, b, d, w) in index]
    gpu_time = [gpu_train_time[(i, b, d, w)] for (i, b, d, w) in index]

    n_groups = 6
    index = np.arange(n_groups)
    bar_width = 0.35
    opacity = 0.4

    rects1 = plt.bar(index, cpu_time, bar_width,
                     alpha=opacity,
                     color='b',
                     label='CPU')

    rects2 = plt.bar(index + bar_width, gpu_time, bar_width,
                     alpha=opacity,
                     color='r',
                     label='GPU')

    plt.ylabel('Time (sec)')
    plt.title('Training time comparison')
    plt.xticks(index + bar_width, ('Default', 'Width-512', 'Layer-4', 'Batch-full', 
                                   'Dim-2048', 'Width-512\n+ Batch-full'))
    plt.legend()
    plt.savefig('./output/fig-nn-train-time.png', dpi=300)
    plt.show()

We get the above results by comparing the time of the Intel Core i5 CPU with the GeForce GTX 1070 GPU. We can see that when using a large batch size, the GPU time gets close to CPU time due to less data transfer. Furthermore, if we increase the width of the network (from 32 to 512) at each hidden layer, the GPU outperforms CPU because the weight matrix ($\boldsymbol{W}^{(k)}$) at each layer becomes larger and can be updated more efficiently by GPU parallelism. This is a common situation when we have a big model.

On the other hand, some other hyperparameters, such as the number of layers and input dimension, do not seem to impact the GPU speedup. The number of layers does not help because the weight matrices ($\boldsymbol{W}^{(k)}$'s) across different layers are updated sequentially during each SGD iteration.

Word2Vec

So far, we use the tf-idf to transform a document into a vector. This vector can be seen as the weighted sum of one-hot vectors of distinct words in the document. The one-hot vectors for words have some limitations. For example, we know that "cat" is more similar to "dog" than "cow" semantically. However, the one-hot vectors of these three words has the same distance: 0. Can we embed words into a vector space that preserves the semantic distance between words?

Word2vec is an NN-based word embedding method. It is able to represent words in a continuous, low dimensional vector space (i.e., the embedding space) where semantically similar words are mapped to nearby points. The main idea of word2vec is to uses an NN with only 1 linear hidden layer (i.e., each hidden unit has the linear activation function $a_j^{(1)}=z_j^{(1)}$) to use a word to predict its neighbors, as shown below:

This is based on a key observation that semantically similar words are often used interchangeably in different contexts. For example, the words "cat" and "dog" may both appear in a context "___ is my favorate pet." When feeding "cat" and "dog" into the NN to predict their nearby words, these two words will be likely to share the same/similar hidden representation ($\boldsymbol{a}^{(1)}$ at layer 1) in order to predict the same/similar nearby words ($\boldsymbol{a}^{(2)}$ at layer 2). After training the NN, we can use the weight matrix $\boldsymbol{W}^{(1)}$ to encode a one-hot vector $\boldsymbol{x}$ into a low dimensional dense vector $\boldsymbol{h}$, by $\boldsymbol{h}=\boldsymbol{W}^{(1)\top}\boldsymbol{x}$.

In the literature, the architecture of an word2vect NN comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the skip-gram (SG) model.

CBOW Skip-gram

Algorithmically, these models are similar, except that CBOW predicts target words (e.g. "cat") from context words ("is," "my," "favorite," and "pet"), while the skip-gram does the inverse and predicts context words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one example (C:["is," "my," "favorite," "pet"], T:"cat")). For the most part, this turns out to be a useful thing for smaller datasets. On the other hand, skip-gram treats each context-target pair (e.g., (T:"cat", C:"pet")) as a new observation and is shown to be able to capture the semantics better when we have a large dataset. We focus on the skip-gram model in the following.

Note that the weights are shared across words to ensure that each word has a single embedding. This is called weight tying. Also, word2vec is a unsupervised learning task as it does not require explicit labels. An NN can be used for both supervised and unsupervised learning tasks.

Cost Function and Output Layer

As most NNs, a skip-gram word2vec model is trained using the maximum likelihood (ML) principle:

$$\arg\min_{\Theta}\sum_{i=1}^{N}{-\log\mathrm{P}(\boldsymbol{y}^{(i)}\,|\,\boldsymbol{x}^{(i)},\Theta)}.$$

Unlike the binary classifier NN we have seen previously, it needs to output multiple classes (the vocabulary size $V$ in total). In a multiclass task where $y=1,\cdots,V$, we usually assume

$$\Pr(y\,|\,\boldsymbol{x})\sim\mathrm{Categorical}(y\,|\,\boldsymbol{x};\boldsymbol{\rho})=\prod_{i=1}^{V}\rho_{i}^{1(y;\,y=i)}.$$

It is natural to use $V$ Softmax units in the output layer. That is, the activation function of each unit outputs one dimension of the softmax function, a generalization of the logistic sigmoid:

$$a_i^{(L)}=\rho_i=\mathrm{softmax}(\boldsymbol{z}^{(L)})_{i}=\frac{\exp(z_{i}^{(L)})}{\sum_{j=1}^{{\color{red}V}}\exp(z_{j}^{(L)})}.$$

The cost function then becomes:

$$\arg\min_{\Theta}\sum_{i}-\log\prod_{j}\left(\frac{\exp(z_{j}^{(L)})}{\sum_{k=1}^{{\color{red}V}}\exp(z_{k}^{(L)})}\right)^{1(y^{(i)};y^{(i)}=j)}=\arg\min_{\Theta}\sum_{i}-z_{y^{(i)}}^{(L)}+\log\sum_{k=1}^{{\color{red}V}}\exp(z_{k}^{(L)})$$

Basically, we want to maximize $\rho_j$ when seeing an example of class $j$. However, this objective introduces hight training cost when $V$ is large. Recall from the lecture that, at every training step in SGD, we need to compute the gradient of the cost function with respect to $\boldsymbol{z}^{(L)}$. This gradient involves the $z_{i}^{(L)}$ of every unit at the output layer, which in turn leads to a lot of weight updates in $\boldsymbol{W}^{(1)}$ and $\boldsymbol{W}^{(2)}$ at every training step. The training will be very slow.

Negative Sampling

To speed up the training process, we can instead replace the $V$ Softmax units at the output layer with $V$ Sigmoid units and use a new cost function:

$$\arg\min_{\Theta}\sum_{i=1}^{N}\left[-\log\mathrm{P}(y^{(i)}=1\,|\,\boldsymbol{x}^{(i)},\Theta)-\sum_{j\in\mathbb{N}^{(i)}}\log\mathrm{P}(y^{(j)}=0\,|\,\boldsymbol{x}^{(j)},\Theta)\right],$$

which amounts to:

$$\arg\min_{\Theta}\sum_{i}\left[-\log\rho_{y^{(i)}}-\sum_{j\in\mathbb{N}^{(i)}}\log(1-\rho_{y^{(j)}})\right],$$

where $\rho_{y^{(i)}}=\sigma(z_{y^{(i)}}^{(L)})$ and $\mathbb{N}^{(i)}$, $\vert\mathbb{N}^{(i)}\vert=U\ll N$, is a set of indexes of the noise words that have never been in the same context as the target word. The cost function is minimized when the model assigns high $\rho_{y^{(i)}}$ to the context words, and low $\rho_{y^{(j)}}$'s to all its noise words. It can be shown that the new model approximates the original one (with the softmax output units) when training on infinite examples. But the new model is computationally advantageous because the cost function involves only $O(U)$ attributes in $\boldsymbol{z}^{(L)}$ (thus $O(U)$ output units). This leads to much less weight updates at each training step and higher efficiency.

The above trick for improving the training efficiency in a multiclass classification task is called the negative sampling. Note that each example now contains one context word and additionally $U$ noise words. We need some extra preprocessing steps.

Preparing Training Examples

With the negative sampling, each training example consists of one correct context word and $U$ noise words. To give an example, let's consider the corpus:

the quick brown fox jumped over the lazy dog ...

Suppose the context window is 2 words (1 at left and 1 at right). We scan through the corpus by moving the window from left to right and get the context words (C) and target words (T):

(C:[the, brown], T:quick), 
(C:[quick, fox], T:brown), 
(C:[brown, jumped], T:fox), 
...

Recall that the skip-gram model tries to predict each context word from its target word. So, we break a (C:[context1, context2, ...], T:target) pair into (T:target, C:context1), (T:target, C:context2), and so on. Our dataset becomes:

(T:quick, C:the), 
(T:quick, C:brown), 
(T:brown, C:quick), 
(T:brown, C:fox), 
...

To support the negative sampling, we need to sample $U$ noise words (N) for each of the above pairs by following some distribution (typically the unigram distribution). For example, let $U=2$, our final dataset could be:

(T:quick, C:the, N:[over, lazy]), 
(T:quick, C:brown, N:[fox, over]), 
(T:brown, C:quick, N:[the, lazy]), 
(T:brown, C:fox, N:[jumped, dog]), 
...

Given an example tuple, we regard the context word (C) as positive and noise words (N1 and N2) as negative when evaluating the cost function defined above. Specifically, the Sigmoid unit in the output layer outputs $\rho_C$ for the context word and $\rho_{N1}$ and $\rho_{N2}$ for the noise words, and the loss function for this example is

$$-\log\rho_{C}-\log(1-\rho_{N1})-\log(1-\rho_{N2}).$$

Gensim: a Word2Vec Library

Gensim is a free Python library for extracting semantic topics from documents. It implements word2vec with both the CBOW and skip-gram models. Let's download it first:

> conda install gensim

Next, we train a skip-gram model with negative sampling using the 20 news groups dataset.

In [9]:
import gensim

# preprocess the text
corpus = []
for text in newsgroups.data:
    corpus.append(tokenizer_stem_nostop(preprocessor(text)))

# size : embed dimension
# min_count : filter words without min frequency
# sg : 0 for CBOW; 1 for skip-gram
# negative : how many noise words should be drawn
%time model_w2v = gensim.models.Word2Vec(corpus, size=64, min_count=5, sg=1, negative=5, workers=2)
CPU times: user 23.3 s, sys: 205 ms, total: 23.5 s
Wall time: 13.3 s

NOTE: if you see a warning like

Pattern library is not installed, lemmatization won't be available.

don't worry, it is because the lemmatization (a kind of stemming) in Gensim requires the Python pattern package but the pattern package does not support Python 3. We don't need lemmatization since we preprocessed the text ourselves.

Capturing Semantics

After training, we can check if the model capture the semantics of words:

In [10]:
display(model_w2v.most_similar('car'))
display(model_w2v.most_similar('baseball'))
display(model_w2v.most_similar('electronic'))
display(model_w2v.most_similar('american'))
[('dealer', 0.8815475702285767),
 ('cars', 0.8629363179206848),
 ('carnum', 0.8581898212432861),
 ('auto', 0.8568898439407349),
 ('tires', 0.8530106544494629),
 ('toyota', 0.8468404412269592),
 ('nissan', 0.837895929813385),
 ('taurus', 0.8325661420822144),
 ('driving', 0.8298842906951904),
 ('ford', 0.8290390968322754)]
[('football', 0.8473756313323975),
 ('hockey', 0.8451310396194458),
 ('manager', 0.8335633277893066),
 ('basketball', 0.8320602178573608),
 ('pro', 0.8268144130706787),
 ('fans', 0.8199853897094727),
 ('hockeynum', 0.8151143789291382),
 ('hall', 0.8130113482475281),
 ('predictions', 0.8118510246276855),
 ('stat', 0.8079237341880798)]
[('networks', 0.8962923884391785),
 ('computer', 0.889359712600708),
 ('bulletin', 0.8893052339553833),
 ('identification', 0.8794335126876831),
 ('privacynum', 0.8599065542221069),
 ('policies', 0.8597185611724854),
 ('computers', 0.8585481643676758),
 ('telecommunications', 0.8580593466758728),
 ('identity', 0.8573257327079773),
 ('automated', 0.8567608594894409)]
[('national', 0.8311284780502319),
 ('america', 0.7881642580032349),
 ('association', 0.786353349685669),
 ('city', 0.7755874991416931),
 ('statenum', 0.7703492641448975),
 ('canadian', 0.7644796371459961),
 ('north', 0.7620282173156738),
 ('florida', 0.7597941756248474),
 ('california', 0.7586327791213989),
 ('associationnum', 0.754423975944519)]

Looks good! Some semantically similar words are indeed close to the input in the embedding space.

A more interesting result is that certain directions in the embedding space specialize towards certain semantic relationships, e.g. male-female, verb tense, and even country-capital relationships between words, as illustrated in the figure below:

Based on this, we can perform arithmetic operations over the word2vec vectors. For example,

$$\text{bmw}-\text{europe}+\text{america}=\,?$$
In [107]:
display(model_w2v.most_similar(positive=['america', 'bmw'], negative=['europe']))
[('owner', 0.8824532628059387),
 ('dodge', 0.8130589723587036),
 ('owners', 0.8003060221672058),
 ('civic', 0.7949789762496948),
 ('camaro', 0.7898281812667847),
 ('weekend', 0.7856303453445435),
 ('friend', 0.784346342086792),
 ('rain', 0.7831990718841553),
 ('mustang', 0.7787527441978455),
 ('owned', 0.7767765522003174)]

We can see some meaningful results such as "dodge," a typical muscle car brand from the U.S. As another examlpe:

$$\text{soccer}-\text{europe}+\text{america}=\,?$$
In [108]:
display(model_w2v.most_similar(positive=['america', 'soccer'], negative=['europe']))
[('basketball', 0.934349000453949),
 ('leaguenum', 0.9208471775054932),
 ('leagues', 0.9182654619216919),
 ('championship', 0.9121876955032349),
 ('teamsnum', 0.9097894430160522),
 ('stanley', 0.9097744822502136),
 ('winning', 0.9021807312965393),
 ('espn', 0.8976176381111145),
 ('playoff', 0.8952078819274902),
 ('espnnum', 0.8938581943511963)]

We see "basketball" that makes sense. The above results show that the directions in the embedding space are roughly aligned with some semantic relationships such as (country, car brand) and (country, sport).

Text Classification using Word2Vec-based Features

Next, let's replace the one-hot vector representation in our previous text classification model with the word2vec representation to see how the semantics help. We first define a function that transforms the tf-idf-based features of a document into the features based on word2vec by averaging the word2vec embedding of words weighted by their tf-idf scores:

In [102]:
def tfidf_word2vec_transform(model_w2v, embed_dim, inv_dic_tfidf, score_tfidf):
    tfidf_word2vec = np.zeros((score_tfidf.shape[0], embed_dim))
    for i in range(score_tfidf.shape[0]):
        doc_word2vec = np.zeros(embed_dim)
        for j in range(score_tfidf.shape[1]):
            doc_word2vec += score_tfidf[i, j] * model_w2v[inv_dic_tfidf[j]]
        tfidf_word2vec[i, :] = doc_word2vec
    return tfidf_word2vec

X_train_w2v = tfidf_word2vec_transform(model_w2v, 64, inverse_dictionary, X_train)
X_test_w2v = tfidf_word2vec_transform(model_w2v, 64, inverse_dictionary, X_test)
In [106]:
batch_size = 32

model = gen_nn(input_dim=X_train_w2v.shape[1])
        
%time his = model.fit(X_train_w2v, y_train, \
                      nb_epoch=1000, \
                      batch_size=batch_size, \
                      validation_split=0.2, \
                      shuffle=True, verbose=0, \
                      callbacks=[early_stop])

# evaluate trained model
score = model.evaluate(X_test_w2v, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
train_loss = his.history['loss']
val_loss = his.history['val_loss']

# visualize training history
plt.plot(range(1, len(train_loss)+1), train_loss, color='blue', label='Train loss')
plt.plot(range(1, len(val_loss)+1), val_loss, color='red', label='Val loss')
plt.xlim(0, len(train_loss))
plt.legend(loc="upper right")
plt.xlabel('#Epoch')
plt.ylabel('Loss')
plt.savefig('./output/fig-nn-w2v-val.png', dpi=300)
plt.show()
CPU times: user 19.3 s, sys: 162 ms, total: 19.4 s
Wall time: 19.8 s

Test loss: 0.368
Test accuracy: 0.826

We can see from the above that the word2vec-based feature representation has a much smaller dimension (64) but does not compromise the accuracy too much. This offers a good trade-off in large-scale learning tasks where memory is a scarce resource. The semantics does help in this case!

In [ ]: