Fall 2016

In [1]:

```
from IPython.display import Image
from IPython.display import display
# inline plotting instead of popping out
%matplotlib inline
# load utility classes/functions that has been taught in previous labs
# e.g., plot_decision_regions()
import os, sys
module_path = os.path.abspath(os.path.join('.'))
sys.path.append(module_path)
from lib import *
```

**word2vec**, that embeds words into a dense vector space where semantically similar words are mapped to nearby points. The weights learned from the latter word2vec network can be used to generate the input of the former network (for text classification) to save the memory consumption.

Keras is a simple to use, high-level neural-network library written in Python and running on top of either the TensorFlow or Theano, two well-known low-level neural-network libraries that offers the necessary computing primitives (including GPU parallelism). The high-level, modular API offered by Keras is compatible with the Scikit-learn and allows fast prototyping.

You can install `Keras`

using:

`> pip install --upgrade keras`

In the rest of this and following labs, we will use `theano`

backend. Go to the configuration file:

`~/.keras/keras.json`

If it isn't there, you can create it. The default configuration file looks like this:

```
{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "tensorflow"
}
```

Simply change the field `"backend"`

to `"theano"`

.

Note that Theano supports up to Python 3.4 currently. So, you may have to switch to a Python 3.4 kernel in Jupyter notebook. If there is no Python 3.4 kernel, you can create one in the command line using Anaconda:

`> conda create --name [env-name] python=3.4`

then restart the Jupyter notebook server. In case of missing packages (e.g., `matlibplot`

or `nltk`

), activate the Python 3.4 environment in command line:

Windows:

`> activate [env-name]`

Linux or Mac OS X:

`> source activate [env-name]`

then install the missing packages, e.g.,

`> conda install nltk`

To run Theano on CPU properly you should have a c++ compiler. On Mac OS X this means that you should have installed the XCode and run it at least once. On Windows you need Visual Studio. For more instructions on Windows, pleae refer to this blog post.

The `Sequential`

model is a linear stack of layers commonly used to implement a **feedforward neural network**. A typical NN for binary classification looks like this:

Each circle denotes a non-linear function called the **unit** or **neuron**. The units are organized in multiple **layers**, where the layer closest to the output (colored in red) is called the output layer and the rests are called the hidden layers. The units are interconnected by weights (solid lines). We denote $W_{i,j}^{(k)}$ as the weight from the $i$-th unit at the $(k-1)$-th layer to the $j$-th unit at the $k$-th layer. We use the word "feedforward" because information flows from input to output

In binary classification where $y\in \{0,1\}$, we can think of this NN as a complex nonlinear function that takes a data point $\boldsymbol{x}$ as the input and outputs

$$a^{(L)}=\rho$$from the output unit, assuming

$$\Pr(y=1\,|\,\boldsymbol{x})\sim\mathrm{Bernoulli}(\rho).$$With $\rho$, we know the entire distribution of $\Pr(y\,|\,\boldsymbol{x})$ that we can use either to define a cost function (for training the weights $W_{i,j}^{(k)}$'s) or to make prediction:

$$\hat{y}=\arg\max_{y}\mathrm{P}(y\,|\,\boldsymbol{x};\rho)=\arg\max_{y}\rho^{y}(1-\rho)^{(1-y)}.$$We can regard each unit as a simple nonlinear function. The design of units in modern NNs are guided by math and engineering principles as discussed in the lecture. Typically, we use the **Rectified Linear Unit (ReLU)** as a hidden unit:

and the **Sigmoid unit** as the output unit:

We can see that they differ only in the **activation fnction**. Note that with the Sigmoid output unit, we have

Looks familiar? This NN is basically a logistic regressor that takes $\boldsymbol{a}^{(L-1)}$ as the input.

The two main components to build neural networks architecture in Keras is **Layer** and **Activation**. There are many useful layers (like embedding, dense, convolutional, recurrent, etc) and activation functions (like sigmoid, relu, tanh, softmax, etc) in Keras. You can find what you want in the Keras Document.

For now, let's create a binary classifier with 1 hidden layer and typical neurons:

In [2]:

```
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(input_dim=512, output_dim=32))
model.add(Activation('relu'))
model.add(Dense(output_dim=1))
model.add(Activation('sigmoid'))
```

You can also create a `Sequential`

model by passing a list of layer instances to the constructor:

In [3]:

```
model = Sequential([
Dense(input_dim=512, output_dim=32),
Activation('relu'),
Dense(output_dim=1),
Activation('sigmoid'),
])
```

Most NNs are trained using the maximum likelihood (ML) principle with the goal to maximize the probability $\log\mathrm{P}(\mathbb{X}\,|\,\Theta)$, where $\Theta=\{\boldsymbol{W}^{(1)}, \cdots, \boldsymbol{W}^{(L)}\}$:

$$\arg\max_{\Theta}\log\mathrm{P}(\mathbb{X}\,|\,\Theta)=\arg\min_{\Theta}-\log\mathrm{P}(\mathbb{X}\,|\,\Theta).$$When examples are i.i.d., this can be simplified to

$$\arg\min_{\Theta}\sum_{i=1}^{N}{-\log\mathrm{P}(\boldsymbol{y}^{(i)}\,|\,\boldsymbol{x}^{(i)},\Theta)}.$$When $\Pr(y=1\,|\,\boldsymbol{x})\sim\mathrm{Bernoulli}(\rho)$ in binary classification, we have

$$-\log\mathrm{P}(y^{(i)}\,|\,\boldsymbol{x}^{(i)};\Theta)=-\log[(a^{(L)})^{y^{(i)}}(1-a^{(L)})^{1-y^{(i)}}]=-\log y^{(i)}a^{(L)}-\log(1-y^{(i)})(1-a^{(L)}).$$The problem is to maximize $a^{(L)}=\rho$ when we see a positive example; and to minimize $a^{(L)}=\rho$ when we see a negative one. As discussed in the lecture, this problem can also be regarded as minimizing the **cross entropy** from the predicted distribution $\mathrm{P}(\mathrm{y}\,|\,\boldsymbol{\mathrm{x}};\Theta)$ to the data distribution $\mathrm{Empirical}(\mathrm{y}\,|\,\boldsymbol{\mathrm{x}};\mathbb{X})$.

Once defining our objective, we can solve it using an optimization algorithm. Most NNs employ the SGD algorithm as it offers many practical advantages as discussed in the lecture.

Back to our Keras example. After defining the networks architecture, we need to configure how to train the network, which is done via the `compile`

method:

In [4]:

```
# for a binary classification problem
from keras.optimizers import SGD
# instantiate an optimizer with specific parameters
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9)
model.compile(optimizer=sgd,
loss='binary_crossentropy',
metrics=['accuracy'])
```

`model.compile()`

, as in the above example, or you can call it by its name as below. In the latter case, the default parameters for the optimizer will be used.

In [5]:

```
# or use default parameters
model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['accuracy'])
```

Let wrap it up by defining a model generating function:

In [3]:

```
def gen_nn(input_dim=512, width=32, depth=2):
model = Sequential()
model.add(Dense(input_dim=input_dim, output_dim=width))
model.add(Activation('relu'))
for k in range(2, depth):
model.add(Dense(output_dim=width))
model.add(Activation('relu'))
model.add(Dense(output_dim=1))
model.add(Activation('sigmoid'))
model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
```

Having a compiled model, we can train it by calling the `fit()`

method and passing the training examples. Let's apply our model to classifying the text in the 20 News Groups dataset.

The 20 newsgroups dataset comprises around 18000 newsgroups posts (data points) on 20 topics (labels). We can load the dataset using `sklearn`

. For simplicity, we consider a binary classification task that distinguishes the "science" label from the "entertainment" label only.

In [4]:

```
from sklearn.datasets import fetch_20newsgroups
# we take 2 news groups:
# 1) rec : recreational activities (entertainment)
# 2) sci : science
categories = [
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
]
# remove the following context, we only need the content
remove = ('headers', 'footers', 'quotes')
# after fetching the data
# content will store in newsgroups.data
# label will store in newsgroups.target
newsgroups = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=0,
remove=remove)
```

We get about 8000 data points:

In [5]:

```
print('#Data size: %s' % np.shape(newsgroups.data))
print('Labels: %s' % np.unique(newsgroups.target))
```

`lib.py`

for more details):

In [5]:

```
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# for simplicity, we only take 512 features
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features=512, sublinear_tf=True,
max_df=0.5, preprocessor=preprocessor, tokenizer=tokenizer_stem_nostop)
X = vectorizer.fit_transform(newsgroups.data).toarray()
y = (newsgroups.target > 3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0)
# the dictionary map from word to feature index
dictionary = vectorizer.vocabulary_
# construct inverse_dictionary for later use
inverse_dictionary = {v: k for k, v in dictionary.items()}
```

Now, we can feed the data into our compiled NN model:

In [6]:

```
import time
batch_size = 32
model = gen_nn(input_dim=X_train.shape[1])
%time his = model.fit(X_train, y_train, nb_epoch=400, \
batch_size=batch_size, \
validation_split=0.2, \
shuffle=True, verbose=0)
# evaluate trained model
score = model.evaluate(X_test, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
```

The accuracy is about 82%. We can plot the learning curve to see the convergence:

In [7]:

```
train_loss = his.history['loss']
val_loss = his.history['val_loss']
# visualize training history
plt.plot(range(1, len(train_loss)+1), train_loss, color='blue', label='Train loss')
plt.plot(range(1, len(val_loss)+1), val_loss, color='red', label='Val loss')
plt.legend(loc="upper right")
plt.xlabel('#Epoch')
plt.ylabel('Loss')
plt.savefig('./output/fig-nn-val.png', dpi=300)
plt.show()
```

**early stop** the SGD algorithm once the overfitting is detected. We can reduce the epoch size to 150 and retrain it:

In [8]:

```
model = gen_nn(input_dim=X_train.shape[1])
%time his = model.fit(X_train, y_train, \
nb_epoch=150, \
batch_size=batch_size, \
validation_split=0.2, \
shuffle=True, verbose=0)
# evaluate trained model
score = model.evaluate(X_test, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
```

Now we have slightly better accuracy on test set.

In the above, we don't actually "early stop" SGD. We just train an NN twice. Its is generally hard to know how many epochs we need in the first training, and starting over again in the second training is time consuming. Keras provides a hook for callback functions that you can implement to get a view on the internal states and statistics of the model during training and to perform some monitoring tasks. For early stopping, there is an `EarlyStopping`

class that helps you stop the current training process:

In [11]:

```
from keras.callbacks import EarlyStopping
model = gen_nn(input_dim=X_train.shape[1])
# monitor the validation loss
# min_delta : minimum change in the monitored quantity to qualify as an improvement
# patience : number of epochs with no improvement after which training will be stopped
early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=20, verbose=0)
%time his = model.fit(X_train, y_train, \
nb_epoch=1000, \
batch_size=batch_size, \
validation_split=0.2, \
shuffle=True, verbose=0, \
callbacks=[early_stop])
# evaluate trained model
score = model.evaluate(X_test, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
train_loss = his.history['loss']
val_loss = his.history['val_loss']
# visualize training history
plt.plot(range(1, len(train_loss)+1), train_loss, color='blue', label='Train loss')
plt.plot(range(1, len(val_loss)+1), val_loss, color='red', label='Val loss')
plt.xlim(0, len(train_loss))
plt.legend(loc="upper right")
plt.xlabel('#Epoch')
plt.ylabel('Loss')
plt.savefig('./output/fig-nn-val-earlystop.png', dpi=300)
plt.show()
```

As we can see, it takes less time to get the desired performance.

Theano supports GPU (only for Nvidia) to speed up the training. You can follow the steps below to enable the GPU in Theano:

- Install Nvidia driver;
- Install CUDA 8.0;
- Download and install cuDNN;
Save the following to

`~/.theanorc`

:[global] floatX=float32 device=gpu0 (if you want to switch back to CPU, cancel this line) [cuda] root=/usr/local/cuda-8.0 [nvcc] fastmath=True [blas] ldflags = -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lm

It requires some extra steps to enable GPU on Windows, see this blog post. If you encounter any problem, Google it! In Jupyter notebook, you have to restart the kernel once you modify the `~/.theanorc`

configuration file.

If you rerun the above code blocks with GPU enabled, you may find that the training time does **not** improve much. This is mainly because that we use a very small batch size (32). In this case, you will not take advantage of GPU parallelism since the bottleneck will be the data transfer between CPU and GPU---at each SGD iteration, the CPU needs to send a batch of data to GPU to compute the updated weights.

We can reduce the overhead of data transfer by setting a larger batch size. Let's run the code below twice, **using CPU in the first time and GPU in the second**, and compare the training time:

In [18]:

```
from itertools import product
import time
import pickle
import os.path
input_dim = [512, 2048]
batch_size = [32, X_train.shape[0]]
layer = [2, 4]
width = [32, 512]
train_time = {}
index = [tuple("{0:04b}".format(i)) for i in [0, 1, 2, 4, 8, 5]]
for (i, b, d, w) in index:
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features=input_dim[int(i)], sublinear_tf=True,
max_df=0.5, preprocessor=preprocessor, tokenizer=tokenizer_stem_nostop)
X = vectorizer.fit_transform(newsgroups.data).toarray()
y = (newsgroups.target > 3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0)
model = gen_nn(input_dim=X_train.shape[1], width=width[int(w)], depth=layer[int(d)])
start = time.time()
his = model.fit(X_train, y_train,
nb_epoch=1000,
batch_size=batch_size[int(b)],
validation_split=0.2,
shuffle=True, verbose=0,)
train_time[(i, b, d, w)] = time.time() - start
# save train_time using pickle
# rerun this block using different configuration (CPU/GPU)
# load the saved train_time and compare results
file_name = 'cpu_train_time.pickle'
if not os.path.isfile(file_name):
with open(file_name, 'wb') as handle:
pickle.dump(train_time, handle)
else:
with open(file_name, 'rb') as handle:
cpu_train_time = pickle.load(handle)
gpu_train_time = train_time
# compare CPU & GPU time
cpu_time = [cpu_train_time[(i, b, d, w)] for (i, b, d, w) in index]
gpu_time = [gpu_train_time[(i, b, d, w)] for (i, b, d, w) in index]
n_groups = 6
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.4
rects1 = plt.bar(index, cpu_time, bar_width,
alpha=opacity,
color='b',
label='CPU')
rects2 = plt.bar(index + bar_width, gpu_time, bar_width,
alpha=opacity,
color='r',
label='GPU')
plt.ylabel('Time (sec)')
plt.title('Training time comparison')
plt.xticks(index + bar_width, ('Default', 'Width-512', 'Layer-4', 'Batch-full',
'Dim-2048', 'Width-512\n+ Batch-full'))
plt.legend()
plt.savefig('./output/fig-nn-train-time.png', dpi=300)
plt.show()
```

We get the above results by comparing the time of the Intel Core i5 CPU with the GeForce GTX 1070 GPU. We can see that when using a large batch size, the GPU time gets close to CPU time due to less data transfer. Furthermore, if we increase the width of the network (from 32 to 512) at each hidden layer, the GPU outperforms CPU because the weight matrix ($\boldsymbol{W}^{(k)}$) at each layer becomes larger and can be updated more efficiently by GPU parallelism. This is a common situation when we have a big model.

On the other hand, some other hyperparameters, such as the number of layers and input dimension, do not seem to impact the GPU speedup. The number of layers does not help because the weight matrices ($\boldsymbol{W}^{(k)}$'s) across different layers are updated sequentially during each SGD iteration.

So far, we use the tf-idf to transform a document into a vector. This vector can be seen as the weighted sum of **one-hot vectors** of distinct words in the document. The one-hot vectors for words have some limitations. For example, we know that "cat" is more similar to "dog" than "cow" semantically. However, the one-hot vectors of these three words has the same distance: 0. Can we embed words into a vector space that preserves the semantic distance between words?

**Word2vec** is an NN-based word embedding method. It is able to represent words in a continuous, low dimensional vector space (i.e., the embedding space) where semantically similar words are mapped to nearby points. The main idea of word2vec is to uses an NN with only **1 linear hidden layer** (i.e., each hidden unit has the linear activation function $a_j^{(1)}=z_j^{(1)}$) to use a word to predict its neighbors, as shown below:

This is based on a key observation that **semantically similar words are often used interchangeably in different contexts**. For example, the words "cat" and "dog" may both appear in a context "___ is my favorate pet." When feeding "cat" and "dog" into the NN to predict their nearby words, these two words will be likely to share the same/similar hidden representation ($\boldsymbol{a}^{(1)}$ at layer 1) in order to predict the same/similar nearby words ($\boldsymbol{a}^{(2)}$ at layer 2). After training the NN, we can use the weight matrix $\boldsymbol{W}^{(1)}$ to encode a one-hot vector $\boldsymbol{x}$ into a low dimensional dense vector $\boldsymbol{h}$, by $\boldsymbol{h}=\boldsymbol{W}^{(1)\top}\boldsymbol{x}$.

In the literature, the architecture of an word2vect NN comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the skip-gram (SG) model.

CBOW | Skip-gram |

Algorithmically, these models are similar, except that CBOW predicts target words (e.g. "cat") from context words ("is," "my," "favorite," and "pet"), while the skip-gram does the inverse and predicts context words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one example (C:["is," "my," "favorite," "pet"], T:"cat")). For the most part, this turns out to be a useful thing for smaller datasets. On the other hand, skip-gram treats each context-target pair (e.g., (T:"cat", C:"pet")) as a new observation and is shown to be able to capture the semantics better when we have a large dataset. We focus on the skip-gram model in the following.

Note that the weights are shared across words to ensure that each word has a single embedding. This is called **weight tying**. Also, word2vec is a **unsupervised learning** task as it does not require explicit labels. An NN can be used for both supervised and unsupervised learning tasks.

As most NNs, a skip-gram word2vec model is trained using the maximum likelihood (ML) principle:

$$\arg\min_{\Theta}\sum_{i=1}^{N}{-\log\mathrm{P}(\boldsymbol{y}^{(i)}\,|\,\boldsymbol{x}^{(i)},\Theta)}.$$Unlike the binary classifier NN we have seen previously, it needs to output multiple classes (the vocabulary size $V$ in total). In a multiclass task where $y=1,\cdots,V$, we usually assume

$$\Pr(y\,|\,\boldsymbol{x})\sim\mathrm{Categorical}(y\,|\,\boldsymbol{x};\boldsymbol{\rho})=\prod_{i=1}^{V}\rho_{i}^{1(y;\,y=i)}.$$It is natural to use $V$ **Softmax units** in the output layer. That is, the activation function of each unit outputs one dimension of the softmax function, a generalization of the logistic sigmoid:

The cost function then becomes:

$$\arg\min_{\Theta}\sum_{i}-\log\prod_{j}\left(\frac{\exp(z_{j}^{(L)})}{\sum_{k=1}^{{\color{red}V}}\exp(z_{k}^{(L)})}\right)^{1(y^{(i)};y^{(i)}=j)}=\arg\min_{\Theta}\sum_{i}-z_{y^{(i)}}^{(L)}+\log\sum_{k=1}^{{\color{red}V}}\exp(z_{k}^{(L)})$$Basically, we want to maximize $\rho_j$ when seeing an example of class $j$. However, this objective introduces hight training cost when $V$ is large. Recall from the lecture that, at every training step in SGD, we need to compute the gradient of the cost function with respect to $\boldsymbol{z}^{(L)}$. This gradient involves the $z_{i}^{(L)}$ of **every unit** at the output layer, which in turn leads to a lot of weight updates in $\boldsymbol{W}^{(1)}$ and $\boldsymbol{W}^{(2)}$ at every training step. The training will be very slow.

To speed up the training process, we can instead replace the $V$ Softmax units at the output layer with $V$ Sigmoid units and use a new cost function:

$$\arg\min_{\Theta}\sum_{i=1}^{N}\left[-\log\mathrm{P}(y^{(i)}=1\,|\,\boldsymbol{x}^{(i)},\Theta)-\sum_{j\in\mathbb{N}^{(i)}}\log\mathrm{P}(y^{(j)}=0\,|\,\boldsymbol{x}^{(j)},\Theta)\right],$$which amounts to:

$$\arg\min_{\Theta}\sum_{i}\left[-\log\rho_{y^{(i)}}-\sum_{j\in\mathbb{N}^{(i)}}\log(1-\rho_{y^{(j)}})\right],$$where $\rho_{y^{(i)}}=\sigma(z_{y^{(i)}}^{(L)})$ and $\mathbb{N}^{(i)}$, $\vert\mathbb{N}^{(i)}\vert=U\ll N$, is a set of indexes of the **noise words** that have never been in the same context as the target word. The cost function is minimized when the model assigns high $\rho_{y^{(i)}}$ to the context words, and low $\rho_{y^{(j)}}$'s to all its noise words. It can be shown that the new model approximates the original one (with the softmax output units) when training on infinite examples. But the new model is computationally advantageous because the cost function involves only $O(U)$ attributes in $\boldsymbol{z}^{(L)}$ (thus $O(U)$ output units). This leads to much less weight updates at each training step and higher efficiency.

The above trick for improving the training efficiency in a multiclass classification task is called the **negative sampling**. Note that each example now contains one context word and additionally $U$ noise words. We need some extra preprocessing steps.

With the negative sampling, each training example consists of one correct context word and $U$ noise words. To give an example, let's consider the corpus:

the quick brown fox jumped over the lazy dog ...

Suppose the context window is 2 words (1 at left and 1 at right). We scan through the corpus by moving the window from left to right and get the context words (C) and target words (T):

(C:[the, brown], T:quick), (C:[quick, fox], T:brown), (C:[brown, jumped], T:fox), ...

Recall that the skip-gram model tries to predict **each** context word from its target word. So, we break a `(C:[context1, context2, ...], T:target)`

pair into `(T:target, C:context1)`

, `(T:target, C:context2)`

, and so on. Our dataset becomes:

(T:quick, C:the), (T:quick, C:brown), (T:brown, C:quick), (T:brown, C:fox), ...

To support the negative sampling, we need to sample $U$ noise words (N) for each of the above pairs by following some distribution (typically the unigram distribution). For example, let $U=2$, our final dataset could be:

(T:quick, C:the, N:[over, lazy]), (T:quick, C:brown, N:[fox, over]), (T:brown, C:quick, N:[the, lazy]), (T:brown, C:fox, N:[jumped, dog]), ...

Given an example tuple, we regard the context word (C) as positive and noise words (N1 and N2) as negative when evaluating the cost function defined above. Specifically, the Sigmoid unit in the output layer outputs $\rho_C$ for the context word and $\rho_{N1}$ and $\rho_{N2}$ for the noise words, and the loss function for this example is

$$-\log\rho_{C}-\log(1-\rho_{N1})-\log(1-\rho_{N2}).$$Gensim is a free Python library for extracting semantic topics from documents. It implements word2vec with both the CBOW and skip-gram models. Let's download it first:

`> conda install gensim`

Next, we train a skip-gram model with negative sampling using the 20 news groups dataset.

In [9]:

```
import gensim
# preprocess the text
corpus = []
for text in newsgroups.data:
corpus.append(tokenizer_stem_nostop(preprocessor(text)))
# size : embed dimension
# min_count : filter words without min frequency
# sg : 0 for CBOW; 1 for skip-gram
# negative : how many noise words should be drawn
%time model_w2v = gensim.models.Word2Vec(corpus, size=64, min_count=5, sg=1, negative=5, workers=2)
```

NOTE: if you see a warning like

`Pattern library is not installed, lemmatization won't be available.`

don't worry, it is because the lemmatization (a kind of stemming) in Gensim requires the Python `pattern`

package but the `pattern`

package does not support Python 3. We don't need lemmatization since we preprocessed the text ourselves.

After training, we can check if the model capture the semantics of words:

In [10]:

```
display(model_w2v.most_similar('car'))
display(model_w2v.most_similar('baseball'))
display(model_w2v.most_similar('electronic'))
display(model_w2v.most_similar('american'))
```

Looks good! Some semantically similar words are indeed close to the input in the embedding space.

A more interesting result is that certain directions in the embedding space specialize towards certain semantic relationships, e.g. male-female, verb tense, and even country-capital relationships between words, as illustrated in the figure below:

Based on this, we can perform arithmetic operations over the word2vec vectors. For example,

$$\text{bmw}-\text{europe}+\text{america}=\,?$$In [107]:

```
display(model_w2v.most_similar(positive=['america', 'bmw'], negative=['europe']))
```

We can see some meaningful results such as "dodge," a typical muscle car brand from the U.S. As another examlpe:

$$\text{soccer}-\text{europe}+\text{america}=\,?$$In [108]:

```
display(model_w2v.most_similar(positive=['america', 'soccer'], negative=['europe']))
```

We see "basketball" that makes sense. The above results show that the directions in the embedding space are roughly aligned with some semantic relationships such as (country, car brand) and (country, sport).

Next, let's replace the one-hot vector representation in our previous text classification model with the word2vec representation to see how the semantics help. We first define a function that transforms the tf-idf-based features of a document into the features based on word2vec by **averaging the word2vec embedding of words weighted by their tf-idf scores**:

In [102]:

```
def tfidf_word2vec_transform(model_w2v, embed_dim, inv_dic_tfidf, score_tfidf):
tfidf_word2vec = np.zeros((score_tfidf.shape[0], embed_dim))
for i in range(score_tfidf.shape[0]):
doc_word2vec = np.zeros(embed_dim)
for j in range(score_tfidf.shape[1]):
doc_word2vec += score_tfidf[i, j] * model_w2v[inv_dic_tfidf[j]]
tfidf_word2vec[i, :] = doc_word2vec
return tfidf_word2vec
X_train_w2v = tfidf_word2vec_transform(model_w2v, 64, inverse_dictionary, X_train)
X_test_w2v = tfidf_word2vec_transform(model_w2v, 64, inverse_dictionary, X_test)
```

In [106]:

```
batch_size = 32
model = gen_nn(input_dim=X_train_w2v.shape[1])
%time his = model.fit(X_train_w2v, y_train, \
nb_epoch=1000, \
batch_size=batch_size, \
validation_split=0.2, \
shuffle=True, verbose=0, \
callbacks=[early_stop])
# evaluate trained model
score = model.evaluate(X_test_w2v, y_test, verbose=0)
print('\nTest loss: %.3f' % score[0])
print('Test accuracy: %.3f' % score[1])
train_loss = his.history['loss']
val_loss = his.history['val_loss']
# visualize training history
plt.plot(range(1, len(train_loss)+1), train_loss, color='blue', label='Train loss')
plt.plot(range(1, len(val_loss)+1), val_loss, color='red', label='Val loss')
plt.xlim(0, len(train_loss))
plt.legend(loc="upper right")
plt.xlabel('#Epoch')
plt.ylabel('Loss')
plt.savefig('./output/fig-nn-w2v-val.png', dpi=300)
plt.show()
```

In [ ]:

```
```