DataLab Cup 2: Image Captioning

Shan-Hung Wu & DataLab
Fall 2016
In [1]:
import os
import _pickle as cPickle
import urllib.request

import pandas as pd
import scipy.misc
import numpy as np

from keras.models import Model
from keras.layers import Input, Dense, Embedding, Reshape, GRU, merge
from keras.optimizers import RMSprop
from keras.models import load_model
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from IPython.display import Image, display, SVG

from pre_trained.cnn import PretrainedCNN

%matplotlib inline
output_notebook()
Using Theano backend.
Using gpu device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5105)
Loading BokehJS ...

Task: Image Captioning

Given a set of images, your task is to generate suitable sentences to describe each of the images.

You'll compete on the modified release of 2014 Microsoft COCO dataset, which is the standard testbed for image captioning.

  • 102,739 images for training set, where each images is annotated with 5 captions
  • 20,548 images for testing(you must generate 1 caption for each image)

Model: Image-Captioning

Given an image, in order to be able to generate descriptive sentence for it, our model must meet several requirements:

  1. our model should be able to extract high level concepts of image, such as the scence, the background, the color or positions of objects in that image => better use CNN to extract image feature
  2. our model should be able to generate a sentence => better use RNN to generate next word based on current(or previous) words
  3. the length of caption may vary, our model must be able to know where to stop => use a special <ED> token
  4. If we'd like to use RNN to generate next word based on current word, then our model requires a first word => use a special <ST> token

So, a naive model looks like the following:

In [2]:
def image_caption_model(vocab_size=2187, embedding_matrix=None, lang_dim=100, img_dim=256, clipnorm=1):
    # text: current word
    lang_input = Input(shape=(1,))
    if embedding_matrix is not None:
        x = Embedding(output_dim=lang_dim, input_dim=vocab_size, init='glorot_uniform', input_length=1, weights=[embedding_matrix])(lang_input)
    else:
        x = Embedding(output_dim=lang_dim, input_dim=vocab_size, init='glorot_uniform', input_length=1)(lang_input)
    lang_embed = Reshape((lang_dim,))(x)
    # img
    img_input = Input(shape=(img_dim,))
    # text + img => GRU
    x = merge([img_input, lang_embed], mode='concat', concat_axis=-1)
    x = Reshape((1, lang_dim+img_dim))(x)
    x = GRU(128)(x)
    # predict next word
    out = Dense(vocab_size, activation='softmax')(x)
    model = Model(input=[img_input, lang_input], output=out)
    # choose objective and optimizer
    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=1e-3, clipnorm=clipnorm))
    return model

model = image_caption_model()
with open('model_ckpt/image-caption.svg', 'rb') as f:
    arch = f.read()
display(SVG(arch))
G 140702882926320 input_7 (InputLayer) input: output: (None, 1) (None, 1) 140702882717880 embedding_4 (Embedding) input: output: (None, 1) (None, 1, 100) 140702882926320->140702882717880 140702882930528 reshape_7 (Reshape) input: output: (None, 1, 100) (None, 100) 140702882717880->140702882930528 140702882935416 input_8 (InputLayer) input: output: (None, 256) (None, 256) 140702882935752 merge_4 (Merge) input: output: [(None, 256), (None, 100)] (None, 356) 140702882935416->140702882935752 140702882930528->140702882935752 140702882935808 reshape_8 (Reshape) input: output: (None, 356) (None, 1, 356) 140702882935752->140702882935808 140702882936984 gru_4 (GRU) input: output: (None, 1, 356) (None, 128) 140702882935808->140702882936984 140702881707792 dense_4 (Dense) input: output: (None, 128) (None, 2187) 140702882936984->140702881707792

the input and output is slightly different during training and testing:

  • training: we have correct caption, and each training (input,output) pair uses correct current word and image as input, and predict the next word
  • testing: we start generate the caption by providing <ST> and image as input, then we sample a word as next word, and use the sampled word as input for next timestep to generate sequential words until the token <ED> is sampled as next word

Preprocess: Text

Dealing with raw strings is efficient, so we'll train on an encoded version of the captions. All necessary vocabularies is extracted in dataset/text/vocab.pkl and we'd like to represent captions by a sequence of integer IDs. However, since the length of captions may vary, our model needs to know where to start and stop. We'll append 2 special tokens <ST> and <ED> to the beginning and end of each caption. Also, the smaller the vocabulary size is the more efficient training will be, so we'll remove rare words by replacing rare words by <RARE> token. In summary, we'll going to

  • append <ST> and <ED> token to the beginning and end of each caption
  • replace rare words by <RARE> token
  • represent captions by vocabulary IDs
In [3]:
vocab = cPickle.load(open('dataset/text/vocab.pkl', 'rb'))
print('total {} vocabularies'.format(len(vocab)))
total 26900 vocabularies
In [4]:
def count_vocab_occurance(vocab, df):
    voc_cnt = {v:0 for v in vocab}
    for img_id, row in df.iterrows():
        for w in row['caption'].split(' '):
            voc_cnt[w] += 1
    return voc_cnt

df_train = pd.read_csv(os.path.join('dataset', 'train.csv'))

print('count vocabulary occurances...')
voc_cnt = count_vocab_occurance(vocab, df_train)

# remove words appear < 100 times
thrhd = 100
x = np.array(list(voc_cnt.values()))
print('{} words appear >= 100 times'.format(np.sum(x[(-x).argsort()] >= thrhd)))
count vocabulary occurances...
2184 words appear >= 100 times
In [5]:
def build_voc_mapping(voc_cnt, thrhd):
    """
    enc_map: voc --encode--> id
    dec_map: id --decode--> voc
    """
    def add(enc_map, dec_map, voc):
        enc_map[voc] = len(dec_map)
        dec_map[len(dec_map)] = voc
        return enc_map, dec_map
    # add <ST>, <ED>, <RARE>
    enc_map, dec_map = {}, {}
    for voc in ['<ST>', '<ED>', '<RARE>']:
        enc_map, dec_map = add(enc_map, dec_map, voc)
    for voc, cnt in voc_cnt.items():
        if cnt < thrhd: # rare words => <RARE>
            enc_map[voc] = enc_map['<RARE>']
        else:
            enc_map, dec_map = add(enc_map, dec_map, voc)
    return enc_map, dec_map

enc_map, dec_map = build_voc_mapping(voc_cnt, thrhd)
# save enc/decoding map to disk
cPickle.dump(enc_map, open('dataset/text/enc_map.pkl', 'wb'))
cPickle.dump(dec_map, open('dataset/text/dec_map.pkl', 'wb'))
vocab_size = len(dec_map)
In [6]:
def caption_to_ids(enc_map, df):
    img_ids, caps = [], []
    for idx, row in df.iterrows():
        icap = [enc_map[x] for x in row['caption'].split(' ')]
        icap.insert(0, enc_map['<ST>'])
        icap.append(enc_map['<ED>'])
        img_ids.append(row['img_id'])
        caps.append(icap)
    return pd.DataFrame({'img_id':img_ids, 'caption':caps}).set_index(['img_id'])


enc_map = cPickle.load(open('dataset/text/enc_map.pkl', 'rb'))
print('[transform captions into sequences of IDs]...')
df_proc = caption_to_ids(enc_map, df_train)
df_proc.to_csv('dataset/text/train_enc_cap.csv')
[transform captions into sequences of IDs]...
In [7]:
def decode(dec_map, ids):
    return ' '.join([dec_map[x] for x in ids])

dec_map = cPickle.load(open('dataset/text/dec_map.pkl', 'rb'))

print('And you can decode back easily to see full sentence...\n')
for idx, row in df_proc.iloc[:8].iterrows():
    print('{}: {}'.format(idx, decode(dec_map, row['caption'])))
And you can decode back easily to see full sentence...

536654.jpg: <ST> a group of three women sitting at a table sharing a cup of tea <ED>
536654.jpg: <ST> three women wearing hats at a table together <ED>
536654.jpg: <ST> three women with hats at a table having a tea party <ED>
536654.jpg: <ST> several woman dressed up with fancy hats at a tea party <ED>
536654.jpg: <ST> three women wearing large hats at a fancy tea event <ED>
15839.jpg: <ST> a twin door refrigerator in a kitchen next to cabinets <ED>
15839.jpg: <ST> a black refrigerator freezer sitting inside of a kitchen <ED>
15839.jpg: <ST> black refrigerator in messy kitchen of residential home <ED>

Preprocess: Image

Since the raw image takes about 20GB and may take days to download all of them. It's not included in the released file. But if you'd like to download origin image, you can request MS-COCO on-the-fly:

In [8]:
def download_image(img_dir, img_id):
    urllib.request.urlretrieve('http://mscoco.org/images/{}'.format(img_id.split('.')[0]), os.path.join(img_dir, img_id))

Transfer Learning: pre-trained CNN

Our task, image captioning, requires good understanding of images, like

  • objects appeared in the image
  • relative positions of objects
  • colors, sizes, ...etc

Training a good CNN from scratch is challenging and time-consuming, so we'll use existing pre-trained CNN model. The one we've prepared for you is the winner of 2012-ILSVRC model - VGG-16(or OxfordNet) in pre_trained/cnn.py.

In [9]:
cnn_mdl = PretrainedCNN(mdl_name='vgg16')

with open('model_ckpt/cnn-model.svg', 'rb') as f:
    arch = f.read()
display(SVG(arch))
G 140546167256960 input_1 (InputLayer) 140546167257520 block1_conv1 (Convolution2D) 140546167256960->140546167257520 140544891673064 block1_conv2 (Convolution2D) 140546167257520->140544891673064 140544888750152 block1_pool (MaxPooling2D) 140544891673064->140544888750152 140544888751496 block2_conv1 (Convolution2D) 140544888750152->140544888751496 140544888545464 block2_conv2 (Convolution2D) 140544888751496->140544888545464 140544888042552 block2_pool (MaxPooling2D) 140544888545464->140544888042552 140544888043896 block3_conv1 (Convolution2D) 140544888042552->140544888043896 140544888164984 block3_conv2 (Convolution2D) 140544888043896->140544888164984 140544888211832 block3_conv3 (Convolution2D) 140544888164984->140544888211832 140544887862328 block3_pool (MaxPooling2D) 140544888211832->140544887862328 140544887863672 block4_conv1 (Convolution2D) 140544887862328->140544887863672 140544887889144 block4_conv2 (Convolution2D) 140544887863672->140544887889144 140544888010680 block4_conv3 (Convolution2D) 140544887889144->140544888010680 140544887958328 block4_pool (MaxPooling2D) 140544888010680->140544887958328 140544887541944 block5_conv1 (Convolution2D) 140544887958328->140544887541944 140544887638840 block5_conv2 (Convolution2D) 140544887541944->140544887638840 140544887767672 block5_conv3 (Convolution2D) 140544887638840->140544887767672 140544887332088 block5_pool (MaxPooling2D) 140544887767672->140544887332088 140544887353976 flatten (Flatten) 140544887332088->140544887353976 140544886252656 fc1 (Dense) 140544887353976->140544886252656 140544886274648 fc2 (Dense) 140544886252656->140544886274648 140544886305904 predictions (Dense) 140544886274648->140544886305904

VGG-16 consists of 16 layers, and we'll take the output of fc2 - the last layer before prediction layer, as input to our image-captioning model. However, since we have about 120,000 images, representing each image by 4,096 dimensions will make training inefficient and space-consuming. Therefore, dimensionality reduction techniques - PCA is used to reduce image feature dimension from 4096 to 256. In summary, for each image,

  • raw image is fed into VGG-16
  • take the output of last layer
  • apply PCA to reduce dimension to 256

We've done the tedious work for you (use functions in utils.py), and the reduced 256-dimension image feature is saved in dataset/train_img256.pkl and dataset/test_img256.pkl.

It should be enough for you to train a good image-captioning model. However, you're always welcome to use other CNN models to extract image features.

In [10]:
img_train = cPickle.load(open('dataset/train_img256.pkl', 'rb'))
img_test = cPickle.load(open('dataset/test_img256.pkl', 'rb'))

Transfer Learning: pre-trained word embedding

Image captioning also requires good unstanding of word meaning, so it's a good idea to use pre-trained word embedding. We'll take advantages of the released by Google - GloVe. As an example, we choose to use the smallest release pre_trained/glove.6B.100d.txt, which is trained on 6 billion corpus of Wikipedia and Gigaword. Again, you're welcomed to use any pre-trained word embedding.
First, we have to prepare the embedding matrix for embedding layer for our image-captioning model:

In [11]:
def generate_embedding_matrix(w2v_path, dec_map, lang_dim=100):
    out_vocab = []
    embeddings_index = {}
    f = open(w2v_path, 'r')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    # prepare embedding matrix
    embedding_matrix = np.random.rand(len(dec_map), lang_dim)
    for idx, wd in dec_map.items():
        if wd in embeddings_index.keys():
            embedding_matrix[idx] = embeddings_index[wd]
        else:
            out_vocab.append(wd)
    print('words: "{}" not in pre-trained vocabulary list'.format(','.join(out_vocab)))
    return embedding_matrix

dec_map = cPickle.load(open('dataset/text/dec_map.pkl', 'rb'))
embedding_matrix = generate_embedding_matrix('pre_trained/glove.6B.100d.txt', dec_map)
words: "<ST>,<ED>,<RARE>,selfie,skiis" not in pre-trained vocabulary list

Training

Since our model only accepts (image+cur_word, next_word) pair as training instances, generating all training instance would require at least $2,184$ (vocabulary size) $\times 10$ (caption length) $\times102,739\times5$ (#image-caption pair) $\times32$ (float32) $/8$ (byte) $=40$ GB to store. It'll take too much space and it's impossible to fit in GPU (GTX-1070 only has 8G) memory.

Therefore, we can only expand several image-caption pairs in into training instances at runtime. So, first let's first prepare the batch generating function:

In [12]:
def generate_batch(img_map, df_cap, vocab_size, size=32):
    imgs, curs, nxts = None, [], None
    for idx in np.random.randint(df_cap.shape[0], size=size):
        row = df_cap.iloc[idx]
        cap = eval(row['caption'])
        if row['img_id'] not in img_map.keys():
            continue
        img = img_map[row['img_id']]
        for i in range(1, len(cap)):
            nxt = np.zeros((vocab_size))
            nxt[cap[i]] = 1
            curs.append(cap[i-1])
            nxts = nxt if nxts is None else np.vstack([nxts, nxt])
            imgs = img if imgs is None else np.vstack([imgs, img])
    return imgs, np.array(curs).reshape((-1,1)), nxts

Sanity Check: overfitting small data

It's a good practice to test your model by overfitting small data because something goes wrong if your model cannot even converge on small data. Let's generate some training/validation examples:

In [13]:
df_cap = pd.read_csv('dataset/text/train_enc_cap.csv')
img1, cur1, nxt1 = generate_batch(img_train, df_cap, vocab_size, size=200)
img2, cur2, nxt2 = generate_batch(img_train, df_cap, vocab_size, size=50)

Create our model and load the pre-trained word embedding matrix.

In [14]:
model = image_caption_model(vocab_size=vocab_size, embedding_matrix=embedding_matrix)

Start training and dump trained model and training history to disk when finished.

In [15]:
hist = model.fit([img1, cur1], nxt1, batch_size=32, nb_epoch=200, verbose=0, 
          validation_data=([img2, cur2], nxt2), shuffle=True)

# dump training history, model to disk
hist_path, mdl_path = 'model_ckpt/demo.pkl', 'model_ckpt/demo.h5'
cPickle.dump({'loss':hist.history['loss'], 'val_loss':hist.history['val_loss']}, open(hist_path, 'wb'))
model.save(mdl_path)

Quick Visualization

Within a few minites, you should be able to generate some grammatically correct captions, though it may not related to image well. Let's sample some training images and see what our model will say.

In [16]:
def generate_caption(model, enc_map, dec_map, img, max_len=10):
    gen = []
    st, ed = enc_map['<ST>'], enc_map['<ED>']
    cur = st
    while len(gen) < max_len:
        X = [np.array([img]), np.array([cur])]
        cur = np.argmax(model.predict(X)[0])
        if cur != ed:
            gen.append(dec_map[cur])
        else:
            break
    return ' '.join(gen)

def eval_human(model, img_map, df_cap, enc_map, dec_map, img_dir, size=1):
    for idx in np.random.randint(df_cap.shape[0], size=size):
        row = df_cap.iloc[idx]
        cap = eval(row['caption'])
        img_id = row['img_id']
        img = img_map[img_id]
        img_path = os.path.join(img_dir, img_id)
        # download image on-the-fly
        if not os.path.exists(img_path):
            download_image(img_dir, img_id)
        # show image
        display(Image(filename=img_path))
        # generated caption
        gen = generate_caption(model, enc_map, dec_map, img)
        print('[generated] {}'.format(gen))
        # groundtruth caption
        print('[groundtruth] {}'.format(' '.join([dec_map[cap[i]] for i in range(1,len(cap)-1)])))
def eval_plot(mdl_path, hist_path, img_path, img_map, df_cap, enc_map, dec_map, size):
    # plot history
    hist = cPickle.load(open(hist_path, 'rb'))
    fig = figure()
    fig.line(range(1,len(hist['loss'])+1), hist['loss'], color='red', legend='training loss')
    fig.line(range(1,len(hist['val_loss'])+1), hist['val_loss'], color='blue', legend='valid loss')
    fig.xaxis.axis_label, fig.yaxis.axis_label = '#batch', 'categorical-loss'
    show(fig)
    # eval captioning
    model = load_model(mdl_path)
    eval_human(model, img_map, df_cap, enc_map, dec_map, img_path, size=size)
In [17]:
enc_map = cPickle.load(open('dataset/text/enc_map.pkl', 'rb'))
dec_map = cPickle.load(open('dataset/text/dec_map.pkl', 'rb'))

eval_plot(mdl_path, hist_path, 'dataset/image', img_train, df_cap, enc_map, dec_map, 5)
[generated] a group of looking of looking of looking of looking
[groundtruth] the group of people are holding their cell phones together
[generated] mirror hand roof
[groundtruth] many <RARE> on land placed closely together with a chair on top of each
[generated] a player player player player player player player player player
[groundtruth] a man holding his hand in the air