ANN - Natural Language Processing

Sentiment Analysis

We are going to classify a movie review as Positive or Negative review given a text review.

We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

IMDB Dataset

Download the Dataset

import tensorflow as tf

imdb = tf.keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Explore the Dataset

train_data.shape, train_labels.shape, test_data.shape, test_labels.shape
((25000,), (25000,), (25000,), (25000,))
len(train_data[0])
218

Each of these numbers correspond to a word.

word_2_int = tf.keras.datasets.imdb.get_word_index(path='imdb_word_index.json')

word_2_int['hello'], word_2_int['world']
(4822, 179)
len(word_2_int)
88584
def sentence_2_int(sentence):
    sentence_2_int_list = []
    for i in sentence.lower().split(' '):
        sentence_2_int_list.append(word_2_int[i])
    return sentence_2_int_list

sentence = "Worst movie i've ever seen"
sentence_2_int(sentence)
[246, 17, 204, 123, 107]

Preparing the data

Padding

Every sentence will be off different length, to pass the sentences through the ANN model, we need to have a fixed length data. So we

  • pad small sentences
  • cut very long sentences to a fixed length
pad_value = 0
sentence_len = 100

train_data = tf.keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=0,
                                                        padding='post',
                                                        maxlen=sentence_len)

test_data = tf.keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=sentence_len)

train_data.shape, train_labels.shape, test_data.shape, test_labels.shape
((25000, 100), (25000,), (25000, 100), (25000,))

ANN Model

We will use a new layer into our model called the Embedding layer.

The words of each sentence are now represented with numbers, we cannot directly feed those numbers into the model, we need to convert the numbers into vectors for the model to understand what word it is?

but how do we decide the vectors? we leave that to the model.

So embedding layer, takes in an integer and converts it into a vector.

vocab_size = 10000
embedding_dim = 16

import tensorflow as tf
from tensorflow import keras

tf.keras.backend.clear_session()

model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=sentence_len),
                             # the model will take as input an integer matrix of size (batch, input_length).
                             # now model.output_shape == (None, 100, 16), where None is the batch dimension.
                             tf.keras.layers.Dropout(0.4),
                             tf.keras.layers.Flatten(),
                             tf.keras.layers.Dense(units=sentence_len*embedding_dim), 
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.4),
                             tf.keras.layers.Dense(units=500),
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.4),
                             tf.keras.layers.Dense(units=1), 
                             tf.keras.layers.Activation('sigmoid')
                             ])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
dropout (Dropout)            (None, 100, 16)           0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 1600)              2561600   
_________________________________________________________________
activation (Activation)      (None, 1600)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 500)               800500    
_________________________________________________________________
activation_1 (Activation)    (None, 500)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 501       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 3,522,601
Trainable params: 3,522,601
Non-trainable params: 0
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

tf_history = model.fit(train_data, train_labels, batch_size=2000, epochs=10, verbose=True, validation_data=(test_data, test_labels))
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 [==============================] - 12s 489us/sample - loss: 0.6917 - acc: 0.5162 - val_loss: 0.6860 - val_acc: 0.5558
Epoch 2/10
25000/25000 [==============================] - 12s 485us/sample - loss: 0.6434 - acc: 0.6450 - val_loss: 0.5741 - val_acc: 0.7036
Epoch 3/10
25000/25000 [==============================] - 12s 482us/sample - loss: 0.4729 - acc: 0.7753 - val_loss: 0.4278 - val_acc: 0.8019
Epoch 4/10
25000/25000 [==============================] - 12s 487us/sample - loss: 0.3406 - acc: 0.8524 - val_loss: 0.3833 - val_acc: 0.8292
Epoch 5/10
25000/25000 [==============================] - 12s 483us/sample - loss: 0.2646 - acc: 0.8873 - val_loss: 0.3691 - val_acc: 0.8388
Epoch 6/10
25000/25000 [==============================] - 12s 482us/sample - loss: 0.2038 - acc: 0.9180 - val_loss: 0.3913 - val_acc: 0.8380
Epoch 7/10
25000/25000 [==============================] - 12s 484us/sample - loss: 0.1641 - acc: 0.9350 - val_loss: 0.4083 - val_acc: 0.8380
Epoch 8/10
25000/25000 [==============================] - 12s 483us/sample - loss: 0.1318 - acc: 0.9490 - val_loss: 0.4307 - val_acc: 0.8374
Epoch 9/10
25000/25000 [==============================] - 12s 484us/sample - loss: 0.1029 - acc: 0.9595 - val_loss: 0.4761 - val_acc: 0.8348
Epoch 10/10
25000/25000 [==============================] - 12s 485us/sample - loss: 0.0871 - acc: 0.9660 - val_loss: 0.4966 - val_acc: 0.8368
model.save('trained_model.h5')

Model Pipeline

Load trained Model

trained_model = tf.keras.models.load_model('trained_model.h5')

trained_model.summary()
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:97: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:97: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
dropout (Dropout)            (None, 100, 16)           0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 1600)              2561600   
_________________________________________________________________
activation (Activation)      (None, 1600)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 500)               800500    
_________________________________________________________________
activation_1 (Activation)    (None, 500)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 501       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 3,522,601
Trainable params: 3,522,601
Non-trainable params: 0
_________________________________________________________________

Sentence to Vectors

import numpy as np

def sentence_2_int(sentence):
    sentence_2_int_list = []
    for i in sentence.lower().split(' '):
        sentence_2_int_list.append(word_2_int[i])
    arr = np.array(sentence_2_int_list).reshape(1,-1)
    arr = tf.keras.preprocessing.sequence.pad_sequences(arr, value=0, padding='post', maxlen=100)
    return arr

sentence = "Worst movie i've ever seen"
sentence_2_int(sentence)
array([[246,  17, 204, 123, 107,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0]], dtype=int32)
model.predict(sentence_2_int(sentence))
array([[0.22403198]], dtype=float32)
def sentence_2_prediction(sentence):
    vector = sentence_2_int(sentence)
    prob = model.predict(vector)
    prediction = prob > 0.5
    if prediction == 1:
        print('Positive Review :D')
    else:
        print('Negative Review :(')
sentence = 'Good Movie i really enjoyed it'

sentence_2_prediction(sentence)
Positive Review :D
sentence = 'worst movie'

sentence_2_prediction(sentence)
Negative Review :(

If you try many different sentences, you may notice the model actually doesn’t perform well. There may be many reasons for it.

  • vocabulary size
  • sequence length(no of words in a sentence for padding)
  • model architecture
  • embedding dim

Train the model by changing all the above to improve its performance.

Previous
comments powered by Disqus