How to predict movie genres

Vesna Lukic
Jan 10, 2020
5 min read

Updated: Feb 14, 2020

Did you know it is possible to predict the genre a movie belongs to by knowing its synopsis? Read ahead to see how it is done!

1. Reading in the data

First we read in the data into python, to a pandas dataframe.

train_set = pd.read_csv('/Users/vesna/Desktop/NLP_challenge/train.csv')

There are 36518 rows and 23 columns. The input data is made up of sentences that describe movies, and the labels consist of a set of relevant movie genres that capture the essence of the synopsis. Therefore, the problem is relatively high-dimensional. We will try a neural network approach to solve it.

2. Cleaning and exploring the data

Given that some entries have multiple genres, we want to separate them out and convert them to a one-hot vector such that if a certain genre is present in a column, it will contain a 1, and 0 if not. We also want to get rid of empty entries in the synopsis.

genres = np.unique(train_set.genres)

list_genres=[]

for i in range(0,len(genres)):

list_genres.append(genres[i].split()[0])

unique_genres=list(set(list_genres))

for i in range(0,len(unique_genres)):

train_set[unique_genres[i]]=0

for i in range(0,len(unique_genres)):

train_set[unique_genres[i]]=train_set.genres.str.contains(unique_genres[i]).astype(int)

filter = train_set["synopsis"] != ""

train_set = train_set[filter]

train_set = train_set.dropna()

It is useful to explore how many movies there are that contain each specific genre. This can be done using a barplot, as follows:

label_cols=train_set[train_set.columns[4:25]]

plot_label_cols=pd.DataFrame(np.sum(label_cols))

plot_label_cols['genres']=plot_label_cols.index

plot_label_cols.columns=['Count','Genres']

plt.bar(plot_label_cols['Genres'],plot_label_cols['Count'])

There are 19 individual genres. We can see that the Drama, Comedy and Thriller genres are the most frequent genres, and Western, IMAX and Film-noir are the least frequent. Given that there is a wide distribution of counts across the genres, we are dealing with an imbalanced class problem.

Another pre-processing step involves removing punctuation, numbers, single characters and multiple spaces as they are not directly relevant to the contents of the synopsis.

def preprocess_text(sen):

sentence = sen.translate(str.maketrans('', '', string.punctuation))

sentence = sen.replace(r'\d+','')

sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen)

sentence = re.sub(r'\s+', ' ', sen)

return sentence

X = []

sentences = list(train_set["synopsis"])

for sen in sentences:

X.append(preprocess_text(sen))

cleaned_synopsis=X

The next step is to perform word tokenization, where the sentences which are made up of strings, are converted into single strings of words. Then we find how many unique words there are, to see how many make up the vocabulary. Since all the entries in the input data must consist of strings having equal length, we find the length of the longest sentence and apply padding to the entries that have fewer characters.

all_words = []

for sent in cleaned_synopsis:

tokenize_word = word_tokenize(sent)

for word in tokenize_word:

all_words.append(word)

unique_words = set(all_words)

vocab_length=len(unique_words)+1

## Size of vocabulary is 88,470 words

word_count = lambda sentence: len(word_tokenize(sentence))

longest_sentence = max(cleaned_synopsis, key=word_count)

length_long_sentence = len(word_tokenize(longest_sentence))

3. Exploring word embeddings

In order to have the data in a readable form to be input into a machine learning algorithm, it is necessary to convert the words into individual numbers.

There are a number of ways to do this. The first way is to generate a unique number for each word from scratch, for example by using the ‘one_hot’ function in python.

A different way is to use pre-trained word embeddings, where a set of vectors have been generated after training on another body of text. Vectors that are more highly correlated with each other should be words that are similar, for example ‘flower’ and ‘leaf’ should have vectors that are more correlated compared to ‘flower’ and ‘computer’.

We explore both options.

4. Model construction

We divide our data to have 80% available for training and 20% for validation.

X=padded_sentences

y=label_cols

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=17)

We try a deep neural network model made up of an embedding layer, a flatten layer and two dense layers. The output should be a vector of 19 individual genres, therefore the final layer will have 19 nodes. One can try changing the parameters after seeing the performance on the validation set.

model = Sequential()

embedding_layer = Embedding(vocab_length, n_dim_vector, input_length=length_long_sentence, trainable=False)

model.add(embedding_layer)

model.add(Flatten())

model.add(Dense(128, activation='sigmoid'))

model.add(Dense(19, activation='sigmoid'))

4.1 No pre-trained word embeddings

First we explore not using any pre-trained word embeddings. We use the binary cross-entropy cost function with the Adam optimizer and train for 5 epochs, then the trained model is applied to the test set.

embedded_sentences = [one_hot(sent, vocab_length) for sent in cleaned_synopsis]

padded_sentences = pad_sequences(embedded_sentences, length_long_sentence, padding='post')

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

history = model.fit(X_train, y_train, batch_size=128, epochs=5, verbose=1, validation_split=0.2)

The output is as follows:

Train on 23371 samples, validate on 5843 samples

Epoch 1/5

23371/23371 [==============================] - 38s 2ms/step - loss: 0.2865 - acc: 0.8951 - val_loss: 0.2710 - val_acc: 0.9006

Epoch 2/5

23371/23371 [==============================] - 38s 2ms/step - loss: 0.2442 - acc: 0.9085 - val_loss: 0.2710 - val_acc: 0.9004

Epoch 3/5

23371/23371 [==============================] - 38s 2ms/step - loss: 0.2015 - acc: 0.9206 - val_loss: 0.2782 - val_acc: 0.8994

Epoch 4/5

23371/23371 [==============================] - 38s 2ms/step - loss: 0.1582 - acc: 0.9389 - val_loss: 0.2936 - val_acc: 0.8956

Epoch 5/5

23371/23371 [==============================] - 41s 2ms/step - loss: 0.1211 - acc: 0.9567 - val_loss: 0.3136 - val_acc: 0.8929

When applying the trained model to the test set, we observe a classification accuracy of 90.2%. It seems that only one epoch of training is sufficient to achieve the best classification accuracy, so rerunning the model with one epoch gives:

Train on 23371 samples, validate on 5843 samples

Epoch 1/1

23371/23371 [==============================] - 16s 675us/step - loss: 0.2867 - acc: 0.8973 - val_loss: 0.2749 - val_acc: 0.8997

The new classification accuracy observed is 90.1%, so although the validation accuracy appears to decline with every epoch as seen in the first run, the classification accuracy does not appear to improve after training for just one epoch.

4.2 Using pre-trained word embeddings

We can use pre-trained word embeddings from Glove (Global Vectors for Word Representation). There are several available for use. We make use of one from 'Common Crawl' that has 42B tokens, 1.9M vocabulary, and consists of 300 dimensional embedded vectors. We use this pre-trained file to obtain an embedded matrix, from which we can input into the 'weights' variable in the 'Embedding' layer in our deep neural network. We also use one epoch of training, in order to be comparable to the model run previously, where no pre-trained word embeddings have been used.

glove_file = open('/Users/vesna/Desktop/NLP_challenge/glove.42B.300d.txt', encoding="utf8")

n_dim_vector=300

for line in glove_file:

records = line.split()

word = records[0]

vector_dimensions = asarray(records[1:], dtype='float32')

embeddings_dictionary[word] = vector_dimensions

glove_file.close()

word_tokenizer = Tokenizer()

word_tokenizer.fit_on_texts(cleaned_synopsis)

embedding_matrix = zeros((vocab_length, n_dim_vector))

for word, index in word_tokenizer.word_index.items():

embedding_vector = embeddings_dictionary.get(word)

if embedding_vector is not None:

embedding_matrix[index] = embedding_vector

model = Sequential()

embedding_layer = Embedding(vocab_length, n_dim_vector, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)

model.add(embedding_layer)

model.add(Flatten())

model.add(Dense(128, activation='sigmoid'))

model.add(Dense(19, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

history = model.fit(X_train, y_train, batch_size=128, epochs=1, verbose=1, validation_split=0.2)

The output is as follows:

Epoch 1/1

23371/23371 [==============================] - 38s 2ms/step - loss: 0.2882 - acc: 0.8946 - val_loss: 0.2720 - val_acc: 0.9004

We also observe a classification accuracy of 90.1%, on the validation set. This is the same result that was obtained when no pre-trained embedding vectors were used. Therefore, in this case, the use of a pre-trained embedded vector does not appear to provide any additional advantages. This could be because the body of text that it was trained on is significantly different in context compared to the movie synopses.

How to predict movie genres

Recent Posts

Comments