# Word2Vec & Doc2Vec (gensim)
Analysis of Social Media Contents

Alessandro Ortis - University of Catania

Partially based on:  [Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus by TextMiner](http://textminingonline.com/dive-into-nltk-part-x-play-with-word2vec-models-based-on-nltk-corpus)

## Exploring the `gutenburg` corpus

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works. Most of the items in its collection are full texts of public domain books.

In [1]:
import nltk
#nltk.download('gutenberg')

In [2]:
from nltk.corpus import gutenberg
#gutenberg.readme().replace('\n', ' ')

In [3]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
bible_kjv_sents = gutenberg.sents('bible-kjv.txt')
len(bible_kjv_sents)

30103

## Implementing Word2Vec

In [5]:
from string import punctuation

discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation and word.isalpha()] 
                                            for sent in bible_kjv_sents]
discard_punctuation_and_lowercased_sents[3]

['in',
 'the',
 'beginning',
 'god',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth']

In [8]:
discard_punctuation_and_lowercased_sents[2]

['the', 'first', 'book', 'of', 'moses', 'called', 'genesis']

In [9]:
from gensim.models import word2vec

bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200)
#bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200, hs=1, negative=0)
bible_kjv_word2vec_model.save('bible_word2vec_gensim')
# model = Word2Vec.load(fname) # To load a model
word_vectors = bible_kjv_word2vec_model.wv
#del bible_kjv_word2vec_model # When we finish training the model, we can only delete it and keep the word vectors.
word_vectors.save_word2vec_format('bible_word2vec_org', 'bible_word2vec_vocabulary')
len(word_vectors.vocab)

5279

In [21]:
word_vectors.most_similar(['egypt']) 
# Most similar as in closest in the word graph. Word2vec 
#is essentially about proportions of word occurrences in relations holding in general over large corpora of text. 
#Consider word analogy ‘man is to woman as king is to X’ which 
#was famously demonstrated in word2vec. The algorithm is able to come up with an answer queen, 
#almost magically by simple vector differences. The main idea, called distributional hypothesis, 
#is that similar words appear in similar contexts of words around them.

[('canaan', 0.7867099046707153),
 ('captivity', 0.7653553485870361),
 ('moab', 0.7472388744354248),
 ('jerusalem', 0.7398157119750977),
 ('judah', 0.7306808233261108),
 ('land', 0.7249926328659058),
 ('edom', 0.7220834493637085),
 ('canaanites', 0.712759256362915),
 ('assyria', 0.7048673629760742),
 ('inhabitants', 0.6912243366241455)]

In [22]:
word_vectors.most_similar(['heaven'], topn=3)

[('earth', 0.7694835662841797),
 ('heavens', 0.7296751737594604),
 ('darkness', 0.6714426875114441)]

In [23]:
word_vectors.most_similar(["moses"], topn=5)

[('joshua', 0.8443305492401123),
 ('jeremiah', 0.8026585578918457),
 ('david', 0.7863354682922363),
 ('messengers', 0.763727068901062),
 ('balaam', 0.7555711269378662)]

In [24]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[('angel', 0.612388014793396),
 ('queen', 0.6083270311355591),
 ('daughter', 0.5992907881736755),
 ('chamberlains', 0.5949094295501709),
 ('prophet', 0.593009352684021)]

In [25]:
# The `_cosmul` variant uses a slightly-different comparison when using multiple positive/negative examples (such as when asking about analogies). One paper has shown it does better:
word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'], topn=3)

[('queen', 0.9693329930305481),
 ('angel', 0.9502780437469482),
 ('messengers', 0.9317901730537415)]

In [26]:
word_vectors.similarity('lord', 'god')

0.7580022

In [28]:
word_vectors.doesnt_match("lord god salvation food spirit".split(" "))

'food'

# Doc2Vec

### Exercise
Try to explore the Doc2vec model implemented in genism.

In [26]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize