for instance,
from collections import defaultdict from math import sqrt import numpy import re words = [] with open("pg.txt") as f: for x in f.readlines(): x = re.sub(r'[^a-z]+', " ", x.lower()) words.extend(x.split()) counts = defaultdict(int) for x in words: counts[x] += 1 N = 100 common = [x for x in sorted(counts.keys(), key=lambda(y): -counts[y])] topn = [common[y] for y in range(N)] before = defaultdict(lambda: defaultdict(int)) after = defaultdict(lambda: defaultdict(int)) for wi in range(2, len(words) - 2): if words[wi] in topn: after[words[wi - 1]][words[wi]] += 1 after[words[wi + 1]][words[wi]] -= 1 vecs = {} for cw in common: vecs[cw] = [after[cw][tw] for tw in topn] def normalize(v): len = sqrt(sum(x * x for x in v)) if len < 4: return False return [x / len for x in v] for cw in common: nv = normalize(vecs[cw]) if nv: vecs[cw] = nv else: del(vecs[cw]) target = "of" vtarget = vecs[target] ranked = sorted(vecs.keys(), key=lambda(x): -numpy.vdot(vtarget, vecs[x])) for x in ranked: print x, numpy.vdot(vtarget, vecs[x])
thrown at some project gutenberg text yields a whole bunch of prepositions all clumped together at one end of the objective function
of 1.0 into 0.978894370397 from 0.978135291622 for 0.978033825851 upon 0.977359735558 in 0.975798319051 by 0.972774713053 on 0.972298555885 among 0.969156476196 during 0.96721552238 towards 0.966558119514 with 0.962186062745 between 0.954884847906 to 0.946391749022 through 0.942979756684 respecting 0.940621244463 at 0.939330654695
and a whole bunch of adjectives all clumped together at the other extreme:
entire -0.959733960081 first -0.962401201704 fourth -0.96430285823 next -0.964411719589 following -0.964601889971 principal -0.966878822836 second -0.968995783247 united -0.972999124908 greatest -0.973458299043 utmost -0.973928331974 whole -0.978632110159
What I'm doing is taking the 100 most common words, and assigning a length-100 vector to all the w words in the vocabulary of the text, where you find the ith element by giving it a ++ every time you see w immediately before the ith most common word, and a -- every time you see w immediately after the ith common word.
And then I normalize those vectors, throwing out the ones that are too small to be interesting, and sort all the words by their dot product with some chosen word, in this case 'of'.
Dunno, it seems interesting that adjectives seem like "the opposite" of prepositions in this space, and also determiners appear to be the opposite of (some?) nouns. And the adjective-preposition axis seems roughly orthogonal to the determiner-noun axis.
---
omg and I tried seeing if that crazy thing from google's word2vec paper did anything, and replaced the vtarget line with:
vtarget = [-vecs["taking"][i] + vecs["capturing"][i] + vecs["take"][i] for i in range(N)]
in order to ask the system to attempt to best complete the analogy
taking:take::capturing:?
and the best hit --- which couldn't be the correct answer "capture" since that token never occurred in the text --- was nonetheless "seize". WHAT IS THIS SORCERY.