Jason (jcreed) wrote,

Somehow I keep finding myself playing naively in an NLP-ish sandbox without really knowing the first thing about legitimate academic NLP.

for instance,

from collections import defaultdict
from math import sqrt
import numpy
import re

words = []
with open("pg.txt") as f:
  for x in f.readlines():
      x = re.sub(r'[^a-z]+', " ", x.lower())

counts = defaultdict(int)
for x in words:
    counts[x] += 1

N = 100

common = [x for x in sorted(counts.keys(), key=lambda(y): -counts[y])]
topn = [common[y] for y in range(N)]

before = defaultdict(lambda: defaultdict(int))
after = defaultdict(lambda: defaultdict(int))
for wi in range(2, len(words) - 2):
    if words[wi] in topn:
        after[words[wi - 1]][words[wi]] += 1
        after[words[wi + 1]][words[wi]] -= 1

vecs = {}
for cw in common:
    vecs[cw] = [after[cw][tw] for tw in topn]

def normalize(v):
    len = sqrt(sum(x * x for x in v))
    if len < 4:
        return False
    return [x / len for x in v]

for cw in common:
    nv = normalize(vecs[cw])
    if nv:
        vecs[cw] = nv

target = "of"
vtarget = vecs[target]

ranked = sorted(vecs.keys(), key=lambda(x): -numpy.vdot(vtarget, vecs[x]))

for x in ranked:
    print x, numpy.vdot(vtarget, vecs[x])

thrown at some project gutenberg text yields a whole bunch of prepositions all clumped together at one end of the objective function
of 1.0
into 0.978894370397
from 0.978135291622
for 0.978033825851
upon 0.977359735558
in 0.975798319051
by 0.972774713053
on 0.972298555885
among 0.969156476196
during 0.96721552238
towards 0.966558119514
with 0.962186062745
between 0.954884847906
to 0.946391749022
through 0.942979756684
respecting 0.940621244463
at 0.939330654695

and a whole bunch of adjectives all clumped together at the other extreme:
entire -0.959733960081
first -0.962401201704
fourth -0.96430285823
next -0.964411719589
following -0.964601889971
principal -0.966878822836
second -0.968995783247
united -0.972999124908
greatest -0.973458299043
utmost -0.973928331974
whole -0.978632110159

What I'm doing is taking the 100 most common words, and assigning a length-100 vector to all the w words in the vocabulary of the text, where you find the ith element by giving it a ++ every time you see w immediately before the ith most common word, and a -- every time you see w immediately after the ith common word.

And then I normalize those vectors, throwing out the ones that are too small to be interesting, and sort all the words by their dot product with some chosen word, in this case 'of'.

Dunno, it seems interesting that adjectives seem like "the opposite" of prepositions in this space, and also determiners appear to be the opposite of (some?) nouns. And the adjective-preposition axis seems roughly orthogonal to the determiner-noun axis.


omg and I tried seeing if that crazy thing from google's word2vec paper did anything, and replaced the vtarget line with:

vtarget = [-vecs["taking"][i] + vecs["capturing"][i] + vecs["take"][i] for i in range(N)]

in order to ask the system to attempt to best complete the analogy


and the best hit --- which couldn't be the correct answer "capture" since that token never occurred in the text --- was nonetheless "seize". WHAT IS THIS SORCERY.
Tags: language

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded