?
Notes from a Medium-Sized Island [entries|archive|friends|userinfo]
Jason

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Oct. 15th, 2013|08:56 pm]
Jason
[Tags|]

Somehow I keep finding myself playing naively in an NLP-ish sandbox without really knowing the first thing about legitimate academic NLP.

for instance,

from collections import defaultdict
from math import sqrt
import numpy
import re

words = []
with open("pg.txt") as f:
  for x in f.readlines():
      x = re.sub(r'[^a-z]+', " ", x.lower())
      words.extend(x.split())

counts = defaultdict(int)
for x in words:
    counts[x] += 1

N = 100

common = [x for x in sorted(counts.keys(), key=lambda(y): -counts[y])]
topn = [common[y] for y in range(N)]


before = defaultdict(lambda: defaultdict(int))
after = defaultdict(lambda: defaultdict(int))
for wi in range(2, len(words) - 2):
    if words[wi] in topn:
        after[words[wi - 1]][words[wi]] += 1
        after[words[wi + 1]][words[wi]] -= 1

vecs = {}
for cw in common:
    vecs[cw] = [after[cw][tw] for tw in topn]

def normalize(v):
    len = sqrt(sum(x * x for x in v))
    if len < 4:
        return False
    return [x / len for x in v]

for cw in common:
    nv = normalize(vecs[cw])
    if nv:
        vecs[cw] = nv
    else:
        del(vecs[cw])

target = "of"
vtarget = vecs[target]

ranked = sorted(vecs.keys(), key=lambda(x): -numpy.vdot(vtarget, vecs[x]))

for x in ranked:
    print x, numpy.vdot(vtarget, vecs[x])



thrown at some project gutenberg text yields a whole bunch of prepositions all clumped together at one end of the objective function
of 1.0
into 0.978894370397
from 0.978135291622
for 0.978033825851
upon 0.977359735558
in 0.975798319051
by 0.972774713053
on 0.972298555885
among 0.969156476196
during 0.96721552238
towards 0.966558119514
with 0.962186062745
between 0.954884847906
to 0.946391749022
through 0.942979756684
respecting 0.940621244463
at 0.939330654695

and a whole bunch of adjectives all clumped together at the other extreme:
entire -0.959733960081
first -0.962401201704
fourth -0.96430285823
next -0.964411719589
following -0.964601889971
principal -0.966878822836
second -0.968995783247
united -0.972999124908
greatest -0.973458299043
utmost -0.973928331974
whole -0.978632110159


What I'm doing is taking the 100 most common words, and assigning a length-100 vector to all the w words in the vocabulary of the text, where you find the ith element by giving it a ++ every time you see w immediately before the ith most common word, and a -- every time you see w immediately after the ith common word.

And then I normalize those vectors, throwing out the ones that are too small to be interesting, and sort all the words by their dot product with some chosen word, in this case 'of'.

Dunno, it seems interesting that adjectives seem like "the opposite" of prepositions in this space, and also determiners appear to be the opposite of (some?) nouns. And the adjective-preposition axis seems roughly orthogonal to the determiner-noun axis.

---

omg and I tried seeing if that crazy thing from google's word2vec paper did anything, and replaced the vtarget line with:

vtarget = [-vecs["taking"][i] + vecs["capturing"][i] + vecs["take"][i] for i in range(N)]

in order to ask the system to attempt to best complete the analogy

taking:take::capturing:?

and the best hit --- which couldn't be the correct answer "capture" since that token never occurred in the text --- was nonetheless "seize". WHAT IS THIS SORCERY.
LinkReply

Comments:
[User Picture]From: oniugnip
2013-10-16 05:09 am (UTC)
Distributional semantics; you're doing it! :D
(Reply) (Thread)
[User Picture]From: jcreed
2013-10-16 06:14 pm (UTC)
\o/
(Reply) (Parent) (Thread)
[User Picture]From: jcreed
2013-10-16 06:58 pm (UTC)
So, like, is there anything (cross-linguistically?) systematic to the fact that determiners seem very "after"-y? (or I guess viewed from the other end "before"-y, but whatever...) Like how "the" occurs extremely rarely before any other common word, and quite commonly after then?

And I seem to recall from a previous ad-hoc experiment, the entropy of the distribution of tokens occurring before "the", "a", etc. was much lower than after, since they were more likely to be drawn from a smaller set of prepositions and conjunctions and stuff, rather than the much more unbounded set of, e.g. nouns.
(Reply) (Parent) (Thread)
[User Picture]From: jcreed
2013-10-16 06:59 pm (UTC)
I guess I should clarify that obviously I don't suspect determiners should always have the same directionality, since languages can be head-final or head-initial in various ways... I think what I'm trying to ask is whether it's known whether you can, across languages, get a decent read on lexical categories just by local distributional statistics like this?
(Reply) (Parent) (Thread)
[User Picture]From: oniugnip
2013-10-16 08:48 pm (UTC)
No yeah! I think your intuitions sound pretty much right! This is the nature of corpus linguistics, finding out stuff like that.

Some languages have determiners in different places, of course, and not all languages have determiners. Also I think sometimes definiteness is just marked on the noun itself...
(Reply) (Parent) (Thread)
From: eub
2013-10-16 07:52 am (UTC)
If you put up an online analogy solver, that would be awesome. (There isn't one, that I can see; I don't know if this is surprising?)
(Reply) (Thread)
[User Picture]From: jcreed
2013-10-16 06:13 pm (UTC)
sadly *most* of the answers it gives are total garbage. I got lucky on my first try.
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: lindseykuper
2013-10-16 07:31 pm (UTC)
Yeah, I meant to say this, too. I think NLTK already has built-in stuff for some of what you're trying to do, jcreed. (Also, Alex is an NLTK committer, so you can bother him with questions!)
(Reply) (Parent) (Thread)
[User Picture]From: oniugnip
2013-10-16 08:49 pm (UTC)
Do it in SML and call it SNLTK and collide with the Scheme/Racket one!
(Reply) (Parent) (Thread)
From: queen_elvis
2013-10-17 08:22 pm (UTC)
My husband looked this over last night and said you're on the cutting edge with distributional semantics. Apparently it was the talk of the conference at ACL.
(Reply) (Thread)
From: eub
2013-10-18 08:53 am (UTC)
I wonder what the first few principal components of this would turn out to be. Though come to think, what's the right way to do PCA on S^n?
(Reply) (Thread)
[User Picture]From: jcreed
2013-10-18 10:49 pm (UTC)
Doesn't seem entirely unreasonable to just do PCA on the raw vectors and ignore the fact that they happen to be normalized. They still might be clustered around two antipodal points, and PCA would notice that.

Still, I feel you that there seems like there ought to be some compensating change in the choice of algorithm for the fact that the data don't live generally in R^n...
(Reply) (Parent) (Thread)