Notes from a Medium-Sized Island [entries|archive|friends|userinfo]
Jason

 [ website | My Website ] [ userinfo | livejournal userinfo ] [ archive | journal archive ]

[Oct. 15th, 2013|08:56 pm]
Jason
 [ Tags | language ]

Somehow I keep finding myself playing naively in an NLP-ish sandbox without really knowing the first thing about legitimate academic NLP.

for instance,

```from collections import defaultdict
from math import sqrt
import numpy
import re

words = []
with open("pg.txt") as f:
for x in f.readlines():
x = re.sub(r'[^a-z]+', " ", x.lower())
words.extend(x.split())

counts = defaultdict(int)
for x in words:
counts[x] += 1

N = 100

common = [x for x in sorted(counts.keys(), key=lambda(y): -counts[y])]
topn = [common[y] for y in range(N)]

before = defaultdict(lambda: defaultdict(int))
after = defaultdict(lambda: defaultdict(int))
for wi in range(2, len(words) - 2):
if words[wi] in topn:
after[words[wi - 1]][words[wi]] += 1
after[words[wi + 1]][words[wi]] -= 1

vecs = {}
for cw in common:
vecs[cw] = [after[cw][tw] for tw in topn]

def normalize(v):
len = sqrt(sum(x * x for x in v))
if len < 4:
return False
return [x / len for x in v]

for cw in common:
nv = normalize(vecs[cw])
if nv:
vecs[cw] = nv
else:
del(vecs[cw])

target = "of"
vtarget = vecs[target]

ranked = sorted(vecs.keys(), key=lambda(x): -numpy.vdot(vtarget, vecs[x]))

for x in ranked:
print x, numpy.vdot(vtarget, vecs[x])

```

thrown at some project gutenberg text yields a whole bunch of prepositions all clumped together at one end of the objective function
```of 1.0
into 0.978894370397
from 0.978135291622
for 0.978033825851
upon 0.977359735558
in 0.975798319051
by 0.972774713053
on 0.972298555885
among 0.969156476196
during 0.96721552238
towards 0.966558119514
with 0.962186062745
between 0.954884847906
to 0.946391749022
through 0.942979756684
respecting 0.940621244463
at 0.939330654695
```

and a whole bunch of adjectives all clumped together at the other extreme:
```entire -0.959733960081
first -0.962401201704
fourth -0.96430285823
next -0.964411719589
following -0.964601889971
principal -0.966878822836
second -0.968995783247
united -0.972999124908
greatest -0.973458299043
utmost -0.973928331974
whole -0.978632110159
```

What I'm doing is taking the 100 most common words, and assigning a length-100 vector to all the w words in the vocabulary of the text, where you find the ith element by giving it a ++ every time you see w immediately before the ith most common word, and a -- every time you see w immediately after the ith common word.

And then I normalize those vectors, throwing out the ones that are too small to be interesting, and sort all the words by their dot product with some chosen word, in this case 'of'.

Dunno, it seems interesting that adjectives seem like "the opposite" of prepositions in this space, and also determiners appear to be the opposite of (some?) nouns. And the adjective-preposition axis seems roughly orthogonal to the determiner-noun axis.

---

omg and I tried seeing if that crazy thing from google's word2vec paper did anything, and replaced the vtarget line with:

vtarget = [-vecs["taking"][i] + vecs["capturing"][i] + vecs["take"][i] for i in range(N)]

in order to ask the system to attempt to best complete the analogy

taking:take::capturing:?

and the best hit --- which couldn't be the correct answer "capture" since that token never occurred in the text --- was nonetheless "seize". WHAT IS THIS SORCERY.

 From: 2013-10-16 05:09 am (UTC) (Link)
Distributional semantics; you're doing it! :D
 From: 2013-10-16 06:14 pm (UTC) (Link)
\o/
 From: 2013-10-16 06:58 pm (UTC) (Link)
So, like, is there anything (cross-linguistically?) systematic to the fact that determiners seem very "after"-y? (or I guess viewed from the other end "before"-y, but whatever...) Like how "the" occurs extremely rarely before any other common word, and quite commonly after then?

And I seem to recall from a previous ad-hoc experiment, the entropy of the distribution of tokens occurring before "the", "a", etc. was much lower than after, since they were more likely to be drawn from a smaller set of prepositions and conjunctions and stuff, rather than the much more unbounded set of, e.g. nouns.
 From: 2013-10-16 06:59 pm (UTC) (Link)
I guess I should clarify that obviously I don't suspect determiners should always have the same directionality, since languages can be head-final or head-initial in various ways... I think what I'm trying to ask is whether it's known whether you can, across languages, get a decent read on lexical categories just by local distributional statistics like this?
 From: 2013-10-16 08:48 pm (UTC) (Link)
No yeah! I think your intuitions sound pretty much right! This is the nature of corpus linguistics, finding out stuff like that.

Some languages have determiners in different places, of course, and not all languages have determiners. Also I think sometimes definiteness is just marked on the noun itself...
 From: 2013-10-16 07:52 am (UTC) (Link)
If you put up an online analogy solver, that would be awesome. (There isn't one, that I can see; I don't know if this is surprising?)
 From: 2013-10-16 06:13 pm (UTC) (Link)
sadly *most* of the answers it gives are total garbage. I got lucky on my first try.
(Deleted comment)
 From: 2013-10-16 07:31 pm (UTC) (Link)
Yeah, I meant to say this, too. I think NLTK already has built-in stuff for some of what you're trying to do, jcreed. (Also, Alex is an NLTK committer, so you can bother him with questions!)
 From: 2013-10-16 08:49 pm (UTC) (Link)
Do it in SML and call it SNLTK and collide with the Scheme/Racket one!