Part-of-speech tagging (often just called “POS tagging”) is one of the few NLP tasks that routinely gets very high accuracy scores, usually in the high nineties. So the idea is something of an old chestnut in NLP. The specific tags used vary a lot, but will give you an idea of what tagged text looks like, before and after:
Automatically add part-of-speech tags to words in text.
Automatically/adverb add/verb part-of-speech/adjective tags/noun to/to words/noun in/preposition text/noun./punctuation
So, assuming you’ve gone through some hocus-pocus to end up with a bunch of tagged text, what do you do with it?
Well, certain patterns of parts of speech tend to indicate terminology. If the same two nouns keep showing up in a text, for instance, chances are that the words are related, that is, they constitute a “term.” This approach is described section 5.1 of this chapter of Manning and Schütze’s text. It’s a very intuitive approach, and even a very simple implementation will turn up some useful stuff.
This kind of “shallow analysis” doesn’t attempt to find any long-distance relationships between phrases and words. There’s no parsing going on, in other words, just some pattern matching.
So part-of-speech tagging has its uses.
But what if we went even shallower than that? What if we tried to look exclusively at the statistical patterns of words? Can we get any useful information out?
So, I have in mind a small experiment, and I’ll just write it up here as I go.
As I’ve mentioned, one interest of mine is translation, of the human variety (as opposed to fully automated machine translation). I’m no pro at translation, so I don’t know if more qualified people feel the same way, but I find that it can be a pretty tedious endeavor. I find myself looking at a long sentence that I need to translate, and wishing that I could somehow subdivide it into phrases. So here’s a sentence I translated recently:
Es natural que exista un gran contraste entre su estilo de vida original y la nueva sociedad, que debe ser superado paulatinamente, pero la manera de resolverlo de Tatum bordea la ilegalidad.
What I want to see is something along the lines of:
Es natural
que exista
un gran contraste
entre su estilo
de vida original
y la nueva sociedad,
que debe ser superado paulatinamente,
pero la manera
de resolverlo
de Tatum bordea la ilegalidad.
I’m not even looking for indentation, I just want to see sensible subphrases on their own line so I can break down the translation process. Like I said, I don’t know if real translators work this way, but when I do this approach seems like it would be useful.
So how to automate it? If we had a part of speech tagger for Spanish (and I’m sure it wouldn’t be too difficult to find one), and a parser (which would be a bit harder to find), we’d write some system to place noun phrases on their own line, etc etc.
But that seems like overkill… we’re just splitting up a sentence into more manageable chunks, after all. So here’s my observation: in the sentence I split up manually above, the words which begin each line seem to be frequent. What if we “split on frequency”? We’ll take the most frequent words in the document, and split up the sentence into subphrases beginning with those words. Will it be useful? I have no idea, let’s see. Tomorrow. 8^)