philosophi.ca : The Extraordinary Effectiveness Of Words

An outline of an argument about words and text tools.

1. What is interpretation? A simple model is that a user brings a question to the text (like "what is this about?") and interpretation (the activity) is the set of practices used in answering that question. An interpretation is typically a new text (or other type of artifact) that responds to questions about the source text.

2. Analytical tools help users interpret a text. They work by a) helping the user formalize their question into a "query", and b) then returning a response to the query that helps the interpreter.

3. Most of our analytical tools are built on pattern matching. Concordances, indexes (including Google) and visualizations help us formalize our question by letting us enter a pattern to match and then they return different genres of result displays that show the matches.

4. The patterns we enter and the computer matches are almost always what we call words whether single words or sequences of words. In English is relatively easy to specify a pattern for a word and get the computer to find all the instances of that word.

This then raises some interesting questions:

A. What is a word? Why do they work so well?

I take it linguists distinguish between orthographic words or OW (those things that are separated by spaces) and semantic words or SW (simple concepts.) My first hypothesis is that:

i. The extraordinary and very particular interpretative effectiveness of pattern matching tools is due to the high correlation between OW (which can be parsed by computers or other procedural systems) and SW in English and other similar languages.

In other words wherever the patter OW is found there is a high probability that the author meant the reader to think about the corresponding SW.

This leads to questions about the correlation:

B. What is the correlation between orthographic and semantic words? Can it be measured? Has it been measured?

Another tactic is to look at the negative questions:

C. What would affect the correlation between OW and SW?

There are a number of things which are likely to affect the correlation:

ii. The ambiguity of words where the OW can mean more than one SW affects correlation.

Ambiguity reduces correlation, but surprisingly it doesn't seem to reduce the effectiveness of words for tools. How can we explain the ineffectiveness of ambiguity? Have we learned to bypass it with judicious choices of patterns or displays? Could ambiguity be less of an issue because most searches are for patterns and in documents that are less likely to be ambiguous?

iii. The genre of the document being interpreted affects correlation.

It would follow that if ambiguity and polysemy are problems for correlation that genres of writing that exploit polysemy would be less easy to interpret with word matching tools. Genres like poetry, humour, and satire are hypothesized to be less amenable to interpretation, though in different ways.

iv. The amount of information in the target text affects interpretative effectiveness.

For various reasons the amount of text data seems to affect the effectiveness of interpretative tools based on word matching. First of all, the more words, the better the chance that statistical measurements of word frequency will accurately reflect the interpretation. Secondly, for many interpretative queries it is enough just to find one significant passage and the more text the more likely there will be a passage that is responsive. Interestingly this is the situation we now have with the web - lots of data. This also raises questions about the relationship between quantity of data for interpretation and parsable quality. The title of this excursion draws on an essay by Google linguists to the effect that lots of data and statistical techniques out perform high-quality marked-up data and AI techniques. Could the same be true of literary interpretation?

v. The language of the text being interpreted affects correlation.

Many languages don't have orthographic words, or to use the computer term, "strings" that correspond to simple semantic words. Spaces don't divide what we call semantic words in all languages. <need example here>

There are also languages that have heavily inflected words so that pattern matching a string is less effective for finding all the instances of a SW. English makes pattern matching fairly easy. If you want all the variants of the SW "word" you just do a truncation search for the string "word" and you get "word", "words", "wording" and so on. Can we measure the effectiveness of simple pattern matching for finding relevant orthographic words?

vi. Word matching identifies where something is discussed not what is being said about it.

In other words the effectiveness of OW/SW correlation lies partly is what is asked of it or how it is used. Tools that depend on matching and correlation generally tell you *that* something is discussed and *where* it is discussed, not *what* is said about it. We compensate for this by providing displays to the results like concordances that make it easy to see what is being said about something. Take Google, you enter the word(s) to search for, and then you skim the results eliminating the items that don't have the sort of text you want. You end up doing the reading of the extracted concording passages and then select certain passages for deeper reading. You decide *what* each work is likely to be about of the choices offered.

In conclusion it would seem that the effectiveness of our analytical tools is due to:

Features of the English language such as the correlation of orthographic words and semantic words.
Our expectations of these tools that they just show us that and where something is discussed, not "what" is said about something.
Our use of these to find explicit information rather than to find poetic figures or propositions.

One of the side-effects of the effectiveness of words is that they have created a threshold of computer-based interpretation beyond which we haven't effectively gone. I contend that all our new techniques, from visualizations to data-mining are still word based. Their innovations are still however anchored in the matching of patterns for words, what is new is the statistical techniques applied to word counts in very large datasets.

Can we do better? Do we need to do better?