These are conference notes about the Exploiting Text: A text research workshop in honour of Frank Wm. Tompa.

Note: these are being written live so there will be lots of typos and I won't cover everything. When the battery runs low or I get tired then I stop.

If you see any problems, please send me corrections.

Thursday, August 7th

John Simpson: Oxford's Big Break: Frank Tompa and the OED

Simpson was the editor of the OED during the important shift to digital form. The digitization led to the publication of the second edition and all sorts of other projects. Simpson reminisced about the digitization of the OED.

Simpson started with some history of the OED. The founding editor was Dr. Murray who died working on the letter T. He showed a picture of Murray in his gown and study.

1857 Philological Society of London calls for new dictionary. A New Words committee if formed.
1884-1928 the First Edition is published in 10 volumes
1933 A one volume supplement published
1971-86 4 volumes of Supplement published
1983 or so they started work on the computerization of the dictionary
1989 the Second Edition is published from the digital data
2000 it goes online
2010 new version of online

He then led us through an entry, in this case for "workshop". Then he talked about how the OED got involved with U of Waterloo. The problem at Oxford University Press was that they knew they needed to look at computerization, but no one to help locally. IBM came on board early on as the OED was seen as a major (big) dataset. Doug Wright (?), the President of Waterloo, saw an opportunity to build the reputation of computer science. The OED was also pushing the limits of SGML. Simpson ended up spending a semester in Kitchner working out how to structure the data which at that time was really big. By the end he was sure that computerization would work.

By the end of the 1990s it was clear that the web was changing how the OED could be published but also how to track language. They are working on a new project (OED3) that is revising the complete text of the OED. He showed some videos, including Revising the OED: Potato Salad. They can now search all sorts of historical dictionaries and full text databases.

Taking the OED online not only makes it easier for users to consult, but also allows users to access data in different ways, especially visually. He talked about the Glasgow Historical Thesaurus of the OED and how they have connected the Thesaurus data to the online OED. There were problems because the Thesaurus was based on an older OED. He showed the historical Thesaurus entry for a sense of spoon. You can see the ontology of the Thesaurus and navigate by it.

Then he showed the timelines. You can see the distribution of words from Canada over time. You can see when words came in from different language communities like Inuit.

They are also worrying about the usual publishing issues like should the OED be available free. He talked about how they are working with different communities like the science fiction community to update their entries for words like "viewphone." Scifi enthusiasts have been finding earlier entries than those staff had found. They are also linking to other dictionaries like the Dictionary of Old English and they are working on mobile interfaces.

He ended with an animated visualization The OED in two minutes that shows words coming into English from other languages.

Some interesting questions were asked about:

Can they export the data and let people play with it? You need to make a request and sign stuff.
Could they estimate how many words are missing or missing quotations? I thought that was a neat idea - figure out the known unkown.
Can they imagine a wikification of the dictionary? The OUP is careful about what they release so they aren't sure.

Ian Munro: Succinct Data Structures for Text, Graphs and Other Stuff

Munro talked about what he learned from Tompa. "Mind your grammar" - he learned that you should use grammars for "nonstandard things". He also talked about suffix trees to speed up search and deal with storage issues. The general problem is representing structural information in as little room as possible. Then there are questions about how to insert information into structures. With big data text there are problems of compressing the text and then operating on the compressed data.

Alastair Moffat: External Suffix Arrays for Large-Scale String Search

Moffat explained a suffix array and how it can be used for searching for patterns. A suffix array of "she_sells_shells" would be an array of all the suffixes from "he_sells_shells" to the final "s". He talked about efficiencies when working with such large indexes that they can't fit in memory. He described a recommended way of indexing and searching large strings.

Michael Lesk: From Searching to Researching

Lesk started with an anecdote about how the OED came to

His talk focused on three phases of digital scholarship.

Finding - we make catalogues and enable searching. Vannevar Bush had the idea in 1945. Waterloo's string searching methods to find repeats were important and applied to other fields
Reading - when it becomes normal to read off the screen. In 2011 Amazon started reporting selling more Kindle books than print. Waterloo work on dictionaries pioneered all sorts of services using dictionaries.
Analysis - the next step is the using computers to assist scholars in reaching conclusions from materials. Again Waterloo was one of the first places doing work on this.

Then he compared progress in different media. Image handling seems to be about 10 years behind text and video another 10 years on. The average 19th century book has been scanned half a dozen times thanks to all the book scanners.

People complain that digitization has changed how we read and is dumbing us down. This is an old story - Plato complained about writing.

He talked about tracking ideas and mentioned Schilit and Kolak in "Exploring a Digital Library through Key Ideas" - they were doing this without referencing PAT Trees. He gave examples from Google NGram viewer.

Then he talked about sense disambiguation. Words change over time. He talked about "train" and "engine" and how they have drifted.

He asked why digitizing all this text hasn't rescued more obscure authors? He asked if it is true that "big data beats better algorithms". He is worried that we will get too much triviality.

Searching is now the province of machines and digital texts. Reading is increasingly online. Amazon is stomping on the paper publishers of general books. Research is moving to algorithms with authorship studies, stylistic analysis, network analysis and crowdsourcing supplementing, but not yet replacing, traditional criticism.

Geoffrey Rockwell: On the Archaeology of Text Tools

I gave a talk about studying early text tools with some examples.

Charlie Clarke: Time Well Spent

Clarke talked about how we evaluate information access systems. To evaluate systems we have base it on users. Two ways to do that, users in the wild (A/B Testing, mouse movements ...), the second way is users in a lab (eye tracking, think aloud ...). The problem is that this is slow and conditions can't be exactly replicated.

He then talked about rankers for text tools. We don't know if these ranking systems actually tell us about what is useful to users. We want to reflect meaningful units to users. The best query system should be like a newspaper - it shows you what you should read and starts with a short intro and expands.

He proposed a way of measuring the time well spent vs the time spent. He talked about simulated users. He can produce a distribution of different users. He can simulate the same type of user on different systems so as to compare them.

Evangelos Milios: Interactive Term-Supervised Text Document Clustering

Milios talked about visual text analysis and interactive text analysis. The normal view of visualization is that the visualization is a way of presenting the results of text mining. If you introduce a user then the user begins the interact with the text mining back end to change the visualization. The user is typically not a data scientist, but a domain expert.

Interactive document clustering is interesting because typically there is little metadata of any quality, the unsupervised clustering isn't satisfactory for domain experts. Users want to interact with the clustering to improve what they get back. Rather than improving the algorithm, they want to let users move terms around. They focus on terms. Their algorithm is Lexical Double Clustering.

He talked about the limits of "bag of words" model. We need to get a more abstract conceptual representation. We can use external knowledge sources (like WordNet and Wikipedia.) WordNet is limited as it doesn't cover named entities. He talked about how they take advantage of Wikipedia to get a "bag of concepts" methods. He talked about "wikifying" documents; not sure what that is. The BOW however works better than the BOC approach so they tried combined them.

He talked about the Sunflower system that they have developed which is more concept centres. To evaluate interactive algorithms you need to study users differently as you need to look at what they do in their environment and over time.

Jody Palmer, Opportunities for Content Analytics

Palmer is at Open Text and talked about what can be done and challenges with analytics. Content Analytics is the purposeful analysis of content. There are all sorts of business needs from policy automation, customer service, search support to criminal tracking and so on. He talked about what analysis is and then what content is. Open Text can pull content from calendars, chat, bug postings, discussions and so on. Some of the things they do are concept extraction, entity extraction. He talked about "crisp rules" based on metadata that the user can edit.

They provide faceted search over concepts, entities and tone. They support companies like Evolve24 that provide sentiment analysis to other companies about events and products.

Categorization drives a lot of other stuff. It can help automate records management. It can act on legacy data. Categorization is important as the volume of data is too large to categorize manually.

It sounds like Open Text provides a complete document management and analytics environment. They have customers with documents with legal implications, operational implications and lots of legacy content.

The opportunities include:

Encouraging and automating metatdata tagging at content upload.
Importance of video and voice
What to do about short texts (photo captions, tweets ...)
Dealing with multimedia context (caption that only makes sense with image)

The opportunities boil down to:

Finding - tagging, facets, relevance, recommendations
Understanding - sentiment, synopses, pattern and trend analysis
Controlling - categorizing, policy driven actions, governance requirements, and records management

Characteristics for success

Volume - analytics only make sense because of volume.
Defensibility - understandable and appropriately controlled by the user

There was an interesting discussion of e-discovery system and the value of false positives vs false negatives. I followed up with a question about whether ethics are being built in. Palmer talked about how legal issues do make a difference and they are often based on ethical positions.

The future of content analytics is that '''content grows, content lasts." There is opportunity around understanding and legal requirments to manage.

I asked Palmer at lunch if anyone has published about the history of Open Text and he mentioned a book that was published on the first 10 years. We talked a bit about what has happened to PAT and Lector. We also talked about Open Text and web searching.

Stan Matwin: Big [Text] Data - Does It Make Knowledge Obsolete

Matwin's question is around the relationship between knowledge and data. How does Machine Learning learn knowledge from data, if at all. Is it possible to incorporate knowledge in machine learning. He has shown that incorporated knowledge can reduce examples needed to train an ML.

He talked the Hegelian principle of the transition from quantity to quality. He was critical of the Mayer-Schonberger & Cukier book on Big Data. He believes that the simple Google idea that big data will outperform knowledge structures is now being superceded with ideas about hybrid approaches that combine data and knowledge. Bag of words has limits. He talked about word to vector ideas - see Google word2vector page.

There was a good question about why we have decent translation now, but not good summarization.

Friday, August 8th

Alex Lopez-Ortiz, Faster and Smaller Inverted Indices with Treaps

Alex presented a paper that talked about weighted indexes. This touched on work a number of people, including Charlie Clarke have worked on.

Ken Church: More Substring Statistics

Church talked about what we can do with substrings (ngrams) from a few words to a million. You can do anything with substrings that you can do with words - create concordances, calculate frequencies and so on. He told about suffix arrays. His example was searching for "Manuel Noriega" in AP News. He made an interesting point about "burstiness" of words. Two works might have the same frequency, but one bursts in the corpus and the other is spread evenly. "We are looking for deviations from chance." He talked about priming - how you expect "nurse" to be more likely after "doctor".

Mark Chignell: Finding out What with Data and Why with Text: A Healthcare Data Mining Case Study

Mark talked about the trajectory that he has taken from the HyperCard Jefferson project to health care which brought him back to text analysis. He works with Emergency Physicians who have to deal with all sorts of problems from hangnails to late stage cancer. The systems they use are not very good. A very complex view of the world that is a mixture of text and data that is not integrated. There are attempts to create dashboards, but nothing really good that works where their interrupt driven life. They may not even know where the patient is and don't know the status of lab tests. Too often they use Google.

There is a folklore about what doctors will or won't do that isn't based on what happens on the floor. They are very open to new tools that don't slow them down. They will be early adopter of the fast and easy.

How much data is locked into health doctors? Mark has looked at this. The moment you look at this people get worried about privacy. He has looked at how you can export data that won't have privacy implications, but still be useful.

Doctors are taught in case-based reasoning fashion. Physicians think in terms of cases. One can mine data to find similar patients and summarize into a imagined case that doesn't represent a real person, but shows what should be done. The problem is that no one is going to believe any one thing until you prove everything.

He has been working with data from Boston and trying to cluster it by patients/cases. He works closely with emergency physicians showing them results all along. Regression analysis on matching clusters provides useful predictions. Text analysis within clusters helps explain them. He believes that they can give.

Paul Turner et al.: Canadian Open Data at Work

Paul Turner from Open Text and Ray Sharma talked about the Canadian Open Data initiative is. Ray started with talking about his background and then mobile apps. An interesting statistic is that app developers are now coming mostly from Asia now. He showed another slide showing which operating systems allow you to make the most money (Apple.) Apple is now trying to force . Then he showed an "Entertainment cost per hour" chart. An app is cheap entertainment - 5 cents per hour compared to movies. Games drive the whole app phenomenon. They drive the app economy. The evolution of idea of Freemium is interesting and exploding. In 2011 the freemium companies were the top type of company being invested in. Hundreds of models on how to make money from free.

In Ontario there are 21,000 people imployed in apps. This is more than the people employed in games. Consumers are spending more on mobile apps than the internet. They are surfing their apps, not the web. iOS is the most important platform for monetization.

The Canadian Open Data Experience (CODE) event was hosted by XMG and government of Canada. They were sponsored by OpenText, Google, and IBM. It was a hackathon with over 900 participants, 290 teams, 110 apps. At least 3 companies have been formed and acquired. There is going to be an open data institute for Waterloo.

Government data is an asset that should be open and preserved. McKinsey has a study pegging the value of open data in the trillions. He talked about how the crowd can help industry when supported by open data. Twitter is built on the crowd.

Then we heard from some of the CODE winning app developers starting with New Roots that allows new immigrants to find cities where there is work and which have desired features. The developers found all sorts problems with the open data.

Another team of high school students that was part of an OpenText dev camp took data from the government's Canadian Termium Database to create an app that mimics the original interface of the New OED.

Jose Blakeley: Analysis and Migration of Programs through Scope

Blakeley started his PhD in the early days of the New OED. He wrote an entity-relationship model for the New OED. He talked about how the New OED showed how regular relational databases didn't really work well for complex reference works like the OED.

Blackeley now works with Microsoft and is working on a development environment for big data called Cosmos-SCOPE. This is a big project with 5000 developers, hundreds of thousands of jobs run a day. Cosmos is the storage system at the end. SCOPE is the scripting language (I think) of the distributed computation system. He gave some examples of SCOPE scripts.

They have a goal to evolve the language transparently to users. They have all the jobs ever run so they can analyze what people are writing to see what effects a change to the language would have. They can also automatically migrate user programs. They can get statistics from the hundreds of thousands of jobs to see what elements are used and what impact a change in the language would have.

A neat feature is that they have a test corpus and can test automatic translations to see if they work.

One thing that is really interesting is that the nature of this language is such that they have stored every example of code written in SCOPE. What an opportunity to study the evolution of a language.

Mariano Consens: Structural Summaries of Semistructured Data or Structured Text

One thing that has changed in the data landscape is that it is not dominated by big vendors. There is a lot of experimentation and open tools. He talked about optimizations based on understanding the structure of (for example HTML pages.) Then he went back to the number of tools and change in tools from big vendors to a much more complex environment. There are non-SQL environments that have some SQL features.

He then talked about created summaries from linked data. By summary what he meant was a summary graph.

Raymond Ng: Rhetorical Structure Analysis and its "New" Applications

He asked about how we can find rhetorical structure and how to use it. There are inter-sentential and intra-sentential rhetorical structures. We need to break up text into discourse units and then create a rhetorical tree which is different from a syntactic tree. They do sentence level parsing and then multi-sentence parsing. The problem is that for larger documents it is not scalable. Why not use paragraphs?

They have found "leaky" sentences where parts of the sentences are rhetorically connected to different other sentences. They have found that 5% to 12% are leaky. This seems obvious - I would expect sentences to connect multiple other sentences.

He talked about the explosion of text, including how speech recognition is getting good enough to get reasonable transcripts for meetings. They want to develop tools to extract, mine and summarize post past and ongoing text conversations. Ng wants to summarize conversations (whether oral or written in social media.) They want to allow people to be able to search, extract, and see sentiment. They want to extract features, polarities and strengths from reviews.

He then gave an overview of what people are trying to do with sentiment analysis. Some are trying to provide reasons to explain sentiments and Ng's work should help with this. They also want to extract summaries. Abstracted summaries are not clips from original, but new automaticallly generated summaries. He gave a really impressive abstracted summary. You could have a visualization of summaries and do other things with the sumaries. He has a book Methods for Mining and Summarizing Text Conversations'.

Frank Tompa: Accessing Text Through Slices

Frank talked about something that is just an idea and bit of a prototype. He wants to apply the concept of database views to text processing. A view is defined by a query and query/results may or may not be stored. What does that look like in text? You have a query and some sort of subset of the text in some order of what system things you want. He showed something like a view from the online OED. Alas you can't use this for a subsequent query. There are no user-defined subsets and rarely more than one order to lists.

What he wants to do is to define a text slice which is a) a subset and b) a order. He showed examples from the OED like all the words where the first example is from August 8th sorted by data of first example. He talked about all the sorting issues.

The idea of slices is that one can pose queries against slices and then define new slices against slices. Then he asked what we could do with text slices that is new. Slices are connected to many other regions. An entry connects to the next entry. "Next" can mean many things and he gave an example from WordSmith. I think he imagines being able to pivot on an item in a slice list. He then discussed the interesting research questions around slices:

Which classes of slices are effective in exploring various texts?
If a user annotates a slice how is that carried on and back?
Are there best practices in marking up texts so that slices can be more easily specified?
How can we do this efficiently?

That was the end of the conference.