These are conference notes about the Exploiting Text: A text research workshop in honour of Frank Wm. Tompa.
Note: these are being written live so there will be lots of typos and I won't cover everything. When the battery runs low or I get tired then I stop.
If you see any problems, please send me corrections.
Thursday, August 7th
John Simpson: Oxford's Big Break: Frank Tompa and the OED
Simpson was the editor of the OED during the important shift to digital form. The digitization led to the publication of the second edition and all sorts of other projects. Simpson reminisced about the digitization of the OED.
Simpson started with some history of the OED. The founding editor was Dr. Murray who died working on the letter T. He showed a picture of Murray in his gown and study.
He then led us through an entry, in this case for "workshop". Then he talked about how the OED got involved with U of Waterloo. The problem at Oxford University Press was that they knew they needed to look at computerization, but no one to help locally. IBM came on board early on as the OED was seen as a major (big) dataset. Doug Wright (?), the President of Waterloo, saw an opportunity to build the reputation of computer science. The OED was also pushing the limits of SGML. Simpson ended up spending a semester in Kitchner working out how to structure the data which at that time was really big. By the end he was sure that computerization would work.
By the end of the 1990s it was clear that the web was changing how the OED could be published but also how to track language. They are working on a new project (OED 3?) that is revising the complete text of the OED. He showed some videos, including Revising the OED: Potato Salad. They can now search all sorts of historical dictionaries and full text databases.
Taking the OED online not only makes it easier for users to consult, but also allows users to access data in different ways, especially visually. He talked about the Glasgow Historical Thesaurus of the OED and how they have connected the Thesaurus data to the online OED. There were problems because the Thesaurus was based on an older OED. He showed the historical Thesaurus entry for a sense of spoon. You can see the ontology of the Thesaurus and navigate by it.
Then he showed the timelines. You can see the distribution of words from Canada over time. You can see when words came in from different language communities like Inuit.
They are also worrying about the usual publishing issues like should the OED be available free. He talked about how they are working with different communities like the science fiction community to update their entries for words like "viewphone." Scifi enthusiasts have been finding earlier entries than those staff had found. They are also linking to other dictionaries like the Dictionary of Old English and they are working on mobile interfaces.
He ended with an animated visualization The OED in two minutes that shows words coming into English from other languages.
Some interesting questions were asked about:
Ian Munro: Succinct Data Structures for Text, Graphs and Other Stuff
Munro talked about what he learned from Tompa. "Mind your grammar" - he learned that you should use grammars for "nonstandard things". He also talked about suffix trees to speed up search and deal with storage issues. The general problem is representing structural information in as little room as possible. Then there are questions about how to insert information into structures. With big data text there are problems of compressing the text and then operating on the compressed data.
Alastair Moffat: External Suffix Arrays for Large-Scale String Search
Moffat explained a suffix array and how it can be used for searching for patterns. A suffix array of "she_sells_shells" would be an array of all the suffixes from "he_sells_shells" to the final "s". He talked about efficiencies when working with such large indexes that they can't fit in memory. He described a recommended way of indexing and searching large strings.
Michael Lesk: From Searching to Researching
Lesk started with an anecdote about how the OED came to
His talk focused on three phases of digital scholarship.
Then he compared progress in different media. Image handling seems to be about 10 years behind text and video another 10 years on. The average 19th century book has been scanned half a dozen times thanks to all the book scanners.
People complain that digitization has changed how we read and is dumbing us down. This is an old story - Plato complained about writing.
He talked about tracking ideas and mentioned Schilit and Kolak in "Exploring a Digital Library through Key Ideas" - they were doing this without referencing PAT Trees. He gave examples from Google N Gram? viewer.
Then he talked about sense disambiguation. Words change over time. He talked about "train" and "engine" and how they have drifted.
He asked why digitizing all this text hasn't rescued more obscure authors? He asked if it is true that "big data beats better algorithms". He is worried that we will get too much triviality.
Searching is now the province of machines and digital texts. Reading is increasingly online. Amazon is stomping on the paper publishers of general books. Research is moving to algorithms with authorship studies, stylistic analysis, network analysis and crowdsourcing supplementing, but not yet replacing, traditional criticism.
Geoffrey Rockwell: On the Archaeology of Text Tools
I gave a talk about studying early text tools with some examples.
Charlie Clarke: Time Well Spent
Clarke talked about how we evaluate information access systems. To evaluate systems we have base it on users. Two ways to do that, users in the wild (A/B Testing, mouse movements ...), the second way is users in a lab (eye tracking, think aloud ...). The problem is that this is slow and conditions can't be exactly replicated.
He then talked about rankers for text tools. We don't know if these ranking systems actually tell us about what is useful to users. We want to reflect meaningful units to users. The best query system should be like a newspaper - it shows you what you should read and starts with a short intro and expands.
He proposed a way of measuring the time well spent vs the time spent. He talked about simulated users. He can produce a distribution of different users. He can simulate the same type of user on different systems so as to compare them.
Evangelos Milios: Interactive Term-Supervised Text Document Clustering
Milios talked about visual text analysis and interactive text analysis. The normal view of visualization is that the visualization is a way of presenting the results of text mining. If you introduce a user then the user begins the interact with the text mining back end to change the visualization. The user is typically not a data scientist, but a domain expert.
Interactive document clustering is interesting because typically there is little metadata of any quality, the unsupervised clustering isn't satisfactory for domain experts. Users want to interact with the clustering to improve what they get back. Rather than improving the algorithm, they want to let users move terms around. They focus on terms. Their algorithm is Lexical Double Clustering.
He talked about the limits of "bag of words" model. We need to get a more abstract conceptual representation. We can use external knowledge sources (like Word Net? and Wikipedia.) Word Net is limited as it doesn't cover named entities. He talked about how they take advantage of Wikipedia to get a "bag of concepts" methods. He talked about "wikifying" documents; not sure what that is. The BOW however works better than the BOC approach so they tried combined them.
He talked about the Sunflower system that they have developed which is more concept centres. To evaluate interactive algorithms you need to study users differently as you need to look at what they do in their environment and over time.
Jody Palmer, Opportunities for Content Analytics
Palmer is at Open Text and talked about what can be done and challenges with analytics. Content Analytics is the purposeful analysis of content. There are all sorts of business needs from policy automation, customer service, search support to criminal tracking and so on. He talked about what analysis is and then what content is. Open Text can pull content from calendars, chat, bug postings, discussions and so on. Some of the things they do are concept extraction, entity extraction. He talked about "crisp rules" based on metadata that the user can edit.
They provide faceted search over concepts, entities and tone. They support companies like Evolve24 that provide sentiment analysis to other companies about events and products.
Categorization drives a lot of other stuff. It can help automate records management. It can act on legacy data. Categorization is important as the volume of data is too large to categorize manually.
It sounds like Open Text provides a complete document management and analytics environment. They have customers with documents with legal implications, operational implications and lots of legacy content.
The opportunities include:
The opportunities boil down to:
Characteristics for success
There was an interesting discussion of e-discovery system and the value of false positives vs false negatives. I followed up with a question about whether ethics are being built in. Palmer talked about how legal issues do make a difference and they are often based on ethical positions.
The future of content analytics is that '''content grows, content lasts." There is opportunity around understanding and legal requirments to manage.
I asked Palmer at lunch if anyone has published about the history of Open Text and he mentioned a book that was published on the first 10 years. We talked a bit about what has happened to PAT and Lector. We also talked about Open Text and web searching.
Stan Matwin: Big [Text] Data - Does It Make Knowledge Obsolete
Matwin's question is around the relationship between knowledge and data. How does Machine Learning learn knowledge from data, if at all. Is it possible to incorporate knowledge in machine learning. He has shown that incorporated knowledge can reduce examples needed to train an ML.
He talked the Hegelian principle of the transition from quantity to quality. He was critical of the Mayer-Schonberger & Cukier book on Big Data. He believes that the simple Google idea that big data will outperform knowledge structures is now being superceded with ideas about hybrid approaches that combine data and knowledge. Bag of words has limits. He talked about word to vector ideas - see Google word2vector page.
There was a good question about why we have decent translation now, but not good summarization.
Friday, August 8th
Alex Lopez-Ortiz, Faster and Smaller Inverted Indices with Treaps
Alex presented a paper that talked about weighted indexes. This touched on work a number of people, including Charlie Clarke have worked on.
Ken Church: More Substring Statistics
Church talked about what we can do with substrings (ngrams) from a few words to a million. You can do anything with substrings that you can do with words - create concordances, calculate frequencies and so on. He told about suffix arrays. His example was searching for "Manuel Noriega" in AP News. He made an interesting point about "burstiness" of words. Two works might have the same frequency, but one bursts in the corpus and the other is spread evenly. "We are looking for deviations from chance." He talked about priming - how you expect "nurse" to be more likely after "doctor".
Mark Chignell: Finding out What with Data and Why with Text: A Healthcare Data Mining Case Study
Mark talked about the trajectory that he has taken from the Hyper Card? Jefferson project to health care which brought him back to text analysis. He works with Emergency Physicians who have to deal with all sorts of problems from hangnails to late stage cancer. The systems they use are not very good. A very complex view of the world that is a mixture of text and data that is not integrated. There are attempts to create dashboards, but nothing really good that works where their interrupt driven life. They may not even know where the patient is and don't know the status of lab tests. Too often they use Google.
There is a folklore about what doctors will or won't do that isn't based on what happens on the floor. They are very open to new tools that don't slow them down. They will be early adopter of the fast and easy.
How much data is locked into health doctors? Mark has looked at this. The moment you look at this people get worried about privacy. He has looked at how you can export data that won't have privacy implications, but still be useful.
Doctors are taught in case-based reasoning fashion. Physicians think in terms of cases. One can mine data to find similar patients and summarize into a imagined case that doesn't represent a real person, but shows what should be done. The problem is that no one is going to believe any one thing until you prove everything.
He has been working with data from Boston and trying to cluster it by patients/cases. He works closely with emergency physicians showing them results all along. Regression analysis on matching clusters provides useful predictions. Text analysis within clusters helps explain them. He believes that they can give.
Paul Turner et al.: Canadian Open Data at Work
Paul Turner from Open Text and Ray Sharma talked about the Canadian Open Data initiative is. Ray started with talking about his background and then mobile apps. An interesting statistic is that app developers are now coming mostly from Asia now. He showed another slide showing which operating systems allow you to make the most money (Apple.) Apple is now trying to force . Then he showed an "Entertainment cost per hour" chart. An app is cheap entertainment - 5 cents per hour compared to movies. Games drive the whole app phenomenon. They drive the app economy. The evolution of idea of Freemium is interesting and exploding. In 2011 the freemium companies were the top type of company being invested in. Hundreds of models on how to make money from free.
In Ontario there are 21,000 people imployed in apps. This is more than the people employed in games. Consumers are spending more on mobile apps than the internet. They are surfing their apps, not the web. iOS is the most important platform for monetization.
The Canadian Open Data Experience (CODE) event was hosted by XMG and government of Canada. They were sponsored by Open Text?, Google, and IBM. It was a hackathon with over 900 participants, 290 teams, 110 apps. At least 3 companies have been formed and acquired. There is going to be an open data institute for Waterloo.
Government data is an asset that should be open and preserved. Mc Kinsey? has a study pegging the value of open data in the trillions. He talked about how the crowd can help industry when supported by open data. Twitter is built on the crowd.
Then we heard from some of the CODE winning app developers starting with New Roots that allows new immigrants to find cities where there is work and which have desired features. The developers found all sorts problems with the open data.
Another team of high school students that was part of an Open Text dev camp took data from the government's Canadian Termium Database to create an app that mimics the original interface of the New OED.
Jose Blakeley: Analysis and Migration of Programs through Scope
Blakeley started his Ph D? in the early days of the New OED. He wrote an entity-relationship model for the New OED. He talked about how the New OED showed how regular relational databases didn't really work well for complex reference works like the OED.
Blackeley now works with Microsoft and is working on a development environment for big data called Cosmos-SCOPE. This is a big project with 5000 developers, hundreds of thousands of jobs run a day. Cosmos is the storage system at the end. SCOPE is the scripting language (I think) of the distributed computation system. He gave some examples of SCOPE scripts.
They have a goal to evolve the language transparently to users. They have all the jobs ever run so they can analyze what people are writing to see what effects a change to the language would have. They can also automatically migrate user programs. They can get statistics from the hundreds of thousands of jobs to see what elements are used and what impact a change in the language would have.
A neat feature is that they have a test corpus and can test automatic translations to see if they work.
One thing that is really interesting is that the nature of this language is such that they have stored every example of code written in SCOPE. What an opportunity to study the evolution of a language.
Mariano Consens: Structural Summaries of Semistructured Data or Structured Text
One thing that has changed in the data landscape is that it is not dominated by big vendors. There is a lot of experimentation and open tools. He talked about optimizations based on understanding the structure of (for example HTML pages.) Then he went back to the number of tools and change in tools from big vendors to a much more complex environment. There are non-SQL environments that have some SQL features.
He then talked about created summaries from linked data. By summary what he meant was a summary graph.
Raymond Ng: Rhetorical Structure Analysis and its "New" Applications
He asked about how we can find rhetorical structure and how to use it. There are inter-sentential and intra-sentential rhetorical structures. We need to break up text into discourse units and then create a rhetorical tree which is different from a syntactic tree. They do sentence level parsing and then multi-sentence parsing. The problem is that for larger documents it is not scalable. Why not use paragraphs?
They have found "leaky" sentences where parts of the sentences are rhetorically connected to different other sentences. They have found that 5% to 12% are leaky. This seems obvious - I would expect sentences to connect multiple other sentences.
He talked about the explosion of text, including how speech recognition is getting good enough to get reasonable transcripts for meetings. They want to develop tools to extract, mine and summarize post past and ongoing text conversations. Ng wants to summarize conversations (whether oral or written in social media.) They want to allow people to be able to search, extract, and see sentiment. They want to extract features, polarities and strengths from reviews.
He then gave an overview of what people are trying to do with sentiment analysis. Some are trying to provide reasons to explain sentiments and Ng's work should help with this. They also want to extract summaries. Abstracted summaries are not clips from original, but new automaticallly generated summaries. He gave a really impressive abstracted summary. You could have a visualization of summaries and do other things with the sumaries. He has a book Methods for Mining and Summarizing Text Conversations'.
Frank Tompa: Accessing Text Through Slices
Frank talked about something that is just an idea and bit of a prototype. He wants to apply the concept of database views to text processing. A view is defined by a query and query/results may or may not be stored. What does that look like in text? You have a query and some sort of subset of the text in some order of what system things you want. He showed something like a view from the online OED. Alas you can't use this for a subsequent query. There are no user-defined subsets and rarely more than one order to lists.
What he wants to do is to define a text slice which is a) a subset and b) a order. He showed examples from the OED like all the words where the first example is from August 8th sorted by data of first example. He talked about all the sorting issues.
The idea of slices is that one can pose queries against slices and then define new slices against slices. Then he asked what we could do with text slices that is new. Slices are connected to many other regions. An entry connects to the next entry. "Next" can mean many things and he gave an example from Word Smith?. I think he imagines being able to pivot on an item in a slice list. He then discussed the interesting research questions around slices:
That was the end of the conference.
|Page last modified on August 08, 2014, at 02:17 PM - Powered by PmWiki|