Main »

Text Mining The Novel 2015

These are some general notes from the October 22nd-23rd meeting of the novelTM team. To be honest they don't reflect the quality of the conversation. Andrew Piper, the noveTM leader had us all submit papers beforehand so we didn't waste time listening to papers being read out.

Ted Underwood: The Rhythms of Genre

Types of Methods We reflected on the types of methods we were seeing:

  • Supervised methods where you start with a list of works in a genre and interrogate the traditions
  • Unsupervised methods where you don't start with a definition
  • Multilingual methods
  • Using features - using word lists - Most Distinctive Words?

Word Selection We reflected on how all these methods depend on bags of words. For example, the 10,000 word selection - do you want a large set of words or do you want a small high-frequency set. With a large set you could get low frequency words throwing things off.

Is the representation of the text by unigrams part of the problem? Should we think of n-grams? What other features would we want to follow?

Explaining text mining How do our papers convince our colleagues or not? On the one hand our colleagues can understand word lists. On the other hand it seems reductive. What about philosophical argument for philosophers? Could the poor showing of philosophers in DH be due to the strong bag-of-words approach.

A lot of these papers seem to point to signalling in the market - titles, people intending to write for a market, people picking works similar to ... these seem be ways authors/publishers signal to their consumers what they will get. How can we study that?

Matt Erlin: "From the Ideational to the Epistemic Novel?"

We discussed the issue of how he is using topic modelling. His topics didn't seem to be epistemological - only one did. I was also not convinced that the texts he chose were purely epistemological - they were important to epistemology, but not what I would topic model to get epistemological (as opposed to philosohical) topics. What was interesting was how he topic modelled all the texts (the sample of philosophy texts and the novels) and then found the epistemological topics from within a larger list. Then he traced those topics back through the novels.

Again the there are issues of how to convince people reading our research.

This paper also raised a question on dictionary construction.

Matt Jockers: Genre, Gender, and Character in the 19th Century Novel

We discussed how it was disappointing to see gender stereotypes reinforced. How can we tease out subtler changes in the 19th century.

If the general trend is repeating the stereotypical trends, the outliers are then what is interesting. Could we look at specific subsets like sensation fiction where we might see more women with agency.

Could we think of other ways of performing femininity and masculinity. Could we play with the grey areas? Could we look at the context of the pronouns, as in look at pronouns in dialogue?

Laura Mandell and Nigel Lepianka: "Discovering Plot Structures through Topic Modelling and Supervised Topic


We talked about the hermeneutical uses of topic modelling. We need to do more cleaning - try to focus on nouns. The data makes a big difference to the topics. If you focus on verbs you get tense clusters. They used supervised LDA in ways I want to understand.

Chunking makes a difference when topic modelling. If you do whole documents you are less likely to get themes.

Ben Schmit's plot arcs could be an interesting way of looking at structure in texts.

We talked about how much optimization takes place. There is a level to which optimization takes place in big data. But in big data methods the optimization can be made transparent, open, recapitulated and challenged.

Mark Algee-Hewitt: "The Taxonomy of Titles"

Mark used titles in an interesting way to show that there were genre hints in the titles like "history" or "tale." Another issue was the use of MDW (Most Distinctive Words) which seemed to work better than high frequency words. A third issue that came up was the hand sorting of the MDW into categories that interested them and they saw.

This again raised a question of how we deal with word lists. What is the role of expert categorizing.

Matt Wilkens: "Genre, Computation, and the Weird Canonicity of Recently Dead White Men"

We discussed the feature selection. Two were not textual, like gender of author and date of publication. What else could we use? What extratextual features should be used.

How do we narrate the statistics like the PCA?

My view is:

  • We want to have essays aimed at humanists
  • We want to have notebooks or code for people who want to try our methods. This should be discoverable in TAPoR
  • We want to have data somewhere where it can be used by others

David Bamman: "A Bayesian Mixed Effects Model of Literary Characters"

We had a great workshop on how to open up topic modelling to think about character.

Day 2

Andrew Piper summarized the first day. Who are we talking to? What are we returning to the community?

What are our reporting standards? Many disciplines develop standards and practices for sharing research.

What is the relationship between significance, effect size, and classifier.

We keep ending up with non-semantic objects, visualizations, topic lists, and so on. To what extent is novelTM focusing on a particular way of reporting. How will we share data?

There was a great discussion about whether we want to engage the theoretical positions about literature. How does the statistical work we are doing challenge theoretical positions about fictionality?

Hardik Vala: "Building an Interface for the Reliable Extraction of Social Networks in Novels"

Vala showed how they have developed an annotated text that can be used as a baseline for testing automatic taggers. The steps were:

- Human tagging of aliases - they got really good human reliability among taggers - Human resolution of aliases ( who "he" is) - also good precision - Human interaction tagging where the taggers tagged the interactions between characters - X agreed with Y - here the precision dropped

He has sharable code coming:

See Hardik et al. "Mr. Bennet, his coachman, and the Archbishop walk into a bar ..."

Planning Meeting

We discussed ways forward. Some points made:

  • We want to take the momentum of meetings and gathered papers and produce an issue in a journal. We may be working with a new journal.
  • We need a data management plan
  • We want to imagine larger interventions - how the publications, tools, data would intersect.
  • We also want to make way for projects that

Next year the theme is identity. We need to learn more about how social media are mined for identity.



edit SideBar

Page last modified on October 24, 2015, at 01:50 PM - Powered by PmWiki