These are my notes about the Linked Infrastructure For Networked Cultural Scholarship Team Meeting in Banff, Alberta.

Note that these notes are erratic. I write when the battery is charged and when things make sense.

There are official notes at https://tinyurl.com/lincsnotes

Saturday, Sept. 14, 2019

Peter Patel-Schneider

Schneider started us off. He works now for Samsung research on semantic web technologies.

He talked about how there are commercial knowledge graphs, like the Google KG that ingests everything.

Freebase (not any more)
DBpedia
Schema.org
Wikidata
Semantic Web (not really a data)
Linked Data Cloud (very diverse, but shows the idea)

Wikidata is a Knowledge Graph full of Triple Facts. It also has ranks (facts that are deprecated or preferred). It has qualifiers (things attached to facts like "in the year 1897 the population of Berlin is X" where the year is a qualifier.) There are references (as in this information came from.) There is a simple ontology language (RDFS). There are lots of tools and there is the culture of wikipedia.

In his opinion it is big but sprawling. He showed a demo, Douglas Adam. He commented on the complexity and issues in wikidata. It is multi-lingual, but that means there are special ID numbers.

Wikidata gets used in Wikipedias. It isn't RDF, but there is official RDF dump so it fits in the Linked Open Data Cloud.

The job at hand is to:

Find a good way to represent cultural information
Get it to fit in wikidata and other collections
Write some software to do representation and reasoning
Do things over and over

He then talked about Powerful Ontologies. To use powerful ontology languages like OWL that support inference and so on, then you get a good representation, but you have to figure out the tokens/nodes.

He then got to what he wanted to say, which is how can we develop a logic that captures the intuitions in data like wikidata. If we can build a useful logic then we can do useful reasoning.

I asked whether it was possible to have one logic? The history of philosophy seems to be one of discovering how good ideas about logic go wrong. He answered by saying that we have two different problems:

Choosing a logic (type of logic)
Choosing the objects in the ontology - he said there is always a sloppy set of edge cases

My sense is that the humanities are those disciplines that deal with the ideographic (the unique cases and exceptions) not the nomographic (regular, lawful cases.) We therefore often focus on exactly what won't work with a logic. It gets even worse in that we want to compare the ideographic to the nomographic. We want an exception to logic and a standard logic too. We want to compare the exceptional epistemology to whatever passes as everyday epistemology.

My intuition is that

Peter mentioned Cyc as a project that spent 30 years to develop a master ontology of common knowledge which still in incomplete.

Susan Brown: LINCS Overview

Susan Brown introduced us to the project and the team meeting.

LINCS was imagined to do things like

converting data
enhancing data
creating software
linking data
research on linked data
train people

We are a $5 million Cyberinfrastructure project (CFI and partner funding). It lasts 3 years with 6 university partners. 48 humanities researchers and additional technical team. We anticipate training 200 HQP.

She showed a number of useful visualizations of the project from different perspectives. She showed a flow chart that captures what the project has to do:

We take source datasets,
Convert them into linked data,
Which is stored (and therefore published to be accessible),
So that it can be accessed or searched or inferences can be drawn
Which means lots of interfaces

How will this be maintained?

She then showed a systems diagram with data sources at the bottom and flowing up to interfaces.

We then talked about the Project Charter and Membership Guidelines. A good bibliographic reference is Ruecker, S. and M. Radzikowska (2008). The Iterative Design of a Project Charter for Interdisciplinary Research. Designing Interactive Systems, ACM Press.

We talked about what has to be done and who has to do it. A lot of this came from the proposal and the letters of agreement.

She mentioned LOUD (Linked Open Usable Data) as what we want.

We were reminded that we want to administer ourselves in a respectful and non-hierarchical fashion.

Research questions panel: Deanna Reder, Janelle Jenstad, Diane Jakacki, Stacy Allison-Cassin, Jon Bath

Stacy Allison-Cassin talked about the challenges of abstract ideas in libraries like what a "work" is. She talked about how this is especially a challenge in music. She works with the Mariposa Folk Festival to maintain an archive. Janelle Jenstad talked about the projects she is involved with like the Early Modern Map of London. Jon Bath at the U of Saskatchewan started by talking about building silos. He now wants to stop building silos. Deanna Reder at SFU is the PI of the people and the text. It is a project about indigenous literatures widely defined. Diane Jakacki (scroll down) is the lead of REED London that is using CWRC.

They then talked about what questions LINCS might help them answer:

Any resource in the humanities ends up generating questions that cross disciplines. Linked data might help connect to different perspectives on the same people, places, events and so on. Some of the questions LINCS should facilitate:
We don't always know if our data is correct. Linked Open Data (LOD) lets us check things.
How do we know if our data is valuable to others?
Where should our data live? How will it survive?
What can we learn about openness and the appropriate uses of data?

They had an interesting discussion about openness and how we often don't want it. Archives show and hide things. Some of the issues:

How do we recognize the perspectives built into any dataset and any link?
Some datasets are highly curated by experts.
Can we unlink data?
Authority

I believe that the value of LOD is to help in claims being made. One way we check claims is to follow entities named in the claim to check the claim.

What do we need to learn about the technologies:

What is LOD?
How to best structure data? What are best practices?
How to use it? How to draw inferences? How to reason about it?
How to talk about it?
How to weave into a research project?

Ichiro Fujinaga: SIMSSA and LOD project

SIMSSA stands for Single Interface for Music Score Searching and Analysis. Fijinaga talked about the challenges to music. In textual disciplines we have Google Books, but there is no search for music. We don't even really know what to search for. There is music recognition tools that can add information to images. IIIF provides an image interoperability framework. They have standards for music encoding (MEI). They have various tools.

They hope that if they build others will come.

OMR - Optical Music Recognition is a core technology that converts an image score into a computer readable music file. He showed the Cantus Ultimus. This lets one search across distributed score collections. He showed search for a sequence of pitches which was very cool. Then he showed MusicLibs that lets one search all sorts of music.

He talked about provenance and the challenges of different types of provenance. There is the provenance of the musical work, provenance of source of instance, provenance of computer files, and who did the cataloging. To do this they use RDF Quads - named graphs that connect sources.

Then he talked about feature extraction - how they can extract neat sets of features and create study sets. The types of questions that they can ask are really cool as in "select from fugues printed in London those that modulate to G Major." He talked about how he wants to now link to external data like prosographical databases.

I asked about the cool queries that he described and what the inteface was to design queries. He pointed me to a apper about JSYMBOLIC 2.2.

Heather Dunn: Canadian Heritage Information Network LOD initiatives (CHIN)

Dun talked about CHIN which has a lot of data about Canadian artefacts. A lot of their data is flat. They have lots of data which is not online as it is not bilingual.

Artists in Canada is one database they have that might be useful. They have developed a Nomenclature that is used for museum cataloguing. There is an overlap between object names between their different projects. They haven't yet connected their own data. They are using PoolParty to manage their nomenclature. They are trying to figure out how to scale up and make decisions about semantic data.

She talked about the Records Data Model that is meant to eventually cover all museums. They are starting with the "agents" - beyond just the artists. They are basing what they do on CIDOC-CRM which is widely used and is event-based (which is good for historical objects and actors.)

She showed a complex data pipeline. They are trying to figure out how to make it simpler to museums.

She was asked about reconciliation when you have different ontologies.

Deb Stacey: Ontology policy/strategy

Stacey started by talking about what an ontology is. It is a specification of a shared conceptualization. It should have a shared vocabulary in a coherent and consistent manner. It can guarantee consistency.

The ontologies we know and love:

Dublin Core
Prov-O - W3C provenance ontology
Scheme.org
FaBiO - FRBR-aligned Bibliographic Ontology
CIDOC-CRM
Tadirah
Europeana Data Model
FOAF
Library of Congress

Some of the problems with ontologies are that they are hard to keep simple. People end up picking and choosing and disagreeing. We should reuse those of others. There is a balance between simplicity and semantics. We need to pick ontologies that are well known in community of practice.

Ontologies provide structure that lets you see some things, but hides others. Does this fit with the exceptionalism of the humanities?

Reasoning is important and hard. If one gets it right you can do neat reasoning. There is a limit to the current technology of reasoning.

Some of the issues include:

Reasoning
SHACL
Foundations

And reasoning issues are evolving.

With Reasoning we get certain things:

Stability (of a concept)
Subsumption of concepts
Consistency of individuals
Check individuals to see if they are instance of a concept
Retrieval of individuals
Realization of individual

She talked about shapes and SHACL (Shapes Constraint Language). This allows you to constrain and then validate graph type properties. A shape is a way to identify metadata about a particular type of resource. You can describe what has to be there and what is optional.

She talked about Foundational Ontologies that are basic, upper ontologies that describe superclasses that everyone agrees about. You then build on top of them. CIDOC, Dolce, SUMO are examples that are big.

Dynamic Ontologies are drive by the structuring of the data. As one edits the data it can trigger changes to the ontology. This can be a version of versioning. It can be community driven, but it also expensive in ways. Above all, we have to accept that changes will happen and we have to change our ontologies.

We had a discussion about how much an ontology might force on us and whether we need them. To some extent you always have some structure if you have structured data, an ontology makes it explicit.

A great question asked was whether we should have so many people together in this project. What is the advantage of a large project.

Constance Crompton: Is Less More?

Constance Crompton discussed ontologies that could cross-walk with the TEI. She has been looking at what to pull from the TEI and how there is a fit with other ontologies. There are number challenges coming up:

Mixing ontologies decreases reasoning
TEI in the wild is stringy - it isn't URL-y
TEI elements gather their meaning from their parent and child relationships
TEI elements are inconsistently used across projects

There are different options for TEI contributions. We could have tools that would take people through the transformations needed to fit in LINCS. That would be for the most involved. Or we could have lighter ways for people to connect. Or just run automated tools on TEI that is open.

She then presented questions:

What matters to ingest? Do we want only high confidence stuff or more guesses?
What do we want people to know about LOD when they start tagging in TEI.

I'm convinced that what we need is just in time markup where we can add interpretative markup and use it with other layers of markup.

Constance suggested that the paratextual information may be more important. That what people want to know is about Atwood, not line numbers in a text.

Lisa Goddard and Stacy Allison-Cassin: Storage and Publishing Platforms

We then had an important discussion about storage and platforms. Important as we need to make up our minds soon. Stacy Allison-Cassin started by talking about why she got involved in wikidata. She didn't have a grant and needed something existing that she could add to. She found that there wasn't a lot of Canadian information in wikipedia so she ran a Music in Canada @ 150 Wikimedia Project. It was a year long project with a pre-conference workshops, editathons and outputs. She talked about how you have to be invested in the community relationships with other metadata folk. You can't just use it.

She reminded us that there will always be mistakes in the data. She showed us how there is a mistake icon that shows up automatically. She talked about the Witchfinder General project where they tried to see what they could draw from wikidata about witches in Scotland. They documented all sorts of problems they found. She discussed how one can propose new properties and they get voted on.

She then showed some tools like the Reasonator and Mix and Match integration and then showed a bunch of projects including projects using Wikibase like Rhizome.

Lisa Goddard talked about how we have to make a decision. She pointed out that we have to make a decision soonish. She believes we should use Wikibase. What are the strengths or weaknesses.

She talked about what we need (criteria):

RDF standards
Ontology customization
APIs
Ingest options and tools
Search and interface tools
Good interface
Performance and scalability
Community practice and documentation
Ease of integration with others
Versioning

She talked about APIs and whether we may have to extend the Wikidata API. She walked us through some of these criteria and how they play out in Wikibase. The Wikidata model is nice and humanly readable, but it doesn't need do qualifications as well. Provenance is at the statement level. Wikibase rank give simple way to indicate certainty.

Lisa was asking about what might be alternatives. I don't know much about platforms but here are some options:

Why not just use wikidata
Apache Marmotta: http://marmotta.apache.org/
Wikibase Repository: https://www.mediawiki.org/wiki/Extension:Wikibase_Repository - use a subset of
Whatever they are using in the music community
Whatever they use for linked data for the New York Times or Guardian
Just use a robust database and build stuff on top of it
What is the European Data Portal using: https://www.europeandataportal.eu/en/resources/training-companion/open-data-platforms

Lisa talked about Reification. This is obviously a complex issue.

The key is how can we make the decision. What do we need to know for this to work. Here are some things we think we need:

Easy to create properties
Easy to publish ontologies online
Is reasoning or publishing more important
You can have conversations about materials
You can watch things
There is versioning, but it is hard to create
There are APIs for everything, but people complain that it is slow
There a bunch of tools already being built
There are SPARQL end points and query builders

Denilson Barbosa: Conversion and Diffbot Text Understanding Quasi-Demo

Denilson talked about how we will always need to convert or extract metadata. There are datasets like the Internet Archive that would be great to be able to use to extract things.

He then talked about using machine learned models for identifying entities and disambiguating entities to a reference KG. We need one or more reference Knowledge Graphs. We are partnering with Diffbot which has clients like the NSA, Ebay and Amazon. He talked about Diffbot's mission which is ambitious. They will make a NER and disambiguation tool open for us. They may give us limited access to their Knowledge Graph.

He then demoed a simple tool that does NER using Diffbot KG and then returns entities. The tool was good at figuring out who "she" and "her" was.

Lightning talks

Janelle Jenstad talked about MOEML gazetteer (raw geojson)
Alison Hedley: Yellow 90s - late 19th century personography
Bryan Tarpley: ARC (meta)data - a superset of neat community review and aggregation platforms like NINES
Michael Frishkopf: Canadian Centre for Ethnomusicology -
Michelle Meagher: Born-digital Heresies - Heresies is a feminist publication on art and politics. They are studying relationships through the journal and in the journal.
Emmanuel Chateau-Dutier: TEI Paris Guidebooks - They are digitizing these architectural textbooks/guidebooks for which they have different editions.
Geoffrey Rockwell: Twitter data

Sunday, Sept. 15, 2019

Access - Tools session

Shawn Murphy talked about HuViz - He gave a prehistory of HuViz. HuViz is a RDF OWL visualization tool. You get a circle (shelf) of nodes not being visualized. You can drag and drop nodes into the center and it pulls its adjacencies in. This lets you visualize subsets of a larger graph.
Laura Mandell talked about BigDIVA which lets one visualize large collections by genre, date range, disciplines and so on. It has a cool trace my path tool. She then showed an idea about how it could become a LOD viewer.
Susan Brown talked about CWRC that is a virtual editing environment. She showed CWRC-Writer that allows one to edit TEI text and link entities. She also showed Nerve (?) that does NER to identify candidate entities.
Stéfan Sinclair presented Spyral which is an extension of Voyant. It allows you to create (spiral) notebooks. We have prototyped this in Jupyter notebooks - The Art of Literary Text Analysis is available at Github: https://github.com/sgsinclair/alta
MJ Suhonos showed Wordpress for LOD. He gave an example https://personography.1890s.ca . He uses Wordpress as it has good plug-in ecology.

Breakout Sessions

We then had reports from the breakout groups. I was in a Tools group. We came up with the following:

We need an API that can query, read and write to Lincs. This may, because of wikimedia, be limited to writing one triple at a time. The API needs to support bots that can comb through and do cleaning/reconciliation.
Where there is lots of LOD to be uploaded we will need a batch upload/copy interface that may be much more controlled.
We imagine that there will be federated versions of Lincs like one for CWRC. These will get a stream from Lincs central and people can test an upload of lots of data on the children versions. Data that has been checked on a child can then be copied to over to the central. Thus one type of "tool" would be children versions of Lincs and sandbox versions spun up for some purpose.
The API needs to be designed for cleaning bots, research bots, and reasoning bots.
It seemed like we would need an API working group and a Conversion working group. A breakout group also discussed the conversion process.

My sense is that we will have the following types of tools:

Children versions of Lincs that are maintained to interact with large portals like CWRC. These will get a stream of updates from Lincs central and will be authorized to upload new data to the central.
Sandbox versions of Lincs that are fired up to test the ingestion of a large set of LOD or for testing a different tool.
Bots that might crawl the data adding or checking it.
Reasoners that might create new knowledge in some fashion that might be added back or used for research.
Conversion tools that create LOD that might be then ingested in a sandbox version of Lincs for then copying into the central one when checked.

At this point I had to leave.

Linked Infrastructure For Networked Cultural Scholarship Team Meeting 2019