philosophi.ca : Text Database Preparation Model

This is a draft document describing in general terms how the Canada Foundation for Innovation programs can help with the development of text databases suitable for research. This came out of a discussion at the Digital Humanities and High Performance Computing workshop organized by SHARCNET. At that workshop we had conversations with a CFI representative about how to get humanities data cleaned up so that HPC methods could be used on it. CFI can be used as a source of funding for the "messy data" challenge. Please note that this is not a CFI document - it is my interpretation only.

CFI Funding for Text Database Preparation

CFI will fund the acquisition of a database or software if it is needed research infrastructure.
CFI will not fund research or graduate stipends as that is not their mandate.
Therefore CFI would not fund the collection of data - for example the collection of linguistic data in the field, as that is research. They will, however, fund the turning of the raw data into a research database that can be used - that would constitute setup. Therefore they would fund the XML encoding, the proofing, the development of delivery database, and the development of documentation and training. In other words they will fund what it takes to get the raw data to the point where it is "set up" for research.
But CFI will not fund what looks like research activities using the database. Therefore they will not fund the graduate stipends for students to study it and write their theses. Nor will they pay for travel to conferences and other research-like activities. That's what SSHRC is for.
Nor will CFI pay for the maintenance of the infrastructure in the grant. Once it is set up, CFI expects the university to maintain it and they provide the IOF funds to help. The idea is that if the university is unwilling to maintain a building built by CFI then the university doesn't really need it.

A Linguistic Database Case

Project X has a large set of linguistic records donated. These records are all on tape. We apply to CFI to transcribe them, proof them, enrich them, and to build a multimedia database that allows people to search the text and hear the audio. We also ask for funds to develop the documentation for training researchers. The total budget is $100,000. We get a value ascribed to the donation based on the value of a similar linguistic database of $20,000. The budget for this section of the node budget is therefore:

Cost of raw audio records. $20,000 - This is donated
Cost of one year of a full time transcription and preparation technician: $60,000 - This is a personnel cost
Cost of a database programming contract: $15,000 - This is a contract, but again is for personnel
Cost of one month of a technical writer for documentation: $5,000 - Again personnel

Timing: I would have the transcription and preparation done in year one. I would have the programming and documentation done in year two. I would then freeze the set up as a virtual machine so that it didn't need support as long as the server was running.

Summary:

Cost	Description	Percentage
$100,000	Total Budget	100%
$40,000	CFI grant	40%
$40,000	Provincial match	40%
$20,000	Coinvestment (In-kind donation)	20%

This project therefore is appropriately and fully funded. It needs no cash as the donation covers the 20% and triggers the CFI funding. Note that the $20,000 is included in the final cost of the project even though it is all donation. The point is that this project costs 100K and we have a coinvestment to help get there that is just as good as cash. If we didn't have the donation we would have to buy the materials. Note also that this project does not do research. It doesn't do the collection and it doesn't pay for people to study the database. It simply pays for the database to be prepared to the point where it can be studied. It is up to the project to get SSHRC or other funding to pay for the research. It is also up the university to maintain the infrastructure. The university needs to manage the server on which it runs.