A computer cannot directly search for text in a scanned image. The text in an image first has to be extracted and stored in a separate character-based file. This process, which is called OCR (optical character recognition), works quite well on documents that contain consistently shaped letters like those found in typed or type-set documents. However, the OCR process does not work at all well on hand-written materials. Consequently, we had to have a human transcribe the letters and words manually into text. We outsourded this transcription process to a commercial firm called "Access Imagery" for our first project, but subsequently have used student workers to do it.
Transcribing a document results in a file containing all the words of a document, but in order to preserve the visual layout of the original, special encoding (called "markup" or "tagging") of the text is required. In a properly encoded text document, the layout will mimic the paragraph structure, line breaks, and pagination of the original document.
We chose to have the transcription of the materials in the Civil War collection encoded using the TEI encoding scheme (see sidebar) because it is a widely-used means of preserving the layout of the original document and for semantic and syntactic tagging of manuscript materials. Because we encoded names of people and places that appear in these transcriptions a user can generate lists of names of people and places from a single document or across ultiple documents.
We also tagged any words that were misspelled or undecipherable so the reader would not think a misspelling was the fault of the transcriber. We could also have tagged other details mentioned in the manuscripts such as all military terms, domestic terms, or obsolete vocabulary, but we did not expect our primary users to be requesting lists of such terminology.
A TEI-conformant text contains a "header" section and a "body" section.
The TEI header tag (<teiHeader>) provides descriptive metadata about the document in a way that is similar to Dublin Core, but in this project we chose to put the descriptive metadata in a separate XML file in unqualified Dublin Core which points to the TEI file, which contains the encoded transcription of the document. The TEI header contains only a minimal set descriptive metadata. [The teiHeader can hold more than descriptive metadata, but in this project it only contains a few descriptive metadata elements.]
The TEI body tag (<body>) wraps the actual text of the document and may consist of a wide range of tags that wrap or specify kinds of layout structure, semantics, or syntax.
These tags are used to retain structural equivalence between the transcription and the original.
Tags from the "Additional Element Set for Names and Dates" (teind2.dtd)
When reading the text was problematic the following editorial tags were used as appropriate:
<choose> <sic>text with error</sic> <corr>corrected text</corr> </choose> <choose> <orig>text with error</orig> <reg>corrected text</reg> </choose> <choose> <abbr>text with error</abbr> <expan>corrected text</expan> </choose>
View an example of a TEI file from this project (created in 2004).
We display the TEI encoding texts on the Web by using a stylesheet in XSL format which reads each TEI tag and formats it in XHTML preserving the original's layout and color-coding of the names of persons, organizations, places, and geographic features.
Copyright © 2007-2010, The Trustees of Hamilton College. All rights reserved.