Creating Meta Information
Google stores massive amounts of words very well, and by using massive computing power it can find words very quickly.
But Google does not know what the words mean.
For example, you can't ask it to find all articles that use football terms or even to show me all articles that mention people with names that start with the letter "O' or to make a list of words and phrases that have something to do with "love".
In order to answer these questions words and phrases need to be characterized with semantics, and these semantics we call meta-information.
That is, you'd have to characterize the phrase "touch down" as a football term only when it is found in an article about football (and not articles about airplanes).
To do so, we wrap words and phrases and assign some semantic classification to it.
We can use a popular encoding language called XML (eXtensible Markup Language).
<term type="football">touch down</term
In XML, these wrappers are called "elements."
When you've wrapped words or phrases in a text inside XML elements, you are said to have "marked up" the text or "encoded" the text.
A simple example of text marked up with XML might look like this:
Code Example 0: XML markup
<person type="president">John Adams</person>
(The above marks this as a specific classification of "John Adams" with the use of an XML "attribute").
But we can get more fine-grained than this by encoding the forename and surname separately.
<person type="president"><forename>John</forename> <surname>Adams</surname><person>
This example shows the use of an attribute: "type"
"Attributes" are used to clarify further the semantics of an element.
It is sometimes indented to improve readability.
Code Example 1:
XML rule : element names may not contain spaces.
This example shows the use of 3 element: "person," "firstName" and "lastName."
Elements are used to delimit the content being described by marking the beginning of the content and the end of the content.
Notice the closing element is always preceded by a slash.
Code Example 2:
- XML rule : Tags cannot overlap - you can nest them within each other -- like Russian dolls, as in the following fragment of an XML encoded article.
Code Example 3:
<title>Tragic Events in South Africa</title>
<date>February 19, 2013</date>
<sentence>It was the middle of the night, Oscar Pistorius says, and he thought an intruder was in the house.</sentence>
<sentence>He felt vulnerable in the pitch dark and too scared to turn on the lights.</sentence>
<sentence>The track star pulled his 9mm pistol from beneath his bed, moved toward the bathroom and fired into the door.</sentence>
How do we know what names to use for the tags?
It would be very confusing if everyone used their own tags.
So people marking up texts are strongly encouraged to use a standard set of tag names, so one person doesn't use a tag named "lastName," another person uses "Surname," and even a third person uses "familyName."
Various tag sets have been devised over the years, and there is nothing to stop you from making up your own, but in the scholarly text analysis world there is one tag set that reigns supreme. There is just one tag set that is used widely enough to be called the standard. It is the TEI which stands for the Text Encoding Initiative, which was an initiative of a broad-based community of text encoders, some of whom were academic scholars, some librarians, and some publishers. In the literary world, TEI is used for marking up old manuscript letters, poetry, entire poetry anthologies, and even whole books. The purpose of the encoding ranges from studying one author's style, to comparing different authors' styles, to comparing different versions of the same play over several centuries, to identify all names or places found in a text, and just recently it is being used to help publish texts in ePub formats for mobile devices.
The TEI tag set is the one we are interested in and will be investigating for the rest of this talk.
We are interested in seeing what TEI can do for the analysis of literary texts, which means we will be using only a small subset of the TEI elements.
TEI encoding is accomplished with XML, which was chosen because it is easy for computers to understand.
This use of XML is very important precisely because we need computer programmers to help us process the texts after we encode them, and since scholars of poetry are not usually also computer programmers, it is wise to use a markup language like TEI that is instaciated in XML so we can take advance of computer programs written to consume XML files.
What does TEI encoding look like? For example, the previous example translated into TEI elements and attributes would look like this:
Code Example 4: TEI elements for personal names
Note: There are two errors in this code, a tag mismatch and a missing slash. Can you find them?
1. Copy the following line into computer memory:
<div><p>My name is <persName><forename>___</forename> <surname>___</surname></persName></p></div>
2. Go to TBE validation service - checks your TEI encoding for errors
3. Replace "___" with your first name and last name
4. Add <forename>[your middle initial]</forename>
5. Keep the "TBE validation service" tab open and go back to the Exercises page.
- For TEI markup, it helps to determine what exactly you are going to mark up. Content can be marked up in many ways, but some elements may not be of interest for your project. For example, you can mark up word morphology -- such as all plural words -- but you might not be interest in that.
- There are several hundred elements and attributes defined in the TEI Guidelines, but when we narrow our focus down to encoding only poetry, we greatly reduce the number of elements that we need to use when encoding a poem.
Code Example 5: Basic Structure of a TEI Document
<!--- Insert bibliographic information about the text. -->
<!-- Insert the poem itself here. -->
1. Go back to the open "TBE validation service." (TBE validation service)
2. Click "Copy to Input."
3. Type in something for the title header <title> and <publisher>
<title>My first TEI project</title>
Let's apply what we are learning to an actual poem by Agha Shahid Ali.
"Ghazal" take from : http://www.poetryfoundation.org/poem/172051
Code Example 6: TEI markup of teiHeader of "Ghazal"
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en">
<title>Ghazal: a machine-readable transcription</title>
<name>Agha Shahid Ali</name>
<name>Peter J. MacDonald</name>
<pubPlace>Clinton, New York</pubPlace>
<p>"Ghazal," by Agha Shahid Ali, originally published in "Rooms Are Never Finished," Copyright ©
2002 by Agha Shahid Ali</p>
Encoding a Poem with TEI
Poetic analysis takes place at several textual levels and there are TEI elements that we can use to mark up structures at any of these levels with varying degrees of success.
1. It can show the physical structure of the poem.
2. It can show the rhyme and meter of a line of poetry.
3. It can identify the semantic structure of the poem.
Code Example 7. Marking up the Text Layout of a Poem (stanzas, line, fonts, underlining, indenting):
<l>I'll do what I must if I'm bold in real time.</l>
<l>A refugee, I'll be paroled in real time.</l>
<l>Cool evidence clawed off like shirts of hell-fire?</l>
<l>A former existence untold in real time ... </l>
<l>The one you would choose: Were you led then by him?</l>
<l>What longing, O <hi rend="italic">Yaar</hi>, is controlled in real time?</l>
<lg> = line group
<l> = line
<lb = line break
<hi rend="italic"> = render the text with italic font
More fine-grained structural items could be tagged as well in these lines: modal verbs (will, must, would), all personal pronouns (I, I, you, him), contractions (I'll, I'll), sentence type (statements, questions, fragments).
Code Example 8: Marking up the Poetic Form of a Poem (rhyme, meter):
<lg type="couplet" n="1">
<l n="1">I'll do what I must if I'm <rhyme label="a">bold</rhyme> <rhyme label="b" xml:id="A">in real time</rhyme>.</l>
<l n="2">A refugee, I'll be <rhyme label="a">paroled</rhyme> <rhyme label="b" corresp="#A">in real time</rhyme>.</l>
<lg type="couplet" n="2">
<l n="3">Cool evidence clawed off like shirts of hell-fire?</l>
<l n="4">A former existence untold <rhyme label="a" corresp="#A">in real time</rhyme> ... </l>
<lg type="couplet" n="3">
<l n="5">The one you would choose: Were you led then by him?</l>
<l n="6">What longing, O <hi rend="italic">Yaar</hi>, is controlled <rhyme label="a" corresp="#A">in real time</rhyme>?</l>
<x @type=""> = type of x
<x @n=""> = unique number assigned to this x
<rhyme> = rhyme scheme: a, b, etc.
<x @xml:id="n"> = unique number representing this content
<x @corresp="#Y"> = content corresponds to another item had @xml:id:Y
Markup the Semantics Entities of a Poem (themes, etc.):
So far we have only marked up the physical structure of the poem, but that is not really very interesting or revealing in itself.
We need to learn how to go beyond marking up the structure and start marking up the semantics of the text and these are not clearly indicated by structural features.
In order to markup the meaning of a text requires a thorough understanding of the poem and language use and perhaps even biography and history.
Code Example 9: Mark up the Semantics.
<interpGrp resp="PMacD" type="imagery">
<interp xml:id="uncertainty">uncertainty unknown hidden</interp>
<interp xml:id="assertivenes">assertiveness action</interp>
<interp xml:id="hope">desire hope longing</interp>
<interp xml:id="salvation">salvation safety renewal</interp>
<interp xml:id="temporalDislocation">dislocation, distancing, tempral
<interp xml:id="spatialDislocation">home, geographic terminology</interp>
<interp xml:id="socialDislocation">social dislocation, stranger, strangeness,
<interp xml:id="nostalgia">nostalgia, memory, remembering</interp>
<interp xml:id="adversity">adversity, danger, opposition, challenges</interp>
<lg type="couplet" n="1">
<seg ana="#assertiveness">I'll do what I must</seg>
<seg ana="#uncertainty">if I'm bold</seg>
<seg ana="#temporalDislocation">in real time</seg>.</l>
<l n="2">A <seg ana="#socialDislocation">refugee</seg>,
<seg ana="#salvation">I'll be paroled</seg>
<seg ana="#dislocation">in real time</seg>.</l>
<lg type="couplet" n="2">
<l n="3"><seg ana="#adversity">Cool evidence clawed off like shirts of hell-fire?</seg></l>
<l n="4">A <seg ana="#temporalDislocation">former existence</seg>
<seg ana="#temporalDislocation">in real time</seg> ... </l>
<lg type="couplet" n="3">
<l n="5">The one you <seg ana="#modality">would</seg>
<seg ana="#uncertainty">Were you led then by him?</seg></l>
<l n="6">What <seg ana="#hope">longing</seg>, O <foreign><hi rend="italic">Yaar</hi></foreign>, is controlled
<seg ana="#temporalDislocation">in real time</seg>?</l>
1. Copy the following entire poem into computer memory:
2. Go to the TEImatron.
3. Paste in the poem.
4. Click "Submit."
TEI gives us a standard vocabulary to make explicit the form and meaning of a poem so computers can process it.
Learning TEI Encoding
Processing Engines for TEI files
Hamilton College Library TEI Projects
A TEI Project with Static Output
- A Civil War Letter (our first attempt at encoding in TEI, but not dynamic use of the codes, only colored highlighting)
TEI Projects with Dynamic Output
TEI Project under development
- Top -
(Reviewed: February 27, 2013)