Optical Character Recognition – Mary Ellen Pethel

At first glance, American Poetry might not catch your eye or seem overly impressive. However, scratch beneath the surface of its simplistic homepage and users will find over 40,000 poems by more than 200 American poets from the colonial period to the early twentieth century. It is also connected to African American, Canadian, and British poetry and literature. The database is hosted and published by ProQuest by way of its humanities published imprint of Chadwyck-Healey. A digital publishing specialist, Chadwyck-Healey is “synonymous with innovation in electronic publishing since the release of the English Poetry Full-Text Database in 1992” (“About Chadwyck-Healey”).

The database American Poetry first debuted in 1996 and offers multiple search options, which include keyword, first line/title, and poet/author. For any of these options there is a metadata search index generated by the database that offers a list of searchable terms found within the collection. If one is researching a specific poet then there are additional search fields where results can be mined by gender, ethnicity, literary period, and years lived. Ethnicity and literary period also have indexes available to help users find and select appropriate terms recognized by the database. There are also collections linked on another page that are cross-searchable via the Literature Online interface. Some samples of these collections include African Writers Series, Twentieth-Century Drama, and an upgraded edition of the King James Bible online. The governance of this literature and poetry collection falls under a special selected editorial board. Board members advise on the selection of text and editions with the goals of comprehensiveness and inclusiveness.

After performing a search, using their easily navigable search options, and selecting an individual work, there is a great deal of information provided by American Poetry in regard to the literary period and author. For each poem or work of literature, there is a link with information about the author: gender, birth/death dates, ethnicity, nationality, and literary period. For the poem itself, there is full-text but it is transcribed right onto the webpage and the original is not viewable. While those seeking the text alone (and its legibility) will be satisfied, it leaves a bit to be desired for the historian or digital humanist who wonders what was lost through digitization. There is no exportable image, and searching full-text within the text can only be done using Contol+F as you can on any webpage. There are options for “Print View,” “Download Citation” and “Text Only.”

Surprisingly, the “Download Citation” option is clunky compared to the database’s overall streamlined organization and presentation format. The necessary information is there, but the export and formatting options required additional steps. Rather than go through this process, users would be better off typing up the citation the old-fashioned way—formatted and entered manually in a document. There is also a “Durable URL” option but it simply provides a link that can be saved or emailed. Emailing the link to someone who does not have access to the database will not be able to view your sent data without signing in with a user name and password. However, this feature can help to generate a quick link list for the researcher.

Chadwyck-Healey first began publishing in 1973, and has spent over £50 million over the last decade. Their bibliographic basis is the Bibliography of American Literature (Yale University Press, 1955-1991) and supplemented with additional poets recommended by the Editorial Board to “provide a thorough representation.” Text conversion was processed through four stages: selection of texts, encoding and indexing, re-keying and scanning, and preservation. The selection of text involved a consortium of scholars, research libraries, national libraries, and a publishing team. The encoding method was Standard Generalised Mark-Up Language (SGML). As stated, “SGML encoding of original texts allows works to be divided into content elements . . . and recognized accordingly that provides a route through vast amounts of data” (“Text Conversion”). The re-keying and scanning process took SGML and compared it to text generated by Optical Character Recognition (OCR). Re-keying primarily rectifies spelling and punctuation discrepancies. During the digitization process, the entire text of each poem was included as well as any accompanying text “written by the poet and forming an integral part of the poem,” (“About American Poetry”). This allows for preservation of materials.

Access to the collection follows a strict subscription-only policy; however, it can be accessed remotely. While most databases are primarily operated remotely, this designation shows the age of the database a bit—harkening back to the days of library-only or on-campus databases. There are also some other options that show the age of the database including notes on how to navigate JavaScript, which internet browser to use (Internet Explorer listed), 18 different step-by-step sample searches, changing system color (for user preference), shortcut key to navigate the site “without using a mouse.” In today’s touchpad, cloud-based world many of these features are antiquated as students and faculty alike are more sophisticated and search-savvy.

American Poetry remains an early model of early digitized databases—designed with students and educators (and paid subscriptions) in mind. The publisher, Chadwyck-Healey, boasts that is it used by “specialist researchers to undergraduates alike” and that its full-text primary source materials “create fresh avenues for critical debate, scholarly dialogue, and serendipitous discovery.” While this claim may be a bit far-fetched, this digital collection does contribute and make available a vast amount of poetry and literature related to “America” and mother “Britain,” to the digital world. For this reason, American Poetry is still very much worth the price of an institutional subscription.

movie_2

Module #4 focused on the different purposes, methods, and uses of digitization and issues related to it. In creating a Guide to Digitization, one must first consider three essential questions and the answers to them.

What can you capture, and not capture, when you digitize something?

Digitizing an image or object can help to create core content that can represent and disseminate information, text, and at time audio-visual content. However, according to Melissa Terras, “additional infrastructure (such as a database, a website front end, and an explanatory apparatus or additional teaching materials) is required in order to deliver the content successfully to users.”

Which forms of digitization make the most sense for different types of items?

Our activity nicely illustrates the differentiation of digitization and its effectiveness based on the type of item. The following categories were used to digitally asses three images and corresponding videos of the “21st Century Kitchen”: size, weight, color, texture, all sides, sound smell. Digital Images captured an average of 50 percent of these categories, while digital video captured an average of 90 percent. The objects included text, food products, and inanimate objects. Images worked best for text-heavy items while videos worked best for objects or substances: texture, size, sound, or weight.

To what extent does working with digitized representations impact how we understand different kinds of items, and/or our ability to use them for different purposes?

Marlene Manoff identifies what she calls, “Textual Scholarship,” which address the physical aspects of a source in addition to the text itself. I think this is important to consider even when dealing with what many researchers would largely dismiss as an essential consideration. For artifacts, I think that descriptions, dimensions, and audio-visual aids are extremely helpful but not always practical or affordable for those (libraries, archives, etc.) doing the digitizing. OCR is also an extremely useful technology for translating and providing information from non-traditional texts.

While many digitized representations are utilized for the purpose of research and/or learning in an educational environment, several of our readings pointed to the growing interest and availability of digitized information for public audience interest (such as Google Books) as well as commercial digitization ventures.

My “Guide to Digitization” would include the following components:

Following the Digitization Guideline of the Library of Congress in terms of scanning and color settings as well as formatting.
As Paul Conway argued, digital humanists should always be cognizant of the intellectual premise, goals, and “meaning-making” that is created through and by the digitization process. As Manoff stated, “If print and electronic versions are different objects, we should not treat them as if they are interchangeable.”
We must discern the cultural purpose, academic relevance, and historical significance of the original item(s) when considering the best method and format of digitization. This credo should guide the digitization of individual items as well as overarching goals of collections, databases, or exhibits. It includes decisions whether to enhance, zoom, crop, etc.
Digitization should never be a substitute for preservation. While digitization can reduce wear and tear, digital surrogates should serve a greater purpose.
Within reason and with given resources, the digitization of images, objects, texts, and other forms of media should be accompanied by as much information and technology as possible including OCR, indexing, searchable terms, and cross-referencing.