About COHR Interview Collection Project Online Technical Summary

Introduction
How to Use this Document
System Overview
Content Files
    Audio Content
    Text Content
TEI Encoding
Archiving and Preservation
Related Documents
Software and Hardwares
FLASH note
Resources

Introduction

This document describes the technical architecture of the UCLA Center for Oral History Research Interview Collection Project (COHR Interview Collection) and is organized into different sections representing each component of the system.

UCLA Center for Oral History Reasearch Interview Collection project is a collaborative project of UCLA Center for Oral History Research and UCLA Digital Library Program.

COHR Interview Collection project is a multimedia project which implemented with Java technology. The data are stored and retrieved efficiently with an Oracle database. The metadata standard that we are following is TEI(p5). The interview audio files are streamed through Helix Media Server hosted by UCLA Library LIT department. The metadata and content files are uploaded into Oracle database through DLCS - Digital Library Collection System which is a system designed and implemented by UCLA Digital Library Program.

The COHR Interview Collection web site is accessible at http://oralhistory.library.ucla.edu/ .
The permanent preservation place for this project will be California Digital Library.

How to Use this Document

Readers should use this document as the overview of the system and not as a detailed technical guide. For more information about a specific technology or component, click on the related links located throughout the document.

System Overview

COHR Interview Collection Project is a Web application and content publishing system.

Content Files

The text content is stored on a master server and in the oracle database.

TEI Encoding

We selected the metadata for each interview and map them to the correspondence element in TEI.

CDL published a guideline for TEI header in the year of 2004. This guideline is based on TEI/P5. Considering that there are minor changes between TEI p4 and p5 and considering that in the future we will deposite all our , this guideline is still valuable. http://www.cdlib.org/inside/diglib/stwg/metadata/META_BPG.html

CDL had published a guideline for encoding Oral Histories withe the title "California Digital Library TEI best practice guidelines for Encoding Oral Histories". I have been following it while we worked on the Pilot project. CDL took it off from their website for some reason. I still have a print out version of this guideline. It is targeted at TEI p4.

DLF published a guildline for TEI encoding: http://www.diglib.org/standards/tei.htm

Archiving and Preservation

The content files and metadata of the interviews are archived in house in UCLA library. There are several servers that are hosting the master audio and text content files. Systematical backup policies are applied on these servers.

California Digital Library is a unit of the UC Office of the President (based in Oakland), offers the Digital Preservation Repository (DPR: http://www.cdlib.org/inside/projects/preservation/dpr/ )Besides in house archiving, UCLA digital library program UCLA Digital Library Program is the gateway for UCLA materials to be deposited in the CDL DPR.

Related Documents

XML Templates and Examples:

Before you convert a word file into XML file, you need to filter special characters. The following website give clear direction on special Character to filter: http://xml.silmaril.ie/authors/specials/

Biography/Interview History:
1. Simple biography: xml template 1    xml example 1    Apply XSLT on the example 1
2. Biography with headers: xml template 2    xml example 2    Apply XSLT on the example 2
3. Biography with list: xml template 3    xml example 3    Apply XSLT on the example 3
3. Stylesheet: You can download a simple XSLT stylesheet which could be applied on above examples. To apply stylesheet on your XML files, you can simply insert following code at the top of your XML files:
<?xml-stylesheet type="text/xsl" href="your style sheet.xsl"?>
See above three examples with XSLT file applied on the XML files.
By using an XML parser or a software which could parse a stylesheet on an XML file, you can get an HTML page easily. You can modify above stylesheet if you want a different HTML page.

Work flow to upload current interview files into DLCS

Step 1: Uploading WAV audio files into DLCS once they are ready. Create sessions under the interview, and set all sessions and the current interview "in progress" working status. The "in progress" working status is the default value once you create a new item in the DLCS. (Following the direction from the center about how to make a MP3 file and how to remove the WAV from O drive)

Step 2: Once the transcript is generated by transcriber, starting working on the transcript XML files by following Convert InqScribe XML into UCLA XML . The final transcript which includes every session needs to be a simplfied TEI/P5 xml file. (We have about 40 XML files which have long TEI header. We would like to simplify this process in the future by have the system to generate a detailed XML which includes the metedata part as the header part.)

Step 3: Milestone tags: If the final transcript XML file doesn't have milestone tags for the timestamp. you need to implement this step. There are two ways to apply milestone tags on timestamps.

  • Slow Way
  • Quick Way

Step 4: You need to insert the above file into the body part of Simplelified TEI p5 XML template. You need to validate your document with our UCLA TEI Lite schema. The following format is the way to do it:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.tei-c.org/ns/1.0 http://digital2.library.ucla.edu/xslt/schema/teiliteOH.xsd">

The <sp xml:id=""> needs a unique ID for each sp tag. This id is very important for us to identify each different speech and we got to keep it. We can modify it by taking each session file out, and add the session number into the xml:id. For example, a sp tag with id as 5 in session one could have ID as <sp xml:id="sp1-5"> , a sp tag with id as 5 in session two could have ID as <sp xml:id="sp2-5">. You can do this while woking on step 5, and copy each session content back into the final transcript file. It is very simple to use find and replace function of Oxygen.

Step 5: Take each session XML file and saved as short session XML file.

Step 6: Apply the following XSLT to generate an HTML file.

Step 7: Upload the full transcript XML file and session xml file into DLCS. Be sure to comment out the xslt link before you upload the XML full transcript file.
<!--<?xml-stylesheet type="text/xsl" href="http://digital2.library.ucla.edu/xslt/local/interviewDisplayInHouse.xsl"?> -->

You also need to have an HTML file for the transcript by applying a XSLT onto above transcript. Lisa, sorry I leave this part of work for you.
<?xml-stylesheet type="text/xsl" href="http://digital2.library.ucla.edu/xslt/local/interviewFinalDisplay.xsl"?>

If you get lost, just take a look at our example interview at: Interview of Francine Diamond

Work flow to upload previous interview files into DLCS

Step one: Locate the PDF file and convert it into word file, then to XML file. PDF files are locateda at: Lis35:\oral_history\UCLA Oral History Research Center\COHR Transcripts . There are two formats for these PDF files: PDF files in text format, and in images format. For PDF in text format, it is very easy to convert them into Text file or word file. For image PDF file, you will encounter a lot of errors after you convert them into text file. You need to original print out transcript book to corrent those errors. Once you convert them into doc file, you can just following previous steps to convert them into XML file, and upload into DLCS.
  • Steps to Converting PDF files into DOC files.

Software and Hardwares

License infor is stored at Lis55d$:\APPS\DLP\
  • InqScribe: Software we choose to transcribe our transcripts. We bought several licenses but it turns out the only one is necessary: for open an inqScribe file and convert it into XML files.
  • Oxygen: we use it to edit and transform XML files. DLP has a couple of license and COHR has two. (Oxygen has problem with the most recent Java Virtual Machine so you may want to install the one comes with a virtual machine. Here: http://www.oxygenxml.com/InstData/Windows/VM/oxygen.exe )
  • AVS Video Tool: DLP only ordered one copy for me to work on video files. This is a small and neat video tool to convert video format.
  • Super: a free software which convert (encode) or play any Multimedia file. It seems to be good to use.
  • Adobe Web Premium CS3: The library has a site license for web programmers who worked on Reddot(library cover the fee). As the DLP Reddot programmer, I have one copy installed on my machine. COHR has one version installed on one machine for anyone who do Reddot programming.
  • Sony Sound Forge 9: audio edit tool we are using. Two copies have been bought by DLP and the CDs together have been returned to Stephen.
  • Adobe Premium Pro CS3: video edit tool we are using. One copy has been bought by DLP and installed on the student computer in Stephen's office.

Resources