Ben Feist
[email protected]
  • All Articles
  • Project Apollo 17
  • A47 Headphone Amp
  • Victor Animatophone

Digitizing Apollo 17 Part 3 – New OCR Techniques

Posted on March 30, 2012 by Feist Posted in Project Apollo 17

As discussed in my previous post, The Apollo 17 PDFs contained an early attempt at recognizing the typewritten text using Adobe Acrobat’s built in OCR functionality. Working from Adobe’s OCR output would result in a huge amount of manual labour, which kind of defeats the purpose of using OCR to being with.

ABBYY-FineReader-Pro-11.0.

I did some research and found that ABBYY FineReader 11 offered many interesting features that I could use to digitally extract the transcript data into a format I could use for further processing with Python cleanup scripts (I’ll cover that cleanup step in a later post). The key feature that made me select FineReader was its ability to detect patterns in a page and create table grids that would compartmentalize the data recognized into regions. This auto-detection was only about 80% accurate, requiring manual intervention on almost every page of the transcript but it was the best combination of automation and manual labour to get a result that was free of the issues in the PDF and could be manipulated further.

finereader1In the screenshot above you can see the grid I established for the first page of the transcript. This grid breaks up the timecode, speaker, and verbiage data into three separate table columns. It also breaks each event into its own table row. Establishing the rows also has the benefit of allowing wrapped verbiage data to remain in the verbiage column without disrupting the timestamp data. The right side of the screenshot is the resulting OCR output from FineReader. Characters that are highlighted in cyan are considered “maybe” characters by FineReader. Theoretically, if I trained FineReader to understand these maybe occurrances it would be more accurate on subsequent pages. I spent much time training and retraining but never found that there was any benefit. I knew that later I would have to read through manually anyway, but this step wasn’t the time to do so.

This step lasted a few weeks. Each evening after work I put on some tunes and carefully establish this grid for every page of the transcript. Some of the pages were scanned at a 5 – 10 degree angle whiched cause FineReader to completely panic and interpret the entire page as garbage. This only happened 10 – 15 times though. Those pages will have to be re-keyed at a later stage.

Digital Output

Another great feature of FineReader it to output table results directly to Excel. This would allow me to perform the first steps of the testing and scrubbing of the output data. Here’s a link to the first 500 pages of the TEC transcript as directly outputted by FineReader if you’re interested. You’ll notice that the original Tape numbers are contained in the output. This was done deliberately in order to be able to refer back to the original typewritten page from any point in the transcript.

ocr_outputIn the screen grab above you can see that within Excel the transcript is intact. Timestamp, speaker, and verbiage are in three columns and the lines that were wrapped in the typewritten transcript are no longer wrapped. I should point out that this didn’t “just work”. I had made errors in tabling some of the pages resulting in big mistakes in the output. Looking at the column sizes in Excel helped me to discover these errors and correct them in FineReader before outputting again–and again. I was careful to not manually make any changes to the resulting CSV (Excel) output. This was important because I wanted to ensure that if I had an insight somewhere down the line about additional OCR cleaning, then I still had the option of doing something differently in FineReader and just generating the output again. If I had tampered with the output then any new output from FineReader would overwrite my manual changes.

In this screenshot you can see the number of large milestones that were hit in the main effort to get clean CSV output from FineReader

In this screenshot you can see the number of large milestones that were hit in the main effort to get clean CSV output from FineReader

The timestamps originally came out as all manner of different characters, not only digits. This and many other problems were to be cleaned in the next step, Python processing of the CSV output, which I will cover in a future post.

 

 

« Digitizing Apollo 17 Part 2 – Transcript Restoration, A Beginning
Digitizing Apollo 17 Part 4 – Technical vs Public Affairs Office »

Leave a comment Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • From Apollo 17 to NASA May 20, 2019
  • Digitizing Apollo 17 Part 16 – New Apollo17.org, 44th Anniversary Edition December 11, 2016
  • Digitizing Apollo 17 Part 15 – Apollo17.org v1.0 Launched for the Mission’s 43rd Anniversary December 2, 2015
  • Digitizing Apollo 17 Part 14 – A Fantastic Reception September 8, 2015
  • Digitizing Apollo 17 Part 13 – Apollo17.org – Alpha Release v0.1 March 23, 2015
  • Digitizing Apollo 17 Part 12 – YouTube Channel of Complete Mission February 13, 2015
  • Digitizing Apollo 17 Part 11 – More mission audio released by NASA December 14, 2014
  • Digitizing Apollo 17 Part 10 – Manual Transcript Corrections Completed! April 5, 2014
  • Digitizing Apollo 17 Part 9 – The Trip Home March 10, 2014
  • Digitizing Apollo 17 Part 8 – Changing The Clocks January 27, 2013
  • Digitizing Apollo 17 Part 7 – Listening in Real Time December 22, 2012
  • Digitizing Apollo 17 Part 6 – Timeline Reconstruction December 19, 2012
  • Digitizing Apollo 17 Part 5 – Python Processing April 30, 2012
  • Digitizing Apollo 17 Part 4 – Technical vs Public Affairs Office April 15, 2012
  • Digitizing Apollo 17 Part 3 – New OCR Techniques March 30, 2012

Categories

  • How-To (4)
  • Project Apollo 17 (17)
  • Technology (5)

Pages

  • All Articles
  • Home
  • Project Apollo 17
  • test

Categories

  • How-To (4)
  • Project Apollo 17 (17)
  • Technology (5)

Archives

  • May 2019 (1)
  • December 2016 (1)
  • December 2015 (1)
  • September 2015 (1)
  • March 2015 (1)
  • February 2015 (1)
  • December 2014 (1)
  • April 2014 (1)
  • March 2014 (1)
  • January 2013 (1)
  • December 2012 (2)
  • April 2012 (2)
  • March 2012 (2)
  • February 2012 (1)
  • April 2011 (1)
  • March 2011 (1)
  • January 2011 (1)
  • November 2010 (1)
  • February 2010 (1)
  • July 2009 (1)
  • February 2004 (1)
  • July 2003 (1)
  • November 2002 (1)

Recent Comments

  • Gaston on My Victor Animatograph Corporation Animatophone Model 40, Type 13
  • 91 - The 240-Hour Cut - SpaceReporting on From Apollo 17 to NASA
  • HARVEY DUNN on My Victor Animatograph Corporation Animatophone Model 40, Type 13
  • Ed elfstrom on My Victor Animatograph Corporation Animatophone Model 40, Type 13
  • Feist on Digitizing Apollo 17 Part 16 – New Apollo17.org, 44th Anniversary Edition

Tags

3COM 16mm ABBYY FineReader 11 Adobe Premiere Advertising ALSJ Amplifier Animatophone Apollo Apollo 17 Apple Audio Canada Circuits DIY Encryption Film FineReader 11 Fix Google Hard Drives Headphone Amplifier Headphones HomeConnect Innovation Jack Schmitt Mobile PGP Phil Zimmerman Privacy Processing Projector Python Regina Security Soldering Spacelog Streetview SxSW Transcript Transcripts UNRAID Victor Corporation WIFI
© Ben Feist