Ben Feist

  • All Articles
  • Project Apollo 17
  • A47 Headphone Amp
  • Victor Animatophone

Digitizing Apollo 17 Part 5 – Python Processing

Posted on April 30, 2012 by Feist Posted in Project Apollo 17

Now that the TEC and PAO transcript data is in pipe-delimited CSV format, I can start to use batch cleansing techniques to further clean the raw OCR output data (CSV) into 2nd phase cleaned CSV. These processes are all automated tasks with no manual intervention. Once again, I did this purposely to keep the string of steps automated all the way from the original OCR steps to the cleaned output CSV in case I ever need to go back and change one of my earlier OCR settings etc. Any manual changes to the data would be wiped out by re-exporting any of the earlier steps.

 

PythonLogo-220x200

Choosing Python

Python was never in my stable of languages back when I was a full-time hands-on developer earlier in my career. It’s a bit of a newcomer, and to a coder like me it feels like Perl with a whole slew of PHP-like features making it a great choice for automating content clean-up. I won’t go too much into detail about Python, but I do think it’s cool that it forgoes using braces to establish code structure and instead uses indentation itself. Forced visual organization–I like it.

For example, there’s a little routine that someone in the Spacelog project wrote to clean up transcript timecode tokens. I used this routine against the TEC transcript and it worked perfectly. Here’s an excerpt from it:

It takes a timecode in the format 00 00 00 00 (day, hour, minute, second) and uses a list of possible OCR errors for each digit and replaces any of those characters with the corresponding digit. For example, if you find a “B” in the timecode, then it should have been the number “8”.

The process of running a routine like this against the Raw CSV and outputting a Cleaned CSV makes thousands of changes to the transcript without any manual intervention. As I added new cleaning routines to the Python script I improved the cleanliness of the transcript. Some of the Python cleaning functions I wrote are:

  • Counting page numbers to make sure no pages were skipped in the OCR step
  • Look for verbiage rows with no callsign that start with “Tape” and treat them as page metadata
  • Look for callsign OCR errors by checking each callsign against a list of known speakers (callsignList = [ “LAUNCH CNTL”, “CAPCOM”, “PAO”, “SC”, “AMERICA”, “CERNAN”, “SCHMITT”, “EVANS”, “CHALLENGER”, “RECOVERY”, “SPEAKER”, “GREEN”, “ECKER”, “BUTTS”, “JONES”, “HONEYSUCKLE” ])
  • Look for out of order timestamps. I found thousands (more on this later)
  • A complex one: check if the first entry on a given page has no callsign but isn’t a Page number or Tape number. This is a continuation of an entry on the previous page. Concatenate the content into the previous page’s last entry contiguously. This one found hundreds of wrapped verbiage and fixed them.

Drawing the Line from OCR to Phase 2

At some point in the middle of writing and running these Python routines, slowly making the output cleaner and cleaner,  I decided to stop using the OCR output CSV as the source. I drew a line in the sand that meant from this point forth I would be making manual changes to the cleaned TEC CSV file and would no longer export to CSV from FineReader. This next step meant making a “Cleaned CSV Phase 2” file. In other words, the input would now be the “Cleaned CSV” that contained partially Python processed material based on the OCR output,  and the output would be a new “Cleaned CSV Phase 2” file.

Simple Search and Replace

Making the move to Phase 2 cleaning allowed me to perform a series of carefully worded search and replace functions on the Cleaned CSV file. These included common OCR errors in the Tape titles for example, and easily identifiable errors that were common in various verbiage entries. Using “replace all” often resulted in thousands of corrections per action, slowly but surely making the content cleaner.

Timestamp Token Transform

This phase 2 process was where many of the Python cleaning routines were written (from the bullet list above). One big scripting item that was accomplished in Phase 2 was my decision to convert the timestamp tokens from 00 00 00 00 format to 000:00:00 format. This 2nd format is hours:minutes:seconds and is referred to as GET , or Ground-Elapsed Time in the mission audio. It would be easy to convert back again if need be for a given output format, but for my purposes I wanted GET because that’s what the PAO transcripts use and what the astronauts themselves used when they spoke throughout the mission. I also wrote a routine to fill in missing timestamps by simply copying the last known timestamp onto every line that was missing one. This is an interim step to the final cleaned output but makes every transcript row a complete record.

The resulting CSV can be found here.

My thoughts then turned to how to address the timestamp issues as a whole. Some timestamps are approximate guesses, many are missing, and history is blurry in this regard. I slowly realized that to establish a true mission timeline I would have an even larger task ahead than simple transcript restoration could address.

Tweet
« Digitizing Apollo 17 Part 4 – Technical vs Public Affairs Office
Digitizing Apollo 17 Part 6 – Timeline Reconstruction »

Leave a comment Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • From Apollo 17 to NASA May 20, 2019
  • Digitizing Apollo 17 Part 16 – New Apollo17.org, 44th Anniversary Edition December 11, 2016
  • Digitizing Apollo 17 Part 15 – Apollo17.org v1.0 Launched for the Mission’s 43rd Anniversary December 2, 2015
  • Digitizing Apollo 17 Part 14 – A Fantastic Reception September 8, 2015
  • Digitizing Apollo 17 Part 13 – Apollo17.org – Alpha Release v0.1 March 23, 2015
  • Digitizing Apollo 17 Part 12 – YouTube Channel of Complete Mission February 13, 2015
  • Digitizing Apollo 17 Part 11 – More mission audio released by NASA December 14, 2014
  • Digitizing Apollo 17 Part 10 – Manual Transcript Corrections Completed! April 5, 2014
  • Digitizing Apollo 17 Part 9 – The Trip Home March 10, 2014
  • Digitizing Apollo 17 Part 8 – Changing The Clocks January 27, 2013
  • Digitizing Apollo 17 Part 7 – Listening in Real Time December 22, 2012
  • Digitizing Apollo 17 Part 6 – Timeline Reconstruction December 19, 2012
  • Digitizing Apollo 17 Part 5 – Python Processing April 30, 2012
  • Digitizing Apollo 17 Part 4 – Technical vs Public Affairs Office April 15, 2012
  • Digitizing Apollo 17 Part 3 – New OCR Techniques March 30, 2012

Categories

  • How-To (4)
  • Project Apollo 17 (17)
  • Technology (5)

Pages

  • All Articles
  • Home
  • Project Apollo 17
  • test

Categories

  • How-To (4)
  • Project Apollo 17 (17)
  • Technology (5)

Archives

  • May 2019 (1)
  • December 2016 (1)
  • December 2015 (1)
  • September 2015 (1)
  • March 2015 (1)
  • February 2015 (1)
  • December 2014 (1)
  • April 2014 (1)
  • March 2014 (1)
  • January 2013 (1)
  • December 2012 (2)
  • April 2012 (2)
  • March 2012 (2)
  • February 2012 (1)
  • April 2011 (1)
  • March 2011 (1)
  • January 2011 (1)
  • November 2010 (1)
  • February 2010 (1)
  • July 2009 (1)
  • February 2004 (1)
  • July 2003 (1)
  • November 2002 (1)

Recent Comments

  • Ed elfstrom on My Victor Animatograph Corporation Animatophone Model 40, Type 13
  • Feist on Digitizing Apollo 17 Part 16 – New Apollo17.org, 44th Anniversary Edition
  • Barry Brington on Digitizing Apollo 17 Part 16 – New Apollo17.org, 44th Anniversary Edition
  • Feist on Digitizing Apollo 17 Part 1 – Discovering Apollo
  • Feist on Digitizing Apollo 17 Part 16 – New Apollo17.org, 44th Anniversary Edition

Tags

3COM 16mm ABBYY FineReader 11 Adobe Premiere Advertising ALSJ Amplifier Animatophone Apollo Apollo 17 Apple Audio Canada Circuits DIY Encryption Film FineReader 11 Fix Google Hard Drives Headphone Amplifier Headphones HomeConnect Innovation Jack Schmitt Mobile PGP Phil Zimmerman Privacy Processing Projector Python Regina Security Soldering Spacelog Streetview SxSW Transcript Transcripts UNRAID Victor Corporation WIFI
© Ben Feist