Now that the TEC and PAO transcript data is in pipe-delimited CSV format, I can start to use batch cleansing techniques to further clean the raw OCR output data (CSV) into 2nd phase cleaned CSV. These processes are all automated tasks with no manual intervention. Once again, I did this purposely to keep the string of steps automated all the way from the original OCR steps to the cleaned output CSV in case I ever need to go back and change one of my earlier OCR settings etc. Any manual changes to the data would be wiped out by re-exporting any of the earlier steps.
Python Logo
Choosing Python
Python was never in my stable of languages back when I was a full-time hands-on developer earlier in my career. It’s a bit of a newcomer, and to a coder like me it feels like Perl with a whole slew of PHP-like features making it a great choice for automating content clean-up. I won’t go too much into detail about Python, but I do think it’s cool that it forgoes using braces to establish code structure and instead uses indentation itself. Forced visual organization—I like it.
For example, there’s a little routine that someone in the Spacelog project wrote to clean up transcript timecode tokens. I used this routine against the TEC transcript and it worked perfectly. Here’s an excerpt from it:
A Python routine for cleaning timecode
It takes a timecode in the format 00 00 00 00
(day, hour, minute, second) and uses a list of possible OCR errors for each digit and replaces any of those characters with the corresponding digit. For example, if you find a “B” in the timecode, then it should have been the number “8.”
The process of running a routine like this against the Raw CSV and outputting a Cleaned CSV makes thousands of changes to the transcript without any manual intervention. As I added new cleaning routines to the Python script I improved the cleanliness of the transcript. Some of the Python cleaning functions I wrote are:
- Counting page numbers to make sure no pages were skipped in the OCR step
- Look for verbiage rows with no callsign that start with “Tape” and treat them as page metadata
- Look for callsign OCR errors by checking each callsign against a list of known speakers:
callsignList = [ "LAUNCH CNTL", "CAPCOM", "PAO", "SC", "AMERICA", "CERNAN", "SCHMITT", "EVANS", "CHALLENGER", "RECOVERY", "SPEAKER", "GREEN", "ECKER", "BUTTS", "JONES", "HONEYSUCKLE" ]