If you’re interested in this project, have a quick flip through the PDF of the technical air-to-ground mission audio transcript to get an idea of what the source material is like. The raw PDF document was published courtesy Stephen Garber (NASA HQ) and Glen Swanson (JSC) (55MB PDF). These transcripts were originally typed in 1972 by NASA typists.
“NASA Public Affairs employed legions of typists stationed in telephone booth-sized rooms whose single job was converting voice to paper. Armed with reel-to-reel tape players, electric typewriters, and reams of paper, these individuals hammered out transcripts within hours of when the astronauts first spoke the words.” – Glen E. Swanson
The PDF from NASA was digitized so that you can select text for copy/paste etc. This was probably done some time in the last decade. However, the way it was done makes the effort almost entirely useless for extraction of the information for digital manipulation. NASA can’t be blamed for this, it was probably the native OCR (optical character recognition) function within Adobe Acrobat that did such a poor job. Plus, I’m sure digitizing the information wasn’t the primary purpose of turning the original typewritten pages into a PDF.
For example, take this excerpt of the TEC transcript:
When the content is selected and copied from the PDF it turns into this:
APOLLO 17 AIR-T0-GROUND VOICE TRANSCRIPTION 00 00 00 03 CDR Roger. The clock has started. We have yaw. 00 00 00 12 CDR Roger; tower. Yaw's complete. We're into roll, Bob. 00 00 00 17 CC Roger, Geno. Looking great. Thrust good on all five engines. CDR Okay, babe. It 's looking good here. 00 00 00 21 CDR Roll is complete. We are pitching. SC Wow woozle I
It doesn’t look too bad at first, but upon closer inspection you can see that there are a few problems:
- There are many OCR errors–the letter O is a 0 in many cases. The exclamation mark is listed as an “I”. Spacing errors.
- The line wrapping of the 3rd line contains a hard carriage return that puts the remainder of the line into the timecode column.
- There is no delimiter between timecode, speaker, and verbiage other than a space, but when mixed with the hard wrapping there’s no immediately evident way to automate the separation of the information. It would be yet another huge manual effort to clean it up.
You’ll also notice that the 4th line contains no timecode. In fact, throughout the Apollo 17 transcripts there are thousands of these missing timecodes. This appears to be an issue unique to the Apollo 17 transcripts. One person suggested that this might be due to NASA getting lazy, knowing that Apollo 17 was to be the last flight thus not needing to learn from this flight for the next. Whatever the reason, it makes the resulting restoration effort even more difficult.
The Internet Archive includes high resolution JP2 images of the transcripts. JP2 is possibly the least helpful image format ever invented. It’s not widely supported and in my opinion is a dubious choice for archiving content. I wrote a batch job in Photoshop to convert all 2,460 JP2 pages to PNG format. Neither JP2 nor PNG is a lossy compression format so there was no data lost in the process. This folder of PNGs will serve as the input to the next step: reOCRing ever page of the Apollo 17 TEC transcript.