At Trakstar Hire, one of the common things that users do is view candidate resumes. We show a quick preview of the resume so that users don’t have to download and then open each file. We’d like this to be fast and efficient too (because a user typically screens scores of resumes in one sitting). Turns out that this is a tricky problem to solve.
Trakstar Hire accepts resumes in almost all possible formats (doc, docx, pdf, rtf, odt, etc). One of the ways to preview these is to use an embedded document viewer (like google, zoho, scribd etc). This requires either (a) on-the-fly conversion (making them slow), or (b) storing a document with them (also slow the first time, increased costs, plus we’ll have to trust another service with our client’s data). We weren’t comfortable with either.
We decided to give our users a text preview of a resume. On the face of it, it seems pretty easy. A tool like Apache Tika is fairly reliable and fast at this. However, resumes simply converted to plain text are not very readable. Especially, if the original file has a lot of formatting. Tables get distorted & well formatted resumes look jumbled very often. We wanted a better alternative.
The approach we finally chose is to first convert all documents to html, and then convert this html to text. HTML has the ability to maintain important formatting, and provides us with a common base for text-conversion. Most of our early attempts to search for document converters to html lead to open office (run as a headless server). Unfortunately, open office was buggy, and crashed often for batch converting documents to html. It caused more sleepless nights than the rest of my application code put together. We finally discovered abiword which does this a lot more reliably.
I am currently testing abiword with more than 100k docs. So far its working fine. I see a few glitches while its is converting pdf to html but I think I can live with that. Ill use pdf2html instead of abiword for pdf documents.
Converting html to (readable) text is the other piece of the puzzle. We solve this using a text-based web-browser w3m. This does a good job of converting markup to readable text (using dots, dashes etc). The end result is quite pretty.
Hope this is useful for others who encounter a similar problem.