Tuesday, 23 October 2007

PDF Manipulation; Tesseract OCR

Little bit of story to begin.

Working in a small print shop, I've had to deal with pdfs a lot more. One reason is setup: customers bring in all sorts of wonky files at the incorrect finished size and setup. Customers make files in Word, which may spontaneously change formatting (hey... how are WE supposed to know what your document is supposed to look like?) We get a lot on bands printing 11x17 posters, but they also want handbills, which are not set up. "Ready to print" is rarely the case.

Word also likes to not printed coloured backgrounds, even if you tell it to.

For pdf manipulation, I use three main tools: Adobe Photoshop, Adobe Acrobat Professional, and Xerox Freeflow Makeready.

The beauty about Photoshop is that you can actually open up a PDF. It will rasterize the file (essentially "make it into a picture") , which is great if you have you're getting issues with formatting. The downside is that it rasterizes the image. Gone is the editability (is that a word?) of the PDF. Always save as a copy before working on an image!

This concept gives light to the structure of the pdf as well. PDFs are similar in concept to web pages (HTML, if you're familiar with it). The document doesn't contain an image. It is a document explaining the structure of different elements, and how they should be displayed. This is why you are able to search many pdfs.

With extended use of Freeflow, it becomes obvious too that making a document n-up (multiple copies of the same thing in a page) is a function of the markup in the PDF. If you make a document 4-up, the size of the file doesn't change (or does by a miniscule amount). However, if you make a (rasterized) image 4-up and save it as an image again, it gets much larger.

Acrobat Professional is great for some purposes, but some of the tools are clunky to use. Worse still, some tools that look intuitive to use, aren't. Furthermore, it's still buggy. Crop kind of works. The "layers" tab is essentially useless.

Freeflow has some great tools. For example, you can view the contents of a pdf in its layers, and manipulate the layers (you could insert an image which is one colour and move it to the bottom of the layers... making it the background colour). It's also really easy to make things n-up. I'm surprised I don't see this more. Computers can do math much more quickly than we can. As humans, it's not that difficult for us to do (if nice round-numbered measurements are used), but it is time consuming.

Put into perspective the hundreds/thousands of dollars these programs cost and the fact that I still have to switch between them to solve some problems, plus there are still some problems they don't solve...

A large problem I would like to solve:
Collecting some of my images together at home into pdfs. Some of these are just images, some are images of text.

Using Linux, I wasn't about run out and buy a $440 copy of Acrobat Professional. Kpdf is a great lightweight viewer, but that's about all it can do. I supposed I could use a workaround using LaTeX, OpenOffice, or ImageMagick, but that might get tedious or cumbersome. I asked on IRC about ImageMagick, but I couldn't get a definite answer.

I came across some solutions quite by accident.

PDF Hacks contained a solution. Apparently you can make all images in a directory into a pdf with ImageMagick ("hack" 48, p125).
convert -density 100 -quality 85 \ -page "800x800>" -resize "800x800>" *.jpg album.pdf
Well, that's simple. Easy to write as a shell script or program, too.

I came across Tesseract quite by accident. It happened to be a July 2007 Linux Journal article. Tesseract can take a uncompressed tif image and use OCR to port it into an actual text document. It does it with an excellent success rate (97-100%).

The author says he uses it to scan in his textbooks, so he doesn't have to lug heavy books around campus.

Plus: both ImageMagick and Tesseract are free.

1 comment:

uniquegeek said...

Interesting article with more goodies.
http://www.linux.com/feature/138511