PDF Extraction / Conversion
I don't often review commercial products, but here's a niche that simply has not been filled (at least on the Windows platform) in the Open Source community: PDF conversion programs.
There are plenty of packages out there that convert format X (word, text, autocad) to PDF files, in part because that's usually the direction of conversion we're going, and also of the abominably expensive nature of the native Adobe products that would like to find their way to your desktop and into your wallet for this process. But going in the other direction, PDF to text, Word, or Rich Text, is just not something we do very often. When you do find yourself with a need to do this, the privilege is going to cost you from $25 to $100.
Thanks to my special relationship to these vendor's customers, I've been able to borrow the various software for purposes of taking a look at it. The "winner" of this stand off will be awarded with something I rarely employ - real money for software.
Now, why would you want to deconstruct a PDF file into its component text and graphics so you can read it with something else? After all, PDF was created with the express purpose of having an open platform that can read documents in their native configurations without having to buy the corresponding software. There are at least two reasons I can think of.
First, one may wish to convert a PDF file to something else when the authors of that document wish to artificially fix the document's information so that it cannot be changed or filled out, if it's a form. Similarly, authors of PDF files have various other restrictions that can be embedded into PDFs that might prevent one from saving or even printing the form out. Therefore, in some cases it's handy to convert the document into something that you actually can manipulate, save and transmit electronically. A perfect example of this is a PDF file that is made of an Excel table. By design, Excel output should be interactive, but it usually is not when distributed as a PDF. If we could find some utility to extract the table and put it back into an Excel form that would be a real solution.
Secondly, you may have occasion to view a PDF file on an incompatible device - say, an electronic book, a PDA, a cell phone, or even a game device such as the Sony PSP. All manner of devices are becoming reading tools these days, and they don't all support PDF. It was this latter reason that found me on the road to converting PDF to something else - almost anything else would do. As most of you know, I am a fan of JPT (just plain text), and that is usually the format I go for when I have a choice. It is a flexible as it can be, because there is no device that can't display text in one form or another.
I've had my eye on a new Sony E-book PRS-500, which actually can view PDFs, but the e-book I am using now, the REB1000 (made by RCA and Gemstar and now out of production), cannot. I want to bequeath this old standby reader to my son, but many of the titles I'd like to put on that machine are in PDF format, and need to be converted to text or DOC format, then subsequently converted to the .RB standard for the REB1000.
A couple of disclaimers are in order, before I go into this comparison. There is one Open Source product that may help you convert from PDF to text, and that is Ghostscript and its graphical front-end, Ghostview. Ghostscript and its viewer companion Ghostview provide the same PDF reading capability as Adobe's free PDF reader, which is now in version 8. When this product was in version 6, Adobe had added so much crap to the reader that it loaded only with difficulty on older machines, and I was advocating the use of Ghostscript as a replacement for the Adobe product at that time. Even now, Adobe Reader wants to phone home and update itself at an alarming rate, and software that does this, in my opinion, is tempting to bad people. Every time your computer ventures out to the internet intent upon downloading executable code, that process is fraught with opportunity for those people that would harm your computer just for kicks. Ghostscript does not do a great job rendering complex PDFs, but it works well enough and I use it in computing environments that absolutely have to remain stable and where I don't want the processing bandwidth to be interrupted by vendors that feel you "just gotta have" their latest update.
Anyway, Ghostscript and Ghostview do have the capability of extracting plain text from a PDF document. So if your document is plain, unformatted text, these two free programs might be just what you need and you would avoid the cost and hassle of having one more utility program to deal with.
If you have more complex forms, such as spreadsheets, photos, and GIF files in your PDF, you need something more powerful to deal with them, and that is what this comparative review is about.
Note that if your original PDF file contains an image of text, as in someone simply taking a book and scanning it to a JPEG image, virtually nothing is going to be converted. For this case, you would need Optical Character Recognition(OCR) software to get the images of each letter translated into actual text. So if you get the warning on any of these programs that says "No text was converted." That means there was no text in your original PDF - if it shows up as text it must be images of text instead of actual text. Perhaps someday soon we'll do a review of OCR software. If you own a scanner or multi-function printer, you may already have OCR software on your computer.
I reviewed these programs to convert PDF to other formats:
PDF Grabber 4.0 (Euro 32.77)
ABC Amber PDF ($12.95 or free with the purchase of their text converter software, $24.95)
Solid Converter PDF ($49.99 or $99.99 for the PRO version, which converts to more formats, including extracting tables to Excel)
PDF to Word v1.2 ($39.95)
SmartPDF Converter ($39.90 +$9.80 per year for support & maintenance)
Abbyy PDF Transformer ($99.99)
Each of these has a semi-functional demo version that can be downloaded to try for free, though some of them require that you give up an e-mail address for the privilege. Some of the vendors, I suspect, are rather lax about documenting the features of the program that do not work in the trial copy. For example I took the Solid Converter PDF product for a spin and it seemed to work better then many of these, but it would not save the converted file - citing an error about how I may not have permissions on the target drive. If this is a means of crippling the trial version so that I'll pay for the right to use it, fine, but I would like to know this up front because these error messages will be reported as errors in critical reviews.
I start with the least expensive, ABC Amber PDF, now in version 2.04. This software provider has dozens of different software products for conversions, and if you buy the one called ABC Amber Text converter, they throw in the PDF converter free. The text converter is $24.95.
This utility defaults to only converting one page, so you have to remember to set it to convert every page. It does fine for text, but does not have a lot of options for lining up other elements like JPegs and charts. It progressed reasonably fast, converting 498 pages of one PDF in about 1 and 1/2 minutes. There is a handy log file that shows by default at the bottom of the screen that tells you what it's doing.
ABC Amber's interface looks as if it's going to let you convert to up to 40 or more formats, but that's a little misleading because you have to have Microsoft's conversion pack installed to get to most of those formats, but it does do a great job of extracting text from a PDF. The input file size was just over 800kb and the output RTF (Word) file was over 2 megabytes. I've noticed inflation of file size to be a problem in general with PDF extraction software. I did notice that Microsoft Word had trouble loading the finished file, but WordPad opened it fine and the formatting looked good.
PDF Grabber had a professional interface with many conversion options, particular as it relates to formatting the results most like the original. It was slower however, converting 133 page file in about 6 minutes. By default, the resulting converted file is saved to a directory in My Documents, which can be a little disorienting. It would be better if the default save directory was the same as the import directory. The trial version places "X" at random locations in the converted file until you pay for the program. Before exiting the program, there is a confusing screen that asks about whether you want to include the fonts it could not find. I think it's asking if it's okay to replace unknown fonts in the original with similar windows fonts, but I'm not sure.
Looks like a good buy for the money though.
In spite of it's name, PDF2WORD actually only allows transitioning a PDF to its RTF conversion, which is best viewed in Microsoft WordPad, but it does convert faithfully, including images. For the $40 price tag though, I don't see a lot of features here. A log file to see what has been done and what is left to do, for example. This one initially through out a lot of weird error messages too, but maybe I was using PDFs that were not 6.0 compatible.
SmartPDF Converter Pro does not allow you to save your converted file but I guess that is the reason this is trial software. They should say that in the documentation though, not let me discover it as an error. I also found this one too expensive for what it can do.
Abbyy PDF Transformer, like other Abbyy products lay out a bunch of icons on the desktop that I didn't ask for. Sorry guys, that automatically disqualifies you. Some of these products offer to turn your converted PDFs from WORD or some other format back into PDFs. I couldn't really figure out how this was a feature. There are so many free applications that make PDFs, I didn't think this was a plus.
That leaves SolidConverterPDF. This one, I think might be the winner. Rather than complain and crap out when it encounters fonts in the PDF for which it has no match, it simply replaces them with Arial. There are more formatting options to get your finished document to look like the original, and it will convert tables by default. And it writes to the .DOC final format without having Microsoft Word installed. There is also real time logging of what it's doing so you know what happened if it doesn't work. My only complaint is it's a little slow, and it won't write the finished file and yields an error message. If this is the software complaining it's not registered, then you should tell me that, don't just jury rig the program to fail because it looks like an error in the program.
Michael Moore
There are plenty of packages out there that convert format X (word, text, autocad) to PDF files, in part because that's usually the direction of conversion we're going, and also of the abominably expensive nature of the native Adobe products that would like to find their way to your desktop and into your wallet for this process. But going in the other direction, PDF to text, Word, or Rich Text, is just not something we do very often. When you do find yourself with a need to do this, the privilege is going to cost you from $25 to $100.
Thanks to my special relationship to these vendor's customers, I've been able to borrow the various software for purposes of taking a look at it. The "winner" of this stand off will be awarded with something I rarely employ - real money for software.
Now, why would you want to deconstruct a PDF file into its component text and graphics so you can read it with something else? After all, PDF was created with the express purpose of having an open platform that can read documents in their native configurations without having to buy the corresponding software. There are at least two reasons I can think of.
First, one may wish to convert a PDF file to something else when the authors of that document wish to artificially fix the document's information so that it cannot be changed or filled out, if it's a form. Similarly, authors of PDF files have various other restrictions that can be embedded into PDFs that might prevent one from saving or even printing the form out. Therefore, in some cases it's handy to convert the document into something that you actually can manipulate, save and transmit electronically. A perfect example of this is a PDF file that is made of an Excel table. By design, Excel output should be interactive, but it usually is not when distributed as a PDF. If we could find some utility to extract the table and put it back into an Excel form that would be a real solution.
Secondly, you may have occasion to view a PDF file on an incompatible device - say, an electronic book, a PDA, a cell phone, or even a game device such as the Sony PSP. All manner of devices are becoming reading tools these days, and they don't all support PDF. It was this latter reason that found me on the road to converting PDF to something else - almost anything else would do. As most of you know, I am a fan of JPT (just plain text), and that is usually the format I go for when I have a choice. It is a flexible as it can be, because there is no device that can't display text in one form or another.
I've had my eye on a new Sony E-book PRS-500, which actually can view PDFs, but the e-book I am using now, the REB1000 (made by RCA and Gemstar and now out of production), cannot. I want to bequeath this old standby reader to my son, but many of the titles I'd like to put on that machine are in PDF format, and need to be converted to text or DOC format, then subsequently converted to the .RB standard for the REB1000.
A couple of disclaimers are in order, before I go into this comparison. There is one Open Source product that may help you convert from PDF to text, and that is Ghostscript and its graphical front-end, Ghostview. Ghostscript and its viewer companion Ghostview provide the same PDF reading capability as Adobe's free PDF reader, which is now in version 8. When this product was in version 6, Adobe had added so much crap to the reader that it loaded only with difficulty on older machines, and I was advocating the use of Ghostscript as a replacement for the Adobe product at that time. Even now, Adobe Reader wants to phone home and update itself at an alarming rate, and software that does this, in my opinion, is tempting to bad people. Every time your computer ventures out to the internet intent upon downloading executable code, that process is fraught with opportunity for those people that would harm your computer just for kicks. Ghostscript does not do a great job rendering complex PDFs, but it works well enough and I use it in computing environments that absolutely have to remain stable and where I don't want the processing bandwidth to be interrupted by vendors that feel you "just gotta have" their latest update.
Anyway, Ghostscript and Ghostview do have the capability of extracting plain text from a PDF document. So if your document is plain, unformatted text, these two free programs might be just what you need and you would avoid the cost and hassle of having one more utility program to deal with.
If you have more complex forms, such as spreadsheets, photos, and GIF files in your PDF, you need something more powerful to deal with them, and that is what this comparative review is about.
Note that if your original PDF file contains an image of text, as in someone simply taking a book and scanning it to a JPEG image, virtually nothing is going to be converted. For this case, you would need Optical Character Recognition(OCR) software to get the images of each letter translated into actual text. So if you get the warning on any of these programs that says "No text was converted." That means there was no text in your original PDF - if it shows up as text it must be images of text instead of actual text. Perhaps someday soon we'll do a review of OCR software. If you own a scanner or multi-function printer, you may already have OCR software on your computer.
I reviewed these programs to convert PDF to other formats:
PDF Grabber 4.0 (Euro 32.77)
ABC Amber PDF ($12.95 or free with the purchase of their text converter software, $24.95)
Solid Converter PDF ($49.99 or $99.99 for the PRO version, which converts to more formats, including extracting tables to Excel)
PDF to Word v1.2 ($39.95)
SmartPDF Converter ($39.90 +$9.80 per year for support & maintenance)
Abbyy PDF Transformer ($99.99)
Each of these has a semi-functional demo version that can be downloaded to try for free, though some of them require that you give up an e-mail address for the privilege. Some of the vendors, I suspect, are rather lax about documenting the features of the program that do not work in the trial copy. For example I took the Solid Converter PDF product for a spin and it seemed to work better then many of these, but it would not save the converted file - citing an error about how I may not have permissions on the target drive. If this is a means of crippling the trial version so that I'll pay for the right to use it, fine, but I would like to know this up front because these error messages will be reported as errors in critical reviews.
I start with the least expensive, ABC Amber PDF, now in version 2.04. This software provider has dozens of different software products for conversions, and if you buy the one called ABC Amber Text converter, they throw in the PDF converter free. The text converter is $24.95.
This utility defaults to only converting one page, so you have to remember to set it to convert every page. It does fine for text, but does not have a lot of options for lining up other elements like JPegs and charts. It progressed reasonably fast, converting 498 pages of one PDF in about 1 and 1/2 minutes. There is a handy log file that shows by default at the bottom of the screen that tells you what it's doing.
ABC Amber's interface looks as if it's going to let you convert to up to 40 or more formats, but that's a little misleading because you have to have Microsoft's conversion pack installed to get to most of those formats, but it does do a great job of extracting text from a PDF. The input file size was just over 800kb and the output RTF (Word) file was over 2 megabytes. I've noticed inflation of file size to be a problem in general with PDF extraction software. I did notice that Microsoft Word had trouble loading the finished file, but WordPad opened it fine and the formatting looked good.
PDF Grabber had a professional interface with many conversion options, particular as it relates to formatting the results most like the original. It was slower however, converting 133 page file in about 6 minutes. By default, the resulting converted file is saved to a directory in My Documents, which can be a little disorienting. It would be better if the default save directory was the same as the import directory. The trial version places "X" at random locations in the converted file until you pay for the program. Before exiting the program, there is a confusing screen that asks about whether you want to include the fonts it could not find. I think it's asking if it's okay to replace unknown fonts in the original with similar windows fonts, but I'm not sure.
Looks like a good buy for the money though.
In spite of it's name, PDF2WORD actually only allows transitioning a PDF to its RTF conversion, which is best viewed in Microsoft WordPad, but it does convert faithfully, including images. For the $40 price tag though, I don't see a lot of features here. A log file to see what has been done and what is left to do, for example. This one initially through out a lot of weird error messages too, but maybe I was using PDFs that were not 6.0 compatible.
SmartPDF Converter Pro does not allow you to save your converted file but I guess that is the reason this is trial software. They should say that in the documentation though, not let me discover it as an error. I also found this one too expensive for what it can do.
Abbyy PDF Transformer, like other Abbyy products lay out a bunch of icons on the desktop that I didn't ask for. Sorry guys, that automatically disqualifies you. Some of these products offer to turn your converted PDFs from WORD or some other format back into PDFs. I couldn't really figure out how this was a feature. There are so many free applications that make PDFs, I didn't think this was a plus.
That leaves SolidConverterPDF. This one, I think might be the winner. Rather than complain and crap out when it encounters fonts in the PDF for which it has no match, it simply replaces them with Arial. There are more formatting options to get your finished document to look like the original, and it will convert tables by default. And it writes to the .DOC final format without having Microsoft Word installed. There is also real time logging of what it's doing so you know what happened if it doesn't work. My only complaint is it's a little slow, and it won't write the finished file and yields an error message. If this is the software complaining it's not registered, then you should tell me that, don't just jury rig the program to fail because it looks like an error in the program.
Michael Moore
Labels: Conversion, Ebook, Ghostscript, Ghostview, GSview, PDF, reader, text, utility