Pdftotext
Encyclopedia
pdftotext is an open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 command-line utility for converting PDF files to plain text
Plain text
In computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....

 files —i.e. extracting text data from PDF-encapsulated files. It is freely available and included by default with many Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

 distributions
Linux distribution
A Linux distribution is a member of the family of Unix-like operating systems built on top of the Linux kernel. Such distributions are operating systems including a large collection of software applications such as word processors, spreadsheets, media players, and database applications...

, as well as being available on Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

 (as part of the Xpdf
Xpdf
Xpdf is an open-source PDF viewer for the X Window System and Motif.Xpdf runs on practically any Unix-like operating system. Xpdf can decode LZW and read encrypted PDFs. The official version obeys the DRM restrictions of PDF files, which may prevent copying, printing, or converting some PDF files...

 distribution). Such text extraction is complicated as PDF files are internally built on page drawing primitives, meaning the boundaries between words and paragraphs often must be inferred based on their position on the page.

$ pdftotext file.pdf
This usage produces a text file with the same name as the input file. Wildcards (*
Asterisk
An asterisk is a typographical symbol or glyph. It is so called because it resembles a conventional image of a star. Computer scientists and mathematicians often pronounce it as star...

), for example $ pdftotext *pdf, for converting multiple files, cannot be used because pdftotext expects only one file name. A loop on the shell
Shell (computing)
A shell is a piece of software that provides an interface for users of an operating system which provides access to the services of a kernel. However, the term is also applied very loosely to applications and may include any software that is "built around" a particular component, such as web...

 can be used for batch
Batch processing
Batch processing is execution of a series of programs on a computer without manual intervention.Batch jobs are set up so they can be run to completion without manual intervention, so all input data is preselected through scripts or command-line parameters...

 conversions, as in

$ for f in *.pdf
> do
> pdftotext "$f"
> done

for the bash shell.

pdftotext is part of the Xpdf software suite. Poppler
Poppler (software)
In computing, Poppler is a free software library used to render PDF documents. It is used by the PDF viewers of the open source GNOME and KDE desktop environments, and its development is supported by freedesktop.org....

, which is derived from Xpdf, also includes an implementation of pdftotext. On most Linux distributions, pdftotext is included as part of the poppler-utils package.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK