github.com/pdfminer/pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

Open this visualization on its own page →

Contributors

122

Lines of Code

3,232

From

2007-12-30

To

2022-11-06

About pdfminer/pdfminer.six

Pdfminer.six is a Python library for extracting and analyzing information from PDF documents. It works by parsing the PDF source code directly to retrieve text content along with detailed metadata like exact location, font properties, and color information. The tool supports a wide range of PDF features including the PDF-1.7 specification, various font types (Type1, TrueType, Type3, CID), embedded images in multiple formats, several compression algorithms, and both RC4 and AES encryption standards.

The library is built with a modular architecture that allows developers to replace individual components or implement custom interpreters and rendering devices for specialized purposes beyond basic text extraction. It can export content in multiple formats including plain text, images, HTML, and hOCR, and includes specialized support for CJK languages, vertical writing systems, interactive form extraction, table of contents extraction, and automatic layout analysis.

As a community-maintained fork of the original PDFMiner project, pdfminer.six is written entirely in Python and requires Python 3.10 or newer. The project welcomes community contributions and provides documentation, setup instructions, and an active discussion channel for users and developers looking to extend or improve the tool.

Share this video