github.com/pdfminer/pdfminer.six ↗
Community maintained fork of pdfminer - we fathom PDF
Open this visualization on its own page →
Contributors
122
Lines of Code
3,232
From
2007-12-30
To
2022-11-06
About pdfminer/pdfminer.six
Pdfminer.six is a Python library for extracting and analyzing information from PDF documents. It works by parsing the PDF source code directly to retrieve text content along with detailed metadata like exact location, font properties, and color information. The tool supports a wide range of PDF features including the PDF-1.7 specification, various font types (Type1, TrueType, Type3, CID), embedded images in multiple formats, several compression algorithms, and both RC4 and AES encryption standards.
The library is built with a modular architecture that allows developers to replace individual components or implement custom interpreters and rendering devices for specialized purposes beyond basic text extraction. It can export content in multiple formats including plain text, images, HTML, and hOCR, and includes specialized support for CJK languages, vertical writing systems, interactive form extraction, table of contents extraction, and automatic layout analysis.
As a community-maintained fork of the original PDFMiner project, pdfminer.six is written entirely in Python and requires Python 3.10 or newer. The project welcomes community contributions and provides documentation, setup instructions, and an active discussion channel for users and developers looking to extend or improve the tool.