jusText ↗

Heuristic based boilerplate removal tool

Open this visualization on its own page →

Contributors

Lines of Code

593

From

2011-03-09

2020-07-01

About miso-belica/jusText

jusText is a Python tool designed to extract main content from HTML pages by removing boilerplate elements like navigation menus, headers, footers, and other extraneous markup. It uses heuristic-based algorithms to identify and preserve text containing full sentences, making it particularly well-suited for building linguistic corpora and other text processing tasks that require clean, meaningful content.

The tool works by parsing HTML with lxml and classifying paragraphs based on text density and other features to distinguish between actual content and boilerplate. It can be used either from the command line to process individual HTML files or URLs, or integrated into Python applications via its API. The project supports multiple languages through stopword lists and was originally developed at Masaryk University's Natural Language Processing Centre.

The original jusText project has been adapted to several other programming languages including C++, Go, and Java, and is used by various libraries and tools in the text processing ecosystem. While this is a maintained fork of code from an older, unmaintained Google Code repository, several alternative content extraction libraries exist today like trafilatura, newspaper, and inscriptis.

Share this video