reperio ↗

Simple, lightweight library to parse and scrap html pages.

Open this visualization on its own page →

Contributors

Lines of Code

3,274

From

2022-04-28

2025-04-11

About crimson-med/reperio

Reperio is a lightweight HTML parsing and scraping library designed for simplicity and performance. It provides a straightforward API for extracting structured data from web pages, supporting both direct string input and remote URLs through a promise-based interface. The library efficiently parses HTML content into organized components including metadata, images, videos, links, scripts, styles, and tables, with benchmarks showing sub-millisecond performance on moderately sized documents.

The library offers both high-level convenience methods and granular parsing functions for developers who need flexibility. Key features include URL extraction across multiple element types, automatic image downloading with deduplication, table conversion to JavaScript objects using headers as property keys, and sentence-level text search capabilities. Each parsed element type returns structured objects with relevant attributes, allowing programmatic access to href values, image sources, video sources, and other metadata.

Reperio is published to npm and follows semantic versioning for updates. The project welcomes contributions and maintains a test suite in the source repository, with clear publishing guidelines for maintainers who need to release new versions.

Share this video