Module for automatic summarization of text documents and HTML pages.
-
Updated
May 16, 2024 - Python
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Automatically extract the main text content (and more) from an HTML document
从html中提取正文,用于新闻类网页
PHP library which determines which css is used from html snippets.
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Go package that cleans a HTML page for better readability.
Media Graper is a open source tool for Linux which is developed to extract all the Images, links, Videos from a Webpage.
A simple extractor based on BeatufulSoup, You can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.
Add a description, image, and links to the html-extractor topic page so that developers can more easily learn about it.
To associate your repository with the html-extractor topic, visit your repo's landing page and select "manage topics."