You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This idea will most likely be implemented in unified-doc-cli
Goals
The internet is a connection of files. unified-doc aims to bridge working with different files with unified document APIs. With a CLI implemented in unified-doc-cli, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.
Searching for content
Sanitizing content
Extracting just the textContent (useful for NLP pipelines).
Parse to hast and continue content processing with hast utilities in the unified ecosystem.
Outputting source file in different format (.html, .txt, and eventually .pdf and .docx etc).
Enrich source file by attaching plugins, annotations etc.
Config file
Maybe a .unirc.js file? This config basically provides the input for unified-doc. You can attach/override default parsers/plugins/search-algorithms.
Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:
unified-doc https://some-webpage.html --text-content' > myfile.txt# repipe search results as annotations to the same file, and save the final html fileunified-doc https://some-webpage.html --search 'spongebob' >>> --annotate SEARCH_RESULTS >>> --file .html # HTML file saved with annotations.
Bulk processing
The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.
The text was updated successfully, but these errors were encountered:
This part of the project excites me the most, given it's immediate value once implemented.
Unfortunately I have a non-existent experience with writing CLI libraries. I would be tackling this in the future and ramping up on my personal knowledge, but any help/advice from the community is greatly appreciated here.
This idea will most likely be implemented in
unified-doc-cli
Goals
The internet is a connection of files.
unified-doc
aims to bridge working with different files with unified document APIs. With a CLI implemented inunified-doc-cli
, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.textContent
(useful for NLP pipelines).hast
and continue content processing withhast
utilities in theunified
ecosystem..html
,.txt
, and eventually.pdf
and.docx
etc).Config file
Maybe a
.unirc.js
file? This config basically provides the input forunified-doc
. You can attach/override default parsers/plugins/search-algorithms.CLI wrapper around API methods
The entry point for the CLI should be either:
From this entry point, we can determine the
content
andfilename
accordingly.CLI wrapper should intuitively wrap familiar API methods.
Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:
Bulk processing
The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.
The text was updated successfully, but these errors were encountered: