-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: API should return structured article json #38
Comments
Sound good to me. But how to identify the DOM position of title, author and date? |
I found a Golang port of DOM-distiller, go-domdistiller, that also incorporates some improvements. I haven't played around with it, but I think this is a cool idea. If the API returns a filtered json blog, I'm wondering if it makes sense to replace |
@joncrangle I could work on the backend API if you want to work on the frontend display with nice typography. I'm thinking a
Then a |
I think this is a very good idea an additional route to display a simplified article. Could become a big help for People with visual impairments. Not sure yet about adding a framework to display a simple article. This might be solvable by tailwind. I'd rather add a separate button to open it as outline to the default form. |
I started work in a branch on the frontend piece. Rather than use a framework, I created a Get handler for the The data is mocked at the moment. If this is moving in the right direction, I plan to add a dropdown menu in the top right so the user can select visual preferences regarding font family (serif / sans serif) and increase or decrease the font size. Perhaps even a light/dark mode switch as well. I also made some improvements to the |
I like the simplicity. Maybe add an image to the article since the library is able to extract it. But feel free to change everything. |
@deoxykev How do you envision the API return longer content / images, even if only a cover image? |
@joncrangle I like the frontend. I'm thinking to handle images, we could return it like this:
{
"success": true,
"error": {
"message": "This is an example message. If success is true, this shouldn't be here.",
"type": "example_error",
"cause": { ... recurse for nested errors ... }
},
"data": {
"url": "https://example.com",
"title": "Example Domain",
"author": "",
"date": "2023-11-15",
"content": [
{
"type": "h1",
"data": "This domain is for use in illustrative examples in documents."
},
{
"type": "img",
"url": "/https://example.com/header-image.jpg",
"alt": "header image alt text",
"position": "header"
},
{
"type": "p",
"data": "You may use this domain in literature without prior coordination or asking for permission."
},
{
"type": "img",
"url": "/https://example.com/inline-image.jpg",
"alt": "inline image alt text",
"position": "inline"
}
]
}
}
Then maybe the API could look more like this:
All APIs should contain the top level objects: {
"success": bool,
"error": {},
"data": {},
} |
I've used the following mock json response to make some pretty good progress on the frontend: {
"success": true,
"error": {
"message": "This is an example message. If success is true, this shouldn't be here.",
"type": "example_error",
"cause": "recurse - for nested error - string for testing"
},
"data": {
"url": "https://example.com",
"title": "Example Domain",
"author": "John Doe",
"date": "2023-11-15",
"content": [
{
"type": "h1",
"data": "This domain is for use in illustrative examples in documents."
},
{
"type": "img",
"url": "/https://source.unsplash.com/random/900x700/?city,night",
"alt": "header image alt text",
"caption": "This is the image caption"
},
{
"type": "h2",
"data": "This is an h2."
},
{
"type": "p",
"data": "<a> tag: <a href=\"/https://example.com\">This is an example link</a>. This is <em>emphasized</em> text. This is <strong>bold</strong> text. These are example <kbd> tags <kbd>Ctrl</kbd> + <kbd>Shift</kbd>"
},
{
"type": "p",
"data": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
},
{
"type": "p",
"data": "Pulvinar etiam non quam lacus suspendisse faucibus. Et pharetra pharetra massa massa ultricies mi. Rhoncus dolor purus non enim praesent elementum facilisis leo vel. Phasellus vestibulum lorem sed risus ultricies tristique nulla. Duis tristique sollicitudin nibh sit amet commodo nulla. Eget aliquet nibh praesent tristique magna sit amet purus gravida. Sem fringilla ut morbi tincidunt augue interdum velit euismod. Amet consectetur adipiscing elit duis tristique sollicitudin nibh sit. Lobortis scelerisque fermentum dui faucibus in ornare quam viverra. Nunc sed blandit libero volutpat sed cras ornare. Sit amet purus gravida quis. Duis ut diam quam nulla porttitor massa."
},
{
"type": "p",
"data": "Integer malesuada nunc vel risus. Lobortis feugiat vivamus at augue eget arcu dictum varius. Pulvinar sapien et ligula ullamcorper malesuada. Vel quam elementum pulvinar etiam non quam. Magnis dis parturient montes nascetur ridiculus mus mauris vitae. Odio eu feugiat pretium nibh. Pretium nibh ipsum consequat nisl vel pretium lectus. Elementum curabitur vitae nunc sed velit dignissim sodales. Mauris sit amet massa vitae tortor condimentum lacinia quis. Orci porta non pulvinar neque laoreet suspendisse. Enim eu turpis egestas pretium aenean pharetra magna ac placerat."
},
{
"type": "h3",
"data": "This is an h3."
},
{
"type": "blockquote",
"data": "This is a blockquote."
},
{
"type": "h4",
"data": "This is an h4. Here comes a list:"
},
{
"type": "ul",
"data": "<li>Item 1</li><li>Item 2</li><li>Item 3</li>"
},
{
"type": "hr",
"data": ""
},
{
"type": "ol",
"data": "<li>Item 1</li><li>Item 2</li><li>Item 3</li>"
},
{
"type": "img",
"url": "/https://source.unsplash.com/random/900x700/?cat,dog",
"alt": "image alt text",
"caption": ""
},
{
"type": "table",
"data": "<tr><th>Person 1</th><th>Person 2</th><th>Person 3</th></tr><tr><td>Emil</td><td>Tobias</td><td>Linus</td></tr><tr><td>16</td><td>14</td><td>10</td></tr>"
},
{
"type": "code",
"data": "func main() { fmt.Println(\"Hello, World!\")}"
},
{
"type": "made-up",
"data": "You may use this domain in literature without prior coordination or asking for permission."
}
]
}
}
I'm handling a lot of different tags we might encounter. At the moment, I'm unescaping p, table, ul and ol tags in order to render the elements they contain and keep it simple. I've also started designing a dropdown for user preferences. Javascript to handle this remains to be written. I'm thinking I'll probably save the visual preferences to local storage so it persists. |
This looks fantastic. I’m excited. Im working on a refactor of the core proxy logic but the API should be ready soon. |
I can only agree with that. I really like where this is going. |
I've made further progress in feat/outline. I'm still just rendering mock data from a The dropdown and its functionality works without adding any new dependencies. I've changed the route to When the major refactor work is complete and we can direct our attention to the API, I don't think it will be too much work to plugin this frontend piece. I've used switch/case statements in I've noticed that since I'm storing the font, font size and theme preferences in local storage, any rulesets in place that clear local storage clear those as well. Instead of clearing local storage in these use cases, rules that clear local storage will need to crafted more like this to avoid clearing user preference values from local storage: const keysToKeep = ["font", "fontsize", "theme"];
Object.keys(localStorage).filter(key => !keysToKeep.includes(key)).forEach(key => localStorage.removeItem(key)); |
My two cents (if they're not irrelevant already thanks to the work other people have already put into this issue), is that we can use Potentially, different selectors for each site can be provided in |
Came across this issue and although yall seem pretty far along already, I wanted to share this web metadata scraping pkg / service I've used successfully before. Can return structured data of a bunch of metadata from websites The maintainer, kikobeats, has a few other similar services and runs some as a SaaS as well (https://microlink.io/meta) |
I'm still stuck on a refactor, so no real work has been done on it yet. Metascraper seems interesting, thanks for sharing. It seems to rely on a headless browser, with a latency of around 2 seconds . It does seem to return neat metadata such as background colors, which might be interesting for CSS. Almost there.... #50 |
In addition to go-domdistiller, there is also go-trafilatura that seems to have fallback extractor functions to go-domdistiller and go-readabiility. I haven't tried these, but it seems to perform well in their benchmarks as a reading mode algorithm that can extract the metadata, and content. |
The libraries you mentioned suffer from not having 100% accuracy. This would potentially make the API hard to test, and sometimes return incorrect data. Maybe we can use these libraries for the majority of the API response data, and "fill in the blanks" with hard-coded rules I mentioned in my previous comment. |
I just tried go-trafilartu in b7a012d with pretty decent results. Here's two samples 1 2. The library returns everything as a DOM node, which can be rendered to HTML. Most articles have opengraph tags, so getting metadata such as title, description, tags, etc should be trivial. Anyway, sometimes the article has some JS that makes the content disappear from view, but it's still in the DOM. In which case, this does a great job of extracting the content and thus "bypassing" the paywall. If there was a CSS selector in the ruleset, it could definitely pick the content out more accurately, using goquery or similar. |
This seems very promising. While exploring the go-trafilatura package, I saw that there was an output.go with a jsonExtractResult function, as well as helper.go that contains the CreateReadableDocument function. I sort of mashed them together below in a first attempt proof of concept to return json similar to the API described above. type ImageContent struct {
Type string `json:"type"`
URL string `json:"url"`
Alt string `json:"alt"`
Caption string `json:"caption"`
}
type LinkContent struct {
Type string `json:"type"`
Href string `json:"href"`
Data string `json:"data"`
}
type TextContent struct {
Type string `json:"type"`
Data string `json:"data"`
}
type JSONDocument struct {
Success bool `json:"success"`
Error struct {
Message string `json:"message"`
Type string `json:"type"`
Cause string `json:"cause"`
} `json:"error"`
Metadata struct {
Title string `json:"title"`
Author string `json:"author"`
URL string `json:"url"`
Hostname string `json:"hostname"`
Description string `json:"description"`
Sitename string `json:"sitename"`
Date string `json:"date"`
Categories []string `json:"categories"`
Tags []string `json:"tags"`
License string `json:"license"`
} `json:"metadata"`
Content []interface{} `json:"content"`
Comments string `json:"comments"`
}
func CreateJSONDocument(extract *trafilatura.ExtractResult) *JSONDocument {
jsonDoc := &JSONDocument{}
// Populate success
jsonDoc.Success = true
// Populate metadata
jsonDoc.Metadata.Title = extract.Metadata.Title
jsonDoc.Metadata.Author = extract.Metadata.Author
jsonDoc.Metadata.URL = extract.Metadata.URL
jsonDoc.Metadata.Hostname = extract.Metadata.Hostname
jsonDoc.Metadata.Description = extract.Metadata.Description
jsonDoc.Metadata.Sitename = extract.Metadata.Sitename
jsonDoc.Metadata.Date = extract.Metadata.Date.Format("2006-01-02")
jsonDoc.Metadata.Categories = extract.Metadata.Categories
jsonDoc.Metadata.Tags = extract.Metadata.Tags
jsonDoc.Metadata.License = extract.Metadata.License
// Populate content
if extract.ContentNode != nil {
jsonDoc.Content = parseContent(extract.ContentNode)
}
// Populate comments
if extract.CommentsNode != nil {
jsonDoc.Comments = dom.OuterHTML(extract.CommentsNode)
}
return jsonDoc
}
func parseContent(node *html.Node) []interface{} {
var content []interface{}
for child := node.FirstChild; child != nil; child = child.NextSibling {
switch child.Data {
case "img":
image := ImageContent{
Type: "img",
URL: dom.GetAttribute(child, "src"),
Alt: dom.GetAttribute(child, "alt"),
Caption: dom.GetAttribute(child, "caption"),
}
content = append(content, image)
case "a":
link := LinkContent{
Type: "a",
Href: dom.GetAttribute(child, "href"),
Data: dom.InnerText(child),
}
content = append(content, link)
case "h1":
text := TextContent{
Type: "h1",
Data: dom.InnerText(child),
}
content = append(content, text)
case "h2":
text := TextContent{
Type: "h2",
Data: dom.InnerText(child),
}
content = append(content, text)
case "h3":
text := TextContent{
Type: "h3",
Data: dom.InnerText(child),
}
content = append(content, text)
// continue with other tags
default:
text := TextContent{
Type: "p",
Data: dom.InnerText(child),
}
content = append(content, text)
}
}
return content
} |
Sweet, thanks for that. Here are the preliminary API results: 1 2. I'll migrate this over to the |
The API is now ready for testing!FYI I changed the path from Usage is like: @joncrangle can you test your frontend in feat/outline with the origin/proxy_v2 refactor branch? There's tons of changes, so if you submit a PR to my branch I can sort out the conflicts. |
I've been tweaking the functions above to fix some issues. Right now, tags aren't escaped so we end up losing a bunch of content (basically any tags nested within another text tag are just treated like text). As I've been experimenting, I've been testing a recursive approach to generate json that is more like a nested DOM. The following is still buggy but sharing so you can see what I've been experimenting with: type ImageContent struct {
Type string `json:"type"`
URL string `json:"url"`
Alt string `json:"alt"`
Caption string `json:"caption"`
}
type LinkContent struct {
Type string `json:"type"`
Href string `json:"href"`
Data string `json:"data"`
}
type TextContent struct {
Type string `json:"type"`
Data string `json:"data"`
}
type JSONDocument struct {
Success bool `json:"success"`
Error struct {
Message string `json:"message"`
Type string `json:"type"`
Cause string `json:"cause"`
} `json:"error"`
Metadata struct {
Title string `json:"title"`
Author string `json:"author"`
URL string `json:"url"`
Hostname string `json:"hostname"`
Description string `json:"description"`
Sitename string `json:"sitename"`
Date string `json:"date"`
Categories []string `json:"categories"`
Tags []string `json:"tags"`
License string `json:"license"`
} `json:"metadata"`
Content Content `json:"content"`
Comments Content `json:"comments"`
}
type Content struct {
Type string `json:"type"`
Data string `json:"data,omitempty"`
URL string `json:"url,omitempty"`
Alt string `json:"alt,omitempty"`
Caption string `json:"caption,omitempty"`
Href string `json:"href,omitempty"`
Children []Content `json:"children,omitempty"`
}
func createJSONDocument(extract *trafilatura.ExtractResult) *JSONDocument {
jsonDoc := &JSONDocument{}
// Populate success
jsonDoc.Success = true
// Populate metadata
jsonDoc.Metadata.Title = extract.Metadata.Title
jsonDoc.Metadata.Author = extract.Metadata.Author
jsonDoc.Metadata.URL = extract.Metadata.URL
jsonDoc.Metadata.Hostname = extract.Metadata.Hostname
jsonDoc.Metadata.Description = extract.Metadata.Description
jsonDoc.Metadata.Sitename = extract.Metadata.Sitename
jsonDoc.Metadata.Date = extract.Metadata.Date.Format("2006-01-02")
jsonDoc.Metadata.Categories = extract.Metadata.Categories
jsonDoc.Metadata.Tags = extract.Metadata.Tags
jsonDoc.Metadata.License = extract.Metadata.License
// Populate content
if extract.ContentNode != nil {
jsonDoc.Content = parseContent(extract.ContentNode)
}
// Populate comments
if extract.CommentsNode != nil {
jsonDoc.Comments = parseContent(extract.CommentsNode)
}
return jsonDoc
}
func parseContent(node *html.Node) Content {
var content Content
switch node.Type {
case html.ElementNode:
switch node.Data {
case "img":
content = Content{
Type: "img",
URL: dom.GetAttribute(node, "src"),
Alt: dom.GetAttribute(node, "alt"),
Caption: dom.GetAttribute(node, "caption"),
}
case "a":
content = Content{
Type: "a",
Href: dom.GetAttribute(node, "href"),
Data: dom.InnerText(node),
}
case "h1", "h2", "h3", "h4", "h5", "h6", "blockquote", "code", "pre", "kbd":
content = Content{
Type: node.Data,
Data: dom.InnerText(node),
Children: parseChildren(node),
}
//TODO additional tags
default:
// For other HTML tags, recursively call parseContent
content = Content{
Type: "parent", // Use a default type for other HTML tags
Children: parseChildren(node),
}
}
case html.TextNode:
// Handle text nodes only if they contain non-whitespace characters
text := strings.TrimSpace(dom.InnerText(node))
if text != "" {
content = Content{
Type: "p",
Data: text,
}
}
}
return content
}
func parseChildren(node *html.Node) []Content {
var children []Content
var currentText string
for child := node.FirstChild; child != nil; child = child.NextSibling {
childContent := parseContent(child)
if childContent.Type == "text" {
// If the current child is a text node, concatenate its data with previous text nodes
currentText += childContent.Data
} else {
// If the current child is not a text node, append the previous text nodes as a single text node
if currentText != "" {
textNode := Content{
Type: "text",
Data: currentText,
}
children = append(children, textNode)
currentText = ""
}
// Append the current child content
children = append(children, childContent)
}
}
// If there are remaining text nodes after the loop, append them as a single text node
if currentText != "" {
textNode := Content{
Type: "text",
Data: currentText,
}
children = append(children, textNode)
}
return children
} This provided the following output. There are still lots of mistakes (an a tag was missed, presumably because it had a nested em tag) and I've been trying to handle p tags before looking at other tags, so this still needs quote a bit of additional work. |
I think we should reconsider the nested DOM structure approach. There are actually two primary goals here:
Separation of ConcernsThe current approach, where the user-facing endpoint depends on the API structure might be too complicated. Mimicking HTML in JSON introduces unnecessary complexity and edge case handling. Because the DOM distillation algorithms tend to return an HTML DOM structure anyway, it doesn't make sense to translate the structure into JSON, only to have frontend Javascript translate that back into HTML. Let's just return the DOM structure directly. Design proposalTo that end, I think we should have two endpoints:
A split approach here would simplify the architecture and reduces the need for display logic in two places, in two different languages, making the code easier to maintain. Additionally, a server-side rendered approach enables us to keep the page navigable via href links. |
I agree that this nested DOM approach is getting too complex and I ran into a number of footguns just experimenting. If nested tags can be escaped and unescaped so that the result mirrors the following it keeps things simple and we probably wouldn't need two endpoints, just the {
"type": "p",
"data": "<a> tag: <a href=\"/https://example.com\">This is an example link</a>. This is <em>emphasized</em> text. This is <strong>bold</strong> text. These are example <kbd> tags <kbd>Ctrl</kbd> + <kbd>Shift</kbd>"
}, |
What if it just sent the rendered HTML directly? So imagine this: You have the “outlined” and distilled HTML, with nested markup, etc. Basic html element tags only, no classes. Then you have a shell, with a skeleton div space for the outlined HTML to be injected into via template rendering. This shell contains the menu from your frontend where you can tweak the global css styles, as well as the ladder logo header, and a print option. When the site is requested, the “outlined” html is injected into the shell template, and the whole blob is sent to the client. Perhaps images could be rendered inline so that users could easily save the entire document to their computer. |
That approach would probably be pretty quick to implement. If we update the templating to inject the outlined HTML content (potentially the metadata for headings) or any API errors into the main tag of outline.html, the styling for all the text elements can all be moved to input.css to apply global styles. The outline.html shell already has a ladder header, footer, dropdown and JavaScript for controlling user preferences from script.js. |
Alright, I’ll try integrating that when I have time later today. Will you join us on the discord @joncrangle? A few of us are already there. |
Using some “reading mode” algorithm (such as DOM-distiller) I think the API could return a json blob representing just the source URL, title, author, date and text content of the article, without the extra HTML.
This would make it feasible for web scraping tasks, for non-JS heavy sites.
In addition, this would open up the possibility of an endpoint that returned the cleaned content of the site, much like the old outline.org.
The text was updated successfully, but these errors were encountered: