Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create module parser #89

Open
nappex opened this issue May 6, 2022 · 6 comments
Open

Create module parser #89

nappex opened this issue May 6, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@nappex
Copy link
Collaborator

nappex commented May 6, 2022

Očekáváme předem známa vstupní data stažená downloaderem. Donwloader stahne textovy obsah stranky.
data tvořící text budou ve formě co řadek to jedna url (jako kdysi napč. gopher).
Vystupem parseru bude json soubor ktery bude obsahovat seznam vsech url nalezenutych v textu dané stranky.

@nappex
Copy link
Collaborator Author

nappex commented May 6, 2022

dalsi problemy

Jake formaty chceme nacitat.
Jake vystupni formaty chceme.

@Glutexo
Copy link
Owner

Glutexo commented Aug 1, 2022

#118 slightly blocks this because, without the rename, we’d end up with a module called Onigumo that is essentially a Downloader and then a new Parser module. Or even worse, to fit into the naming and architecture, we’d stuff everything into the existing Onigumo module.

@Glutexo Glutexo added the blocked label Aug 1, 2022
@Glutexo Glutexo removed the blocked label Aug 16, 2022
@Glutexo
Copy link
Owner

Glutexo commented Aug 16, 2022

#118 got merged; I am removing the “blocked” label.

@Glutexo
Copy link
Owner

Glutexo commented Aug 24, 2022

To answer your questions, @nappex:

The input of the parser will be anything downloaded from the Internet. Typically, this is an HTML page, but it can be a plain text file or a JSON API response. It is Spider’s job to know and discern the format; Onigumo can only provide a set of tools (like #87) to make everyday tasks easier.

The output format will be a structured representation of the input data. JSON was my first pick for its universality, and I preferred it over YAML because the data will be processed only by a machine. The output can be, however, whatever kind of structured data, be it a serialized object of a programming language of the Spider, an XML document, or some binary format.

Theoretically, Onigumo doesn’t even need to force a concrete structure format. Having one would make it easy to define specific standardized nodes in the schema, allowing to control the core without writing any code. On the other hand, a piece of Spider code that knows its file format can provide the same data just as fine.

Feel free to share any thoughts on the topic!

@Glutexo Glutexo removed the need info label Aug 24, 2022
@Glutexo Glutexo linked a pull request Sep 5, 2022 that will close this issue
@Glutexo Glutexo removed a link to a pull request Sep 5, 2022
@Glutexo
Copy link
Owner

Glutexo commented Sep 5, 2022

The work has begun: #146

@Glutexo Glutexo self-assigned this Jan 24, 2023
@Glutexo
Copy link
Owner

Glutexo commented Jan 24, 2023

I extracted the steps described in your description (#89 (comment) and #89 (comment)), @nappex, to a new follow-up Issue #169. Feel free to take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

2 participants