Create module parser #89

nappex · 2022-05-06T18:47:10Z

Očekáváme předem známa vstupní data stažená downloaderem. Donwloader stahne textovy obsah stranky.
data tvořící text budou ve formě co řadek to jedna url (jako kdysi napč. gopher).
Vystupem parseru bude json soubor ktery bude obsahovat seznam vsech url nalezenutych v textu dané stranky.

nappex · 2022-05-06T18:50:47Z

dalsi problemy

Jake formaty chceme nacitat.
Jake vystupni formaty chceme.

Glutexo · 2022-08-01T14:07:41Z

#118 ~~slightly~~ blocks this because, without the rename, we’d end up with a module called Onigumo that is essentially a Downloader and then a new Parser module. Or even worse, to fit into the naming and architecture, we’d stuff everything into the existing Onigumo module.

Glutexo · 2022-08-16T09:54:17Z

#118 got merged; I am removing the “blocked” label.

Glutexo · 2022-08-24T12:28:34Z

To answer your questions, @nappex:

The input of the parser will be anything downloaded from the Internet. Typically, this is an HTML page, but it can be a plain text file or a JSON API response. It is Spider’s job to know and discern the format; Onigumo can only provide a set of tools (like #87) to make everyday tasks easier.

The output format will be a structured representation of the input data. JSON was my first pick for its universality, and I preferred it over YAML because the data will be processed only by a machine. The output can be, however, whatever kind of structured data, be it a serialized object of a programming language of the Spider, an XML document, or some binary format.

Theoretically, Onigumo doesn’t even need to force a concrete structure format. Having one would make it easy to define specific standardized nodes in the schema, allowing to control the core without writing any code. On the other hand, a piece of Spider code that knows its file format can provide the same data just as fine.

Feel free to share any thoughts on the topic!

Glutexo · 2022-09-05T09:17:53Z

The work has begun: #146

Glutexo · 2023-01-24T13:44:13Z

I extracted the steps described in your description (#89 (comment) and #89 (comment)), @nappex, to a new follow-up Issue #169. Feel free to take a look!

Glutexo added this to Onigumo Minimal Product Jul 25, 2022

Glutexo added enhancement New feature or request need info labels Jul 26, 2022

Glutexo added the blocked label Aug 1, 2022

Glutexo removed the blocked label Aug 16, 2022

Glutexo mentioned this issue Aug 23, 2022

✨ Use the first argument as a module name #145

Merged

Glutexo removed the need info label Aug 24, 2022

Glutexo linked a pull request Sep 5, 2022 that will close this issue

Introduce a Parser module stub #146

Draft

Glutexo removed a link to a pull request Sep 5, 2022

Introduce a Parser module stub #146

Draft

Glutexo self-assigned this Jan 24, 2023

Glutexo mentioned this issue Jan 24, 2023

Create a dummy spider #169

Open

Glutexo mentioned this issue Jun 2, 2024

Add an integration test for the Downloader → Parser workflow #236

Open

Glutexo mentioned this issue Aug 16, 2024

Update mermaid graph of onigumo processing with a new approach #237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create module parser #89

Create module parser #89

nappex commented May 6, 2022

nappex commented May 6, 2022

Glutexo commented Aug 1, 2022 •

edited

Loading

Glutexo commented Aug 16, 2022

Glutexo commented Aug 24, 2022

Glutexo commented Sep 5, 2022

Glutexo commented Jan 24, 2023 •

edited

Loading

Create module parser #89

Create module parser #89

Comments

nappex commented May 6, 2022

nappex commented May 6, 2022

Glutexo commented Aug 1, 2022 • edited Loading

Glutexo commented Aug 16, 2022

Glutexo commented Aug 24, 2022

Glutexo commented Sep 5, 2022

Glutexo commented Jan 24, 2023 • edited Loading

Glutexo commented Aug 1, 2022 •

edited

Loading

Glutexo commented Jan 24, 2023 •

edited

Loading