-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create module parser #89
Comments
dalsi problemy Jake formaty chceme nacitat. |
#118 |
#118 got merged; I am removing the “blocked” label. |
To answer your questions, @nappex: The input of the parser will be anything downloaded from the Internet. Typically, this is an HTML page, but it can be a plain text file or a JSON API response. It is Spider’s job to know and discern the format; Onigumo can only provide a set of tools (like #87) to make everyday tasks easier. The output format will be a structured representation of the input data. JSON was my first pick for its universality, and I preferred it over YAML because the data will be processed only by a machine. The output can be, however, whatever kind of structured data, be it a serialized object of a programming language of the Spider, an XML document, or some binary format. Theoretically, Onigumo doesn’t even need to force a concrete structure format. Having one would make it easy to define specific standardized nodes in the schema, allowing to control the core without writing any code. On the other hand, a piece of Spider code that knows its file format can provide the same data just as fine. Feel free to share any thoughts on the topic! |
The work has begun: #146 |
I extracted the steps described in your description (#89 (comment) and #89 (comment)), @nappex, to a new follow-up Issue #169. Feel free to take a look! |
Očekáváme předem známa vstupní data stažená downloaderem. Donwloader stahne textovy obsah stranky.
data tvořící text budou ve formě co řadek to jedna url (jako kdysi napč. gopher).
Vystupem parseru bude json soubor ktery bude obsahovat seznam vsech url nalezenutych v textu dané stranky.
The text was updated successfully, but these errors were encountered: