Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide incoming string for url or path from open-http/pen-file as variable #533

Open
TobiasNx opened this issue May 23, 2024 · 6 comments

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented May 23, 2024

At the moment we cannot use the incoming url-string after it is used in open-http.

A useful scenario would be if we scrape a website but the website does not provide the url as metadata and to quickly identify the source. Another would be if catching errors in a later process it could state the _id as source of the error.

There also could be a more abstract approach since this could also be useful for open-file and provide the file name as _id

e.g.:
https://metafacture.org/playground/?flux=%22https%3A//phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-html%0A%7C+fix%28%22copy_field%28%27_id%27%2C%27_id%27%29%22%29%0A%7C+encode-json%28prettyPrinting%3D%22true%22%29%0A%7C+print%0A%3B

Not sure where the value of _id comes from.

@blackwinter
Copy link
Member

_id is the internal record identifier which is set automatically by some decoder/handler modules and which can be set manually (based on some literal value) with the change-id Flux command.

It can not be set by input modules, because they don't know anything about records at that point. OTOH, the source location (URL, path) is not available anymore when the decoder receives the stream and there is (currently) no way to transport it out-of-band. Setting the ID to the source location would also mean that (potentially) multiple records would get the same ID, so it violates the uniqueness guarantee.

It might, however, be possible to save the URL in a variable which can then be used in the transformation. Maybe along the following lines:

default inputUrl = "https://phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html";

inputUrl
| open-http(accept="application/xml")
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

@TobiasNx
Copy link
Contributor Author

I would be fine with a variable that could be used in the FIX and the FLUX.

It would help in this scenario.

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

@blackwinter
Copy link
Member

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

@TobiasNx
Copy link
Contributor Author

TobiasNx commented May 23, 2024

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

I think if I could use the variable in the fix my use case would be solved yes. :)

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

If I could configure the logging message and add the variable to the output is one scenario where the variable could be handy. Another could be if the file-name is passed on as a variable I could use it to write a file with a given variable as name.
But these are additional feature, what would be good in the first place is to have the variable available for FIX and for other FLUX Commands.

@blackwinter
Copy link
Member

I think if I could use the variable in the fix my use case would be solved yes. :)

But you can. Doesn't the proposed solution work for you?

@TobiasNx
Copy link
Contributor Author

ahh, i now I see the specific aspect of your approach.
I tought you were suggesting that the opener-module would create the variable, but you were not.

something like this:

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| open-http(input-to-variable="inputUrl"))
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

Instead you would define the variable beforehand.

This would not solve my usecase since you have to provide/configure the variable outside of the flux-workflow itself.
The usecase would be in our scenario to use a sitemap via the sitemap reader in oersi, then open the html and fetch data.
I do not know the data before hand.

Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

@TobiasNx TobiasNx changed the title Provide incoming url from open-http as _id Provide incoming string for url or path from open-http/pen-file as variable Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants