Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step hooks #111

Open
tpluscode opened this issue Apr 26, 2023 · 16 comments
Open

Step hooks #111

tpluscode opened this issue Apr 26, 2023 · 16 comments

Comments

@tpluscode
Copy link
Contributor

tpluscode commented Apr 26, 2023

As discussed with @cristianvasquez @giacomociti @mchlrch, we propose that instead of implementing specialized monitoring steps, there should be a built-in feature to allow observing existing steps in a pipeline.

_Originally posted by @tpluscode in #121

@tpluscode
Copy link
Contributor Author

tpluscode commented Apr 26, 2023

For starters, I propose to all each step to be extended with at least two extension points:

  1. after or flush
  2. onChunk
  3. onError

The first one already has an issue: #71

The second could be defined in a similar fashion, by adding function(s) to a step. Those functions would be called for every chunk, similarly to a PassThrough step but without accessing the underlying stream.

Here's a step which produces the final set of triples. The onChunk hook is implemented by a hypothetical hook which counts the passing quads and reports the total when pipeline finishes

<#serialize>
  a :Step ;
  code:implementedBy [
    a code:EcmaScriptModule ;
    code:link <node:barnard59-formats/ntriples.js#serialize> ;
  ] ;
  :hook [ 
    code:implementedBy [
      a code:EcmaScriptModule ;
      code:link <file:monitoring.js#counter> ;
    ] ;
  ] ;
.
// monitoring.js

// code:arguments can be declared in pipeline definition
export function counter(...args) {
  // the pipeline context can be accessed as in steps
  const { logger } = this

  let total = 0

  // a map of hooks is returned to allow them to access shared state
  // and we might add more hooks in the future
  // all hooks optional
  return {
    data(quad) {
      total++
    },
    flush() {
      logger.info(`Total quads processed: ${total}`)
    }
  }
}

@cristianvasquez
Copy link
Contributor

I think I understand the spirit of the hooks. But wouldn't it be better that they were implicit? (not to be added to the code).

The ideal would be to observe a node of an arbitrary pipeline running

@tpluscode
Copy link
Contributor Author

Not sure I understand @cristianvasquez. It makes sense to attach observability to every single step, if that's what you are proposing. You'd make a choice to count at specific points. For example, in museum we process API objects and convert them to RDF so might count the number of input objects and then the final number of quads.

And how do without code? Even monitoring could be implemented in multiple ways, such as otel

@ludovicm67
Copy link
Member

I like the idea of having such kind of hooks, because we can do what we want, and it will help a lot for debugging, like printing quads, generate some metrics, and so on.

@cristianvasquez
Copy link
Contributor

And how do without code? Even monitoring could be implemented in multiple ways, such as otel

Yes, there can be many different implementations, with their design choices.

My point was related to the choice of adding more triples to the pipeline steps or not, especially if those triples are not part of the business logic.

Choice: If one has a UI, where one can click nodes, then a virtual hook is added to show information in a box.

Example UI idea in VSCode: https://hackmd.io/@KhLoxKJzSyWQgXHIpI2qdQ/BycIGrTv5 (outdated)

Choice: If one does not use a UI, one can imagine having a turtle file where one adds by hand the triples for the hook, and then run the pipeline each time from the command line.

less is better :)

@ludovicm67
Copy link
Member

I like the idea to have something that we can extend the way we want, and this for the steps we want.

In the way @tpluscode suggested, it's easy to do any logic we want, we are not stuck to a limited thing.

Always be explicit to avoid confusion, and make sure that we directly know what is happening.
Explicitly writing hooks to specific steps is great for this.

Maybe providing a default collection of hooks can be nice (printing quads that matches a pattern, generating metrics, …).

@cristianvasquez
Copy link
Contributor

Perhaps they can be run through the cli

> barnard59 run -v --pipeline=urn:pipeline -log=urn:serialize

@ludovicm67
Copy link
Member

But how do you handle custom logic (debug logs, metrics export, …) for a specific step that way?

@cristianvasquez
Copy link
Contributor

The Barnard runner reads the turtle file and composes the pipeline in memory, at that moment it might add also the hooks @tpluscode mentions, being 'dummy' at the start. At runtime, logic can be injected into such hooks, to do the counting, logging etc.

I used to do debugging in that way when I started using the barnard pipelines...

@tpluscode
Copy link
Contributor Author

The Barnard runner reads the turtle file

So there is one or multiple such pipeline files? For a moment I thought that you have in mind the proposal #93 where you could have multiple sources and use the CLI to combine them.

barnard59 run --source main.ttl --source console-debug.ttl
barnard59 run --source main.ttl --source otel.ttl

The second would extend the main graph with hooks at the desired steps

@ludovicm67
Copy link
Member

Can be an option also!

So as an example, this will be:

# main.ttl
<#serialize>
  a :Step ;
  code:implementedBy [
    a code:EcmaScriptModule ;
    code:link <node:barnard59-formats/ntriples.js#serialize> ;
  ] ;
# otel.ttl
<#serialize>
  :hook [ 
    code:implementedBy [
      a code:EcmaScriptModule ;
      code:link <file:monitoring.js#counter> ;
    ] ;
  ] ;

and running them using:

barnard59 run --source main.ttl --source otel.ttl

?

If so, this is also OK for me.

Is the example valid? If not, can you correct me?

@tpluscode
Copy link
Contributor Author

The --source option does not exist now. We can only load from a single source

@ludovicm67
Copy link
Member

But once it's added, would my example work?

@tpluscode
Copy link
Contributor Author

Yes, that's how I'd see it

@tpluscode tpluscode transferred this issue from zazuko/barnard59-core Jun 20, 2023
@tpluscode
Copy link
Contributor Author

I added an onError hook, which could potentially be used for logging and retries (re #94)

@tpluscode
Copy link
Contributor Author

tpluscode commented Nov 20, 2023

To add to this subject, I just had this idea of custom hooks to be defined by step implementors. For example, the filter operation could offer an optional hook to execute when a chunk has been filtered

Here in new syntax

[ 
  op:base\/filter ( "({ AnzahlRecords }) => Number(AnzahlRecords) > 0"^^code:EcmaScript ) ;
  p:onFiltered [ 
    code:implementedBy """
      function (chunk) {
        this.logger.info(`Skipping ${chunk.ExportFileName} because it is empty`)
      }
    """^^code:EcmaScript
  ]
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants