Skip to content
This repository has been archived by the owner on Apr 22, 2022. It is now read-only.

Add option to not load user agent database #283

Open
tkochanek opened this issue Aug 4, 2019 · 7 comments
Open

Add option to not load user agent database #283

tkochanek opened this issue Aug 4, 2019 · 7 comments

Comments

@tkochanek
Copy link

Since user agent database is not up-to-date it should be possible to configure divolte not to load this database

From my experience loading user agent database takes much time during divolte startup so this could minimize startup time and saves memory

@friso
Copy link
Collaborator

friso commented Aug 4, 2019

I believe startup time is dominated by the compilation of the mapping script (Groovy).

Do you have a particular usage scenario where startup time is critical?

@Fokko
Copy link
Contributor

Fokko commented Aug 5, 2019

I think we should remove, or at least deprecate, the user agent database. It is old and I'm not sure if it should be part of Divolte itself. I would run something like YAUAA in {Flink, Beam, Spark, Dataflow, ...} downstream to enrich the data.

@friso
Copy link
Collaborator

friso commented Aug 5, 2019

Understandably, the outdated parser is problematic. Yet, I imagine there are users who rely on the UA enrichment happening before events hit their services. The default schema has UA information in it, so there is no backwards compatible way to remove it by default.

We can swap out the current UA parser in favour of YAUAA, but I believe one issue with YAUAA is that its output cannot populate all fields in the Divolte default schema (this was the case when I last looked at it), so you'd probably want to add a configuration construct to switch to this parser in order not to break existing configurations and dependencies on the default schema.

I don't believe the startup time overhead of loading the UA database is really an issue. There is no runtime overhead of the UA parser if you don't use it in mapping (evaluation is lazy). UA parsing is also cached, so most times it will be a hash map lookup and not a regex operation.

@friso
Copy link
Collaborator

friso commented Aug 5, 2019

Also, this makes me wary:
image

From the docs.

@Fokko
Copy link
Contributor

Fokko commented Aug 6, 2019

Yes, having a single-threaded library doesn't look really good. It works well with . mapPartitions kind of semantics, but for streaming it isn't ideal. Earlier I had an assignment where they wanted to drop the IP asap because of GDPR but wanted to keep some geographical information about their users. In this case, doing everything in Divolte itself would be ideal.

Getting back to the original question about the startup. I think we really need some numbers here to really see if it is the UA database. Apart from that, the JVM isn't really known for its excellent startup speed, so we could also introduce some readiness check. Then we can let k8s/docker know that we're ready and it can be added to the load balancer pool.

@friso
Copy link
Collaborator

friso commented Aug 6, 2019

It basically means that the parser is stateful and for some reason the state needs to be kept in the implementing class as instance variables as opposed to something scoped to the method call. I can't imagine any justification for such an implementation. I chose to stay away from it.

Also, I am not sure if the parse tree approach that is documented is future proof, but time will tell.

The reason for providing ip2geo and UA parsing in Divolte is exactly that you can ditch the two most discriminating pieces of information on a client early in the pipeline.

I believe a better approach to offline parsing would be to get a dataset from something like this and build a tailored parser to that. This also has the benefit of only parsing for the top-N most seen user agents and leaving the rest as esoteric or otherwise unimportant. These databases, however, tend to come at a cost (for obvious reasons), so we could never deliver it as part of Divolte. The same way we don't do this for ip2geo from MaxMind.

@MrMoronIV
Copy link

Divolte is complaining that http://user-agent-string.info/rpc/get_data.php?key=free&format=xml cannot be found, which is true since the page doesn't exists.
I've set mapper.user_agent_parser.type = caching_and_updating in the hope it did what it said in the docs, but reading this conversation it means the UA parsing is not reliable at all?

Should I let PHP parse the UA on my site and push it as a custom field towards Divolte? Same goes for country detection?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants