Add option to not load user agent database #283

tkochanek · 2019-08-04T05:05:38Z

Since user agent database is not up-to-date it should be possible to configure divolte not to load this database

From my experience loading user agent database takes much time during divolte startup so this could minimize startup time and saves memory

friso · 2019-08-04T18:14:59Z

I believe startup time is dominated by the compilation of the mapping script (Groovy).

Do you have a particular usage scenario where startup time is critical?

Fokko · 2019-08-05T13:03:54Z

I think we should remove, or at least deprecate, the user agent database. It is old and I'm not sure if it should be part of Divolte itself. I would run something like YAUAA in {Flink, Beam, Spark, Dataflow, ...} downstream to enrich the data.

friso · 2019-08-05T18:48:16Z

Understandably, the outdated parser is problematic. Yet, I imagine there are users who rely on the UA enrichment happening before events hit their services. The default schema has UA information in it, so there is no backwards compatible way to remove it by default.

We can swap out the current UA parser in favour of YAUAA, but I believe one issue with YAUAA is that its output cannot populate all fields in the Divolte default schema (this was the case when I last looked at it), so you'd probably want to add a configuration construct to switch to this parser in order not to break existing configurations and dependencies on the default schema.

I don't believe the startup time overhead of loading the UA database is really an issue. There is no runtime overhead of the UA parser if you don't use it in mapping (evaluation is lazy). UA parsing is also cached, so most times it will be a hash map lookup and not a regex operation.

friso · 2019-08-05T18:55:44Z

Also, this makes me wary:

From the docs.

Fokko · 2019-08-06T14:43:06Z

Yes, having a single-threaded library doesn't look really good. It works well with . mapPartitions kind of semantics, but for streaming it isn't ideal. Earlier I had an assignment where they wanted to drop the IP asap because of GDPR but wanted to keep some geographical information about their users. In this case, doing everything in Divolte itself would be ideal.

Getting back to the original question about the startup. I think we really need some numbers here to really see if it is the UA database. Apart from that, the JVM isn't really known for its excellent startup speed, so we could also introduce some readiness check. Then we can let k8s/docker know that we're ready and it can be added to the load balancer pool.

friso · 2019-08-06T18:33:55Z

It basically means that the parser is stateful and for some reason the state needs to be kept in the implementing class as instance variables as opposed to something scoped to the method call. I can't imagine any justification for such an implementation. I chose to stay away from it.

Also, I am not sure if the parse tree approach that is documented is future proof, but time will tell.

The reason for providing ip2geo and UA parsing in Divolte is exactly that you can ditch the two most discriminating pieces of information on a client early in the pipeline.

I believe a better approach to offline parsing would be to get a dataset from something like this and build a tailored parser to that. This also has the benefit of only parsing for the top-N most seen user agents and leaving the rest as esoteric or otherwise unimportant. These databases, however, tend to come at a cost (for obvious reasons), so we could never deliver it as part of Divolte. The same way we don't do this for ip2geo from MaxMind.

MrMoronIV · 2019-10-18T10:54:06Z

Divolte is complaining that http://user-agent-string.info/rpc/get_data.php?key=free&format=xml cannot be found, which is true since the page doesn't exists.
I've set mapper.user_agent_parser.type = caching_and_updating in the hope it did what it said in the docs, but reading this conversation it means the UA parsing is not reliable at all?

Should I let PHP parse the UA on my site and push it as a custom field towards Divolte? Same goes for country detection?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to not load user agent database #283

Add option to not load user agent database #283

tkochanek commented Aug 4, 2019

friso commented Aug 4, 2019

Fokko commented Aug 5, 2019 •

edited

Loading

friso commented Aug 5, 2019 •

edited

Loading

friso commented Aug 5, 2019

Fokko commented Aug 6, 2019

friso commented Aug 6, 2019

MrMoronIV commented Oct 18, 2019

Add option to not load user agent database #283

Add option to not load user agent database #283

Comments

tkochanek commented Aug 4, 2019

friso commented Aug 4, 2019

Fokko commented Aug 5, 2019 • edited Loading

friso commented Aug 5, 2019 • edited Loading

friso commented Aug 5, 2019

Fokko commented Aug 6, 2019

friso commented Aug 6, 2019

MrMoronIV commented Oct 18, 2019

Fokko commented Aug 5, 2019 •

edited

Loading

friso commented Aug 5, 2019 •

edited

Loading