Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest field configuration helper cache #2614

Draft
wants to merge 11 commits into
base: integration
Choose a base branch
from

Conversation

dtspence
Copy link
Collaborator

@dtspence dtspence commented Oct 16, 2024

Implementation of a field helper cache using an LRU map.

  • The cache can be optionally enabled on the BaseIngestHelper instance and will wrap an existing FieldConfigHelper implementation.
  • By default the cache is disabled and must be enabled via configuration.

As part of the work, there was also a factory implementation in separate branch which would allow an implementation to be specified by configuration, which may be preferrable.

Closes #2612

@dtspence dtspence force-pushed the feature/ingest-cache-field-helper branch 3 times, most recently from db63c0c to 3f7d5b6 Compare October 17, 2024 14:10
@dtspence dtspence marked this pull request as ready for review October 17, 2024 14:10

private static class ResultEntry {
private final String fieldName;
private final EnumMap<AttributeType,Boolean> resultMap;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need to limit the size of this cache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the map here will be limited to the size of the AttributeType (which was in the current class - maybe we should change the name). I tried an alternate version which used a switch statement on the enum and individual boolean variables, but it did not seem to impact the timing w/JMH. I will confirm the timing again - and we can also update to have a switch statement and explicit boolean results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An approach with a switch statement and primitive boolean variables was slightly quicker on second pass. I updated the logic, please let me know if we should evaluate the previous approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed additional performance ideas, going to run additional tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LinkedHashMap implementation tested slightly better in benchmarks. Please indicate if we should switch the implementation or restructure the representation.

@dtspence dtspence force-pushed the feature/ingest-cache-field-helper branch from 3f7d5b6 to 47d1cf3 Compare October 25, 2024 18:45
@dtspence dtspence marked this pull request as draft October 25, 2024 20:12
@dtspence dtspence force-pushed the feature/ingest-cache-field-helper branch from 8e5b1fb to 038518c Compare January 8, 2025 17:11
@dtspence dtspence force-pushed the feature/ingest-cache-field-helper branch from 038518c to c999a78 Compare January 8, 2025 17:32
@dtspence dtspence marked this pull request as ready for review January 13, 2025 15:47
jschmidt10
jschmidt10 previously approved these changes Jan 22, 2025

private void debugLogState() {
if (resultCache.hasLimitExceeded()) {
log.info("Field cache LRU limit exceeded [limit={}, debug={}, size={}, uniq={}]",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this feature is "debugLogState" I was expecting this to be debug level logging. Not a deal breaker but I'd vote to make it debug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update - I set the level this way so it would show up in logs without having to also adjust logger/levels.

}

protected boolean removeEldestEntry(Map.Entry<K,V> eldest) {
boolean localLimitExceeded = size() > maxSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is size() threadsafe? Wondering if this could return a stale size in a multi-threaded scenario. Not sure how much we care.

Copy link
Collaborator Author

@dtspence dtspence Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a comment and/or refactor - the intent was to have the limitExceeded variable be thread-safe for the debug log output.

EDIT: The refactored the implementation no longer uses a thread for the debug/diagnostic log messages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup() method is already too long. There are blocks of logic preceded by a comment explaining what's to come. I know this is a bit out of scope but it would help with readability if it were refactored so that each of those commented blocks turn into meaningfully named methods.

@@ -255,10 +259,19 @@ public void setup(Configuration config) {
// Load the field helper, which takes precedence over the individual field configurations
final String fieldConfigFile = config.get(this.getType().typeName() + FIELD_CONFIG_FILE);
if (fieldConfigFile != null) {
if (log.isDebugEnabled()) {
log.debug("Field config file " + fieldConfigFile + " specified for: " + this.getType().typeName() + FIELD_CONFIG_FILE);
final boolean fieldConfigCacheEnabled = config.getBoolean(this.getType().typeName() + FIELD_CONFIG_CACHE_ENABLED, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache truly is the better option do we really want it to be feature based?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference was to make it feature/opted-into based on the need to select an appropriate cache value and also so the capability could be turned off in the event of unexpected behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable. I suggest that once the confidence level is high, we remove the conditional logic.

log.debug("Field config file " + fieldConfigFile + " specified for: " + this.getType().typeName() + FIELD_CONFIG_FILE);
final boolean fieldConfigCacheEnabled = config.getBoolean(this.getType().typeName() + FIELD_CONFIG_CACHE_ENABLED, false);
final boolean fieldConfigCacheLimitDebug = config.getBoolean(this.getType().typeName() + FIELD_CONFIG_CACHE_KEY_LIMIT_DEBUG, false);
final int fieldConfigCacheLimit = config.getInt(this.getType().typeName() + FIELD_CONFIG_CACHE_KEY_LIMIT, 100);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was the default of 100 chosen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a value that felt was a reasonable/minimal default and also fit with the sample datasets. I can change it to something different, would a higher value at 1_000 or 10_000 be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a recommendation on size. I was just curious.

Comment on lines 258 to 259
if (log.isDebugEnabled()) {
log.debug("Field config file " + fieldConfigFile + " specified for: " + this.getType().typeName() + FIELD_CONFIG_FILE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably restore this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was elevated to an info - however, it is logged twice - there is another location within XMLFieldHelper which outputs the file as info. I will update this back to debug.

Copy link
Collaborator

@jalphonso jalphonso Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new diagnostic code if it makes better sense to have it as info then I'm ok with it.

Comment on lines 50 to 67
this.debugLimitsEnabled = debugLimitEnabled;
this.debugFieldUnique = new HashSet<>();
this.debugFieldComputes = new AtomicLong();

if (debugLimitEnabled) {
this.debugStateExecutor = Executors.newSingleThreadScheduledExecutor(
// @formatter:off
new ThreadFactoryBuilder()
.setPriority(NORM_PRIORITY)
.setDaemon(true)
.setNameFormat("CachedFieldConfigHelper.DebugState")
.build()
// formatter:off
);
this.debugStateExecutor.scheduleAtFixedRate(this::debugLogState, DEFAULT_DEBUG_STATE_SECS, DEFAULT_DEBUG_STATE_SECS, SECONDS);
} else {
this.debugStateExecutor = null;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this debug related code be left in?

Copy link
Collaborator Author

@dtspence dtspence Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the name to be something different and/or refactor. The intent was to trace internals to view and help properly size the cache. It was meant to be conditionally enabled, so not to impact normal execution. I evaluated a way to link to Hadoop metrics, but I didn't immediately see a straight forward path to connect. In respect to the name, would generally changing this to an optional view-state semantic be better or is there another convention that would apply better?

EDIT: we could also update tracing/state logic to all be inline (i.e. remove threading), which when enabled would optionally log out the same information. That would simplify the maintenance/logic.

EDIT#2: refactored the logic to no longer have a background thread to log the message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnostic nomenclature works for me.

@dtspence dtspence marked this pull request as draft January 23, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FieldConfigHelper results cache
4 participants