-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest field configuration helper cache #2614
base: integration
Are you sure you want to change the base?
Conversation
db63c0c
to
3f7d5b6
Compare
|
||
private static class ResultEntry { | ||
private final String fieldName; | ||
private final EnumMap<AttributeType,Boolean> resultMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need to limit the size of this cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the map here will be limited to the size of the AttributeType (which was in the current class - maybe we should change the name). I tried an alternate version which used a switch statement on the enum and individual boolean variables, but it did not seem to impact the timing w/JMH. I will confirm the timing again - and we can also update to have a switch statement and explicit boolean results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An approach with a switch statement and primitive boolean variables was slightly quicker on second pass. I updated the logic, please let me know if we should evaluate the previous approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed additional performance ideas, going to run additional tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LinkedHashMap implementation tested slightly better in benchmarks. Please indicate if we should switch the implementation or restructure the representation.
...ouse/ingest-core/src/test/java/datawave/ingest/data/config/CachingFieldConfigHelperTest.java
Outdated
Show resolved
Hide resolved
3f7d5b6
to
47d1cf3
Compare
8e5b1fb
to
038518c
Compare
038518c
to
c999a78
Compare
|
||
private void debugLogState() { | ||
if (resultCache.hasLimitExceeded()) { | ||
log.info("Field cache LRU limit exceeded [limit={}, debug={}, size={}, uniq={}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this feature is "debugLogState" I was expecting this to be debug level logging. Not a deal breaker but I'd vote to make it debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update - I set the level this way so it would show up in logs without having to also adjust logger/levels.
} | ||
|
||
protected boolean removeEldestEntry(Map.Entry<K,V> eldest) { | ||
boolean localLimitExceeded = size() > maxSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is size()
threadsafe? Wondering if this could return a stale size in a multi-threaded scenario. Not sure how much we care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a comment and/or refactor - the intent was to have the limitExceeded variable be thread-safe for the debug log output.
EDIT: The refactored the implementation no longer uses a thread for the debug/diagnostic log messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup()
method is already too long. There are blocks of logic preceded by a comment explaining what's to come. I know this is a bit out of scope but it would help with readability if it were refactored so that each of those commented blocks turn into meaningfully named methods.
@@ -255,10 +259,19 @@ public void setup(Configuration config) { | |||
// Load the field helper, which takes precedence over the individual field configurations | |||
final String fieldConfigFile = config.get(this.getType().typeName() + FIELD_CONFIG_FILE); | |||
if (fieldConfigFile != null) { | |||
if (log.isDebugEnabled()) { | |||
log.debug("Field config file " + fieldConfigFile + " specified for: " + this.getType().typeName() + FIELD_CONFIG_FILE); | |||
final boolean fieldConfigCacheEnabled = config.getBoolean(this.getType().typeName() + FIELD_CONFIG_CACHE_ENABLED, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cache truly is the better option do we really want it to be feature based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference was to make it feature/opted-into based on the need to select an appropriate cache value and also so the capability could be turned off in the event of unexpected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable. I suggest that once the confidence level is high, we remove the conditional logic.
log.debug("Field config file " + fieldConfigFile + " specified for: " + this.getType().typeName() + FIELD_CONFIG_FILE); | ||
final boolean fieldConfigCacheEnabled = config.getBoolean(this.getType().typeName() + FIELD_CONFIG_CACHE_ENABLED, false); | ||
final boolean fieldConfigCacheLimitDebug = config.getBoolean(this.getType().typeName() + FIELD_CONFIG_CACHE_KEY_LIMIT_DEBUG, false); | ||
final int fieldConfigCacheLimit = config.getInt(this.getType().typeName() + FIELD_CONFIG_CACHE_KEY_LIMIT, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was the default of 100 chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a value that felt was a reasonable/minimal default and also fit with the sample datasets. I can change it to something different, would a higher value at 1_000
or 10_000
be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a recommendation on size. I was just curious.
if (log.isDebugEnabled()) { | ||
log.debug("Field config file " + fieldConfigFile + " specified for: " + this.getType().typeName() + FIELD_CONFIG_FILE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably restore this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was elevated to an info - however, it is logged twice - there is another location within XMLFieldHelper
which outputs the file as info. I will update this back to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new diagnostic code if it makes better sense to have it as info then I'm ok with it.
this.debugLimitsEnabled = debugLimitEnabled; | ||
this.debugFieldUnique = new HashSet<>(); | ||
this.debugFieldComputes = new AtomicLong(); | ||
|
||
if (debugLimitEnabled) { | ||
this.debugStateExecutor = Executors.newSingleThreadScheduledExecutor( | ||
// @formatter:off | ||
new ThreadFactoryBuilder() | ||
.setPriority(NORM_PRIORITY) | ||
.setDaemon(true) | ||
.setNameFormat("CachedFieldConfigHelper.DebugState") | ||
.build() | ||
// formatter:off | ||
); | ||
this.debugStateExecutor.scheduleAtFixedRate(this::debugLogState, DEFAULT_DEBUG_STATE_SECS, DEFAULT_DEBUG_STATE_SECS, SECONDS); | ||
} else { | ||
this.debugStateExecutor = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this debug related code be left in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update the name to be something different and/or refactor. The intent was to trace internals to view and help properly size the cache. It was meant to be conditionally enabled, so not to impact normal execution. I evaluated a way to link to Hadoop metrics, but I didn't immediately see a straight forward path to connect. In respect to the name, would generally changing this to an optional view-state
semantic be better or is there another convention that would apply better?
EDIT: we could also update tracing/state logic to all be inline (i.e. remove threading), which when enabled would optionally log out the same information. That would simplify the maintenance/logic.
EDIT#2: refactored the logic to no longer have a background thread to log the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagnostic nomenclature works for me.
Implementation of a field helper cache using an LRU map.
BaseIngestHelper
instance and will wrap an existingFieldConfigHelper
implementation.As part of the work, there was also a factory implementation in separate branch which would allow an implementation to be specified by configuration, which may be preferrable.
Closes #2612