-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snippet search #25
base: main
Are you sure you want to change the base?
Snippet search #25
Conversation
Details to be worked out in testing. Particularly the search method and output styling. Co-authored-by: Etzeitet <5340057+Etzeitet@users.noreply.github.com>
Added a separate embed to summarise the snippet search results.
Snippet names which correspond to the same snippet content are now grouped together. Snippets are now scored as: percentage of query words in name + percentage of query words in content.
Will fix linting issues later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do with these what you want, in general it looks good tho 🧋
"""Return the number of words in common between the two strings.""" | ||
return sum( | ||
( | ||
Counter(map(str.casefold, words(s1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you casefold after using words
instead of s1.casefold()
? Would also result in the word pattern only needing lowercase but thats an aside.
names_by_content = defaultdict(set) | ||
for name, content in snippets.items(): | ||
names_by_content[content.strip()].add(name) | ||
grouped_snippets = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the strip
necessary here? Aliases shouldn't differ by leading/trailing whitespace unless someone intentionally modifies them, and you seem to subscribe to that idea given the content = snippets[name]
which grabs an unstripped version (ie if the differences mattered we would be stripping that one too).
Working with that added strip yields:
names_by_content = defaultdict(set)
for name, content in snippets.items():
names_by_content[content.strip()].add(name)
grouped_snippets = []
for content, group in names_by_content.items():
grouped_snippets.append((group, content))
return grouped_snippets
or even
return [
(v,k)
for k,v in names_by_content.items()
]
if query is None: | ||
return THRESHOLD | ||
return ( | ||
(common_word_count(query, name) + common_word_count(query, content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common_word_count
function is being called multiple times redundantly. Perhaps modify to take a list of names: typing.Iterable[str]
, eg
return (
max(
common_word_count(query, name)
for name in names
)
+ common_word_count(query, content)
) / len(words(query))
If we're being even more pedantic, would try to minimise calls to words
but i guess we don't necessarily care about the efficiency that much. I just kept this as the heuristic function should probably be called as a whole on all the data available for it as opposed to name-by-name.
) | ||
.add_field( | ||
name="Raw Content", | ||
value=formatted_content, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatted_content
seems the same complexity as the other stuff in the fields, so could move it into the add_field
?
|
||
result_summary_embed = discord.Embed( | ||
color=self.bot.main_color, | ||
title=f"Found {num_results} Matching Snippet{'s' if num_results > 1 else ''}:", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title=f"Found {num_results} Matching Snippet{'s'*(num_results > 1)}:",
#134 \/
name=f"Name{'s'*(len(names) > 1)}",
unless too illegible (which is likely) 🤷
for i, (names, content) in enumerate(grouped_snippets): | ||
group_score = max(score(query, name, content) for name in names) | ||
scored_groups.append((group_score, i, names, content)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the enumerate is to ensure tuple sorting is done properly in the case of equal scores, idk how much it matters particularly tho
Could replace 89-99 with:
for i, (names, content) in enumerate(grouped_snippets):
group_score = max(score(query, name, content) for name in names)
#or score(query, names, content) if modifying score
if group_score >= THRESHOLD: #saves sorting time?
scored_groups.append((group_score, i, names, content))
scored_groups.sort(reverse=True)
Line 116 for _, _, names, content in matching_snippet_groups
would avoid the need for lines 95-99.
Implements #24
The search heuristic probably needs tuning, although we will get better feedback once mods start using the command.