Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snippet search #25

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Snippet search #25

wants to merge 12 commits into from

Conversation

lxnn
Copy link

@lxnn lxnn commented May 14, 2022

Implements #24

The search heuristic probably needs tuning, although we will get better feedback once mods start using the command.

lxnn and others added 10 commits May 13, 2022 20:41
Details to be worked out in testing. Particularly the search method and
output styling.

Co-authored-by: Etzeitet <5340057+Etzeitet@users.noreply.github.com>
Added a separate embed to summarise the snippet search results.
Snippet names which correspond to the same snippet content are now
grouped together.

Snippets are now scored as: percentage of query words in name + percentage
of query words in content.
@lxnn
Copy link
Author

lxnn commented May 14, 2022

Will fix linting issues later today.

Copy link

@LiquidPulsar LiquidPulsar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do with these what you want, in general it looks good tho 🧋

"""Return the number of words in common between the two strings."""
return sum(
(
Counter(map(str.casefold, words(s1)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you casefold after using words instead of s1.casefold()? Would also result in the word pattern only needing lowercase but thats an aside.

names_by_content = defaultdict(set)
for name, content in snippets.items():
names_by_content[content.strip()].add(name)
grouped_snippets = []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the strip necessary here? Aliases shouldn't differ by leading/trailing whitespace unless someone intentionally modifies them, and you seem to subscribe to that idea given the content = snippets[name] which grabs an unstripped version (ie if the differences mattered we would be stripping that one too).

Working with that added strip yields:

    names_by_content = defaultdict(set)
    for name, content in snippets.items():
        names_by_content[content.strip()].add(name)
    grouped_snippets = []
    for content, group in names_by_content.items():
        grouped_snippets.append((group, content))
    return grouped_snippets

or even

    return [
        (v,k)
        for k,v in names_by_content.items()
        ]

if query is None:
return THRESHOLD
return (
(common_word_count(query, name) + common_word_count(query, content))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common_word_count function is being called multiple times redundantly. Perhaps modify to take a list of names: typing.Iterable[str], eg

return (
    max(
        common_word_count(query, name)
        for name in names
    )
+ common_word_count(query, content)
) / len(words(query))

If we're being even more pedantic, would try to minimise calls to words but i guess we don't necessarily care about the efficiency that much. I just kept this as the heuristic function should probably be called as a whole on all the data available for it as opposed to name-by-name.

)
.add_field(
name="Raw Content",
value=formatted_content,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatted_content seems the same complexity as the other stuff in the fields, so could move it into the add_field?


result_summary_embed = discord.Embed(
color=self.bot.main_color,
title=f"Found {num_results} Matching Snippet{'s' if num_results > 1 else ''}:",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title=f"Found {num_results} Matching Snippet{'s'*(num_results > 1)}:",
#134 \/
name=f"Name{'s'*(len(names) > 1)}",

unless too illegible (which is likely) 🤷

for i, (names, content) in enumerate(grouped_snippets):
group_score = max(score(query, name, content) for name in names)
scored_groups.append((group_score, i, names, content))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the enumerate is to ensure tuple sorting is done properly in the case of equal scores, idk how much it matters particularly tho

Could replace 89-99 with:

        for i, (names, content) in enumerate(grouped_snippets):
            group_score = max(score(query, name, content) for name in names)
            #or score(query, names, content) if modifying score

            if group_score >= THRESHOLD: #saves sorting time?
                scored_groups.append((group_score, i, names, content))

        scored_groups.sort(reverse=True)

Line 116 for _, _, names, content in matching_snippet_groups would avoid the need for lines 95-99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants