New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Snippet search #25

Open

lxnn wants to merge 12 commits into main from snippet-search

lxnn commented May 14, 2022

Implements #24

The search heuristic probably needs tuning, although we will get better feedback once mods start using the command.

lxnn and others added 10 commits

May 13, 2022 20:41


          Initial commit on the snippet-search branch

aac8e9c


          Make a start on snippet-search

3fbfa76

Details to be worked out in testing. Particularly the search method and
output styling.

Co-authored-by: Etzeitet <5340057+Etzeitet@users.noreply.github.com>


          Fix command name

fb2c120


          Fix dict comprehension

e11a08b


          Mostly cosmetic changes

65d0f5b


          Cosmetic change; make embed fields not inline

9b2b19f


          Cosmetic changes

1f5f51d

Added a separate embed to summarise the snippet search results.


          Bug fixes

e611e1f


          Grouping of snippets, and improvements to search heuristic

d398f80

Snippet names which correspond to the same snippet content are now
grouped together.

Snippets are now scored as: percentage of query words in name + percentage
of query words in content.


          Bug fix; embed field value length shortened

29ebaba

Author

lxnn commented May 14, 2022

Will fix linting issues later today.

lxnn added 2 commits

May 14, 2022 21:01


          Fix linting issues.

53abbf9


          Remove Python 3.10 typing syntax that snuck in

831c1c7

LiquidPulsar reviewed

View reviewed changes

LiquidPulsar left a comment

Do with these what you want, in general it looks good tho 🧋

snippet_search/snippet_search.py

+                  """Return the number of words in common between the two strings."""
+                  return sum(
+                      (
+                          Counter(map(str.casefold, words(s1)))

LiquidPulsar Jul 1, 2022

Any reason you casefold after using words instead of s1.casefold()? Would also result in the word pattern only needing lowercase but thats an aside.

snippet_search/snippet_search.py

+                  names_by_content = defaultdict(set)
+                  for name, content in snippets.items():
+                      names_by_content[content.strip()].add(name)
+                  grouped_snippets = []

LiquidPulsar Jul 1, 2022

Is the strip necessary here? Aliases shouldn't differ by leading/trailing whitespace unless someone intentionally modifies them, and you seem to subscribe to that idea given the content = snippets[name] which grabs an unstripped version (ie if the differences mattered we would be stripping that one too).

Working with that added strip yields:

    names_by_content = defaultdict(set)
    for name, content in snippets.items():
        names_by_content[content.strip()].add(name)
    grouped_snippets = []
    for content, group in names_by_content.items():
        grouped_snippets.append((group, content))
    return grouped_snippets

or even

    return [
        (v,k)
        for k,v in names_by_content.items()
        ]

snippet_search/snippet_search.py

+                  if query is None:
+                      return THRESHOLD
+                  return (
+                      (common_word_count(query, name) + common_word_count(query, content))

LiquidPulsar Jul 1, 2022

The common_word_count function is being called multiple times redundantly. Perhaps modify to take a list of names: typing.Iterable[str], eg

return (
    max(
        common_word_count(query, name)
        for name in names
    )
+ common_word_count(query, content)
) / len(words(query))

If we're being even more pedantic, would try to minimise calls to words but i guess we don't necessarily care about the efficiency that much. I just kept this as the heuristic function should probably be called as a whole on all the data available for it as opposed to name-by-name.

snippet_search/snippet_search.py

+                              )
+                              .add_field(
+                                  name="Raw Content",
+                                  value=formatted_content,

LiquidPulsar Jul 1, 2022

formatted_content seems the same complexity as the other stuff in the fields, so could move it into the add_field?

snippet_search/snippet_search.py

+                      result_summary_embed = discord.Embed(
+                          color=self.bot.main_color,
+                          title=f"Found {num_results} Matching Snippet{'s' if num_results > 1 else ''}:",

LiquidPulsar Jul 1, 2022

title=f"Found {num_results} Matching Snippet{'s'*(num_results > 1)}:",
#134 \/
name=f"Name{'s'*(len(names) > 1)}",

unless too illegible (which is likely) 🤷

snippet_search/snippet_search.py

+                      for i, (names, content) in enumerate(grouped_snippets):
+                          group_score = max(score(query, name, content) for name in names)
+                          scored_groups.append((group_score, i, names, content))

LiquidPulsar Jul 1, 2022

Looks like the enumerate is to ensure tuple sorting is done properly in the case of equal scores, idk how much it matters particularly tho

Could replace 89-99 with:

        for i, (names, content) in enumerate(grouped_snippets):
            group_score = max(score(query, name, content) for name in names)
            #or score(query, names, content) if modifying score

            if group_score >= THRESHOLD: #saves sorting time?
                scored_groups.append((group_score, i, names, content))

        scored_groups.sort(reverse=True)

Line 116 for _, _, names, content in matching_snippet_groups would avoid the need for lines 95-99.

ChrisLovering force-pushed the main branch from 488b683 to 5134b24 Compare

July 31, 2023 20:25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet