Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Index. Make MIN_WORD_PREFIX_SIZE configurable. #1391

Open
aindlq opened this issue Jul 10, 2024 · 4 comments
Open

Text Index. Make MIN_WORD_PREFIX_SIZE configurable. #1391

aindlq opened this issue Jul 10, 2024 · 4 comments

Comments

@aindlq
Copy link

aindlq commented Jul 10, 2024

Default value of MIN_WORD_PREFIX_SIZE is 4, which can be a bit too high for some searches. Is it possible to make it configurable? Would it work if I just change it to 3? It says here

"If you need this to be changed, please contact the developers");
:

If you need this to be changed, please contact the developers

@hannahbast
Copy link
Member

@aindlq Can you tell us how large your text corpus is, that is, how many word occurrences? And yes, as a very first step this should be configurable.

@aindlq
Copy link
Author

aindlq commented Jul 10, 2024

Right now it is:

INFO: Statistics for text index: #words = 10,714,805, #blocks = 42,960

That is only a subset of the whole data that we are going to load. I guess in the end it should be x5-10 from that number

@hannahbast
Copy link
Member

@aindlq Have you tried just lowering the value? For 10M words, even MIN_WORD_PREFIX_SIZE = 1 should be fine. For 100M words, MIN_WORD_PREFIX_SIZE = 2 might still be OK.

@aindlq
Copy link
Author

aindlq commented Jul 11, 2024

@hannahbast seems to be working fine with MIN_WORD_PREFIX_SIZE = 3 , with a minor change to:

static cppcoro::generator<std::string> fourLetterPrefixes() {
....
  for (char a : chars()) {
    for (char b : chars()) {
      for (char c : chars()) {
          std::string s{a, b, c};
          co_yield s;
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants