Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparing corpus of documents #1

Open
RoyKashob opened this issue Apr 30, 2024 · 0 comments
Open

Preparing corpus of documents #1

RoyKashob opened this issue Apr 30, 2024 · 0 comments

Comments

@RoyKashob
Copy link

I was looking into this dataset. I have a couple of queries.

  1. The given article is a long text in the Oracle folder for the story dataset. How to compare retriever performance whereas the retrieved article is a list of text separated by a token?
  2. I was wondering how to construct the corpus of documents so that I use another retriever model to retrieve documents. I found there are divided_documents in Raw folder. But the maximum number of words in a divided document is still pretty high (6221 according to nltk word_tokenize), which exceeds the input length limit for Contriever retriever. What is the best way to divide documents so that each divided document length is within the input length limit of Contriever?

Thanks for this great work. I would appreciate your time. Looking forward to hearing from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant