Preparing corpus of documents #1

RoyKashob · 2024-04-30T22:35:12Z

I was looking into this dataset. I have a couple of queries.

The given article is a long text in the Oracle folder for the story dataset. How to compare retriever performance whereas the retrieved article is a list of text separated by a token?
I was wondering how to construct the corpus of documents so that I use another retriever model to retrieve documents. I found there are divided_documents in Raw folder. But the maximum number of words in a divided document is still pretty high (6221 according to nltk word_tokenize), which exceeds the input length limit for Contriever retriever. What is the best way to divide documents so that each divided document length is within the input length limit of Contriever?

Thanks for this great work. I would appreciate your time. Looking forward to hearing from you.

Provide feedback