This will scrap the data from nextjs doc with Playwright. Data transformation and cleaning + adding wrappers to make sens of the data for ia with Cheerio. Finally save it in separate files in data/nextjs folder.
npm run scrap
If you want stats on scrapping datas you can run this command
npm run scrapstat
- On Neon.tech create a database (Neon because is compatible with vector data) and create a collection for store the data.
- add the connection string in DATABASE_URL in .env. Be sure to complete userName and replace ******* by password
- Create Tables with the command SQL in database.sql
DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS documents (text text, n_tokens integer, file_path text, embeddings vector(1536));
CREATE INDEX ON documents USING ivfflat (embeddings vector_cosine_ops);
CREATE TABLE IF NOT EXISTS openai_ft_data (
id SERIAL PRIMARY KEY,
query TEXT NOT NULL,
answer TEXT NOT NULL,
suggested_answer TEXT,
user_feedback BOOLEAN
);
CREATE TABLE IF NOT EXISTS usage (
id SERIAL PRIMARY KEY,
ip_address TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
- Add OpenAi key in .env for use the Api for embedding the data.
npm run embedding
this command will do this actions:
- Create array of objects with texts and fileName and save it to a json file (texts.json)
- tokenize all texts with tiktoken to know token Number and save it to a json file (textsTokens.json)
- Split the texts in max 1500 tokens. If split, split according to the subtitles (Tag h2) and save it to a json file (textsTokensSplited.json)
- embedding all split texts with text-embedding-3-small from openai and save it to a json file (textsTokensSplitedEmbedding.json)
- save the embedding data to the database
tiktoken library is used to transform text into tokens. We will use this for calculate how many tokens we need to split the text in order to be able to embed it with openAi.
⏳ Link to npm tiktoken / Lien vers le github de tiktoken
You can uncomment displayTokenLengthStats function if you want to check the token sending statistics before saveToDatabase. In this case, don't forget to comment out saveToDatabase function.