Questions about the free-law data used in the paper "Adapt LLM to domains" #164

WUHU-G · 2024-02-11T03:52:21Z

Dear Authors, you have undoubtedly done an excellent job (domain-specific post-pre-training). But I have a small question about the size of the free-law data used in the original paper, I free downloaded from (https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train) - law data This seems to be much smaller than the 35G (16B tokens) described in the paper "Table 7", but only 1.4B tokens are actually processed using llama tokenizer. May I ask whether the author used the data in this link or another link?

cdxeve · 2024-02-11T05:20:17Z

Hi,

The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.

I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:

https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train

I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?

I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:

from datasets import load_dataset

free_law_data = load_dataset('EleutherAI/pile', 'free_law')

WUHU-G · 2024-02-11T05:24:38Z

Hi,

The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.

I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:
https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train
I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?

I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:
from datasets import load_dataset

free_law_data = load_dataset('EleutherAI/pile', 'free_law')

Thank you very much for your reply. I'll try your method again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the free-law data used in the paper "Adapt LLM to domains" #164

Questions about the free-law data used in the paper "Adapt LLM to domains" #164

WUHU-G commented Feb 11, 2024

cdxeve commented Feb 11, 2024

WUHU-G commented Feb 11, 2024

Questions about the free-law data used in the paper "Adapt LLM to domains" #164

Questions about the free-law data used in the paper "Adapt LLM to domains" #164

Comments

WUHU-G commented Feb 11, 2024

cdxeve commented Feb 11, 2024

WUHU-G commented Feb 11, 2024