Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split_by_user.py的脚本有问题吧? #58

Open
peterzhang2029 opened this issue Jan 13, 2023 · 1 comment
Open

split_by_user.py的脚本有问题吧? #58

peterzhang2029 opened this issue Jan 13, 2023 · 1 comment

Comments

@peterzhang2029
Copy link

import random

fi = open("local_test", "r")
ftrain = open("local_train_splitByUser", "w")
ftest = open("local_test_splitByUser", "w")

while True:
rand_int = random.randint(1, 10)
noclk_line = fi.readline().strip()
clk_line = fi.readline().strip()
if noclk_line == "" or clk_line == "":
break
if rand_int == 2:
print >> ftest, noclk_line
print >> ftest, clk_line
else:
print >> ftrain, noclk_line
print >> ftrain, clk_line
这个脚本对测试集划分为train和test,写的有问题吧? 不过看起来之前的步骤local_aggretor.py里就已经划分好了吧,

@weikangliang
Copy link

weikangliang commented Feb 19, 2023

import random

fi = open("local_test", "r") ftrain = open("local_train_splitByUser", "w") ftest = open("local_test_splitByUser", "w")

while True: rand_int = random.randint(1, 10) noclk_line = fi.readline().strip() clk_line = fi.readline().strip() if noclk_line == "" or clk_line == "": break if rand_int == 2: print >> ftest, noclk_line print >> ftest, clk_line else: print >> ftrain, noclk_line print >> ftrain, clk_line 这个脚本对测试集划分为train和test,写的有问题吧? 不过看起来之前的步骤local_aggretor.py里就已经划分好了吧,

同样感觉是这样的,在DIN,DIEN,CAN代码中都是这样划分的,local_aggretor.py中划分好了已经:一个用户n个历史行为,训练集用的是n-1个行为。预测的时候是用前n-1个行为预测目标第n个行为。

至于split_by_user.py,个人感觉没有太大用。因为照这样划分的话,只是每个用户生成两个样本(一个正,一个负)。而预测的时候user跟训练时候的user又完全不是一个,所以user_id这个特征就不能用了(但是论文中用了这个特征,所以实际上处理时我们不需要split_by_user.py文件,可能是作者做测试的文件放进来了)。这样子看起来泛化性更高了。

建议两种划分方式可以都试一下,不过模型的输入可能要改!而且第一种会导致训练样本急剧增加(大小从30M->28.8GB)。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants