首先请确保可以访问Quora.
Please make sure you can connect to Quora.
在命令行运行下列指令.
Run the following scripts.
git clone https://github.com/ZizhenWang/QuoraCrawler.git
cd QuoraCrawler/
pip install -r requirements.txt
从官网下载与浏览器对应版本的selenium,并添加至系统路径.
Download selenium from website and add it to system PATH.
从可选id中选择一个或多个id
Select one or more ids from available id pool.
对于一个id,如0
,请运行
For single id as 0
, please run
python crawl.py -i 0
对于多个id,如0,1,2,3,4
,请运行
For single id as 0,1,2,3,4
, please run
python crawl.py -i 0,1,2,3,4
- python版本为3.6
- python version is 3.6
- 如果你准备或已经运行了部分数据的爬虫,请将对应的id邮件告知我或通过issue提交,避免重复采集,再次感谢你的帮助!
- If you are ready to run some ids' crawler, please informs me by email or issue to avoid duplicate crawling, thanks for your help again!
- 一个id对应的文件包含6k篇文档,一篇文档需要进行三次渲染,大概需要6h,具体用时视网速而定.
- An id file contains 6.6k documents, and a document needs to be rendered three times. It takes about 8 hours, depending on the speed of the network.
0~89
range | path | state |
---|---|---|
0,1 | local | |
2~9 | tp | |
10~19 | yl | |
20~29 | lili | |
30~39 | vm | |
50~59 | wlh | |
60~64 | local |
- 更新了待爬取的问题表
- 更新了部分爬取代码
range | path | state |
---|---|---|
0,3 | mbp | done |
4,9 | carbon | in progress |
10,14 | mbp | done |
15,19 | carbon | in progress |
20,29 | 185 | in progress |
30,39 | mbp | in progress |
81 | ------- | train/test boundary |
185 | mbp | in progress |
190,199 | pc | in progress |