Skip to content

Sina Weibo Scraper by Yong HU and Mingyang LI. 2,000 posts per second.

Notifications You must be signed in to change notification settings

tslmy/WeiboSpider

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WeiboSpider

Screen Shot

This is a sina weibo spider built by nghuyong largely tailored to run on WWBP's servers by Mingyang Li.

A detailed explanation, written by nghuyong, can be found at 微博爬虫总结:构建单机千万级别的微博爬虫系统.

Description of data structure can be found at 数据字段说明与示例.

Other Branches

The original repo by nghuyong has 3 branches:

Branch Structure Posts per Day
simple single account 100,000
master account pool 1,000,000
senior distributed pool 10,000,000

Usage

  1. Clone thre repo. Install dependencies.
    git clone git@github.com:nghuyong/WeiboSpider.git
    cd WeiboSpider
    pip install -r requirements.txt
  2. Install phantomjs, mongodb, and redis. Start the latter two.
  3. Write down the usernames and passwords of some Sina Weibo accounts in sina/account_build/account.txt. Follow the format indicated in account_sample.txt.
  4. Populate the account pool by running python sina/account_build/login.py.
  5. Populate URLs to start scraping with by issuing python sina/redis_init.py.
  6. Run scraper by running scrapy crawl weibo_spider.

Data Storage

Posts, user profiles, and user relationships (and comments optionally) are stored in the MongoDB.

Performance

With the default setting, 16GB memory, 8-core CPU, Ubuntu, and 36 processes, we are hitting an average of 2,000 posts per second.

About

Sina Weibo Scraper by Yong HU and Mingyang LI. 2,000 posts per second.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%