Twitter-intact-stream is a tool for collecting (nearly) complete Twitter filtered stream, and first used to construct datasets in the following paper,
Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. Variation across Scales: Measurement Fidelity under Twitter Data Sampling. AAAI International Conference on Weblogs and Social Media (ICWSM), 2020. [ paper | code ]
All codes are developed and tested in Python 3.7, along with tweepy 3.7.0.
Before using this Twitter-intact-stream crawler, you need to register your Twitter developer account and apps.
Rename conf/developer.key-backup
to conf/developer.key
, and set the registered tokens in conf/developer.key
.
Twitter-intact-stream splits a set of filtering keywords and languages into multiple subcrawlers. Each subcrawler can cover more tweets for the target stream, altogether they rise the overall sampling rate.
Next, you need config the monitored predicates by creating a conf/*_crawler.conf
file.
We provide an example of COVID-19 configuration (see conf/covid_crawler.conf
, keywords extended from this paper):
{"app_name": "covid",
"crawler0": {"crawler_name": "covid_all",
"key_set": "key0",
"keywords": ["coronavirus", "covid19", "covid", "covid–19", "COVIDー19", "pandemic", "covd", "ncov", "corona", "corona virus", "sars-cov-2", "sarscov2", "koronavirus", "wuhancoronavirus", "wuhanvirus", "wuhan virus", "chinese virus", "chinesevirus", "china", "wuhanlockdown", "wuhan", "kungflu", "sinophobia", "n95", "world health organization", "cdc", "outbreak", "epidemic", "lockdown", "panic buying", "panicbuying", "socialdistance", "social distance", "socialdistancing", "social distancing"],
"languages": []},
"crawler1": {"crawler_name": "covid_1",
"key_set": "key1",
"keywords": ["wuhanlockdown", "wuhan", "kungflu", "sinophobia", "n95", "world health organization", "cdc", "outbreak", "epidemic"],
"languages": []},
"crawler2": {"crawler_name": "covid_2",
"key_set": "key2",
"keywords": ["lockdown", "panic buying", "panicbuying", "socialdistance", "social distance", "socialdistancing", "social distancing"],
"languages": []},
"crawler3": {"crawler_name": "covid_3",
"key_set": "key3",
"keywords": ["pandemic", "covd", "ncov"],
"languages": ["en", "es"]},
"crawler4": {"crawler_name": "covid_4",
"key_set": "key4",
"keywords": ["coronavirus"],
"languages": ["en"]},
"crawler5": {"crawler_name": "covid_5",
"key_set": "key5",
"keywords": ["covid"],
"languages": ["en"]},
"crawler6": {"crawler_name": "covid_6",
"key_set": "key6",
"keywords": ["covid–19", "COVIDー19", "covid19"],
"languages": ["en"]}
}
The value of key_set
should exist as the indexing key of conf/developer.key
.
`crawler0` tracks all 35 keywords in all languages.
`crawler1` tracks all 9 keywords in all languages.
`crawler2` tracks all 7 keywords in all languages.
`crawler3` tracks all 3 keywords in English or Spanish.
`crawler4` tracks all 1 keywords in English.
`crawler5` tracks all 1 keywords in English.
`crawler6` tracks all 3 keywords in English.
Next, you need replace the values with your desired conf file and output directory in the main script (crawlers/multi_process_crawlers.py
):
line 57: with open('../conf/covid_crawler.conf', 'r') as config_file:
line 66: output_dir = '../data/{0}'.format(app_name)
We provide analysis codes to compute the number of missing tweets and sampling rates. The scripts should be executed in order.
extract_tweet_status.py
: This script extracts the collected tweets of json format to text format. See comments for details.merge_subcrawlers.py
: This script removes duplicate tweet ids and sort them chronologically.count_ratemsg_in_subcrawlers.py
: This script counts the sampling rates in each subcrawler.plot_tweet_volume.py
: This script plots the temporal tweet counts and sampling rates at the granularity of hour.
Output:
>>> subcrawler covid_1 , 11725467 retrieved tweets, 216 rate limit track, indicating 6912 missing tweets, yielding 99.94% sampling rate
>>> subcrawler covid_2 , 24415704 retrieved tweets, 75692 rate limit track, indicating 1053052 missing tweets, yielding 95.87% sampling rate
>>> subcrawler covid_3 , 16753021 retrieved tweets, 4397 rate limit track, indicating 33454 missing tweets, yielding 99.80% sampling rate
>>> subcrawler covid_4 , 33952506 retrieved tweets, 604400 rate limit track, indicating 24173145 missing tweets, yielding 58.41% sampling rate
>>> subcrawler covid_5 , 27998308 retrieved tweets, 182773 rate limit track, indicating 1880251 missing tweets, yielding 93.71% sampling rate
>>> subcrawler covid_6 , 20592431 retrieved tweets, 39503 rate limit track, indicating 336820 missing tweets, yielding 98.39% sampling rate
>>> subcrawler covid_7 , 21653393 retrieved tweets, 24647 rate limit track, indicating 137600 missing tweets, yielding 99.37% sampling rate
>>> subcrawler covid_8 , 18046113 retrieved tweets, 8426 rate limit track, indicating 67938 missing tweets, yielding 99.62% sampling rate
>>> subcrawler covid_9 , 15407936 retrieved tweets, 2125 rate limit track, indicating 40456 missing tweets, yielding 99.74% sampling rate
>>> subcrawler covid_10 , 17576365 retrieved tweets, 20193 rate limit track, indicating 187806 missing tweets, yielding 98.94% sampling rate
>>> subcrawler covid_11 , 17894990 retrieved tweets, 11863 rate limit track, indicating 103313 missing tweets, yielding 99.43% sampling rate
>>> subcrawler covid_12 , 18766409 retrieved tweets, 8838 rate limit track, indicating 55686 missing tweets, yielding 99.70% sampling rate
>>> subcrawler covid_all, 34545906 retrieved tweets, 691059 rate limit track, indicating 193510821 missing tweets, yielding 15.15% sampling rate
>>> complete_set covid , 209463069 retrieved tweets, NaN rate limit track, estimating 228056727 total tweets, yielding 91.85% sampling rate
>>> Elapsed time: 0:34:35.998
From which we can see, the sampling rate increases from 15.15% with one single crawl to 91.85% with 12 subcrawlers (209M tweets in 8 days). This is significantly more than current Twitter threshold (4.32M per day). The following image plots temporal tweet counts and sampling rates for a dataset collected from 2020-03-23 to 2020-03-31.