Bentley Historical Library's implementation of twarc, used to capture searches of hashtags and mentions using the Twitter API
- Clone the repository
cd bhl_twarc
- Create a configuration file called
feeds.txt
- Entries in the configuration file should look like this:
[examplehashtag]
crawl:True
name:Example Hashtag (#examplehashtag)
crawl_type:hashtag
search_string:#examplehashtag
[examplementions]
crawl:False
name: Example Mentions (@examplementions)
crawl_type:mentions
search_string:@examplementions
- Create an application at apps.twitter.com
- Note the consumer key, consumer secret, access token, and access token secret associated with the application
- Run
bhl_twarc.py
- The script will parse entries in
feeds.txt
and initiate a Twitter search for all that have acrawl
setting ofTrue
bhl_twarc
will create the following directory structure (using the example configuration above as an example), if it does not exist:
feeds
examplehashtag
html
json
logs
media
profile_images
tweet_images
- The raw JSON returned by the Twitter API will be saved to the feed's
json
directory - Logs for the API search will be stored to a
twarc.log
file in thelogs
directory - An HTML file created using the Twitter JSON will be stored in the
html
directory- based heavily off of twarc's wall.py
- Profile images and embedded images from tweets will be fetched and stored in the corresponding folders in the
media
directory- The paths to images in the converted HTML files will point to the images stored in the
media
directory - CSV files will also be created and stored in the
media
directory, indicating each image's original URL and download location
- The paths to images in the converted HTML files will point to the images stored in the
- An
index.html
file will be created in the feed's root directory containing a table pointing to the raw JSON and converted HTML for each crawl - The README.txt from
bhl_twarc\lib
will be copied to the feed's root directory
The first time bhl_twarc.py
is run, it will prompt you for your consumer key, consumer secret, access token, and access token secret, which will then be stored in a file called .twarc
This file is ignored by default in this repo's .gitignore. Make sure not to commit this file to GitHub as it will contain your Twitter API secret keys
The following command line arguments can be passed to bhl_twarc.py
.
- To perform a search of a particular feed and only that feed from
feeds.txt
:
bhl_twarc.py -f examplehashtag
- To exclude feeds from
feeds.txt
from a crawl:
bhl_twarc.py -e examplehashtag examplementions
- To run a test crawl, using a configuration file called
feeds_test.txt
, the results of which will be saved to a directory calledtest_crawls
:
bhl_twarc.py -t