Skip to content

Animal Wildlife Estimator Using Social Media (A.W.E.S.O.M.E.) is an ongoing project and stems mainly from Sreejith Menon's MS thesis

License

Notifications You must be signed in to change notification settings

smenon8/AnimalWildlifeEstimator

Repository files navigation

Animal Wildlife Estimator Using Social Media (A.W.E.S.O.M.E.)

Author: Sreejith Menon

Project Description :

Identify and quantify self-reporting or behavioral bias of human photographers when it comes to sharing pictures of animals on social media. The project broadly attempts to build a population estimation model of a particular species in a particular region using pictures which are shared as public albums in social media. For purposes of simulating the shared pictures on social media, data was generated from Amazon Mechanical Turk jobs. Amazon Mechanical Turk is used to ask random virtual workers from across the globe asking them if they will or will not share a particular image in a group of 20 images.

2 experiments have been conducted till date.

Overview of features implemented

  • Extract features from IBEIS through a function call.
  • Completely automated the process for selection of images and creation of Amazon Mechanical Turk jobs.
  • Completely automated deployment, approval and download for all the mechanical turk jobs.
  • Completely automated parsing of .results file from the mechanical turk engine and return a python object/csv/json ready for processing.
  • Methods available for both single as well multiple feature extraction for an images or a list of images (can be specified as a csv).
  • Methods available to join features with the results and return python data-frames/csv/json for statistical calculation.
  • Methods available for generating rank list of most shared pictures.
  • Methods available for generating rank list by share proportion based on ecological features like species, sex, age, view_point of the animal.
  • Methods available for generating rank lists for a specific feature across all albums or individual albums (number of shares for zebra in a particular album versus number of shares for giraffes etc.)
  • Methods available to append the results from Amazon Mechanical Turk API with tags from Microsoft Image Tagging API and generate a rank list of most shared tags.
  • Added functionality to generate reports of all statistics in HTML format with bar charts wherever necessary.
  • Methods available for data prepartion for applying classifiers using the Bag-of-words methodology.
  • Methods available for building learning models like Logistic Regression, Support Vector Machines, Decision Trees and Random Forests and returning the predictions as well as performance metrics for the classifier.
  • Methods available for visualizing the shared/not shared pictures on map and create clusters of share-no share homogenous region.
  • Methods available for generating heat maps of the region of shared photos and not shared photos
  • Methods available for applying mark-recapture and calculate the Petersen-Lincoln Index on the GZC data-set.
  • Methods available for applying mark-recapture and calculate the Petersen-Lincoln Index on the shared GZC data-set obtained from the Mechanical Turk experiments.
  • Methods available for extracting images from Flickr when the tags and text are specified.
  • Methods available for parallel download of images from Flickr and concurrent execution of image detection tasks on IBEIS.
  • Methods available for extracting beauty features from images.
  • Methods available for directly predicting the share proportion using different regression techniques. Regression methods used include Linear, Ridge, Lasso, Elastic Nets, SVR etc.
  • Methods to estimate population using synthetic albums which are formed using the predicted shared data. Mark-Recapture models are then applied to this predicted data, to study effects of individual photographers in the population estimation steps.
  • The syntenthic experiments use both probability scores generated from the classifiers as well as the likelihood of picture shared generated by regression algorithms to rank images.
  • Synthetic experiments simulate population estimates when every contributor shares their top k images, bottom k images, random k images and image above a certain likelihood threshold.
  • All the synthetic experiments can recreated for GZGC, GGR or the Flickr datasets with minor parameter changes.
  • Method available to upload images to IBEIS instance running on pachy.
  • Method available to run detection on all images hosted on IBEIS instance with no manual intervention.
  • Method available to run identification with exemplar image logic for all valid annotations without any human intervention.
  • Module available to interact with a mongod instance to manipulate and view data corresponding to Mark-Recapture calculations.
  • A web module interface available to use a web UI to upload some JSON files, which is required to estimate population using simple mark-recapture method for species.

Scraping Basics

This readme describes steps of scraping images from Flickr and Bing. Please install the below dependencies by running the below command directly on your command line. Assuming you have installed python 3.x.

pip install flickrapi
pip install urllib

Using the config file

The file WebScrapeConfig.xml has details on where the key file is stored etc. Below is a snippet of the XML file.
You have to change the parameters depending on where you store the key file, where you want the downloaded files to be stored etc.

Some comments have been added to the config directly to remind what the parameters mean. Similar parameters exist for Bing as well.

<flickr_config>
    <flickr_api_key_file location="<specify where the flickr key (in JSON form) is stored"></flickr_api_key_file>
    <flickr_download_dir dir="/tmp/"></flickr_download_dir> <!-- directory where you want to store your downloaded images-->
</flickr_config>

To start, simply clone this branch using the below command and switch to this current branch.

git clone https://github.com/CompBioUIC/BabyZebras.git
git checkout scraping_branch

Scraping images from Flickr

Open a new file in any text editor and save it as <filename of your choice>.py and add the below code snippet.

import SocialMediaImageExtracts as SE

def __main__():
  SE.scrape_flickr(10, "links.dat", ["grevy's zebra"]) 
  '''
    This step will scrape the first 10 pages of Flickr when you search using the query "grevy's zebra" 
    And then store the URLs of all those images appearing in these 10 pages to output.dat
  '''
  
  SE.download_imgs("links.dat") #this step simply downloads every link in the links.dat file
  

if __name__ == "__main__":
  __main__()

Scraping EXIF data from Flickr

After downloading the images from Flickr and manually filtering the images manually to remove all junk images, we also need to download the metadata or EXIF information for each image. EXIF information means date when the image was taken, width-height of the image etc.

Create a file with list of images that you downloaded and filtered, save it with a name of your choice, say, imageList.dat. Create a new file and save it in the folder where you SocialMediaExtracts.py exists with the below snippet.

import SocialMediaImageExtracts as SE

def __main__():
    configDict = SE.xml_parser()
    
    flickrObj = config_dict["flickr_api_key_file"]
    
    with open("imageList.dat", "r") as fl:
        fileList = fl.read().split("\n")
        
    getExif(flickrObj, <name of output file>.json, fileList = fileList) 

This script will give you a JSON file specified as "<name of the output file>.json", we will be using this in the future for population estimation.

Scraping images from Bing

Similar to what we did for Flickr scrapes, a file with name of your choice and paste the below snippet.

import SocialMediaImageExtracts as SE

def __main__():
  SE.bing_search_pipeline("grevy's zebra", 10)
  '''
    This step will scrape the first 10 pages of Bing when you search using the query "grevy's zebra" 
    And then also downloads all the images found in the first 10 pages to the download directory you specified in the config file
    
    This script will also generate a bunch of exif files in json form 
    (typically, width & height of the image and when the image was published)
    Look at the DataStructsHelperAPI.py file for help in combining these JSONs into 1 big file. 
  '''

if __name__ == "__main__":
  __main__()

Combining JSONs outputted by the Bing scrape

As you may have already observed, the bing scrape generated a bunch of JSON file in the form ../data/bing_img_exif_giraffe_* (assuming you kept <bing_exif_prefix prefix="../data/bing_img_exif_giraffe_"></bing_exif_prefix> in the config file).

Below script can be used to combine multiple JSON file into a single JSON. Assume you have the below JSON files outputted by the bing scraping method.

../data/bing_img_exif_giraffe_150.json
../data/bing_img_exif_giraffe_300.json
../data/bing_img_exif_giraffe_450.json
../data/bing_img_exif_giraffe_600.json
import DataStructsHelperAPI as DS, json

def __main__():
    combinedDict = DS.appendJSON(../data/bing_img_exif_giraffe_150.json,
                                 ../data/bing_img_exif_giraffe_300.json,
                                 ../data/bing_img_exif_giraffe_450.json,
                                 ../data/bing_img_exif_giraffe_600.json)
                                 
    with open("../data/bing_img_exif_giraffe_cobined.json", "w") as fl:
        json.dump(combinedDict, fl, indent=4)

Expected Errors

The step of downloading is known to fail. We are using 2 processes to simulaneously download images and this can cause network congestion and the script might fail. Due to its uncertainity, you might observe it every time you run or never. The best thing you can do is to scrape in small batches. For instance, instead of scraping from 50 pages at once, try first 10 and then move ahead. You might have to make changes to code accordingly. I will add this to my to-do list and see if there is a way to do smooth restarts in case of a failure.

Creating a wildbook instance

Login to pachy using your credentials. On windows you can use putty, on Mac/Ubuntu/*nix you can use the default terminal to SSH.

TMUX

You use tmux to basically keep running something on the background. So that even after exiting the prompt (or even logging off a remote machine) you do not break a script.
More about tmux on: https://en.wikipedia.org/wiki/Tmux
tmux cheat sheet : https://gist.github.com/MohamedAlaa/2961058

Things we are going to use:

tmux new -s <session_name> Choose any session name you like (preferably without spaces), this will start a new tmux terminal, anything you do remains within this new virtual terminal

Control(both mac and windows) + B, then D will take you out of the tmux. Remember that whatever you are running in that virtual terminal is still running. You simply detached from that session.

tmux attach -t <session_name> To go back to the same session.

Running an instance of pachy (one time setup):

Not everyone can run an instance of pachy directly. For permission issues contact admin.

  • Login to pachy
  • Create a new TMUX session
    tmux new -s ibeis_5001
  • Inside the tmux session, go to the directory where you want to create your instance. The preferred directory is /home/shared_ibeis/data/work/
cd /home/shared_ibeis/data/work    
mkdir BABYMOM    
python2.7 /opt/ibeis/ibeis/dev.py --dbdir /home/shared_ibeis/data/work/BABYMOM/ --web --port 5000

Any images you upload will now go inside the above directory. After you execute the above steps a web instance is setup and you can access the web interface from https://pachy.cs.uic.edu:5000

Getting copy of a code to pachy:

Clone this GIT repository to your machine and also your home directory on pachy. Let the admin know if you are unable to clone.
git clone https://github.com/CompBioUIC/BabyZebras.git git checkout scraping_branch You can directly make changes to the code on pachy and run it from there.

Uploading images to pachy

The script UploadAndDetectIBEIS.py has all the required methods to upload images to pachy. Steps to upload:

  • Create a new script file locally on your computer with the below snippet:
    The upload process using the below script will generate a JSON, make sure you save it.
import UploadAndDetectIBEIS as UD
import json

def __main__():
    list_of_images_to_be_uploaded = [this list will contain full paths to the images you want to upload]
    img_file_gid_map = {} # this is the mapping that stores the mapping between the image file and the GID*.
    for img in list_of_images_to_be_uploaded:
    	img_file_gid_map[img] = UD.upload(img)

    with open("name of the mapping file", "w") as mapping_fl:
    	json.dump(img_file_gid_map, mapping_fl, indent=4)

 if __name__ == "__main__":
 	__main__()

Running detection - Recognizing bounding boxes and species of the animal in the bounding box

UploadAndDetectIBEIS.py has the methods to run detection on the images that are uploaded to an instance running.
You can run these steps only when your upload step is complete.
Steps to trigger detection module:

  • Download UploadAndDetectIBEIS.py to your local computer and add the below code snippet to the code.
def __main__():
    gidList = [i for i in range(start_gid, end_gid+1)] 
    detect = partial(run_detection_task)

    with closing(Pool(processes=2)) as p:
        p.map(detect, gidList)
        p.terminate()
        
if __name__ == "__main__":
    __main__()

start_gid and end_gid specifies for what all gid's you want to run the detection for.

  • Login to pachy
  • Start a new tmux session
  • Simply run python UploadAndDetectIBEIS.py
  • Close the tmux session.
  • Exit pachy.
  • To check progress you can login back to pachy and attach to the tmux session you created.

Running identification pipeline - Recognizing individuals across different images

Identification pipeline unlike detection pipeline looks at annotations instead of images themselves. Each annotation is uniquely identified with a unique ID - AID. Each new annotation is matched against existing annotations in the database. (There is a little bit more to the logic - not every annotation but to the "exemplar" ones). We will do a cold start here since our database is empty. We only specify end_gid and identification pipeline will run through gid 1 through end_gid

  • Download UploadAndDetectIBEIS.py to your local computer and add the below code snippet to the code. (You should remove the above snippet(from detection) from the file before running)
def __main__():
    run_id_pipeline(end_gid, 'species for which you are running the detection') # zebra_plains, zebra_grevys, giraffe_reticulated etc. are some of the supported species. 
        
if __name__ == "__main__":
    __main__()
  • Login to pachy
  • Start a new tmux session
  • Simply run python UploadAndDetectIBEIS.py
  • Close the tmux session.
  • Exit pachy.
  • To check progress you can login back to pachy and attach to the tmux session you created.

Notes:

  • GID is nothing but the an ID assigned by the Wildbook to each individual image. A GID uniquely identifies an image.

Running Population Estimation using Web UI

A very simple UI was created to help estimate population of species using images scraped from social media platforms.

Features:

  • Connects to a local mongod instance and based on input from user (species name, source, date-range), estimate the population.
  • A user can also upload EXIF and IBEIS files directly using the UI, these files are uploaded to the locally running mongod instance, and this can be used for population estimation.

Requirements:

  • Python-3
  • Latest version of mongod.
  • A GUI to access mongod (I recommend using RoboMongo - very friendly UI).

Population Estimation:

Enter all the required fields. Population Estimation Home page

Final Output Population Estimation Home page

Uploading files for Population Estimation

There are 3 main files, that are required for being able to perform population estimation using simple Mark-Recapture formula.

  • EXIF files Sample record:
{
"509": {
        "long": -81.643459,
        "orientation": 1,
        "height": 4608,
        "lat": 30.404526,
        "date": "2014-10-29 13:21:13",
        "width": 3456
    },
...
}
  • Mapping between GID-AID
{
"385": [
        [
            525
        ]
    ],
"1449": [
        [
            2030,
            2031
        ]
    ],
...
}
  • Mapping between AID-Features
{
"1": {
        "sex": "UNKNOWN SEX",
        "name": "1",
        "yaw": null,
        "age": "infant",
        "quality": "UNKNOWN",
        "NID": 1,
        "SPECIES": "giraffe_reticulated",
        "exemplar": "1",
        "contributor": null
    },
...
}

The important thing to note here is that, all the files are indexed by GID and not by file names. This has been deliberately manipuated for ease of calculation. While there is some logic which can handle mismatching indexes, it is currently disabled for the sake of simplicity. The EXIF file you have primarily will be indexed by the actual file name and while you upload your images to an active IBEIS/WB instance, you must have gotten back a mapping between the actual file names and GIDs. You can manipulate your EXIF file by using a code similar to:

import json
import DataStructsHelperAPI as DS       # This helper has a lot of helpful methods
                                        # for JSON manipulation etc. 
                                        # located inside the scripts folder

if __name__ == "__main__":
    fl_nm_gid_map = DS.flipKeyValue(DS.json_loader( "gid-filename-map.json")) 
    
    exif_map = DS.json_loader( "exif-file.json")
     
    corrected_exif = { fl_nm_gid_map[fl_nm] : exif_map[fl_nm] for fl_nm in exif_map.keys() }

    with open("corrected-exif-json-fl.json", "w") as corrected_json:
        json.dump(corrected_exif, corrected_json, indent=4) 

The web instance can be triggered by running
python awesome_app.py

All the files relating to the WEB-UI are located inside the repo/web_files