A tool to scrape Frequently Asked Questions (FAQs) from websites and presents them in a portal that allows users to suggest possible ways of paraphrasing the questions. The purpose of the tool is to help in the data gathering phase of Natural Language Processing (NLP).
This tool was created as part of a final year project at Middlesex University, under the supervision of Prof. Franco Raimondi (@fraimondi) and in conjunction with Kare Knowledgeware (formerly Gluru).
- Create a Google Cloud Account (if you don't already have one: here)
- Create a new project (and take note of the project id)
- Create a new Database instance here - Make sure to use Cloud Firestore in Native Mode
- Create a 'Service account key' [here][https://console.cloud.google.com/apis/credentials] with at least the role 'Cloud Datastore User' - and store the resulting JSON file in a secure location.
- Set the required environment variables:
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_json_file>
export GOOGLE_PROJECT_ID=<project_id>
- Install GoLang here (it was developed using
go1.9.1
) - Navigate to the
/go-server
directory within this repo - Run
go get
to install the application dependencies - Run
go run main.go
to run the server (runs on port 9090 by default)
- Install Python 3 here (it was developed using
Python 3.5.0
) - (optional) create a virtual environment to isolate modules from the global instance
- Navigate to the
/fypScraper
directory within this repo - Run
pip install -r requirements.txt
to install the application dependencies - Configure a cron job to run the
/fypScraper/fypScraper/spiders/spiderLauncher.py
script at a given interval (at each excecution, the script will find and process any unscraped websites found in the datastore), for example:
* * * * * <python3_path> <path_to_repo>/fypScraper/fypScraper/spiders/spiderLauncher.py
will run the scraper to check every minute