Automatic Scraping project for extracting FAQs and Help center articles

Introduction

This is an FAQs Automatic scraping from help centers articles using Scrapy tool, by saying Automatic I mean providing a list of companies and their HelpCenter URLs, and the scraper will start automatically follow all the internal articles and extract FAQs as a (Question, Answer) pairs.

How does it work

currently, there are two types of operations to do in the scraper:

general: scraping general HelpCenter content.
zendesk: scraping Zendesk companies.

the scraper read a list of companies and start scraping the info, it writes to JSON files, a folder for each company.

Zendesk:

this is a straight-forward one , Zendesk simply have a common patter in their URL,

f'{company_domain}/api/v2/help_center/en-us/sections.json',
f'{company_domain}/api/v2/help_center/en-us/articles.json'

by simply telling the spider to follow those links you can get all the articles, and their sections which I did in zendesk_spider.

general

this is the tricky one, here the objective is to scrap any other help-center URL by following the tree pattern if exists the tree pattern is simply start_url>>start_url/categories>>start_url>>category>>article. to do that I keep recursively following the pattern while being careful, to avoid hitting URLs that are not help-center articles. after that, I store all the HTML content I get and start processing them as a tree looking for the last HTML page that contains the article and extract FAQS from it.

How to use

to build and run the scraper

docker build -f scraping_docker -t scraper .
docker run scraper -f filename -t operation_type

to build and run the processor

docker build -f processing_docker -t processor .
docker run processor -f filename -t operation_type

Extras:

you can use AWS and MongoDB with this project , it will require more configurations, but I added writes and pipelines that can help to do that.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
FAQ		FAQ
test		test
.gitignore		.gitignore
processing_docker		processing_docker
project_structure		project_structure
readme.md		readme.md
requirements.txt		requirements.txt
scraping_docker		scraping_docker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Scraping project for extracting FAQs and Help center articles

Introduction

How does it work

Zendesk:

general

How to use

Extras:

About

Languages

Mogady/FAQs-automatic-scraper

Folders and files

Latest commit

History

Repository files navigation

Automatic Scraping project for extracting FAQs and Help center articles

Introduction

How does it work

Zendesk:

general

How to use

Extras:

About

Topics

Resources

Stars

Watchers

Forks

Languages