Amazon e-commerce products contain rich sources of information spread over text and tables. Amazon Textual and TAbulaR Information extractIon (ATTARII) can effectively scrape Amazon product web pages and extract sections of interest. These sections on Amazon web pages are divided into two categories:
- Textual information
- Product titles
- Bullet points
- Product descriptions
- Tabular information
- Product detail tables
- Product overview tables
These sections are marked in the figure below:
Given the URL of an Amazon product web page, ATTARII retrieves the web page content by the webdriver of Selenium library. In the next step, ATTARII parses the HTML content with Beautiful Soup library, and it extracts the desired sections using HTML tags and ids. There is an excellent tutorial for Beautiful Soup library.
Different suppliers and developers may use different HTML tags and ids to include the product data. The tool that I have developed here is capable of extracting the desired sections for the majority of Amazon products, when I test the tools for Amazon-PQA dataset.
To get started, you'll need Python and pip installed.
- Clone the Git repository
git clone https://github.com/anaeim/ATTARII.git
- Navigate to the project directory
cd ATTARII
- Create a directory for the extracted textual and tabular information
mkdir extracted_info
- Install the requirements
pip install -r requirements.txt
python ATTARII.py --URL https://www.amazon.com/dp/B08KHR6B3W/ \
--info-type tabular \
--verbosity-enabled \
--dump-info-enabled \
--dump-info-path extracted_info
The meaning of the flags:
--URL
: the URL of the Amazon product web page--info-type
: the type of information for extraction by ATTARII. You can choose betweentabular
andtextual
data.--verbosity-enabled
: to display the extracted information.--dump-info-enabled
: to dump and store the extracted information as a.JSON
file.--dump-info-path
: to specify the directory to dump and store the extracted information.
Here is an example of extracted tabular info for the Apple Watch Series 6 on Amazon: