Data-Scraper

A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.

The script asks for a keyword to search for. It compares the keyword with the file-name and its contents. As soon as it finds the keyword in it, it is listed as a match and output at the end.

File Content Read

The scraper is able to read only the following text-based files:

.docx
.pdf
.txt

Usage

The scraper is searching the ./DATA directory by default. To change that you have to edit the variable directory.

Line 9: directory = "./DATA"

Note

It iterates through every file in the directory. To speed up the process, it is recommended to limit the amount of files.

Requirements

How to install the required libraries.

pip install pdfplumber

pip install docx

Improving

Suggestions for improvements are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DATA		DATA
README.md		README.md
Scraper.py		Scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Scraper

File Content Read

Usage

Requirements

Improving

About

Languages

ortanaV2/Data-Scraper

Folders and files

Latest commit

History

Repository files navigation

Data-Scraper

File Content Read

Usage

Requirements

Improving

About

Topics

Resources

Stars

Watchers

Forks

Languages