Feature/updating indeed scraper (#166) #170

sammytheindi · 2024-09-13T02:10:38Z

Description

Merging fixed indeed scraper into master branch. For more information, see #166
Additional changes include updates to formatting, adding black and prettier configuration files, and performing minor updates to the repo (versioning, updating modules, etc.).

Biggest point of discussion is whether we want to keep the current method of scraping each individual job page, or scraping the job list with description summaries. The latter (default implementation in this PR), is much faster, though will result in less information in the description. It is possible to manually switch back, but do we want to keep it as the default implementation?

Context of change

Software (software that runs on the PC)
Library (library that runs on the PC)
Tool (tool that assists coding development)
Other

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Same tests as in #166

Checklist:

I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
Any dependent changes have been merged and published in downstream modules.

* - Updated to mobile endpoints and user agents to prevent CAPTCHA - Updated parsing of indeed scraper - Fixed tags not being parsed correctly - Fixed remoteness not being parsed correctly - Changed to only scrape the first page of each search by default for speed * - Updated method of loading user agent files - Updated user agent file of indeed scraper * - Updated versions in requirements.txt - Added in black configuration file for formatting - Added a pre-commit hook so all contributors will have consistent formatting on upload - Updated all python files to conform to black formatter * Updated Python version * More black formatting updates * - Added prettierrc and prettierignore - Formatted all files other than python * Updated prettierignore so prettier can search through subdirectories * Reset formatting to longer line width * Reverted to previous commit * Updating again to longer line width after accounting for missing files * Updated prettierrc and prettierignore files and reran formatting * Updated version

PaulMcInnis

lets put back the markdown + demo file but other than this i'm approving it

.github/ISSUE_TEMPLATE/feature_request.md

demo/settings_USA.yaml

jobfunnel/backend/scrapers/indeed.py

jobfunnel/resources/user_agent_list_mobile.txt

readme.md

- Reverted settings_USA changes - Updated readme - Removed extra user-agent from phone user agents list - Removed extra comments

PaulMcInnis

wow makes me so happy to see this working again :)

a few things

we should update readme to say that 3.11 is required at least
we need to add user_agent_mobile.txt to the MANIFEST.in file so that it is properly packaged when you install/run using pip install . and the funnel load -s... per the instructions for new users in the readme.

…mobile user agent list to the MANIFEST.in

sammytheindi · 2024-09-17T11:43:34Z

Thanks @PaulMcInnis, we should be good to go!

As I was updating, I was wondering why this project still uses setup.py, instead of the newer standard of pyproject.toml + setup.cfg (PEP-518).

I will open a new issue (#172) on this as I believe we should probably switch while the project is still small, but would love to hear if there are any reasons not to.

PaulMcInnis · 2024-09-17T15:06:28Z

Thanks @PaulMcInnis, we should be good to go!

As I was updating, I was wondering why this project still uses setup.py, instead of the newer standard of pyproject.toml + setup.cfg (PEP-518).

Hey there, no reason other than age really - this is an old project and it needs some love

sammytheindi self-assigned this Sep 13, 2024

sammytheindi added the bug label Sep 13, 2024

PaulMcInnis self-requested a review September 13, 2024 15:53

PaulMcInnis requested changes Sep 13, 2024

View reviewed changes

PaulMcInnis linked an issue Sep 13, 2024 that may be closed by this pull request

CAPTCHA has broken all scraping of Indeed #154

Closed

- Reverted Markdown changes

080e1e6

- Reverted settings_USA changes - Updated readme - Removed extra user-agent from phone user agents list - Removed extra comments

PaulMcInnis requested changes Sep 14, 2024

View reviewed changes

Changed readme to refer to python 3.11 instead of 3.8, and added the …

9950298

…mobile user agent list to the MANIFEST.in

PaulMcInnis merged commit 72faea2 into master Sep 17, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/updating indeed scraper (#166) #170

Feature/updating indeed scraper (#166) #170

sammytheindi commented Sep 13, 2024

PaulMcInnis left a comment

PaulMcInnis left a comment •

edited

Loading

sammytheindi commented Sep 17, 2024 •

edited

Loading

PaulMcInnis commented Sep 17, 2024

Feature/updating indeed scraper (#166) #170

Feature/updating indeed scraper (#166) #170

Conversation

sammytheindi commented Sep 13, 2024

Description

Context of change

Type of change

How Has This Been Tested?

Checklist:

PaulMcInnis left a comment

Choose a reason for hiding this comment

PaulMcInnis left a comment • edited Loading

Choose a reason for hiding this comment

sammytheindi commented Sep 17, 2024 • edited Loading

PaulMcInnis commented Sep 17, 2024

PaulMcInnis left a comment •

edited

Loading

sammytheindi commented Sep 17, 2024 •

edited

Loading