Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/updating indeed scraper (#166) #170

Merged
merged 3 commits into from
Sep 17, 2024
Merged

Feature/updating indeed scraper (#166) #170

merged 3 commits into from
Sep 17, 2024

Conversation

sammytheindi
Copy link
Collaborator

Description

Merging fixed indeed scraper into master branch. For more information, see #166
Additional changes include updates to formatting, adding black and prettier configuration files, and performing minor updates to the repo (versioning, updating modules, etc.).

Biggest point of discussion is whether we want to keep the current method of scraping each individual job page, or scraping the job list with description summaries. The latter (default implementation in this PR), is much faster, though will result in less information in the description. It is possible to manually switch back, but do we want to keep it as the default implementation?

Context of change

  • Software (software that runs on the PC)
  • Library (library that runs on the PC)
  • Tool (tool that assists coding development)
  • Other

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Same tests as in #166

Checklist:

  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.

* - Updated to mobile endpoints and user agents to prevent CAPTCHA
- Updated parsing of indeed scraper
- Fixed tags not being parsed correctly
- Fixed remoteness not being parsed correctly
- Changed to only scrape the first page of each search by default for speed

* - Updated method of loading user agent files
- Updated user agent file of indeed scraper

* - Updated versions in requirements.txt
- Added in black configuration file for formatting
- Added a pre-commit hook so all contributors will have consistent
  formatting on upload
- Updated all python files to conform to black formatter

* Updated Python version

* More black formatting updates

* - Added prettierrc and prettierignore
- Formatted all files other than python

* Updated prettierignore so prettier can search through subdirectories

* Reset formatting to longer line width

* Reverted to previous commit

* Updating again to longer line width after accounting for missing files

* Updated prettierrc and prettierignore files and reran formatting

* Updated version
@sammytheindi sammytheindi self-assigned this Sep 13, 2024
@PaulMcInnis PaulMcInnis self-requested a review September 13, 2024 15:53
Copy link
Owner

@PaulMcInnis PaulMcInnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put back the markdown + demo file but other than this i'm approving it

.github/ISSUE_TEMPLATE/feature_request.md Outdated Show resolved Hide resolved
demo/settings_USA.yaml Outdated Show resolved Hide resolved
jobfunnel/backend/scrapers/indeed.py Show resolved Hide resolved
jobfunnel/resources/user_agent_list_mobile.txt Outdated Show resolved Hide resolved
readme.md Outdated Show resolved Hide resolved
@PaulMcInnis PaulMcInnis linked an issue Sep 13, 2024 that may be closed by this pull request
- Reverted settings_USA changes
- Updated readme
- Removed extra user-agent from phone user agents list
- Removed extra comments
Copy link
Owner

@PaulMcInnis PaulMcInnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow makes me so happy to see this working again :)

a few things

  1. we should update readme to say that 3.11 is required at least
  2. we need to add user_agent_mobile.txt to the MANIFEST.in file so that it is properly packaged when you install/run using pip install . and the funnel load -s... per the instructions for new users in the readme.

@sammytheindi
Copy link
Collaborator Author

sammytheindi commented Sep 17, 2024

Thanks @PaulMcInnis, we should be good to go!

As I was updating, I was wondering why this project still uses setup.py, instead of the newer standard of pyproject.toml + setup.cfg (PEP-518).

I will open a new issue (#172) on this as I believe we should probably switch while the project is still small, but would love to hear if there are any reasons not to.

@PaulMcInnis
Copy link
Owner

Thanks @PaulMcInnis, we should be good to go!

As I was updating, I was wondering why this project still uses setup.py, instead of the newer standard of pyproject.toml + setup.cfg (PEP-518).

Hey there, no reason other than age really - this is an old project and it needs some love

@PaulMcInnis PaulMcInnis merged commit 72faea2 into master Sep 17, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CAPTCHA has broken all scraping of Indeed
2 participants