Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping Fails: class names has been changed #4

Open
HesamKorki opened this issue Jun 3, 2020 · 2 comments
Open

Scraping Fails: class names has been changed #4

HesamKorki opened this issue Jun 3, 2020 · 2 comments

Comments

@HesamKorki
Copy link

The target website has changed its CSS files and class names and specific routing between pages have been changed too. I will list the changes from the start of the scraping script to the end:

  • {'class': 'category-object'} --> {'class': 'subCategory___BRUDy'}

  • name = category.find('h3', {'class': 'sub-category__header'}).text --> name = category.get('id')

  • {'class': 'sub-category-list'} --> {'class': 'subCategoryList___r67Qj'}

  • {'class': 'child-category'} --> {'class': 'subCategoryItem___3ksKz'}

  • sub_category_name = sub_category.find('a', {'class': 'sub-category-item'}).text --> sub_category_name = sub_category.find('a', {'class': 'navigation___2Efid'}).find('span').text

  • {'class': 'sub-category-item'} --> {'class': 'navigation___2Efid'}

  • '//a[@Class="category-business-card card"]' --> '//a[@Class="wrapper___2rOTx"]'

  • '//a[@Class="button button--primary next-page"]' --> '//a[@Class="paginationLinkNormalize___scOgG paginationLinkNext___1LQ14"]'

  • (By.CLASS_NAME, 'category-business-card card') --> (By.CLASS_NAME, 'wrapper___2rOTx')

  • next_url = base_url + data[category][sub_category] + "?numberofreviews=0&timeperiod=0&status=all" + f'&page={c}' --> next_url = base_url + data[category][sub_category] + "?numberofreviews=0&"+ f'&page={c}'+"&status=all&timeperiod=0"

  • (By.CLASS_NAME, 'category-business-card card') --> (By.CLASS_NAME, 'wrapper___2rOTx')

Also, tqdm_notebook throws as Attribute Error that it does not have 'sp' attribute. It's totally understandable since the notebook project it's just experimental. just replace tqdm_notebook with tqdm and it works!

@HesamKorki HesamKorki changed the title Scraping Fails: class names has been changes Scraping Fails: class names has been changed Jun 3, 2020
@markovivl
Copy link

Thank you very much for the thorough list of changes, could not figure out at first why the original script returns empty data.

@mhbl3
Copy link

mhbl3 commented Dec 5, 2020

Most recent ones as of 12/05/2020:

BUSINESS_CARD = "internal___1jK0Z wrapper___26yB4"
NEXT_PAGE = "paginationLinkNormalize___scOgG paginationLinkNext___1LQ14"
CATEGORY = "subCategory___BRUDy"
SUB_CATEGORY = "subCategoryList___r67Qj"
SUB_CATEGORY_ITEM = "subCategoryItem___3ksKz"
SUB_CATEGORY_NAME = "internal___1jK0Z typography___lxzyt weight-inherit___229vl navigation___2n5P8"
NAME = "subCategoryHeader___36ykD"

    for category in soup.findAll('div', {'class': f'{CATEGORY}'}):
        name = category.find('h3', {'class': f'{NAME}'}).text
        name = name.strip()
        data[name] = {}
        sub_categories = category.find('div', {'class': f'{SUB_CATEGORY}'})
        for sub_category in sub_categories.findAll('div', {'class': f'{SUB_CATEGORY_ITEM}'}):
            sub_category_name = sub_category.find('a', {'class': f'{SUB_CATEGORY_NAME}'}).text
            sub_category_uri = sub_category.find('a', {'class': f'{SUB_CATEGORY_NAME}'})['href']
            data[name][sub_category_name] = sub_category_uri
a_list = driver.find_elements_by_xpath(f'//a[@class="{BUSINESS_CARD}"]')
button = driver.find_element_by_xpath(f'//a[@class="{NEXT_PAGE}"]')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants