Scraping Fails: class names has been changed #4

HesamKorki · 2020-06-03T23:14:21Z

The target website has changed its CSS files and class names and specific routing between pages have been changed too. I will list the changes from the start of the scraping script to the end:

{'class': 'category-object'} --> {'class': 'subCategory___BRUDy'}
name = category.find('h3', {'class': 'sub-category__header'}).text --> name = category.get('id')
{'class': 'sub-category-list'} --> {'class': 'subCategoryList___r67Qj'}
{'class': 'child-category'} --> {'class': 'subCategoryItem___3ksKz'}
sub_category_name = sub_category.find('a', {'class': 'sub-category-item'}).text --> sub_category_name = sub_category.find('a', {'class': 'navigation___2Efid'}).find('span').text
{'class': 'sub-category-item'} --> {'class': 'navigation___2Efid'}
'//a[@Class="category-business-card card"]' --> '//a[@Class="wrapper___2rOTx"]'
'//a[@Class="button button--primary next-page"]' --> '//a[@Class="paginationLinkNormalize___scOgG paginationLinkNext___1LQ14"]'
(By.CLASS_NAME, 'category-business-card card') --> (By.CLASS_NAME, 'wrapper___2rOTx')
next_url = base_url + data[category][sub_category] + "?numberofreviews=0&timeperiod=0&status=all" + f'&page={c}' --> next_url = base_url + data[category][sub_category] + "?numberofreviews=0&"+ f'&page={c}'+"&status=all&timeperiod=0"
(By.CLASS_NAME, 'category-business-card card') --> (By.CLASS_NAME, 'wrapper___2rOTx')

Also, tqdm_notebook throws as Attribute Error that it does not have 'sp' attribute. It's totally understandable since the notebook project it's just experimental. just replace tqdm_notebook with tqdm and it works!

markovivl · 2020-06-09T13:15:59Z

Thank you very much for the thorough list of changes, could not figure out at first why the original script returns empty data.

mhbl3 · 2020-12-05T19:45:32Z

Most recent ones as of 12/05/2020:

BUSINESS_CARD = "internal___1jK0Z wrapper___26yB4"
NEXT_PAGE = "paginationLinkNormalize___scOgG paginationLinkNext___1LQ14"
CATEGORY = "subCategory___BRUDy"
SUB_CATEGORY = "subCategoryList___r67Qj"
SUB_CATEGORY_ITEM = "subCategoryItem___3ksKz"
SUB_CATEGORY_NAME = "internal___1jK0Z typography___lxzyt weight-inherit___229vl navigation___2n5P8"
NAME = "subCategoryHeader___36ykD"

    for category in soup.findAll('div', {'class': f'{CATEGORY}'}):
        name = category.find('h3', {'class': f'{NAME}'}).text
        name = name.strip()
        data[name] = {}
        sub_categories = category.find('div', {'class': f'{SUB_CATEGORY}'})
        for sub_category in sub_categories.findAll('div', {'class': f'{SUB_CATEGORY_ITEM}'}):
            sub_category_name = sub_category.find('a', {'class': f'{SUB_CATEGORY_NAME}'}).text
            sub_category_uri = sub_category.find('a', {'class': f'{SUB_CATEGORY_NAME}'})['href']
            data[name][sub_category_name] = sub_category_uri

a_list = driver.find_elements_by_xpath(f'//a[@class="{BUSINESS_CARD}"]')

button = driver.find_element_by_xpath(f'//a[@class="{NEXT_PAGE}"]')

HesamKorki changed the title ~~Scraping Fails: class names has been changes~~ Scraping Fails: class names has been changed Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping Fails: class names has been changed #4

Scraping Fails: class names has been changed #4

HesamKorki commented Jun 3, 2020

markovivl commented Jun 9, 2020

mhbl3 commented Dec 5, 2020

Scraping Fails: class names has been changed #4

Scraping Fails: class names has been changed #4

Comments

HesamKorki commented Jun 3, 2020

markovivl commented Jun 9, 2020

mhbl3 commented Dec 5, 2020