Adding question mark to the sample fails #40

entrptaher · 2023-05-01T08:15:50Z

The following code,

training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

Throws error

mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'

But the following code works just without the question mark in the html,

training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

The text was updated successfully, but these errors were encountered:

entrptaher · 2023-05-01T08:26:20Z

The same file worked with autoscraper without any issue.

lorey · 2023-05-01T14:19:53Z

Thanks, that's very weird.

Which version are you using?
since generate_all_value_matches just calls BeautifulSoup's find all in the latest version, I have no answer yet.

lorey · 2023-05-01T14:21:00Z

Does the same happen for <html><body><p>what?</p></body></html>?

lorey · 2023-05-01T14:32:17Z

This is how it's meant to be called, not sure what you're trying to achieve.

    training_set = TrainingSet()
    html = "<html><body><p>with a question mark?</p></body></html>"
    page = Page(html)
    sample = Sample(page, {
        'title': 'with a question mark?'})
    training_set.add_sample(sample)
    scraper = train_scraper(training_set)
    print(scraper)

lorey · 2023-05-01T14:34:34Z

Prettify creates whitespace that mlscraper currently is sensitive to. I know this is not perfect, but it's on the roadmap.

lorey · 2023-05-01T14:36:01Z

Related: #15

entrptaher · 2023-05-02T04:29:23Z

I found the issue here,

    def _generate_find_all(self, item):
        assert isinstance(item, str), "can only search for str at the moment"

        # text
        # - since text matches including whitespace, a regex is used
        target_regex = re.compile(r"^\s*%s\s*$" % html.escape(item))

This generates a wrong regex,

with a question mark?
re.compile('^\\s*with a question mark?\\s*$')

Using re.escape fixes this issue,

    def _generate_find_all(self, item):
        assert isinstance(item, str), "can only search for str at the moment"

        # text
        # - since text matches including whitespace, a regex is used
        target_regex = re.compile(r"^\s*%s\s*$" % html.escape(re.escape(item)))

lorey · 2023-05-02T07:57:50Z

Good catch!

lorey added the bug Something isn't working label May 1, 2023

lorey added invalid This doesn't seem right and removed bug Something isn't working labels May 1, 2023

lorey closed this as completed May 1, 2023

lorey reopened this May 2, 2023

lorey added bug Something isn't working and removed invalid This doesn't seem right labels May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding question mark to the sample fails #40

Adding question mark to the sample fails #40

entrptaher commented May 1, 2023

entrptaher commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

entrptaher commented May 2, 2023 •

edited

Loading

lorey commented May 2, 2023

Adding question mark to the sample fails #40

Adding question mark to the sample fails #40

Comments

entrptaher commented May 1, 2023

entrptaher commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

lorey commented May 1, 2023

entrptaher commented May 2, 2023 • edited Loading

lorey commented May 2, 2023

entrptaher commented May 2, 2023 •

edited

Loading