-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding question mark to the sample fails #40
Comments
The same file worked with autoscraper without any issue. |
Thanks, that's very weird.
|
Does the same happen for |
This is how it's meant to be called, not sure what you're trying to achieve. training_set = TrainingSet()
html = "<html><body><p>with a question mark?</p></body></html>"
page = Page(html)
sample = Sample(page, {
'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
print(scraper) |
Prettify creates whitespace that mlscraper currently is sensitive to. I know this is not perfect, but it's on the roadmap. |
Related: #15 |
I found the issue here, def _generate_find_all(self, item):
assert isinstance(item, str), "can only search for str at the moment"
# text
# - since text matches including whitespace, a regex is used
target_regex = re.compile(r"^\s*%s\s*$" % html.escape(item)) This generates a wrong regex,
Using def _generate_find_all(self, item):
assert isinstance(item, str), "can only search for str at the moment"
# text
# - since text matches including whitespace, a regex is used
target_regex = re.compile(r"^\s*%s\s*$" % html.escape(re.escape(item))) |
Good catch! |
The following code,
Throws error
But the following code works just without the question mark in the html,
The text was updated successfully, but these errors were encountered: